Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JS: Parse regular expressions from string literals #2252

Merged
merged 45 commits into from
Nov 15, 2019

Conversation

asger-semmle
Copy link
Contributor

This is an experiment to see what we can do if we parse all string literals as regular expressions.

The overhead in extraction time and database size is about 5%.

I've adapted the security-relevant regexp queries to use ASTs. Preserving the exact behaviour of these queries would be possible, but also defeat the purpose of the experiment, so I've tried to take full advantage of the AST to see how it works out.

To be fair, some of these adjustments could possibly be back-ported to the original version using meta-regexps, like the fact that we now flag ^foo|bar$ as two misleading anchor precedence alerts.

Results so far.

@asger-semmle asger-semmle added JS WIP This is a work-in-progress, do not merge yet! labels Nov 4, 2019
@asger-semmle asger-semmle requested a review from a team as a code owner November 4, 2019 16:23
@esbena
Copy link
Contributor

esbena commented Nov 5, 2019

Comments on the linked results

This is amazing. Well done!
Some of those results will really impress programmers.

Comments inspired by looking at the results

This may need to be done in a separate PR, but if you want to squeeze something into this PR, feel free to do that.

The message for js/regex/missing-regexp-anchor

"This hostname pattern may match any domain name, as it is missing a '$' or '/' at the end"

Do we want to mention : as the port-separator here, or \? as the query separator?

The message links for js/regex/missing-regexp-anchor

The query checks if the regexp is used in a location that does not implicitly add the anchors. We could link to those locations in the alert message.

Browser extensions that may innocently be run on malicious pages

We flag a ton of browser extensions that may be run on malicious pages due to an overly permissive regexp. This may be a problem in theory due to potential information leaks, but in practice it seems benign as the extensions only modify the content of the page they are run on. I think it will be hard to whitelist these cases.

@asger-semmle
Copy link
Contributor Author

Thanks for the comments!

Do we want to mention : as the port-separator here, or ? as the query separator?

I admit I haven't tested for browser-specific quirks, but it seems to me that browsers and URL libraries tend to normalize URLs so there is always a / after the authority, so that should be fairly robust in practice.

Another thing I worry about is implicitly suggesting that people fix the issue by writing more complicated regexps involving #, ?, and :. IMO the two fixes we should recommend are:

  1. parse the URL and match against the origin using the $ anchor, or
  2. put a / after the hostname

The query checks if the regexp is used in a location that does not implicitly add the anchors. We could link to those locations in the alert message.

It does? I don't recall anything like that happening, but maybe I've accidentally removed it? I've actually been worrying FPs from about anchors being dynamically added to a string.

As far as I can tell, none of the native regexp-matching functions have implicit anchoring. Anchors can be added by string manipulation, and by checking the match-index after the match was found. Am I missing anything here?

but in practice it seems benign as the extensions only modify the content of the page they are run on

This can certainly lead to information leak. The spoofed version of the page will listen for these DOM modifications and phone home (this has happened).

With that said, I admit the results on userscript aren't very interesting, but it's a bit of an outlier, and the alerts are technically correct so I'm not too worried.

@esbena
Copy link
Contributor

esbena commented Nov 5, 2019

It does? I don't recall anything like that happening, but maybe I've accidentally removed it?

Sorry about that, I was mistaken here. I just checked: those call sites are only used as a FP filter.

As far as I can tell, none of the native regexp-matching functions have implicit anchoring

I think you are right, I think I just remember going through all of them way back to avoid exactly that case.

--

I agree with your other replies.

--

So all in all, I think the results are fine, and I look forward to merging this.

Copy link
Collaborator

@max-schaefer max-schaefer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic work!

A few comments, mostly on the extractor bits; I'll get back to the QL changes later.

char ch = rawLiteral.charAt(pos + 1);
if ('0' <= ch && ch <= '7') {
// Octal escape: \NNN
length = 4;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this account for \0 and similar? I think three digits is just the upper limit, there may be fewer.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, good catch!

class RegExpRoot extends @regexpterm {
class RegExpRoot extends RegExpTerm {
RegExpParent parent;

// RegExpTerm is abstract, so do not extend it.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outdated comment?

(
i = 0
or
i = [ 1 .. t.(RegExpConstant).getValue().length() - 1 ]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively: exists(t.(RegExpConstant).getValue().charAt(i))

Copy link
Collaborator

@max-schaefer max-schaefer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more thoughts.


/**
* Holds if this term is part of a regular expression literal or a string literal
* that is used as a regular expression.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be worth explaining the difference between this and the previous two predicates in a bit more detail.

Copy link
Contributor Author

@asger-semmle asger-semmle Nov 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. These predicates were meant as strawmen until a more elegant solution could be found, but I'm not sure we'll find anything better.

javascript/ql/src/semmle/javascript/Regexp.qll Outdated Show resolved Hide resolved
@@ -409,6 +409,9 @@ class BigIntLiteral extends @bigintliteral, Literal {
*/
class StringLiteral extends @stringliteral, Literal {
override string getStringValue() { result = getValue() }

/** Gets the value of this string literal parsed as a regular expression. */
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand the extractor changes correctly, not all string literals are parsed as regular expressions. Would it be worth explaining this briefly in the doc comment?

*/
predicate isLikelyCaptureGroup(RegExpGroup group) {
group.isCapture() and
not isInsideChoice(group)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps add not being inside a (negated?) lookahead/lookbehind as a criterion?

Copy link
Collaborator

@max-schaefer max-schaefer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more comments, but overall I'd be in favour of merging this soon and then let it go through a distribution upgrade or two. I think it would be good to have a change note to call out the fact that strings are now parsed as regex literals. Also, I assume this needs an internal dbscheme-hash bump PR?

@@ -0,0 +1,14 @@
+ semmlecode-javascript-queries/DOM/TargetBlank.ql: /Security/CWE/CWE-200
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this query included? It doesn't seem to use regular expressions much.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The suite is just here for benchmarking purposes, and the first query is just to pay the cost of computing cached predicates. I'll remove the suite in a rebase.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suite has been removed

* A mapping from integers to integers, is encoded as a sequence of consecutive intervals and their
* corresponding output intervals.
*/
public class OffsetTranslationBuilder {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class seems to be an unused duplicate of OffsetTranslation.

if (endPos != startPos + 1
&& endPos < src.length()
&& "*+?{".indexOf(src.charAt(endPos)) != -1) {
endPos--; // Last constant belongs under an upcoming quantifier.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this deal correctly with non-BMP characters?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, interesting question and it touches on something I forgot to bring up in the PR description.

The parsing here depends on whether the RegExp has the unicode flag u. If it does, they're seen as one character, otherwise two separate characters. The same issue comes up when non-BMP characters occur in a character class (in the absence of u, they'll be interpreted as two independent characters).

For string literals we don't know if it's going to have the unicode flag. My plan for addressing this was to parse with u by default (c.f. the change to parsing of character classes), and then have a query that warns about non-BMP characters that are "split up" by the parse tree at runtime (albeit not in our own AST). This query currently only exists in the test suite, as I didn't want to go through the dance of adding a new query in the same PR, and it's not a very important query, but nice to have.

In line with the handling of character classes, I'll update this to assume the u flag.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be fixed now

javascript/ql/src/semmle/javascript/Expr.qll Outdated Show resolved Hide resolved
max-schaefer
max-schaefer previously approved these changes Nov 13, 2019
Copy link
Contributor

@esbena esbena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rewritten queries are much hard to get a holistic understanding of than before, although their components seem reasonable in isolation. This is expected, and we have reasonable results, so I am happy to see this merged, and fix surprising FPs and TPs later.

I have one small concern about template literals becoming second-class regexp-citizens. See the comment.

OffsetTranslation offsets = new OffsetTranslation();
offsets.set(0, 1); // skip the initial '/'
regexpExtractor.extract(source.substring(1, source.lastIndexOf('/')), offsets, nd, false);
} else if (nd.isStringLiteral() && !c.isInsideType()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that we parse "foo" and 'bar', but not `baz`? If not, then the instanceof StringLiteral bits in the QL parts of this PR should be adjusted (unless we already landed the unification of template literals and string literals somewhere else).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Constant template literals aren't parsed as regular expressions. There are few legitimate use-cases for the StringLiteral class that don't have the edge case of constant template literals to worry about, which is why I would like to extract constant template literals as string literals instead of special-casing it at the use-sites. You and Max both expressed dislike for that idea so I left it out of this PR to avoid getting stuck on that issue, though I still think it's the right way forward.

@@ -0,0 +1,105 @@
import javascript
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This deserves a minor module docstring.

(can any of the predicates in this file be made private?)

Copy link
Contributor

@esbena esbena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One final sanity check: the fact that we parse the strings in the extractor is not exposed at the QL level, is it? We can potentially switch to QL-parsing without changing any APIs (or even adding a change note).

@asger-semmle
Copy link
Contributor Author

the fact that we parse the strings in the extractor is not exposed at the QL level, is it? We can potentially switch to QL-parsing without changing any APIs

Unfortunately not. QL-parsing would require switching to a newtype-backed AST, and while this PR doesn't technically change that, there are some possible designs that are less feasible now that RegExpTerm is shared between regexp and string literals. We won't be able to switch to QL-parsing without breaking changes as there is no way to maintain that RegExpTerm exists for string literal-regexes and is compatible with ASTNode.

@esbena
Copy link
Contributor

esbena commented Nov 14, 2019

Right. I forgot about the obvious newtype requirement. Approving anyway.

esbena
esbena previously approved these changes Nov 14, 2019
@asger-semmle asger-semmle added depends on internal PR This PR should only be merged in sync with an internal Semmle PR and removed WIP This is a work-in-progress, do not merge yet! labels Nov 15, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
depends on internal PR This PR should only be merged in sync with an internal Semmle PR JS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants