-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JS: Parse regular expressions from string literals #2252
Conversation
Comments on the linked resultsThis is amazing. Well done! Comments inspired by looking at the resultsThis may need to be done in a separate PR, but if you want to squeeze something into this PR, feel free to do that. The message for
|
Thanks for the comments!
I admit I haven't tested for browser-specific quirks, but it seems to me that browsers and URL libraries tend to normalize URLs so there is always a Another thing I worry about is implicitly suggesting that people fix the issue by writing more complicated regexps involving
It does? I don't recall anything like that happening, but maybe I've accidentally removed it? I've actually been worrying FPs from about anchors being dynamically added to a string. As far as I can tell, none of the native regexp-matching functions have implicit anchoring. Anchors can be added by string manipulation, and by checking the match-index after the match was found. Am I missing anything here?
This can certainly lead to information leak. The spoofed version of the page will listen for these DOM modifications and phone home (this has happened). With that said, I admit the results on userscript aren't very interesting, but it's a bit of an outlier, and the alerts are technically correct so I'm not too worried. |
Sorry about that, I was mistaken here. I just checked: those call sites are only used as a FP filter.
I think you are right, I think I just remember going through all of them way back to avoid exactly that case. -- I agree with your other replies. -- So all in all, I think the results are fine, and I look forward to merging this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fantastic work!
A few comments, mostly on the extractor bits; I'll get back to the QL changes later.
char ch = rawLiteral.charAt(pos + 1); | ||
if ('0' <= ch && ch <= '7') { | ||
// Octal escape: \NNN | ||
length = 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this account for \0
and similar? I think three digits is just the upper limit, there may be fewer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, good catch!
class RegExpRoot extends @regexpterm { | ||
class RegExpRoot extends RegExpTerm { | ||
RegExpParent parent; | ||
|
||
// RegExpTerm is abstract, so do not extend it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Outdated comment?
( | ||
i = 0 | ||
or | ||
i = [ 1 .. t.(RegExpConstant).getValue().length() - 1 ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively: exists(t.(RegExpConstant).getValue().charAt(i))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more thoughts.
|
||
/** | ||
* Holds if this term is part of a regular expression literal or a string literal | ||
* that is used as a regular expression. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be worth explaining the difference between this and the previous two predicates in a bit more detail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. These predicates were meant as strawmen until a more elegant solution could be found, but I'm not sure we'll find anything better.
@@ -409,6 +409,9 @@ class BigIntLiteral extends @bigintliteral, Literal { | |||
*/ | |||
class StringLiteral extends @stringliteral, Literal { | |||
override string getStringValue() { result = getValue() } | |||
|
|||
/** Gets the value of this string literal parsed as a regular expression. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand the extractor changes correctly, not all string literals are parsed as regular expressions. Would it be worth explaining this briefly in the doc comment?
*/ | ||
predicate isLikelyCaptureGroup(RegExpGroup group) { | ||
group.isCapture() and | ||
not isInsideChoice(group) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps add not being inside a (negated?) lookahead/lookbehind as a criterion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more comments, but overall I'd be in favour of merging this soon and then let it go through a distribution upgrade or two. I think it would be good to have a change note to call out the fact that strings are now parsed as regex literals. Also, I assume this needs an internal dbscheme-hash bump PR?
@@ -0,0 +1,14 @@ | |||
+ semmlecode-javascript-queries/DOM/TargetBlank.ql: /Security/CWE/CWE-200 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this query included? It doesn't seem to use regular expressions much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The suite is just here for benchmarking purposes, and the first query is just to pay the cost of computing cached predicates. I'll remove the suite in a rebase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suite has been removed
* A mapping from integers to integers, is encoded as a sequence of consecutive intervals and their | ||
* corresponding output intervals. | ||
*/ | ||
public class OffsetTranslationBuilder { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class seems to be an unused duplicate of OffsetTranslation
.
if (endPos != startPos + 1 | ||
&& endPos < src.length() | ||
&& "*+?{".indexOf(src.charAt(endPos)) != -1) { | ||
endPos--; // Last constant belongs under an upcoming quantifier. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this deal correctly with non-BMP characters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, interesting question and it touches on something I forgot to bring up in the PR description.
The parsing here depends on whether the RegExp has the unicode flag u
. If it does, they're seen as one character, otherwise two separate characters. The same issue comes up when non-BMP characters occur in a character class (in the absence of u
, they'll be interpreted as two independent characters).
For string literals we don't know if it's going to have the unicode flag. My plan for addressing this was to parse with u
by default (c.f. the change to parsing of character classes), and then have a query that warns about non-BMP characters that are "split up" by the parse tree at runtime (albeit not in our own AST). This query currently only exists in the test suite, as I didn't want to go through the dance of adding a new query in the same PR, and it's not a very important query, but nice to have.
In line with the handling of character classes, I'll update this to assume the u
flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be fixed now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rewritten queries are much hard to get a holistic understanding of than before, although their components seem reasonable in isolation. This is expected, and we have reasonable results, so I am happy to see this merged, and fix surprising FPs and TPs later.
I have one small concern about template literals becoming second-class regexp-citizens. See the comment.
OffsetTranslation offsets = new OffsetTranslation(); | ||
offsets.set(0, 1); // skip the initial '/' | ||
regexpExtractor.extract(source.substring(1, source.lastIndexOf('/')), offsets, nd, false); | ||
} else if (nd.isStringLiteral() && !c.isInsideType()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that we parse "foo"
and 'bar'
, but not `baz`
? If not, then the instanceof StringLiteral
bits in the QL parts of this PR should be adjusted (unless we already landed the unification of template literals and string literals somewhere else).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Constant template literals aren't parsed as regular expressions. There are few legitimate use-cases for the StringLiteral
class that don't have the edge case of constant template literals to worry about, which is why I would like to extract constant template literals as string literals instead of special-casing it at the use-sites. You and Max both expressed dislike for that idea so I left it out of this PR to avoid getting stuck on that issue, though I still think it's the right way forward.
@@ -0,0 +1,105 @@ | |||
import javascript |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This deserves a minor module docstring.
(can any of the predicates in this file be made private?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One final sanity check: the fact that we parse the strings in the extractor is not exposed at the QL level, is it? We can potentially switch to QL-parsing without changing any APIs (or even adding a change note).
Unfortunately not. QL-parsing would require switching to a newtype-backed AST, and while this PR doesn't technically change that, there are some possible designs that are less feasible now that |
Right. I forgot about the obvious |
Co-Authored-By: Max Schaefer <54907921+max-schaefer@users.noreply.github.com>
Co-Authored-By: Max Schaefer <54907921+max-schaefer@users.noreply.github.com>
Co-Authored-By: Max Schaefer <54907921+max-schaefer@users.noreply.github.com>
This is an experiment to see what we can do if we parse all string literals as regular expressions.
The overhead in extraction time and database size is about 5%.
I've adapted the security-relevant regexp queries to use ASTs. Preserving the exact behaviour of these queries would be possible, but also defeat the purpose of the experiment, so I've tried to take full advantage of the AST to see how it works out.
To be fair, some of these adjustments could possibly be back-ported to the original version using meta-regexps, like the fact that we now flag
^foo|bar$
as two misleading anchor precedence alerts.Results so far.