JS: Parse regular expressions from string literals #2252

asger-semmle · 2019-11-04T16:22:59Z

This is an experiment to see what we can do if we parse all string literals as regular expressions.

The overhead in extraction time and database size is about 5%.

I've adapted the security-relevant regexp queries to use ASTs. Preserving the exact behaviour of these queries would be possible, but also defeat the purpose of the experiment, so I've tried to take full advantage of the AST to see how it works out.

To be fair, some of these adjustments could possibly be back-ported to the original version using meta-regexps, like the fact that we now flag ^foo|bar$ as two misleading anchor precedence alerts.

Results so far.

esbena · 2019-11-05T11:35:50Z

Comments on the linked results

This is amazing. Well done!
Some of those results will really impress programmers.

Comments inspired by looking at the results

This may need to be done in a separate PR, but if you want to squeeze something into this PR, feel free to do that.

The message for `js/regex/missing-regexp-anchor`

"This hostname pattern may match any domain name, as it is missing a '$' or '/' at the end"

Do we want to mention : as the port-separator here, or \? as the query separator?

The message links for `js/regex/missing-regexp-anchor`

The query checks if the regexp is used in a location that does not implicitly add the anchors. We could link to those locations in the alert message.

Browser extensions that may innocently be run on malicious pages

We flag a ton of browser extensions that may be run on malicious pages due to an overly permissive regexp. This may be a problem in theory due to potential information leaks, but in practice it seems benign as the extensions only modify the content of the page they are run on. I think it will be hard to whitelist these cases.

Example: https://github.com/ywzhaiqi/userscript/blob/db2346878aaa809cb3e31e35b1e68b0d3b70fb49/scripts/Super_preloaderPlus/old/Super_preloader_one.user.js#L735
- appears to scrape and modify content on http://www.xiaoshuo570.com
- may also be run on http://www.xiaoshuo570.com.evil.com
- does not send confidential information to http://www.xiaoshuo570.com.evil.com

asger-semmle · 2019-11-05T13:26:28Z

Thanks for the comments!

Do we want to mention : as the port-separator here, or ? as the query separator?

I admit I haven't tested for browser-specific quirks, but it seems to me that browsers and URL libraries tend to normalize URLs so there is always a / after the authority, so that should be fairly robust in practice.

Another thing I worry about is implicitly suggesting that people fix the issue by writing more complicated regexps involving #, ?, and :. IMO the two fixes we should recommend are:

parse the URL and match against the origin using the $ anchor, or
put a / after the hostname

The query checks if the regexp is used in a location that does not implicitly add the anchors. We could link to those locations in the alert message.

It does? I don't recall anything like that happening, but maybe I've accidentally removed it? I've actually been worrying FPs from about anchors being dynamically added to a string.

As far as I can tell, none of the native regexp-matching functions have implicit anchoring. Anchors can be added by string manipulation, and by checking the match-index after the match was found. Am I missing anything here?

but in practice it seems benign as the extensions only modify the content of the page they are run on

This can certainly lead to information leak. The spoofed version of the page will listen for these DOM modifications and phone home (this has happened).

With that said, I admit the results on userscript aren't very interesting, but it's a bit of an outlier, and the alerts are technically correct so I'm not too worried.

esbena · 2019-11-05T13:40:52Z

It does? I don't recall anything like that happening, but maybe I've accidentally removed it?

Sorry about that, I was mistaken here. I just checked: those call sites are only used as a FP filter.

As far as I can tell, none of the native regexp-matching functions have implicit anchoring

I think you are right, I think I just remember going through all of them way back to avoid exactly that case.

--

I agree with your other replies.

--

So all in all, I think the results are fine, and I look forward to merging this.

max-schaefer

Fantastic work!

A few comments, mostly on the extractor bits; I'll get back to the QL changes later.

max-schaefer · 2019-11-06T10:48:03Z

javascript/extractor/src/com/semmle/js/extractor/ASTExtractor.java

+        char ch = rawLiteral.charAt(pos + 1);
+        if ('0' <= ch && ch <= '7') {
+          // Octal escape: \NNN
+          length = 4;


Does this account for \0 and similar? I think three digits is just the upper limit, there may be fewer.

cf. https://www.ecma-international.org/ecma-262/10.0/index.html#prod-annexB-LegacyOctalEscapeSequence

You're right, good catch!

javascript/extractor/src/com/semmle/js/extractor/ASTExtractor.java

max-schaefer · 2019-11-06T10:58:10Z

javascript/ql/src/Performance/ReDoS.ql

-class RegExpRoot extends @regexpterm {
+class RegExpRoot extends RegExpTerm {
+  RegExpParent parent;
+
  // RegExpTerm is abstract, so do not extend it.


Outdated comment?

max-schaefer · 2019-11-06T10:59:55Z

javascript/ql/src/Performance/ReDoS.ql

+    (
+      i = 0
+      or
+      i = [ 1 .. t.(RegExpConstant).getValue().length() - 1 ]


Alternatively: exists(t.(RegExpConstant).getValue().charAt(i))

max-schaefer

A few more thoughts.

max-schaefer · 2019-11-06T16:57:14Z

javascript/ql/src/semmle/javascript/Regexp.qll

+
+  /**
+   * Holds if this term is part of a regular expression literal or a string literal
+   * that is used as a regular expression.


It may be worth explaining the difference between this and the previous two predicates in a bit more detail.

Done. These predicates were meant as strawmen until a more elegant solution could be found, but I'm not sure we'll find anything better.

javascript/ql/src/semmle/javascript/Regexp.qll

max-schaefer · 2019-11-06T17:00:29Z

javascript/ql/src/semmle/javascript/Expr.qll

@@ -409,6 +409,9 @@ class BigIntLiteral extends @bigintliteral, Literal {
 */
 class StringLiteral extends @stringliteral, Literal {
  override string getStringValue() { result = getValue() }
+
+  /** Gets the value of this string literal parsed as a regular expression. */


If I understand the extractor changes correctly, not all string literals are parsed as regular expressions. Would it be worth explaining this briefly in the doc comment?

max-schaefer · 2019-11-06T17:09:10Z

javascript/ql/src/Security/CWE-020/IncompleteHostnameRegExp.ql

+ */
+predicate isLikelyCaptureGroup(RegExpGroup group) {
+  group.isCapture() and
+  not isInsideChoice(group)


Perhaps add not being inside a (negated?) lookahead/lookbehind as a criterion?

javascript/ql/src/Security/CWE-020/HostnameRegexpShared.qll

max-schaefer

A few more comments, but overall I'd be in favour of merging this soon and then let it go through a distribution upgrade or two. I think it would be good to have a change note to call out the fact that strings are now parsed as regex literals. Also, I assume this needs an internal dbscheme-hash bump PR?

max-schaefer · 2019-11-13T12:23:02Z

javascript/config/suites/javascript/regexp

@@ -0,0 +1,14 @@
+ semmlecode-javascript-queries/DOM/TargetBlank.ql: /Security/CWE/CWE-200


Why is this query included? It doesn't seem to use regular expressions much.

The suite is just here for benchmarking purposes, and the first query is just to pay the cost of computing cached predicates. I'll remove the suite in a rebase.

Suite has been removed

javascript/extractor/src/com/semmle/js/extractor/ASTExtractor.java

max-schaefer · 2019-11-13T12:39:26Z

javascript/extractor/src/com/semmle/js/extractor/OffsetTranslationBuilder.java

+ * A mapping from integers to integers, is encoded as a sequence of consecutive intervals and their
+ * corresponding output intervals.
+ */
+public class OffsetTranslationBuilder {


This class seems to be an unused duplicate of OffsetTranslation.

max-schaefer · 2019-11-13T12:41:25Z

javascript/extractor/src/com/semmle/js/parser/RegExpParser.java

+    if (endPos != startPos + 1
+        && endPos < src.length()
+        && "*+?{".indexOf(src.charAt(endPos)) != -1) {
+      endPos--; // Last constant belongs under an upcoming quantifier.


Does this deal correctly with non-BMP characters?

Hm, interesting question and it touches on something I forgot to bring up in the PR description.

The parsing here depends on whether the RegExp has the unicode flag u. If it does, they're seen as one character, otherwise two separate characters. The same issue comes up when non-BMP characters occur in a character class (in the absence of u, they'll be interpreted as two independent characters).

For string literals we don't know if it's going to have the unicode flag. My plan for addressing this was to parse with u by default (c.f. the change to parsing of character classes), and then have a query that warns about non-BMP characters that are "split up" by the parse tree at runtime (albeit not in our own AST). This query currently only exists in the test suite, as I didn't want to go through the dance of adding a new query in the same PR, and it's not a very important query, but nice to have.

In line with the handling of character classes, I'll update this to assume the u flag.

Should be fixed now

javascript/ql/src/Security/CWE-020/IncompleteHostnameRegExp.ql

javascript/ql/src/semmle/javascript/Expr.qll

esbena

The rewritten queries are much hard to get a holistic understanding of than before, although their components seem reasonable in isolation. This is expected, and we have reasonable results, so I am happy to see this merged, and fix surprising FPs and TPs later.

I have one small concern about template literals becoming second-class regexp-citizens. See the comment.

esbena · 2019-11-14T12:03:36Z

javascript/extractor/src/com/semmle/js/extractor/ASTExtractor.java

+        OffsetTranslation offsets = new OffsetTranslation();
+        offsets.set(0, 1); // skip the initial '/'
+        regexpExtractor.extract(source.substring(1, source.lastIndexOf('/')), offsets, nd, false);
+      } else if (nd.isStringLiteral() && !c.isInsideType()) {


Does this mean that we parse "foo" and 'bar', but not `baz`? If not, then the instanceof StringLiteral bits in the QL parts of this PR should be adjusted (unless we already landed the unification of template literals and string literals somewhere else).

Constant template literals aren't parsed as regular expressions. There are few legitimate use-cases for the StringLiteral class that don't have the edge case of constant template literals to worry about, which is why I would like to extract constant template literals as string literals instead of special-casing it at the use-sites. You and Max both expressed dislike for that idea so I left it out of this PR to avoid getting stuck on that issue, though I still think it's the right way forward.

esbena · 2019-11-14T12:10:54Z

javascript/ql/src/Security/CWE-020/HostnameRegexpShared.qll

@@ -0,0 +1,105 @@
+import javascript


This deserves a minor module docstring.

(can any of the predicates in this file be made private?)

esbena

One final sanity check: the fact that we parse the strings in the extractor is not exposed at the QL level, is it? We can potentially switch to QL-parsing without changing any APIs (or even adding a change note).

asger-semmle · 2019-11-14T14:16:20Z

the fact that we parse the strings in the extractor is not exposed at the QL level, is it? We can potentially switch to QL-parsing without changing any APIs

Unfortunately not. QL-parsing would require switching to a newtype-backed AST, and while this PR doesn't technically change that, there are some possible designs that are less feasible now that RegExpTerm is shared between regexp and string literals. We won't be able to switch to QL-parsing without breaking changes as there is no way to maintain that RegExpTerm exists for string literal-regexes and is compatible with ASTNode.

esbena · 2019-11-14T22:15:03Z

Right. I forgot about the obvious newtype requirement. Approving anyway.

Co-Authored-By: Max Schaefer <54907921+max-schaefer@users.noreply.github.com>

asger-semmle added JS WIP This is a work-in-progress, do not merge yet! labels Nov 4, 2019

asger-semmle requested a review from a team as a code owner November 4, 2019 16:23

max-schaefer requested changes Nov 6, 2019

View reviewed changes

asger-semmle force-pushed the regexp branch from 991d294 to f20f1db Compare November 8, 2019 13:22

max-schaefer added this to the 1.23 milestone Nov 13, 2019

max-schaefer requested changes Nov 13, 2019

View reviewed changes

max-schaefer previously approved these changes Nov 13, 2019

View reviewed changes

asger-semmle dismissed max-schaefer’s stale review via beaddfb November 13, 2019 16:16

asger-semmle force-pushed the regexp branch from beaddfb to b899a0f Compare November 13, 2019 17:33

esbena reviewed Nov 14, 2019

View reviewed changes

asger-semmle force-pushed the regexp branch from d86ef44 to 06f799f Compare November 14, 2019 16:33

esbena previously approved these changes Nov 14, 2019

View reviewed changes

asger-semmle added 10 commits November 15, 2019 09:27

JS: Bump extractor version string

0cf191f

JS: Extract RegExp ASTs from string literals

0e1246c

JS: Merge consecutive constants in RegExps

6e1c995

JS: Extract surrogate pairs as one constant node

68d23bc

JS: Add test case for wide constants in char class

591fffc

JS: Update TRAP

c327ee5

JS: Add OffsetTranslation table (preserving behavior)

2901b5e

JS: Fix offsets in regexes parsed from strings with escapes

d3302c3

JS: Update QL API

57de638

JS: Update ReDoS query

97e5da1

asger-semmle and others added 22 commits November 15, 2019 09:27

JS: Port IncompleteHostnameRegExp

e45c361

JS: Use type inference to refine regexp string tracking

20fb771

JS: Fix FPs from TLDs without a domain name

17ad978

JS: Add missing post-anchor case to MissingRegExpAnchor

8c5b9b9

JS: Fix a FP

153d346

JS: Update test annotations

e01a984

JS: Remove outdated comment

c01005a

JS: Simplify charpred of Match

4680e3a

JS: Fix offsets of octal and unicode escape

57a9cad

JS: Do not extract string literal types as regexps

c2e0c8c

JS: Docs regarding regexp terms in string literals

dd9274e

Update javascript/ql/src/semmle/javascript/Regexp.qll

dc6c15c

Co-Authored-By: Max Schaefer <54907921+max-schaefer@users.noreply.github.com>

JS: More qldoc

2242df9

JS: Disregard capture groups in lookaround assertions

a7a90b4

JS: Check for [^.]

4d1f783

JS: Remove unused OffsetTranslationBuilder class

8fcf7a2

JS: Fix parsing of non-BMP chars before a quantifier

37aa85f

Update javascript/ql/src/Security/CWE-020/IncompleteHostnameRegExp.ql

77e5305

Co-Authored-By: Max Schaefer <54907921+max-schaefer@users.noreply.github.com>

Update javascript/ql/src/semmle/javascript/Expr.qll

607aed3

Co-Authored-By: Max Schaefer <54907921+max-schaefer@users.noreply.github.com>

JS: Stats and upgrade script

6809eed

JS: Add qldoc to HostnameRegexpShared

66db382

JS: Add change note

7a489af

asger-semmle dismissed esbena’s stale review via 7a489af November 15, 2019 09:27

asger-semmle force-pushed the regexp branch from 06f799f to 7a489af Compare November 15, 2019 09:27

asger-semmle added depends on internal PR This PR should only be merged in sync with an internal Semmle PR and removed WIP This is a work-in-progress, do not merge yet! labels Nov 15, 2019

max-schaefer approved these changes Nov 15, 2019

View reviewed changes

max-schaefer merged commit 217eda3 into github:master Nov 15, 2019

max-schaefer mentioned this pull request Nov 17, 2019

Extension fails to realise that database needs to be upgraded github/vscode-codeql#168

Closed

asgerf mentioned this pull request Feb 11, 2022

Ruby: IncompleteHostnameRegExp.ql #7917

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JS: Parse regular expressions from string literals #2252

JS: Parse regular expressions from string literals #2252

asger-semmle commented Nov 4, 2019

esbena commented Nov 5, 2019

asger-semmle commented Nov 5, 2019

esbena commented Nov 5, 2019 •

edited

Loading

max-schaefer left a comment

max-schaefer Nov 6, 2019

max-schaefer Nov 6, 2019

asger-semmle Nov 6, 2019

max-schaefer Nov 6, 2019

max-schaefer Nov 6, 2019

max-schaefer left a comment

max-schaefer Nov 6, 2019

asger-semmle Nov 8, 2019 •

edited

Loading

max-schaefer Nov 6, 2019

max-schaefer Nov 6, 2019

max-schaefer left a comment

max-schaefer Nov 13, 2019

asger-semmle Nov 13, 2019

asger-semmle Nov 13, 2019

max-schaefer Nov 13, 2019

max-schaefer Nov 13, 2019

asger-semmle Nov 13, 2019

asger-semmle Nov 13, 2019

esbena left a comment

esbena Nov 14, 2019

asger-semmle Nov 14, 2019

esbena Nov 14, 2019

esbena left a comment

asger-semmle commented Nov 14, 2019

esbena commented Nov 14, 2019

		@@ -0,0 +1,14 @@
		+ semmlecode-javascript-queries/DOM/TargetBlank.ql: /Security/CWE/CWE-200

JS: Parse regular expressions from string literals #2252

JS: Parse regular expressions from string literals #2252

Conversation

asger-semmle commented Nov 4, 2019

esbena commented Nov 5, 2019

Comments on the linked results

Comments inspired by looking at the results

The message for js/regex/missing-regexp-anchor

The message links for js/regex/missing-regexp-anchor

Browser extensions that may innocently be run on malicious pages

asger-semmle commented Nov 5, 2019

esbena commented Nov 5, 2019 • edited Loading

max-schaefer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

max-schaefer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

asger-semmle Nov 8, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

max-schaefer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

esbena left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

esbena left a comment

Choose a reason for hiding this comment

asger-semmle commented Nov 14, 2019

esbena commented Nov 14, 2019

The message for `js/regex/missing-regexp-anchor`

The message links for `js/regex/missing-regexp-anchor`

esbena commented Nov 5, 2019 •

edited

Loading

asger-semmle Nov 8, 2019 •

edited

Loading