JS: Make ReDoS.ql understand lots of character classes #4585

erik-krogh · 2020-10-31T22:37:28Z

Gets a TP/TN for CVE-2020-7662 and CVE-2018-3738.

Adds support for character classes in ReDoS (the existing support was very basic).
We can now flag a regexp like: /((\Sx|[\da-z])*)"/, which will do exponential backtracking on many repetitions of a string like "9x".

The first commit removes a redundant check that had no influence on the final result of the query.
(Removing that check had no effect on test results or performance).

Also does some performance improvements along the way.
Without the performance improvements in the query would perform at least 15X worse on our test case.

Some of the commits have // TODO: in them, that is on purpose.
(The final commit should have none).

Each commit should be understandable on its own, and going through them in order should help understand the implementation.
Note: some of the code is refactored in later commits.

An evaluation looks good in terms of performance.
The total time taken is pretty close to being exactly the same, with some benchmarks being slower and some faster.
This commit made performance twice as bad on our test case, but apparently it is not bad in the real world.

There are many new results - overwhelmingly TPs.
And I haven't found a lost TP (some of them are still reported with a different witness string).

esbena · 2020-11-04T07:03:58Z

Could these range-reasoning improvements be used in

codeql/javascript/ql/src/semmle/javascript/security/performance/SuperlinearBackTracking.qll

Lines 98 to 110 in 9249444

    
           /** 
        
            * Holds if `s1` and `s2` possibly have a non-empty intersection. 
        
            * 
        
            * This is a simple, and under-approximate, version of 
        
            * ReDoS::compatible/2, as this predicate only handles some character 
        
            * classes and constant values. 
        
            */ 
        
           pragma[inline] 
        
           private predicate compatible(RegExpTerm s1, RegExpTerm s2) { 
        
             not s1.(RegExpCharacterClass).isInverted() and 
        
             not s2.(RegExpCharacterClass).isInverted() and 
        
             compatibleConstants(s1, s2) 
        
           }

as well? (other PR)

…n more char classes

…rt for RegExpCharacterClassEscape

asgerf

Very impressive results! 👏

Thanks for breaking it up in mostly independent commits. It made it much easier to review, although I ended up deleting most of my comments again as they became obsolete.

The first commit removes a redundant check that had no influence on the final result of the query.

I'm not convinced that this was strictly true in the first commit, as intersect was heuristic and two compatible terms may have come up without an intersection. But with the more powerful intersect predicate in later commits I agree the check becomes unnecessary.

asgerf · 2020-11-13T10:29:03Z

javascript/ql/src/Performance/ReDoS.ql

@@ -447,7 +371,7 @@ predicate isFork(State q, InputSymbol s1, InputSymbol s2, State r1, State r2) {
    // Use pragma[noopt] to prevent compatible(s1,s2) from being the starting point of the join.


Outdated comment refers to the compatible predicate

asgerf · 2020-11-13T10:43:34Z

javascript/ql/test/query-tests/Performance/ReDoS/tst.js

@@ -94,8 +94,17 @@ var good9 = '(a|aa?)*b';
 // NOT GOOD
 var bad18 = /(([^]|[^a])*)"/;

-// NOT GOOD
+// NOT GOOD - but not flagged


I don't understand why we lost this result. The change in this commit looks like a strict improvement.

I understand we win it back in a later commit, but I'd still like to understand why we lost it.

but I'd still like to understand why we lost it.

So would I.
I never figured out why it was happening.

javascript/ql/src/Performance/ReDoS.ql

asgerf · 2020-11-13T11:18:40Z

javascript/ql/src/Performance/ReDoS.ql

-      // TODO: also supposed to match \f and vertical tab (\x0B).
-      char = [" ", "\t", "\r", "\n"]
+      (
+        char = [" ", "\t", "\r", "\n", "\\u000c", "\\u000b"]


Given the trick below (which lgtm), do we gain anything from having the strings \u000c and \u00b in this list?

We don't.
It was a mistake from my end to include those entries in that list.

asgerf · 2020-11-13T11:27:05Z

javascript/ql/src/Performance/ReDoS.ql

@@ -777,6 +783,7 @@ predicate isPumpable(State fork, string w) {
 */
 State process(State fork, string w, int i) {
  isPumpable(fork, w) and
+  min(string s | isPumpable(fork, s)).prefix(w.length()) = w and


This could do with a comment. So we're only processing strings that are prefixes of the minimum pumpable string. Why is that? And how is this different from requiring that w is the minimum pumpable string? (given that we already use i to address a specific prefix of w we're not a risk of losing the base case)

That one got removed later, exactly because of the reasons you mention. (See the change in the select, where I moved some not out of the min(..)).

We did in fact only look at the minimum pumpable string, so we did not loose any results when this change was made.

asgerf · 2020-11-13T11:35:45Z

javascript/ql/src/Performance/ReDoS.ql

+   * Gets any char that could possibly be matched by a regular expression.
+   * Includes all printable ascii chars, all constants mentioned in a regexp, and all chars matches by the regexp `/\s|\d|\w/`.
+   */
+  private string getAnyPossiblyMatchedChar() {


Not exactly true as the internal values of long non-ascii intervals aren't included (nor should they be).

Maybe just getARelevantChar?

asgerf · 2020-11-13T12:06:13Z

javascript/ql/src/Performance/ReDoS.ql

@@ -873,7 +878,7 @@ from RegExpTerm t, string c, int i
 where
  c = min(string w | isPumpable(Match(t, i), w)) and
  not isPumpable(epsilonSucc+(Match(t, i)), _) and
-  not epsilonSucc*(process(Match(t, i), c, c.length() - 1)) = Accept(_)
+  not epsilonSucc*(process(Match(t, i), c, [0 .. c.length() - 1])) = Accept(_)


Could [0 .. c.length() - 1] simply be replaced with _ here?

Also I'm not sure I understand the change. The condition is that it that there must exist a suffix x that causes the string to be rejected after matching many repetitions of w. The commit message reads:

remove false positives where the analysis would wrongly conclude that the accept state could not be reached

The analysis isn't trying to prove that the accept state can't be reached. It's trying to prove that for some input, an accept state can't be reached. Unfortunately this is hard in an NFA due to non-determinism. (Simply checking if a rejecting state is reachable doesn't work, as for a given input string there may be another choice of edges that leads to an accept state.)

Originally we just checked that the empty string is rejected after pumping. That guarantees that such a suffix exists (the empty string fits the bill), so the check can't cause FPs but may have FNs.

With this change, we're now checking multiple strings (all the prefixes of w) and saying that they must all be rejected. That's not right. We only need one such rejecting suffix. It would be sufficient to check that one of them is rejected. The check is now more causing FNs than before, and it didn't have any FPs to begin with.

The commit message talks about fixing FPs but modifies a check that couldn't cause FPs. I'd therefore argue that the root cause for these FPs is to be found elsewhere, and this change is hiding them for the wrong reason (due to its increased FN rate).

I think we should try to hunt down the remaining FPs (see the other comment regarding anchors) and then revisit this and see if it can be reverted once the root cause is fixed.

Could [0 .. c.length() - 1] simply be replaced with _ here?

Yes.

and it didn't have any FPs to begin with.

It did have FPs.
And I don't think we got any FN's due to this change (We lost results in the test case, but those were FPs, even though some were marked as TPs).

(I might not have worded the commit message in the best way).

What I'm doing is checking if the accepting state can be reached while processing the witness string (repeatedly).

Consider this regexp: /(A*A*X)*/.

If we change [0 .. c.length() - 1] back to c.length() - 1, then the analysis concludes that it can have exponential backtracking with the witness string AXA.
If the witness string is processed, then we end up in the state that is about to match X, which is not an accepting state.
But while processing AXA we "go through" an accepting state.
The JavaScript regular expression evaluator appears to be greedy, because it will not do catastrophic backtracking if an accepting state has been reached at some point.
I cannot construct a string that will cause the above regexp to perform badly.
(I can get quadratic runtime if the g flag is set).

If we instead consider the regexp /(A*A*X)*Y/ or /(A*A*X)*$/ then it is trivial to construct a badly performing string.

I don't think I've missed anything, but I'm happy to be proved wrong if you can construct a string where performance is exponential in the length when matched against /(A*A*X)*/.

If we change [0 .. c.length() - 1] back to c.length() - 1, then the analysis concludes that it can have exponential backtracking with the witness string AXA.

Yes, but I didn't think this check was the root cause of that FP. I saw the check and thought it was suspicious that it didn't match the theory laid out at the top of the query. It wasn't obvious why passing through an accept state mattered here. In such cases, it often pays off to get to the bottom of it, even if it's tempting to just use what appears to work and move on.

After giving it some more thought, I believe it has to do with how we model anchors (or lack of anchors). There's an ambiguity around what the accept state means. Because of our loose handling of anchors, being in the accept state could mean one of two things:

I accept if the string ends here (the text book interpretation of the accept state), or

I can just stop matching here if I want (the unanchored regexp interpretation).

The old check made sense assuming (1), but given how we map JS regexps to an NFA, (2) is closer to reality.

If we were to introduce proper handling of anchoring, I think the old check would need to be reinstated, but with the current NFA construction, we need to emulate the (2) interpretation. I see that the modified check here was a way to do that, and I'm happy if you want to leave it at that (but please explain in a comment). A more direct way to do the same might be to add an Any() transition from the accept state to itself.

As an aside, we could also try to implement better anchor support. That might be well out of scope for this PR, but just a few thoughts:

We could introduce a new state MkAcceptAnySuffix() which accepts an arbitrary suffix from that point on. It can be modelled via an epsilon transition to MkAccept() and an Any() edge to itself.

With that state in hand, we have a way to model anchors:

At a $ anchor, add transition to MkAccept() (which only succeeds if the string ends here and now)

At the end of the regexp, transition to MkAcceptAnySuffix().

We might lose some TPs though, since only testing the empty string probably won't be enough anymore. Maybe the empty string check only works so well because $-anchored regexps have an unreachable accept state.

Great writeup ❤️

Yes, I had to some extent overlooked how the accept state works compared to the theory.

I see that the modified check here was a way to do that, and I'm happy if you want to leave it at that (but please explain in a comment)

I think we should keep the implementation as is for now, and I'll put in a comment at the process() predicate about which assumptions is makes about the accept state.

I'll look into anchors again sometime after merging this PR.

Note: I'm somewhat sure we don't need to model the start anchor ^.

javascript/ql/test/query-tests/Performance/ReDoS/tst.js

Co-authored-by: Asger F <asgerf@github.com>

… not matched by the char class

erik-krogh · 2020-11-17T20:26:21Z

A full run on LGTM revealed 12 projects that failed for some reason.

Our search-space would sometimes blow up with infeasible state-pairs, when there were cycles where the minimum length was long.
The regexp below was enough to cause a timeout by itself (the minimum length of the middle part is about 20):

/\s{0,3}(\s*<em class="gbifHl">\s*<\/em>\s*)+([^\s<]+\s){0,2}([^\s<]*)/

Limiting the search-space to only the state-pairs - where it is feasible to reach the "goal" within the given length - solved the issue.

… the given length

…o match test output

asgerf · 2020-11-18T15:19:06Z

If you're looking for more ways to prune, perhaps we could disregard roots whose only repetition is the root itself, or is at the end of a sequence at the root. In this case, it can't possibly reject after matching (in terms of our NFA, it will always pass through the accept state when looping around).

…out which kind of accepting state is assumed.

erik-krogh · 2020-11-18T21:18:02Z

If you're looking for more ways to prune, perhaps we could disregard roots whose only repetition is the root itself, or is at the end of a sequence at the root. In this case, it can't possibly reject after matching (in terms of our NFA, it will always pass through the accept state when looping around).

I am.

But after that last pruning commit, I decided I would postpone any further improvements until I made a followup PR.

I've added your suggestion to things I'll look at for that followup.

javascript/ql/src/Performance/ReDoS.ql

Co-authored-by: Asger F <asgerf@github.com>

erik-krogh · 2020-11-19T14:39:26Z

A large came back fine.
The evaluation is on full.slugs and every project that failed on LGTM for an earlier version of the query.

erik-krogh · 2020-11-24T12:18:37Z

I got prefix/suffix construction (and another round of performance improvements) ready on another branch.

I can add them here, or wait and put it in another PR.
@asgerf do you have a preference?

asgerf

This seems ready to land, so let's merge this first and do the other improvements in a new PR.

github-actions bot added the JS label Oct 31, 2020

erik-krogh force-pushed the moreReDoS branch 2 times, most recently from 5de12c7 to ddcffc2 Compare November 2, 2020 10:14

erik-krogh changed the title ~~JS: Make ReDoS.ql understand inverted character classes~~ JS: Make ReDoS.ql understand lots of character classes Nov 2, 2020

erik-krogh force-pushed the moreReDoS branch 9 times, most recently from 54229c0 to ae981dc Compare November 3, 2020 13:11

erik-krogh force-pushed the moreReDoS branch 5 times, most recently from 08dffe7 to 59872d4 Compare November 5, 2020 23:23

erik-krogh added 5 commits November 6, 2020 10:18

remove rendundant check

d04f3df

add redos support for the simplest possible inverted char class

321cf09

support that an inverted char class can intersect with itself

64d680e

support inverted char class and dot

804aaf3

matching a inverted char class with a char

16473fc

erik-krogh force-pushed the moreReDoS branch 2 times, most recently from 0c9115e to 1f5d7be Compare November 8, 2020 22:14

erik-krogh added 4 commits November 8, 2020 23:16

detect redos between charclass and inverted charclass

82252c0

improve performance by pruning based on shared root

4ede04f

expand getAOverlapBetweenCharacterClasses to support overlap betwee…

a09ffd5

…n more char classes

refactor character class implementation in ReDoS.ql - preparing suppo…

fa54ad1

…rt for RegExpCharacterClassEscape

erik-krogh added 2 commits November 8, 2020 23:24

testing many possible intersections, instead of a single intersection

17a687b

update expected output

3ef5d89

erik-krogh force-pushed the moreReDoS branch 2 times, most recently from d5cbb67 to 3ef5d89 Compare November 8, 2020 22:30

asgerf self-assigned this Nov 13, 2020

asgerf reviewed Nov 13, 2020

View reviewed changes

erik-krogh and others added 5 commits November 13, 2020 18:03

Apply suggestions from code review

2f4fcc2

Co-authored-by: Asger F <asgerf@github.com>

changes based on review

affb11b

autoformat

a49b99b

fix FP related to inverted character classes choosing a char that was…

97acf1f

… not matched by the char class

Merge branch 'main' into moreReDoS

c4d7533

erik-krogh force-pushed the moreReDoS branch from b710752 to d7404b9 Compare November 17, 2020 20:26

limit the search of state-pairs to the ones that are reachable within…

55f2f86

… the given length

erik-krogh force-pushed the moreReDoS branch from d7404b9 to 55f2f86 Compare November 18, 2020 08:24

erik-krogh added 3 commits November 18, 2020 11:55

make the character search skip unencodable characters

8270bf5

remove duplicated test cases from ReDoS, and adjust variables names t…

c4153a6

…o match test output

prune more regexps initially in the ReDoS query

58c31f0

adjust top comment to reflect what the query does, and add comment ab…

cc1d797

…out which kind of accepting state is assumed.

asgerf reviewed Nov 19, 2020

View reviewed changes

javascript/ql/src/Performance/ReDoS.ql Outdated Show resolved Hide resolved

javascript/ql/src/Performance/ReDoS.ql Outdated Show resolved Hide resolved

Apply suggestions from code review

a3b21ad

Co-authored-by: Asger F <asgerf@github.com>

erik-krogh marked this pull request as ready for review November 19, 2020 14:39

erik-krogh requested a review from a team as a code owner November 19, 2020 14:39

asgerf approved these changes Nov 24, 2020

View reviewed changes

codeql-ci merged commit 3954037 into github:main Nov 24, 2020

erik-krogh mentioned this pull request Nov 24, 2020

JS: Add suffix/prefix construction to js/redos #4721

Merged

		@@ -447,7 +371,7 @@ predicate isFork(State q, InputSymbol s1, InputSymbol s2, State r1, State r2) {
		// Use pragma[noopt] to prevent compatible(s1,s2) from being the starting point of the join.

JS: Make ReDoS.ql understand lots of character classes #4585

JS: Make ReDoS.ql understand lots of character classes #4585

Uh oh!

Conversation

erik-krogh commented Oct 31, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

esbena commented Nov 4, 2020

Uh oh!

asgerf left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

erik-krogh commented Nov 17, 2020

Uh oh!

asgerf commented Nov 18, 2020

Uh oh!

erik-krogh commented Nov 18, 2020

Uh oh!

Uh oh!

Uh oh!

erik-krogh commented Nov 19, 2020

Uh oh!

erik-krogh commented Nov 24, 2020

Uh oh!

asgerf left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

erik-krogh commented Oct 31, 2020 •

edited

Loading