Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Harden heuristics against Regexp::TimeoutError errors #6518

Merged
merged 5 commits into from
Sep 7, 2023

Conversation

lildude
Copy link
Member

@lildude lildude commented Aug 16, 2023

Description

Ruby 3.2 allows you to set a process global Regexp.timeout = <time> to limit the impact of poorly written regexes and prevent ReDoS attacks. GitHub is now enforcing a timeout which has revealed...

a) we're not playing nicely and rescuing the Regexp::TimeoutError errors so GitHub may sometime respond with a 500 error or not render a file at all
b) #6417 introduced a regex which suffers from catastrophic backtracking which ultimately causes the Regexp::TimeoutError errors. This was missed as the issue only comes to light when analysing largish files.

This PR addresses both by rescuing the Regexp::TimeoutError exceptions when analysing using the heuristics strategy - we return an empty result as misidentifcation is better than a 500 error - and replaces the bad regex that brought this to light.

Checklist:

  • I am adding new or changing current functionality
    • I have added or updated the tests for the new or changed functionality.

Ruby 3.2 allows you to set a process global `Regexp.timeout = <time>` to limit the impact of poor regex and prevent ReDoS attacks. GitHub is now enforcing a timeout so we need to be sure to play nicely.
@lildude lildude requested a review from a team as a code owner August 16, 2023 13:59
@DecimalTurn
Copy link
Contributor

DecimalTurn commented Aug 16, 2023

This looks good to me. On top of solving backtracking issues, [\s\S]* also makes things much more readable as it's clear to me that the original regex was written to match "anything including line endings" until we get to endsolid.

In a future PR, we might also want to replace other instances of (?:.|[\r\n])* even if they aren't causing trouble just yet.

EDIT: Minor difference I see is that the original insisted that endsolid be on a seperate line. The new regex (by removing the ^) allows solid and endsolid to be on the same line and still matches.

@lildude
Copy link
Member Author

lildude commented Aug 16, 2023

In a future PR, we might also want to replace other instances of (?:.|[\r\n])* even if they aren't causing trouble just yet.

Looks like we might only have two more uses of this pattern.

EDIT: Minor difference I see is that the original insisted that endsolid be on a seperate line. The new regex (by removing the ^) allows solid and endsolid to be on the same line and still matches.

Oh yes. I'll add back the ^ - it doesn't make much difference adding it back.

@@ -686,7 +686,7 @@ disambiguations:
- extensions: ['.stl']
rules:
- language: STL
pattern: '\A\s*solid(?=$|\s)(?:.|[\r\n])*?^endsolid(?:$|\s)'
pattern: '\A\s*solid[\s\S]*^endsolid(?:$|\s)'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. First of all, solid[\s\S] will match stuff like solidobject (which I presume isn't valid STL syntax), so either (?=$|\s) or (?:$|\s) is necessary here.

  2. Second, because this heuristic is anchored to the first-line of input (\A), it's possible to match the terminating endsolid directive using a separate regex:

    Suggested change
    pattern: '\A\s*solid[\s\S]*^endsolid(?:$|\s)'
    and:
    - pattern: '\A\s*solid(?:$|\s)'
    - pattern: '^\s*endsolid(?:$|\s)'

    (Nota bene: I haven't tested these changes locally)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why (?:$|\s) is preferred over \b in this case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2. Second, because this heuristic is anchored to the first-line of input (\A), it's possible to match the terminating endsolid directive using a separate regex:

Splitting things makes this much more expensive. The second regex here is really expensive (11144 steps) and quite possibly going to lead to another case of things running amok as the files grow compared my initial suggestion (29 steps), which now you point it out, needs improving for the situation you mentioned.

These all seem to match as be much more performant:

  • \A\s*solid(?:$|\s)[\s\S]*^endsolid(?:$|\s) => 33 steps
  • \A\s*solid\b[\s\S]*^endsolid(?:$|\s) => 30 steps
  • \A\s*solid\b[\s\S]*^endsolid\b => 28 steps

Copy link
Contributor

@jorendorff jorendorff Aug 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the perf problems, everyone.

Does regex101 reliably approximate the cost to run these regexps in Ruby? The good original, before my bad version, was '\A\s*solid(?=$|\s)(?m:.*?)\Rendsolid(?:$|\s)'; the closest thing that works in regex101 is /\A\s*solid(?=$|\s)(?:.*?)\nendsolid(?:$|\s)/gms, and it claims that is super expensive (107792 steps).

Copy link
Contributor

@jorendorff jorendorff Aug 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I tested the regexps in Ruby.

None of them perform badly on the test case, or on a 5000 line file I found.

Here's how long each regex takes to run on the 5000-line file, in milliseconds:

original: 1.9683429999859072
current:  2.581663000048138
proposed: 1.2971370000159368

regex101 is using a different regex engine.

Copy link
Contributor

@jorendorff jorendorff Aug 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The numbers above are for Ruby 3.2.2 on my laptop. Here's the code:

source code
require 'benchmark'

ORIGINAL_RE  = /\A\s*solid(?=$|\s)(?m:.*?)\Rendsolid(?:$|\s)/
CURRENT_RE   = /\A\s*solid(?=$|\s)(?:.|[\r\n])*?^endsolid(?:$|\s)/
PROPOSED_RE  = /\A\s*solid[\s\S]*^endsolid(?:$|\s)/

text = File.read("SV05_bed.stl")

t = Benchmark.measure {
  1000.times do
    ORIGINAL_RE.match?(text)
  end
}
puts "original: #{t.real}"

t = Benchmark.measure {
  1000.times do
    CURRENT_RE.match?(text)
  end
}
puts "current:  #{t.real}"

t = Benchmark.measure {
  1000.times do
    PROPOSED_RE.match?(text)
  end
}
puts "proposed: #{t.real}"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regex101 is using a different regex engine.

Ruby and PCRE differ in their interpretation of "multiline" modifiers (the parts scoped with (?m:…)). Specifically:

(?m:...)

means…

(?s:...)

means…
Ruby . matches any character, including newlines N/A (No such modifier)
PCRE/Perl ^ and $ respectively match the beginning and end of each line, rather than the input. . matches any character, including newlines; implies (?m).

See, the behaviour that (?m) normally enables in Perl/PCRE is instead the default in Ruby; the \A and \Z/\z assertions are used to match the start and end of the input string, respectively (which would normally be achieved using ^ and $ without multiline mode).

Now, why is this relevant? Because benchmarks could be skewed in favour of whichever regex happens to fail first (if it won't match . because the dotall modifier is needed, then the regex engine logically has no reason to continue reading past the first newline).

@Alhadis Alhadis changed the title Improve heuristic strategy to rescue Regexp::TimeoutError errors and fix a bad regex Harden heuristics against Regexp::TimeoutError errors Aug 19, 2023
@Alhadis
Copy link
Collaborator

Alhadis commented Aug 19, 2023

@lildude Hope you don't mind my changing the PR's title; it exceeded the 72-character limit of well-formed Git commit-messages. 😉

@Alhadis
Copy link
Collaborator

Alhadis commented Aug 26, 2023

Linguist's heuristics are trying to

  1. Scale well (natively-supported regex constructs like \R and . in multiline mode will likely be more performant than something like (?:.|\r?\n)*),
  2. Be implemented purely as regular expressions, something that greatly restricts our available options when dealing with issues like these. Recall that Linguist's heuristics were historically implemented in Ruby (à la Linguist::Generated), albeit mostly with regular expressions,
  3. Satisfy the limitations of a foreign regex engine; i.e., by removing lookarounds and replacing \R with (?:.|\r?\n)*.

To quote Matthew 6:24-34: No one can serve two masters. If @jesuschrist were active on GitHub, I'm sure he'd agree that we're trying to serve three masters (i.e., conflicting objectives). That third objective is starting to become a liability, IMHO.

/cc @jorendorff

@DecimalTurn
Copy link
Contributor

DecimalTurn commented Aug 26, 2023

... by removing lookarounds and replacing \R with (?:.|\r?\n)*

I thought we were replacing \R by (?:\r?\n|\r) like discussed here.

In the case that this PR addresses, we tried replacing (?m).* with (?:.|[\r\n])*, but it turns out that [\s\S]* is much better for performance.

Based on the Regex101 debugger, it seems like [\s\S]* allows the engine to skip all the way to the end (just like with (?m).*) which is quite helpful when the file is large:

2023-08-26_19-04-25

2023-08-26_19-05-05

@jorendorff
Copy link
Contributor

Scale well (natively-supported regex constructs like \R and . in multiline mode will likely be more performant than something like (?:.|\r?\n)*),

The portable regex in this PR is 50% faster than the original (which uses \R and . in multiline mode). So there aren't really conflicting priorities here.

I didn't meant to introduce one -- of course Linguist is a Ruby library above all, and you must do what is right for the project in that light.

@lildude lildude requested a review from Alhadis August 31, 2023 02:35
@DecimalTurn
Copy link
Contributor

DecimalTurn commented Sep 6, 2023

@lildude are you considering a minor release for these changes (ie. v7.26.1) since this can affect other projects (such as go-enry) that are dependant on Linguist or were you planning to wait for the next major release?

Also, should there be a warning about this issue in the notes for https://github.com/github-linguist/linguist/releases/tag/v7.26.0 ?

@Alhadis
Copy link
Collaborator

Alhadis commented Sep 7, 2023

@lildude
Copy link
Member Author

lildude commented Sep 7, 2023

@lildude are you considering a minor release for these changes (ie. v7.26.1) since this can affect other projects (such as go-enry) that are dependant on Linguist or were you planning to wait for the next major release?

@DecimalTurn I was going to make a patch release, however it's taken so long to finish this PR (I've been distracted by higher priorities of my day job) that it's now time for a major release anyway as the freeze date for the next GitHub Enterprise Server (GHES) release is fast approaching and I like the Linguist updates to bake for a bit in prod before the freeze in case any niggles pop-up (it's a PITA to backport them to GHES 😁).

Gonna merge this and start getting things in line for a new release early next week.

@lildude lildude merged commit a0a6d59 into master Sep 7, 2023
5 checks passed
@lildude lildude deleted the lildude/regex-improvements branch September 7, 2023 08:22
@github-linguist github-linguist locked as resolved and limited conversation to collaborators Jun 17, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants