Eliminate backtracking in the interpreter for patterns with .* #51508

pgovind · 2021-04-19T18:33:39Z

I spend the last week looking at some potential optimizations in the RegexInterpreter and found this improvement. This PR doesn't change current behavior and is a straightforward optimization. Here is how it works:

Given a pattern such as .*foo and a text such as abfoocde, the RegexInterpreter currently sees the .* and zips to the end of the text. Then we start checking for foo from the end and backtrack 1 by 1 from e to f until we see the foo in the text. At this point we stop and return a match. That turns out to be 6 backtracking (and text compare) operations (e, d, c, o, o, f). With this change, after we zip to the end, we use LastIndexOf to find the first potential match in the text and reset our current position to LastIndexOf. If LastIndexOf is -1, we reset to our previous position before we zipped to the end and save all that backtracking work.

Required follow up to this PR:

Equivalent changes to RegexCompiler
Add a benchmark to dotnet/performance

Potential follow up to investigate after this PR:

Same optimization for patterns with oneloop and setloop nodes.

Fixes Optimize .* in #1349

Perf numbers on my machine:

_backtracking = new Regex(".*(ss)");
[Benchmark] public void Backtracking() => _backtracking.Match("Essential services are provided by regular exprs.");

> dotnet run --base "D:\repos\before_backtracking\" --diff "D:\repos\after_backtracking\" --threshold 0.001%
summary:
better: 3, geomean: 6.500
total diff: 3

No Slower results for the provided threshold = 0.001% and noise filter = 0.3ns.

| Faster                                                                           | base/diff | Base Median (ns) | Diff Median (ns) | Modality|
| -------------------------------------------------------------------------------- | ---------:| ----------------:| ----------------:| -------- |
| System.Text.RegularExpressions.Tests.Perf_Regex_Common.Backtracking(Options: Non |      6.56 |          1877.06 |           286.19 | several?|

There's already ~130 tests with various .* patterns, so I'm not adding any new ones yet. I'm investigating if there are potentially interesting patterns that are missing from our unit tests, but I'm reasonably confident that we have a good spread already.

cc @tannergooding @danmoseley @jeffhandley

ghost · 2021-04-19T18:33:47Z

Tagging subscribers to this area: @eerhardt, @pgovind
See info in area-owners.md if you want to be subscribed.

Issue Details

I spend the last week looking at some potential optimizations in the RegexInterpreter and found this improvement. This PR doesn't change any behavior and is a straightforward optimization. Here is how it works:

Given a pattern such as .*foo and a text such as abfoocde, the RegexInterpreter currently sees the .* and zips to the end of the text (technically it zips till the first \n character). Then we start checking for foo from the end and backtrack 1 by 1 from e to f until we see the foo in the text. At this point we stop and return a match. That turns out to be 6 backtracking (and text compare) operations (e, d, c, o, o, f). With this change, after we zip to the end, we use LastIndexOf to find the first potential match in the text and reset our current position to LastIndexOf. If LastIndexOf is -1, we reset to our previous position before we zipped to the end and save all that backtracking work.

Some follow up to investigate after this PR: Same optimization for patterns with oneloop and setloop nodes.

Fixes Optimize .* in #1349

Author:	pgovind
Assignees:	-
Labels:	`area-System.Text.RegularExpressions`
Milestone:	-

danmoseley · 2021-04-20T16:11:46Z

Is this an alternative approach to #42408 ? I need to think about it.

danmoseley · 2021-04-20T16:14:25Z

There are no patterns in our regex perf tests that would be impacted by this optimization. You might consider adding one in the perf repo before committing this, so the change before and after is on the record. In fact, given our limited set of perf tests, it might be a good idea for us to always make sure there's a perf test that would benefit before committing any interesting regex optimization. I suggest in this case several variations.

...raries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexInterpreter.cs

danmoseley · 2021-04-20T16:30:56Z

...raries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexInterpreter.cs

@@ -1217,6 +1233,8 @@ protected override void Go()
                            if (len > i && _operator == RegexCode.Notoneloop)


I wonder whether this should also happen for Notoneloopatomic.

danmoseley · 2021-04-20T16:54:04Z

I would need to spend some time refamiliarizing myself with the code. It would probably be good for @stephentoub to look at it as well as Tanner as he touched it last.

pgovind · 2021-04-20T17:18:28Z

Is this an alternative approach to #42408 ? I need to think about it.

Not really. It's more of a generalization of #42408 I think. #42408 strictly only optimized a pattern starting with .*. So a pattern such as hi.*foo would've still had backtracking. With this PR, caching runtimepos in _maxBacktrackPosition after processing hi lets the optimization work anywhere a .* is encountered

pgovind · 2021-04-20T17:24:36Z

I would need to spend some time refamiliarizing myself with the code.

If you have the time, I suggest working with this pattern and text:
Pattern: hi.*foo
Text: hifooabcd

Put a breakpoint at the start of the while loop in Go() and breakpoints in case RegexCode.Notoneloop:, case RegexCode.Multi: and case RegexCode.Notoneloop | RegexCode.Back: to see the backtracking in action.

danmoseley · 2021-04-20T18:33:10Z

There are multiple somewhat related optimizations that concern .*

This
Remove implicit anchoring optimization from Regex #42408 proposal
Auto atomification
The bump ahead mechanism.

And that is what I don't currently have clearly understood in my head right now. 😀

stephentoub · 2021-04-20T19:41:48Z

There are multiple somewhat related optimizations that concern .*
And that is what I don't currently have clearly understood in my head right now.

(2) and (4) are related. No matter the expression, we start at pos X, find the next place the expression could possibly start, run the match there, and if it fails, bump pos X to be X+1 and try again. That's the bump-ahead mechanism. #42408 optimizes a case where the .* occurs at the beginning of the pattern; if the .* fails to match, we know it can't possibly match until the next newline (as that's where the .* stops), so there's no point in trying again until that point, and rather than bumping by 1, we can bump until the next \n.

(3) would be if you had a pattern like .*\n, in which case the .* could become atomic because there's nothing it could "give up" that would match \n, since .*\n is (by default) the same as [^\n]*\n.

(1) doesn't require .*abc to be at the beginning of the string; it's just vectorizing via LastIndexOf the search for "abc" backwards through the search space carved out by the .*... normally we'd back up by 1 to try to match the remainder of the pattern, and this instead searches for the next backtracking spot faster.

In other words, these are all mostly orthogonal:

(2)/(4) reduce the number of places we need to run the whole Go routine
(3) avoids having to backtrack into a .* in some cases
(1) makes it faster to backtrack into a .*.

danmoseley · 2021-04-20T19:47:41Z

Thanks, that's helpful.

we start at pos X, find the next place the expression could possibly start, run the match there, and if it fails, bump pos X to be X+1 and try again. That's the bump-ahead mechanism

In general, when a match fails, we inevitably bump 1 forward: whereas the bump ahead mechanism as I recall was an optimization (which I believe I proposed, but have paged out) to restart from further than 1 forward. Is this correct: if you have .*abc against xyabc this would run to the first abc and continue, and if that match fails, continue from the b rather than the y ?

stephentoub · 2021-04-20T19:58:14Z

Is this correct: if you have .*abc against xyabc this would run to the first abc and continue, and if that match fails, continue from the b rather than the y ?

If you're matching against xyabcabcabcabc, you actually need to first try to match starting at the last abc rather than the first, and then if the rest of the pattern can't match there, back up to the next to last abc, and then the next to next to last abc, and so on.

But regardless, if you can prove that you can't possibly match starting earlier than X, sure, you can jump to X. #42408 is an example of that for the case where the pattern starts with .*, and you can bump to the next \n rather than +1. In your example, with #42408 you don't even have to try again at y or b, but rather look for the next \n, find it doesn't exist, and you're done.

...raries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexInterpreter.cs

pgovind · 2021-04-28T16:10:51Z

@stephentoub : I fixed the CI issues. Not super urgent to review this right away. Just making sure it doesn't get lost in your notifications :)

danmoseley · 2021-04-28T16:31:20Z

@pgovind you'll need to ping him when he's back May 21st if you want his review. Maybe one of us can review before then so that you can merge though.

pgovind · 2021-04-28T16:35:46Z

Maybe one of us can review before then so that you can merge though

Ok, sounds good to me. I'll wait for your sign off then. It's not urgent whatsoever, but I don't want the PR to get too stale either

jeffhandley · 2021-05-21T22:29:19Z

@stephentoub If possible once you're back, it'd be great to get your review of this before the Preview 6 snap.

pgovind · 2021-07-14T23:37:43Z

is that feasible?

Ok, this is done now and I've addressed the last comment @stephentoub

...libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCompiler.cs

stephentoub · 2021-07-15T15:33:31Z

Can you please make sure we have tests that cover various situations here? e.g.

.* followed by something other than a string and then a string
.* followed by a string run against input that can match that string in multiple locations and where the last occurrence may or may not result in a match for the whole pattern
.* followed by a string matched with and without case sensitivity and with and without RTL
etc.

pgovind · 2021-07-16T00:38:57Z

Working on the unit tests. Will have them up tomorrow

ghost · 2021-07-16T22:01:44Z

Hello @pgovind!

Because this pull request has the auto-merge label, I will be glad to assist with helping to merge this pull request once all check-in policies pass.

p.s. you can customize the way I help with merging this pull request, such as holding this pull request until a specific person approves. Simply @mention me (`@msftbot`) and give me an instruction to get started! Learn more here.

pgovind · 2021-07-19T17:34:22Z

@danmoseley : Can I get sign off from you to backport this to P7 please?

jeffhandley · 2021-07-19T19:28:36Z

@pgovind You can request backport approval by doing the following:

Use the backport bot to create the backport PR to the preview7 branch
Fill out the template, especially highlighting the test coverage and risk potential
Email tactics with a link to the backport PR and copy the PR description into the email
CC Dan, myself, and Stephen

pgovind · 2021-07-19T19:52:56Z

/backport to release/6.0-preview7

github-actions · 2021-07-19T19:53:11Z

Started backporting to release/6.0-preview7: https://github.com/dotnet/runtime/actions/runs/1046477355

#51508)" This reverts commit 7eb749c.

dotnet#51508)" This reverts commit 7eb749c.

#51508)" (#56031) This reverts commit 7eb749c.

dotnet-issue-labeler bot added the area-System.Text.RegularExpressions label Apr 19, 2021

pgovind marked this pull request as draft April 19, 2021 18:33

pgovind force-pushed the explore_better_FindFirstChar branch from 26d39ee to c8f3778 Compare April 19, 2021 18:46

pgovind marked this pull request as ready for review April 19, 2021 18:54

pgovind requested review from stephentoub and eerhardt April 19, 2021 18:55

tannergooding approved these changes Apr 20, 2021

View reviewed changes

danmoseley reviewed Apr 20, 2021

View reviewed changes

...raries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexInterpreter.cs Outdated Show resolved Hide resolved

danmoseley reviewed Apr 20, 2021

View reviewed changes

...raries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexInterpreter.cs Show resolved Hide resolved

danmoseley reviewed Apr 20, 2021

View reviewed changes

danmoseley closed this Apr 20, 2021

danmoseley reopened this Apr 20, 2021

jeffhandley added this to the 6.0.0 milestone Apr 21, 2021

stephentoub reviewed Apr 23, 2021

View reviewed changes

...raries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexInterpreter.cs Outdated Show resolved Hide resolved

This was referenced Apr 26, 2021

Add benchmarks for upcoming optimizations dotnet/performance#1788

Merged

Eliminate backtracking in the One node for some regex patterns #51883

Closed

Remove debug unit tests

d8e73cc

pgovind force-pushed the explore_better_FindFirstChar branch from ccd6643 to d8e73cc Compare July 14, 2021 19:07

Add a length to the AsSpan call

cb3f3f2

stephentoub approved these changes Jul 15, 2021

View reviewed changes

Address RegexCompiler comments and add unit tests

3ba7f45

pgovind added the auto-merge label Jul 16, 2021

stephentoub approved these changes Jul 18, 2021

View reviewed changes

ghost merged commit 7eb749c into dotnet:main Jul 18, 2021

github-actions bot mentioned this pull request Jul 19, 2021

[release/6.0-preview7] Eliminate backtracking in the interpreter for patterns with .* #55960

Merged

This was referenced Jul 20, 2021

[Perf] Changes at 7/18/2021 2:18:25 PM DrewScoggins/performance-2#7497

Open

[Perf] Changes at 7/18/2021 2:18:25 PM DrewScoggins/performance-2#7514

Open

[Perf] Regressions in System.Text.RegularExpressions.Tests.Perf_Regex_Common #56018

Closed

BrennanConroy mentioned this pull request Jul 20, 2021

[main] Update dependencies from dotnet/efcore dotnet/runtime dotnet/aspnetcore#34491

Merged

ManickaP mentioned this pull request Jul 20, 2021

[QUIC] Remove AppContext switch from S.N.Quic #56027

Merged

stephentoub added a commit that referenced this pull request Jul 20, 2021

Revert "Eliminate backtracking in the interpreter for patterns with .* (

8dd098f

#51508)" This reverts commit 7eb749c.

stephentoub mentioned this pull request Jul 20, 2021

Revert "Eliminate backtracking in the interpreter for patterns with .* " #56031

Merged

pgovind pushed a commit to pgovind/runtime that referenced this pull request Jul 20, 2021

Revert "Eliminate backtracking in the interpreter for patterns with .* (

5491121

dotnet#51508)" This reverts commit 7eb749c.

pgovind mentioned this pull request Jul 20, 2021

Revert "Eliminate backtracking in the interpreter for patterns with .… #56034

Closed

stephentoub added a commit that referenced this pull request Jul 20, 2021

Revert "Eliminate backtracking in the interpreter for patterns with .* (

3847790

#51508)" (#56031) This reverts commit 7eb749c.

stephentoub mentioned this pull request Jul 21, 2021

Add back regex tests from reverted .* optimization #56070

Merged

jeffhandley mentioned this pull request Jul 28, 2021

Add a unit test from aspnetcore #56108

Merged

ghost locked as resolved and limited conversation to collaborators Aug 18, 2021

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eliminate backtracking in the interpreter for patterns with .* #51508

Eliminate backtracking in the interpreter for patterns with .* #51508

pgovind commented Apr 19, 2021 •

edited

Loading

ghost commented Apr 19, 2021

danmoseley commented Apr 20, 2021

danmoseley commented Apr 20, 2021 •

edited

Loading

danmoseley Apr 20, 2021

danmoseley commented Apr 20, 2021

pgovind commented Apr 20, 2021

pgovind commented Apr 20, 2021

danmoseley commented Apr 20, 2021

stephentoub commented Apr 20, 2021 •

edited

Loading

danmoseley commented Apr 20, 2021

stephentoub commented Apr 20, 2021

pgovind commented Apr 28, 2021

danmoseley commented Apr 28, 2021

pgovind commented Apr 28, 2021

jeffhandley commented May 21, 2021

pgovind commented Jul 14, 2021

stephentoub commented Jul 15, 2021

pgovind commented Jul 16, 2021

ghost commented Jul 16, 2021

pgovind commented Jul 19, 2021

jeffhandley commented Jul 19, 2021

pgovind commented Jul 19, 2021

github-actions bot commented Jul 19, 2021

		@@ -1217,6 +1233,8 @@ protected override void Go()
		if (len > i && _operator == RegexCode.Notoneloop)

Eliminate backtracking in the interpreter for patterns with .* #51508

Eliminate backtracking in the interpreter for patterns with .* #51508

Conversation

pgovind commented Apr 19, 2021 • edited Loading

ghost commented Apr 19, 2021

danmoseley commented Apr 20, 2021

danmoseley commented Apr 20, 2021 • edited Loading

danmoseley Apr 20, 2021

Choose a reason for hiding this comment

danmoseley commented Apr 20, 2021

pgovind commented Apr 20, 2021

pgovind commented Apr 20, 2021

danmoseley commented Apr 20, 2021

stephentoub commented Apr 20, 2021 • edited Loading

danmoseley commented Apr 20, 2021

stephentoub commented Apr 20, 2021

pgovind commented Apr 28, 2021

danmoseley commented Apr 28, 2021

pgovind commented Apr 28, 2021

jeffhandley commented May 21, 2021

pgovind commented Jul 14, 2021

stephentoub commented Jul 15, 2021

pgovind commented Jul 16, 2021

ghost commented Jul 16, 2021

p.s. you can customize the way I help with merging this pull request, such as holding this pull request until a specific person approves. Simply @mention me (@msftbot) and give me an instruction to get started! Learn more here.

pgovind commented Jul 19, 2021

jeffhandley commented Jul 19, 2021

pgovind commented Jul 19, 2021

github-actions bot commented Jul 19, 2021

pgovind commented Apr 19, 2021 •

edited

Loading

danmoseley commented Apr 20, 2021 •

edited

Loading

stephentoub commented Apr 20, 2021 •

edited

Loading

p.s. you can customize the way I help with merging this pull request, such as holding this pull request until a specific person approves. Simply @mention me (`@msftbot`) and give me an instruction to get started! Learn more here.