Centralize regex tree analysis for atomic/capture/backtracking detection #65734

stephentoub · 2022-02-22T20:49:24Z

We currently either guess at some of this state based on the immediate surrounding nodes (e.g. whether the immediate child backtracks) or we do potentially-expensive walks each time we need to check (e.g. walking all ancestors until root to determine whether a given node is to be considered atomic). This changes the code to do a pass over the graph to compute the relevant information, which can then be used by the code generators any time they need to access that information. The net effect of this is that in some cases where we were generating code to handle backtracking we'll no longer emit that code, we're not susceptible to O(N^2) behavior in some places we previously were for oddly shaped trees (e.g. a loop deeply nested inside of an atomic node), and things are a little bit cleaner.

Fixes #62451
Note that the issue also talks about tracking not just which nodes contain captures, but which nodes are followed by captures, as that would allow us to avoid emitting uncapturing code for nodes in expressions that contain captures but where the captures were before that node in the graph. However, having written the logic to track that, I realized it was both a little complicated and it doesn't really buy us all that much, so I decided not to go ahead with it.

ghost · 2022-02-22T20:49:33Z

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

We currently either guess at some of this state based on the immediate surrounding nodes (e.g. whether the immediate child backtracks) or we do potentially-expensive walks each time we need to check (e.g. walking all ancestors until root to determine whether a given node is to be considered atomic). This changes the code to do a pass over the graph to compute the relevant information, which can then be used by the code generators any time they need to access that information. The net effect of this is that in some cases where we were generating code to handle backtracking we'll no longer emit that code, we're not susceptible to O(N^2) behavior in some places we previously were for oddly shaped trees (e.g. a loop deeply nested inside of an atomic node), and things are a little bit cleaner.

Fixes #62451
Note that the issue also talks about tracking not just which nodes contain captures, but which nodes are followed by captures, as that would allow us to avoid emitting uncapturing code for nodes in expressions that contain captures but where the captures were before that node in the graph. However, having written the logic to track that, I realized it was both a little complicated and it doesn't really buy us all that much, so I decided not to go ahead with it.

Author:	stephentoub
Assignees:	-
Labels:	`area-System.Text.RegularExpressions`, `tenet-performance`
Milestone:	7.0.0

...aries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexTreeAnalyzer.cs

We currently either guess at some of this state based on the immediate surrounding nodes (e.g. whether the immediate child backtracks) or we do potentially-expensive walks each time we need to check (e.g. walking all ancestors until root to determine whether a given node is to be considered atomic). This changes the code to do a pass over the graph to compute the relevant information, which can then be used by the code generators any time they need to access that information. This provides the code with faster and more accurate answers.

stephentoub · 2022-02-25T15:13:54Z

@joperezr, did you have any more feedback on this? Thanks.

joperezr

Do we want to add few unit tests for RegexTreeAnalyzer to parse a few expressions and ensure it succesfully calculates IsAtomic, MayContainCapture, and MayBacktrack? Possible one where _complete is false too that ensures MayContainCapture and MayBacktrack always return true?

I know that all of the new code is covered by existing unit tests already since CodeGen engines will be using it every time, but I wonder if it would be beneficial to have focused tests for this helper class.

Other than that, this LGTM, thanks @stephentoub

stephentoub · 2022-02-25T21:42:59Z

Do we want to add few unit tests for RegexTreeAnalyzer to parse a few expressions and ensure it succesfully calculates IsAtomic, MayContainCapture, and MayBacktrack? Possible one where _complete is false too that ensures MayContainCapture and MayBacktrack always return true?

Sure, we can add some. I'd like to do so separately, though.

stephentoub added area-System.Text.RegularExpressions tenet-performance Performance related issue labels Feb 22, 2022

stephentoub added this to the 7.0.0 milestone Feb 22, 2022

stephentoub requested a review from joperezr February 22, 2022 20:49

ghost assigned stephentoub Feb 22, 2022

joperezr reviewed Feb 24, 2022

View reviewed changes

...aries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexTreeAnalyzer.cs Show resolved Hide resolved

stephentoub force-pushed the regexanalysis branch from 995a617 to 2380596 Compare February 24, 2022 13:05

joperezr approved these changes Feb 25, 2022

View reviewed changes

stephentoub merged commit 2ce0af0 into dotnet:main Feb 25, 2022

stephentoub deleted the regexanalysis branch February 25, 2022 21:43

This was referenced Mar 1, 2022

Regressions in System.Text.RegularExpressions.Tests.Perf_Regex_Common #66014

Closed

[Perf] Changes at 2/25/2022 12:04:02 PM dotnet/perf-autofiling-issues#3776

Closed

dotnet locked as resolved and limited conversation to collaborators Mar 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Centralize regex tree analysis for atomic/capture/backtracking detection #65734

Centralize regex tree analysis for atomic/capture/backtracking detection #65734

stephentoub commented Feb 22, 2022

ghost commented Feb 22, 2022

stephentoub commented Feb 25, 2022

joperezr left a comment

stephentoub commented Feb 25, 2022

Centralize regex tree analysis for atomic/capture/backtracking detection #65734

Centralize regex tree analysis for atomic/capture/backtracking detection #65734

Conversation

stephentoub commented Feb 22, 2022

ghost commented Feb 22, 2022

stephentoub commented Feb 25, 2022

joperezr left a comment

Choose a reason for hiding this comment

stephentoub commented Feb 25, 2022