Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Centralize regex tree analysis for atomic/capture/backtracking detection #65734

Merged
merged 1 commit into from Feb 25, 2022

Conversation

stephentoub
Copy link
Member

We currently either guess at some of this state based on the immediate surrounding nodes (e.g. whether the immediate child backtracks) or we do potentially-expensive walks each time we need to check (e.g. walking all ancestors until root to determine whether a given node is to be considered atomic). This changes the code to do a pass over the graph to compute the relevant information, which can then be used by the code generators any time they need to access that information. The net effect of this is that in some cases where we were generating code to handle backtracking we'll no longer emit that code, we're not susceptible to O(N^2) behavior in some places we previously were for oddly shaped trees (e.g. a loop deeply nested inside of an atomic node), and things are a little bit cleaner.

Fixes #62451
Note that the issue also talks about tracking not just which nodes contain captures, but which nodes are followed by captures, as that would allow us to avoid emitting uncapturing code for nodes in expressions that contain captures but where the captures were before that node in the graph. However, having written the logic to track that, I realized it was both a little complicated and it doesn't really buy us all that much, so I decided not to go ahead with it.

@ghost
Copy link

ghost commented Feb 22, 2022

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

We currently either guess at some of this state based on the immediate surrounding nodes (e.g. whether the immediate child backtracks) or we do potentially-expensive walks each time we need to check (e.g. walking all ancestors until root to determine whether a given node is to be considered atomic). This changes the code to do a pass over the graph to compute the relevant information, which can then be used by the code generators any time they need to access that information. The net effect of this is that in some cases where we were generating code to handle backtracking we'll no longer emit that code, we're not susceptible to O(N^2) behavior in some places we previously were for oddly shaped trees (e.g. a loop deeply nested inside of an atomic node), and things are a little bit cleaner.

Fixes #62451
Note that the issue also talks about tracking not just which nodes contain captures, but which nodes are followed by captures, as that would allow us to avoid emitting uncapturing code for nodes in expressions that contain captures but where the captures were before that node in the graph. However, having written the logic to track that, I realized it was both a little complicated and it doesn't really buy us all that much, so I decided not to go ahead with it.

Author: stephentoub
Assignees: -
Labels:

area-System.Text.RegularExpressions, tenet-performance

Milestone: 7.0.0

We currently either guess at some of this state based on the immediate surrounding nodes (e.g. whether the immediate child backtracks) or we do potentially-expensive walks each time we need to check (e.g. walking all ancestors until root to determine whether a given node is to be considered atomic).  This changes the code to do a pass over the graph to compute the relevant information, which can then be used by the code generators any time they need to access that information.  This provides the code with faster and more accurate answers.
@stephentoub
Copy link
Member Author

@joperezr, did you have any more feedback on this? Thanks.

Copy link
Member

@joperezr joperezr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to add few unit tests for RegexTreeAnalyzer to parse a few expressions and ensure it succesfully calculates IsAtomic, MayContainCapture, and MayBacktrack? Possible one where _complete is false too that ensures MayContainCapture and MayBacktrack always return true?

I know that all of the new code is covered by existing unit tests already since CodeGen engines will be using it every time, but I wonder if it would be beneficial to have focused tests for this helper class.

Other than that, this LGTM, thanks @stephentoub

@stephentoub
Copy link
Member Author

Do we want to add few unit tests for RegexTreeAnalyzer to parse a few expressions and ensure it succesfully calculates IsAtomic, MayContainCapture, and MayBacktrack? Possible one where _complete is false too that ensures MayContainCapture and MayBacktrack always return true?

Sure, we can add some. I'd like to do so separately, though.

@stephentoub stephentoub merged commit 2ce0af0 into dotnet:main Feb 25, 2022
@stephentoub stephentoub deleted the regexanalysis branch February 25, 2022 21:43
@dotnet dotnet locked as resolved and limited conversation to collaborators Mar 28, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve Regex's knowledge of captures and atomicity in the tree
2 participants