New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Centralize regex tree analysis for atomic/capture/backtracking detection #65734
Conversation
Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions Issue DetailsWe currently either guess at some of this state based on the immediate surrounding nodes (e.g. whether the immediate child backtracks) or we do potentially-expensive walks each time we need to check (e.g. walking all ancestors until root to determine whether a given node is to be considered atomic). This changes the code to do a pass over the graph to compute the relevant information, which can then be used by the code generators any time they need to access that information. The net effect of this is that in some cases where we were generating code to handle backtracking we'll no longer emit that code, we're not susceptible to O(N^2) behavior in some places we previously were for oddly shaped trees (e.g. a loop deeply nested inside of an atomic node), and things are a little bit cleaner. Fixes #62451
|
...aries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexTreeAnalyzer.cs
Show resolved
Hide resolved
We currently either guess at some of this state based on the immediate surrounding nodes (e.g. whether the immediate child backtracks) or we do potentially-expensive walks each time we need to check (e.g. walking all ancestors until root to determine whether a given node is to be considered atomic). This changes the code to do a pass over the graph to compute the relevant information, which can then be used by the code generators any time they need to access that information. This provides the code with faster and more accurate answers.
995a617
to
2380596
Compare
@joperezr, did you have any more feedback on this? Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to add few unit tests for RegexTreeAnalyzer to parse a few expressions and ensure it succesfully calculates IsAtomic, MayContainCapture, and MayBacktrack? Possible one where _complete is false too that ensures MayContainCapture and MayBacktrack always return true?
I know that all of the new code is covered by existing unit tests already since CodeGen engines will be using it every time, but I wonder if it would be beneficial to have focused tests for this helper class.
Other than that, this LGTM, thanks @stephentoub
Sure, we can add some. I'd like to do so separately, though. |
We currently either guess at some of this state based on the immediate surrounding nodes (e.g. whether the immediate child backtracks) or we do potentially-expensive walks each time we need to check (e.g. walking all ancestors until root to determine whether a given node is to be considered atomic). This changes the code to do a pass over the graph to compute the relevant information, which can then be used by the code generators any time they need to access that information. The net effect of this is that in some cases where we were generating code to handle backtracking we'll no longer emit that code, we're not susceptible to O(N^2) behavior in some places we previously were for oddly shaped trees (e.g. a loop deeply nested inside of an atomic node), and things are a little bit cleaner.
Fixes #62451
Note that the issue also talks about tracking not just which nodes contain captures, but which nodes are followed by captures, as that would allow us to avoid emitting uncapturing code for nodes in expressions that contain captures but where the captures were before that node in the graph. However, having written the logic to track that, I realized it was both a little complicated and it doesn't really buy us all that much, so I decided not to go ahead with it.