New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial lookbehind support for RegexCompiler / source generator #66127
Conversation
Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions Issue DetailsFor .NET 7 we rewrote RegexCompiler as we were writing the source generator, and the primary feature that was left out in doing so was support for the rarely-used RegexOptions.RightToLeft. However, lookbehinds are implemented using RightToLeft support, and that's a more interesting gap. While adding back full lookbehind support would essentially entail adding back full RightToLeft support, we can get more than half of lookbehinds used in practice (based on our corpus) by supporting only a few of the fixed-width constructs. This PR adds that support back in to both RegexCompiler and the source generator. It also sets things up so we can incrementally add in more constructs as we see fit. Of the 18,964 regexes in our corpus, 17,840 are supported with the compiler / source generator today. This PR increases the supported count to 18,346, cutting the gap appx in half. To get the bulk of the remaining ~600, we'd need to add support for loops (in particular set loops). Contributes to #62345
|
This is awesome. Correct me if I’m wrong but I believe most other engines only support fixed-length look behinds anyway, and we were among the only ones that supported variable length ones (mostly because of our support of rigth-to-left) which means after this change is in, most patterns with look behinds that work on other engines will now work on our codegen ones. |
I believe we'd still need to add support for repeaters, alternations, and anchors... basically everything other than variable-length loops. Those aren't particularly hard, though, I just didn't prioritize them. Even loops aren't "hard"... it's just work and some amount of duplicated code :) I suggest we get this in, and then we can incrementally add support for additional constructs until we're happy. |
For .NET 7 we rewrote RegexCompiler as we were writing the source generator, and the primary feature that was left out in doing so was support for the rarely-used RegexOptions.RightToLeft. However, lookbehinds are implemented using RightToLeft support, and that's a more interesting gap. While adding back full lookbehind support would essentially entail adding back full RightToLeft support, we can get more than half of lookbehinds used in practice (based on our corpus) by supporting only a few of the fixed-width constructs. This PR adds that support back in to both RegexCompiler and the source generator. It also sets things up so we can incrementally add in more constructs as we see fit.
It was basically already supported after the previous round of changes.
Turns out it already was supported, since nothing relied on or updated positions. Just marked it as supported and added tests.
2306cea
to
99958fd
Compare
I had to rebase anyway to deal with merge conflicts, and as long as I was doing that, I added some more lookbehind support. Turns out anchors, alternations, and conditionals were trivial. Yay. What's left are all the various kinds of loops and backreferences. Backreferences should be straightforward but will require a bit of additional code. Repeaters (as a subset of loops) will require some tweaks but should be straightforward. The rest of loops will likely require a bunch more edits. |
@@ -166,9 +165,12 @@ private static ImmutableArray<Diagnostic> EmitRegexMethod(IndentedTextWriter wri | |||
// If we can't support custom generation for this regex, spit out a Regex constructor call. | |||
if (!SupportsCodeGeneration(rm, out string? reason)) | |||
{ | |||
// Note that we remove Compiled from the options, as it has the same support as the source generator, so if the source |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if I'm understanding this, then we shouldn't expect any perf regressions with this change since things that will now fallback, would have already fallback during Compiled construction right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct
void EmitPositiveLookaheadAssertion(RegexNode node) | ||
// Emits the code to handle a positive lookaround assertion. This is a positive lookahead | ||
// for left-to-right and a positive lookbehind for right-to-left. | ||
void EmitPositiveLookaroundAssertion(RegexNode node) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: Isn't this name a bit misleading though? PositiveLookaround should in theory mean both positive lookahead or positive lookbehind for both left-to-right and right-to-left, yet from the description above means this is really always a positive lookahead except that in right-to-left the direction is opposite. Anyway, small detail which probably doesn't matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PositiveLookaround should in theory mean both positive lookahead or positive lookbehind for both left-to-right and right-to-left
It does. The node's RegexOptions defines the direction, so it's lookahead if (node.Options & RegexOptions.RightToLeft) == 0 and it's lookbehind if (node.Options & RegexOptions.RightToLeft) != 0.
bool useSwitchedBranches = isAtomic; | ||
if (!useSwitchedBranches) | ||
bool useSwitchedBranches = false; | ||
if ((node.Options & RegexOptions.RightToLeft) == 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should strictly only be true for a lookbehind in a left-to-right pattern, right? I ask because if there are any other cases where we are now planning to actually support right-to-left, we would need to modify the Scan() override to emit code that moves in the right direction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If (node.Options & RegexOptions.RightToLeft) == 0, it's not a lookbehind. Basically there's just one concept: process forward or backward, dictated by whether the node has RightToLeft set. If you're in a lookbehind, the nodes it contains will all be RightToLeft.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But, yes, once we enable the top-level RightToLeft option, at that point we also need to be able to make the bumpalong process backwards.
For .NET 7 we rewrote RegexCompiler as we were writing the source generator, and the primary feature that was left out in doing so was support for the rarely-used RegexOptions.RightToLeft. However, lookbehinds are implemented using RightToLeft support, and that's a more interesting gap. While adding back full lookbehind support would essentially entail adding back full RightToLeft support, we can get more than half of lookbehinds used in practice (based on our corpus) by supporting only a few of the fixed-width constructs. This PR adds that support back in to both RegexCompiler and the source generator. It also sets things up so we can incrementally add in more constructs as we see fit.
Of the 18,964 regexes in our corpus, 17,840 are supported with the compiler / source generator today. This PR increases the supported count to 18,346, cutting the gap appx in half. To get the bulk of the remaining ~600, we'd need to add support for loops (in particular set loops).
Contributes to #62345