Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial lookbehind support for RegexCompiler / source generator #66127

Closed
wants to merge 5 commits into from

Conversation

stephentoub
Copy link
Member

For .NET 7 we rewrote RegexCompiler as we were writing the source generator, and the primary feature that was left out in doing so was support for the rarely-used RegexOptions.RightToLeft. However, lookbehinds are implemented using RightToLeft support, and that's a more interesting gap. While adding back full lookbehind support would essentially entail adding back full RightToLeft support, we can get more than half of lookbehinds used in practice (based on our corpus) by supporting only a few of the fixed-width constructs. This PR adds that support back in to both RegexCompiler and the source generator. It also sets things up so we can incrementally add in more constructs as we see fit.

Of the 18,964 regexes in our corpus, 17,840 are supported with the compiler / source generator today. This PR increases the supported count to 18,346, cutting the gap appx in half. To get the bulk of the remaining ~600, we'd need to add support for loops (in particular set loops).

Contributes to #62345

@ghost
Copy link

ghost commented Mar 3, 2022

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

For .NET 7 we rewrote RegexCompiler as we were writing the source generator, and the primary feature that was left out in doing so was support for the rarely-used RegexOptions.RightToLeft. However, lookbehinds are implemented using RightToLeft support, and that's a more interesting gap. While adding back full lookbehind support would essentially entail adding back full RightToLeft support, we can get more than half of lookbehinds used in practice (based on our corpus) by supporting only a few of the fixed-width constructs. This PR adds that support back in to both RegexCompiler and the source generator. It also sets things up so we can incrementally add in more constructs as we see fit.

Of the 18,964 regexes in our corpus, 17,840 are supported with the compiler / source generator today. This PR increases the supported count to 18,346, cutting the gap appx in half. To get the bulk of the remaining ~600, we'd need to add support for loops (in particular set loops).

Contributes to #62345

Author: stephentoub
Assignees: -
Labels:

area-System.Text.RegularExpressions

Milestone: 7.0.0

@joperezr
Copy link
Member

joperezr commented Mar 3, 2022

This is awesome. Correct me if I’m wrong but I believe most other engines only support fixed-length look behinds anyway, and we were among the only ones that supported variable length ones (mostly because of our support of rigth-to-left) which means after this change is in, most patterns with look behinds that work on other engines will now work on our codegen ones.

@stephentoub
Copy link
Member Author

I believe we'd still need to add support for repeaters, alternations, and anchors... basically everything other than variable-length loops. Those aren't particularly hard, though, I just didn't prioritize them. Even loops aren't "hard"... it's just work and some amount of duplicated code :) I suggest we get this in, and then we can incrementally add support for additional constructs until we're happy.

For .NET 7 we rewrote RegexCompiler as we were writing the source generator, and the primary feature that was left out in doing so was support for the rarely-used RegexOptions.RightToLeft.  However, lookbehinds are implemented using RightToLeft support, and that's a more interesting gap.  While adding back full lookbehind support would essentially entail adding back full RightToLeft support, we can get more than half of lookbehinds used in practice (based on our corpus) by supporting only a few of the fixed-width constructs.  This PR adds that support back in to both RegexCompiler and the source generator.  It also sets things up so we can incrementally add in more constructs as we see fit.
It was basically already supported after the previous round of changes.
Turns out it already was supported, since nothing relied on or updated positions.  Just marked it as supported and added tests.
@stephentoub
Copy link
Member Author

I had to rebase anyway to deal with merge conflicts, and as long as I was doing that, I added some more lookbehind support. Turns out anchors, alternations, and conditionals were trivial. Yay. What's left are all the various kinds of loops and backreferences. Backreferences should be straightforward but will require a bit of additional code. Repeaters (as a subset of loops) will require some tweaks but should be straightforward. The rest of loops will likely require a bunch more edits.

@@ -166,9 +165,12 @@ private static ImmutableArray<Diagnostic> EmitRegexMethod(IndentedTextWriter wri
// If we can't support custom generation for this regex, spit out a Regex constructor call.
if (!SupportsCodeGeneration(rm, out string? reason))
{
// Note that we remove Compiled from the options, as it has the same support as the source generator, so if the source
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if I'm understanding this, then we shouldn't expect any perf regressions with this change since things that will now fallback, would have already fallback during Compiled construction right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct

void EmitPositiveLookaheadAssertion(RegexNode node)
// Emits the code to handle a positive lookaround assertion. This is a positive lookahead
// for left-to-right and a positive lookbehind for right-to-left.
void EmitPositiveLookaroundAssertion(RegexNode node)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: Isn't this name a bit misleading though? PositiveLookaround should in theory mean both positive lookahead or positive lookbehind for both left-to-right and right-to-left, yet from the description above means this is really always a positive lookahead except that in right-to-left the direction is opposite. Anyway, small detail which probably doesn't matter.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PositiveLookaround should in theory mean both positive lookahead or positive lookbehind for both left-to-right and right-to-left

It does. The node's RegexOptions defines the direction, so it's lookahead if (node.Options & RegexOptions.RightToLeft) == 0 and it's lookbehind if (node.Options & RegexOptions.RightToLeft) != 0.

bool useSwitchedBranches = isAtomic;
if (!useSwitchedBranches)
bool useSwitchedBranches = false;
if ((node.Options & RegexOptions.RightToLeft) == 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should strictly only be true for a lookbehind in a left-to-right pattern, right? I ask because if there are any other cases where we are now planning to actually support right-to-left, we would need to modify the Scan() override to emit code that moves in the right direction.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If (node.Options & RegexOptions.RightToLeft) == 0, it's not a lookbehind. Basically there's just one concept: process forward or backward, dictated by whether the node has RightToLeft set. If you're in a lookbehind, the nodes it contains will all be RightToLeft.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But, yes, once we enable the top-level RightToLeft option, at that point we also need to be able to make the bumpalong process backwards.

@stephentoub stephentoub closed this Mar 7, 2022
@stephentoub stephentoub deleted the lookbehindsupport branch March 20, 2022 23:48
@dotnet dotnet locked as resolved and limited conversation to collaborators Apr 20, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants