Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use IndexOfAnyValues in the RegexCompiler and source gen #78927

Merged
merged 8 commits into from Jan 5, 2023

Conversation

MihaZupan
Copy link
Member

Closes #78203

For the source generator, we emit IndexOfAnyValues definitions under Utilities

file static class Utilities
{
    internal static readonly IndexOfAnyValues<char> AsciiHexDigit =
        IndexOfAnyValues.Create("0123456789ABCDEFabcdef");

    internal static readonly IndexOfAnyValues<char> Ascii_800600008006 =
        IndexOfAnyValues.Create("WYZwyz");
}

Whereas for the RegexCompiler, I added a new IndexOfAnyValues<char>[] field to CompiledRegexRunner and the callers end up looking something like

span.IndexOfAny(this._indexOfAnyValues[42]);

A few issues I came across while working on this:

  • Sets of 3-5 values that fall in a single range should likely prefer using the range
    • Sets like "abcd" now turn into IndexOfAnyValues instead of InRange('a', 'd')
  • The Chars for the FixedDistanceSet optimization don't support negated sets
    • Sets like [^ab] now end up using IndexOfAnyValues instead of IndexOfAnyExcept('a', 'b')
    • The number of places using IndexOfAnyValues is probably inflated because of this

@stephentoub Can you please take a look at this to see if it makes any sense?
The tests are passing, but I don't have in-depth knowledge about all the tradeoffs being made in Regex.

How do you go about benchmarking a change like this? Just pick a few patterns that you know will be affected?

@ghost
Copy link

ghost commented Nov 28, 2022

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

Closes #78203

For the source generator, we emit IndexOfAnyValues definitions under Utilities

file static class Utilities
{
    internal static readonly IndexOfAnyValues<char> AsciiHexDigit =
        IndexOfAnyValues.Create("0123456789ABCDEFabcdef");

    internal static readonly IndexOfAnyValues<char> Ascii_800600008006 =
        IndexOfAnyValues.Create("WYZwyz");
}

Whereas for the RegexCompiler, I added a new IndexOfAnyValues<char>[] field to CompiledRegexRunner and the callers end up looking something like

span.IndexOfAny(this._indexOfAnyValues[42]);

A few issues I came across while working on this:

  • Sets of 3-5 values that fall in a single range should likely prefer using the range
    • Sets like "abcd" now turn into IndexOfAnyValues instead of InRange('a', 'd')
  • The Chars for the FixedDistanceSet optimization don't support negated sets
    • Sets like [^ab] now end up using IndexOfAnyValues instead of IndexOfAnyExcept('a', 'b')
    • The number of places using IndexOfAnyValues is probably inflated because of this

@stephentoub Can you please take a look at this to see if it makes any sense?
The tests are passing, but I don't have in-depth knowledge about all the tradeoffs being made in Regex.

How do you go about benchmarking a change like this? Just pick a few patterns that you know will be affected?

Author: MihaZupan
Assignees: MihaZupan
Labels:

area-System.Text.RegularExpressions

Milestone: 8.0.0

@stephentoub
Copy link
Member

Sets of 3-5 values that fall in a single range should likely prefer using the range

Is this something you plan to address in this PR, or we should address it subsequently?

The Chars for the FixedDistanceSet optimization don't support negated sets

Probably worth opening an issue for this. That said, I've seen diminishing returns in the past around the negated sets, as it's often (not always) the case that the thing you're then searching for occurs really soon.

Can you please take a look at this to see if it makes any sense?

It's lovely. Nice work.

The tests are passing

Can you run outerloop locally as well? That'll run another ~18,000 patterns through the source generator and compiler.

How do you go about benchmarking a change like this? Just pick a few patterns that you know will be affected?

For a sweeping change like this, I'll typically run all the regex tests in dotnet/performance, in particular https://github.com/dotnet/performance/blob/main/src/benchmarks/micro/libraries/System.Text.RegularExpressions/Perf.Regex.Industry.cs, https://github.com/dotnet/performance/blob/main/src/benchmarks/micro/libraries/System.Text.RegularExpressions/Perf.Regex.Common.cs, and https://github.com/dotnet/performance/blob/main/src/benchmarks/micro/runtime/BenchmarksGame/regex-redux-5.cs.

@MihaZupan

This comment was marked as outdated.

@MihaZupan
Copy link
Member Author

Method Toolchain Options Mean Error Ratio MannWhitney(5%)
OneNodeBacktracking \main\corerun.exe IgnoreCase, Compiled 58.89 ns 0.370 ns 1.00 Base
OneNodeBacktracking \pr\corerun.exe IgnoreCase, Compiled 77.09 ns 0.458 ns 1.31 Slower

Cases like OneNodeBacktracking ([^a]+\.[^z]+) regress for IgnoreCase + Compiled where it starts using IndexOfAnyExcept(this._array[42]).

I'd have to check what this means for source-generated cases where there's less per-call overhead for IndexOfAnyValues compared to the compiled version.

I'll keep digging...

@MihaZupan
Copy link
Member Author

MihaZupan commented Dec 5, 2022

  • I did make the "prefer the range over chars for sets like [abc]" here.
  • I also removed the negated AsciiSet variant from the primary set as it only seemed to have regressed things.
    • I had the change to allow negated sets of 2-5 chars as well, but the benchmarks weren't happy. I'll open an issue to track that.
  • The only negated primary set is the range (as before this PR), I just made a change to use IndexOfAnyExcept('a') instead of IndexOfAnyExceptInRange('a', 'a') for length=1 sets.
  • I tweaked the primary set sorting again as my change to check the length of chars first regressed some patterns by preferring things like IndexOf(' ') over IndexOfAny('k', 'K', 'K') (where the space has a much higher frequency according to the table).

Here are all the Regex benchmarks from dotnet/performance: https://gist.github.com/MihaZupan/533a8b9c57815077cec768b59f17b835

The largest regression is for the \p{Sm} pattern when using the interpreter/NonBacktracking engine - not sure how I did that.

Copy link
Member

@stephentoub stephentoub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@EgorBo
Copy link
Member

EgorBo commented Jan 10, 2023

@dotnet dotnet locked as resolved and limited conversation to collaborators Feb 9, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use new IndexOfAnyValues in RegexCompiler and source generator
3 participants