Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process "multi" strings in compiled regexes 4 or 2 chars at a time when possible #1654

Merged
merged 5 commits into from
Jan 14, 2020

Conversation

stephentoub
Copy link
Member

@stephentoub stephentoub commented Jan 12, 2020

When we encounter a series of characters in a regex pattern, we emit a comparison check per character. This updates that codegen to compare 2 or 4 characters at a time, when possible. In general these sequences are not long, but they can easily be a handful of characters, and comparing with ints and longs instead of chars slightly improves both throughput and the size of the IL and JIT'd asm.

@jkotas, any concerns with the change (philosophically due to the Unsafe usage, or otherwise)? Note that I first used StartsWith, but the overhead from that was very pronounced, with the cross-over point not being until around 40 or so characters.

Trivial microbenchmark:

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Diagnosers;
using BenchmarkDotNet.Running;
using System.Text.RegularExpressions;

[MemoryDiagnoser]
public class Program
{
    static void Main(string[] args) => BenchmarkSwitcher.FromAssemblies(new[] { typeof(Program).Assembly }).Run(args);

    private readonly Regex _regex = new Regex(@"Oh what a beautiful morning.  Oh what a beautiful day.  I've got a beautiful feeling everything's going my way.", RegexOptions.Compiled);

    [Benchmark] public bool Match() =>   _regex.IsMatch("Oh what a beautiful morning.  Oh what a beautiful day.  I've got a beautiful feeling everything's going my way.");
}
Method Toolchain Mean Error StdDev Ratio
Match \master\corerun.exe 74.27 ns 0.155 ns 0.130 ns 1.00
Match \pr\corerun.exe 61.73 ns 0.195 ns 0.182 ns 0.83

Contributes to #1349

@stephentoub stephentoub added this to the 5.0 milestone Jan 12, 2020
@stephentoub stephentoub added the tenet-performance Performance related issue label Jan 12, 2020
@jkotas
Copy link
Member

jkotas commented Jan 12, 2020

@jkotas, any concerns with the change (philosophically due to the Unsafe usage, or otherwise)?

It sounds reasonable to me. The high-performance parsers always end up using pattern like this. The convenience methods like StartsWith won't cut it.

…en possible

When we encounter a series of characters in a regex pattern, we emit a comparison check per character.  This updates that codegen to compare 2 or 4 characters at a time, when possible.  In general these sequences are not long, but they can easily be a handful of characters, and comparing with ints and longs instead of chars slightly improves both throughput and the size of the IL and JIT'd asm.
And add a test.
Avoids potentially very long generated Go methods and the resulting long JIT times and potentially stack overflows at invocation time.  Prior to this change, a test added for 100K-long string takes a very long time to run and then stack overflows, whereas with this change it's very fast.
@stephentoub stephentoub merged commit 9236c93 into dotnet:master Jan 14, 2020
@stephentoub stephentoub deleted the emitmultistartswith branch January 14, 2020 00:43
@ghost ghost locked as resolved and limited conversation to collaborators Dec 11, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants