Skip to content

Normalize Regex char class containing a single negated character#1597

Merged
stephentoub merged 2 commits into
dotnet:masterfrom
stephentoub:invertcharclass
Jan 11, 2020
Merged

Normalize Regex char class containing a single negated character#1597
stephentoub merged 2 commits into
dotnet:masterfrom
stephentoub:invertcharclass

Conversation

@stephentoub

@stephentoub stephentoub commented Jan 10, 2020

Copy link
Copy Markdown
Member

When RegexFC scans the regex node tree looking for what characters can be the first character, if it encounters a negated character, it handles it by adding everything else to the set. That then ends up defeating optimizations based on IsSingletonInverse, which expects a negated character class containing only the character, as opposed to a non-negated character class containing everything but the character. There is already a RegexCharClass.Canonicalize method; this PR just adds the check and fix-up for this case.

Method Toolchain Mean Error StdDev Ratio
Test \master\corerun.exe 1,087.4 ns 3.47 ns 3.24 ns 1.00
Test \pr\corerun.exe 425.0 ns 1.43 ns 1.19 ns 0.39
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Diagnosers;
using BenchmarkDotNet.Running;
using System.Text.RegularExpressions;

[MemoryDiagnoser]
public class Program
{
    static void Main(string[] args) => BenchmarkSwitcher.FromAssemblies(new[] { typeof(Program).Assembly }).Run(args);

    private readonly Regex _regex = new Regex("[^b]+b", RegexOptions.Compiled);
    private readonly string _input = new string('b', 1000);
    [Benchmark] public bool Test() => _regex.IsMatch(_input);
}

Best reviewed without whitespace: https://github.com/dotnet/runtime/pull/1597/files?w=1

cc: @ViktorHofer, @danmosemsft, @pgovind, @eerhardt

@eerhardt eerhardt left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I just had a quick question on one part.

When RegexFC scans the regex node tree looking for what characters can be the first character, if it encounters a negated character, it handles it by adding everything else to the set.  That then ends up defeating optimizations based on IsSingletonInverse, which expects a negated character class containing only the character, as opposed to a non-negated character class containing everything but the character.  There is already a RegexCharClass.Canonicalize method; this PR just adds the check and fix-up for this case.
@stephentoub stephentoub merged commit 6a6ddf3 into dotnet:master Jan 11, 2020
@stephentoub stephentoub deleted the invertcharclass branch January 11, 2020 11:57
@stephentoub stephentoub added the tenet-performance Performance related issue label Jan 12, 2020
@stephentoub stephentoub added this to the 5.0 milestone Jan 12, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 11, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants