Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve choice of IndexOfXx routine for some TryFindNextStartingPosition implementations #89099

Merged
merged 1 commit into from Jul 18, 2023

Conversation

stephentoub
Copy link
Member

@stephentoub stephentoub commented Jul 18, 2023

Earlier in .NET 8, we updated the Regex compiler and source generator to be able to vectorize a search for any set, not just simple ones. When one of the main routines couldn't be used, we emit a specialized IndexOfAny helper that uses SearchValues to search for any matching ASCII character or a Unicode character, and if it encounters a Unicode character, it falls back to a linear scan. This meant that a bunch of sets that wouldn't previously have taken these paths now do, but some of those sets have more efficient means of searching; for example, for the set [^aA] that searches case-insensitive for anything other than an 'A', with these scheme we'll emit a whole routine that uses SearchValues with a fallback, but we could just use IndexOfAnyExcept('A', 'a'). This fixes the compiler / source generator to prefer such helpers instead when available.

For example, previously [GeneratedRegex(@"[^Aa]")] would result in this being emitted:

        /// <summary>Finds the next index of any character that matches a character in the set [^Aa].</summary>
        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        internal static int IndexOfNonAsciiOrAny_43339B0AA38B69F44701E535D8179738784663016A5CB17A6B1AEB2FB5F9D08F(this ReadOnlySpan<char> span)
        {
            int i = span.IndexOfAnyExcept(Utilities.s_ascii_200000002000000);
            if ((uint)i < (uint)span.Length)
            {
                if (char.IsAscii(span[i]))
                {
                    return i;
                }
        
                do
                {
                    if (((span[i] | 0x20) != 'a'))
                    {
                        return i;
                    }
                    i++;
                }
                while ((uint)i < (uint)span.Length);
            }
        
            return -1;
        }
        
        /// <summary>Supports searching for characters in or not in "Aa".</summary>
        internal static readonly SearchValues<char> s_ascii_200000002000000 = SearchValues.Create("Aa");

which is then used like:

int i = inputSpan.Slice(pos).IndexOfNonAsciiOrAny_43339B0AA38B69F44701E535D8179738784663016A5CB17A6B1AEB2FB5F9D08F();

Now, that method isn't emitted, and the usage ends up just being:

int i = inputSpan.Slice(pos).IndexOfAnyExcept('A', 'a');

Fixes #84150

…ion implementations

Earlier in .NET 8, we updated the Regex compiler and source generator to be able to vectorize a search for any set, not just simple ones.  When one of the main routines couldn't be used, we emit a specialized IndexOfAny helper that uses SearchValues to search for any matching ASCII character or a Unicode character, and if it encounters a Unicode character, it falls back to a linear scan.  This meant that a bunch of sets that wouldn't previously have taken these paths now do, but some of those sets have more efficient means of searching; for example, for the set `[^aA]` that searches case-insensitive for anything other than an 'A', with these scheme we'll emit a whole routine that uses SearchValues with a fallback, but we could just use IndexOfAnyExcept('A', 'a').  This fixes the compiler / source generator to prefer such helpers instead when available.
@ghost
Copy link

ghost commented Jul 18, 2023

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

Earlier in .NET 8, we updated the Regex compiler and source generator to be able to vectorize a search for any set, not just simple ones. When one of the main routines couldn't be used, we emit a specialized IndexOfAny helper that uses SearchValues to search for any matching ASCII character or a Unicode character, and if it encounters a Unicode character, it falls back to a linear scan. This meant that a bunch of sets that wouldn't previously have taken these paths now do, but some of those sets have more efficient means of searching; for example, for the set [^aA] that searches case-insensitive for anything other than an 'A', with these scheme we'll emit a whole routine that uses SearchValues with a fallback, but we could just use IndexOfAnyExcept('A', 'a'). This fixes the compiler / source generator to prefer such helpers instead when available.

For example, previously [GeneratedRegex(@"[^Aa]")] would result in this being emitted:

        /// <summary>Finds the next index of any character that matches a character in the set [^Aa].</summary>
        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        internal static int IndexOfNonAsciiOrAny_43339B0AA38B69F44701E535D8179738784663016A5CB17A6B1AEB2FB5F9D08F(this ReadOnlySpan<char> span)
        {
            int i = span.IndexOfAnyExcept(Utilities.s_ascii_200000002000000);
            if ((uint)i < (uint)span.Length)
            {
                if (char.IsAscii(span[i]))
                {
                    return i;
                }
        
                do
                {
                    if (((span[i] | 0x20) != 'a'))
                    {
                        return i;
                    }
                    i++;
                }
                while ((uint)i < (uint)span.Length);
            }
        
            return -1;
        }
        
        /// <summary>Supports searching for characters in or not in "Aa".</summary>
        internal static readonly SearchValues<char> s_ascii_200000002000000 = SearchValues.Create("Aa");

which is then used like:

int i = inputSpan.Slice(pos).IndexOfNonAsciiOrAny_43339B0AA38B69F44701E535D8179738784663016A5CB17A6B1AEB2FB5F9D08F();

Now, that method isn't emitted, and the usage ends up just being:

int i = inputSpan.Slice(pos).IndexOfAnyExcept('A', 'a');
Author: stephentoub
Assignees: -
Labels:

area-System.Text.RegularExpressions

Milestone: -

Copy link
Member

@MihaZupan MihaZupan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

I appreciate the extra comments :)

@stephentoub stephentoub merged commit 74d69fd into dotnet:main Jul 18, 2023
103 checks passed
@stephentoub stephentoub deleted the regexsearch branch July 18, 2023 21:38
@dotnet dotnet locked as resolved and limited conversation to collaborators Aug 18, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Regex should explicitly target IndexOfAnyExcept(char {, char, char}) when handling non-ASCII vectorized sets
2 participants