Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorize IndexOfAny on more than 5 chars #68328

Closed
stephentoub opened this issue Apr 21, 2022 · 87 comments · Fixed by #78093
Closed

Vectorize IndexOfAny on more than 5 chars #68328

stephentoub opened this issue Apr 21, 2022 · 87 comments · Fixed by #78093
Assignees
Labels
api-approved API was approved in API review, it can be implemented area-System.Memory tenet-performance Performance related issue
Milestone

Comments

@stephentoub
Copy link
Member

stephentoub commented Apr 21, 2022

EDITED BY @stephentoub on 10/26/2022 to add revised API shape:

namespace System
{
    public static class IndexOfAnyValues
    {
        public static IndexOfAnyValues<T> Create<T>(ReadOnlySpan<T> values)  where T : IEquatable<T>?;
        // in the future, could add an `IndexOfAnyValues<char> Create(ReadOnlySpan<string> values) overload
    }

    public class IndexOfAnyValues<T> where T : IEquatable<T>?
    {
        // no public/protected members, including ctors
    }

    public static partial class MemoryExtensions
    {
        // These are identical to the existing overloads, just with IndexOfAnyValues<T> instead of ReadOnlySpan<T> for the values parameter type
        public static int IndexOfAny<T>(this ReadOnlySpan<T> span, IndexOfAnyValues<T> values) where T : IEquatable<T>?;
        public static int IndexOfAnyExcept<T>(this ReadOnlySpan<T> span, IndexOfAnyValues<T> values) where T : IEquatable<T>?;
        public static int LastIndexOfAny<T>(this ReadOnlySpan<T> span, IndexOfAnyValues<T> values) where T : IEquatable<T>?;
        public static int LastIndexOfAnyExcept<T>(this ReadOnlySpan<T> span, IndexOfAnyValues<T> values) where T : IEquatable<T>?;
        // ... repeat for Span overloads
    }
}

API Usage

internal static class HttpRuleParser
{
    private static readonly IndexOfAnyValues<char> s_tokenChars =
        IndexOfAnyValues.Create("!#$%&'*+-.0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ^_`abcdefghijklmnopqrstuvwxyz|~");
    
    public static bool IsToken(ReadOnlySpan<char> input) =>
        input.IndexOfAnyExcept(s_tokenChars) < 0;
}

EDITED BY @MihaZupan on 10/24/2022 to add proposed API shape:

API Proposal

namespace System
{
    public static partial class MemoryExtensions
    {
        public static int IndexOfAny(this ReadOnlySpan<byte> span, IndexOfAnyAsciiValues asciiValues);
        public static int IndexOfAnyExcept(this ReadOnlySpan<byte> span, IndexOfAnyAsciiValues asciiValues);
        public static int LastIndexOfAny(this ReadOnlySpan<byte> span, IndexOfAnyAsciiValues asciiValues);
        public static int LastIndexOfAnyExcept(this ReadOnlySpan<byte> span, IndexOfAnyAsciiValues asciiValues);

        public static int IndexOfAny(this ReadOnlySpan<char> span, IndexOfAnyAsciiValues asciiValues);
        public static int IndexOfAnyExcept(this ReadOnlySpan<char> span, IndexOfAnyAsciiValues asciiValues);
        public static int LastIndexOfAny(this ReadOnlySpan<char> span, IndexOfAnyAsciiValues asciiValues);
        public static int LastIndexOfAnyExcept(this ReadOnlySpan<char> span, IndexOfAnyAsciiValues asciiValues);
        // ... repeat for Span overloads
    }
}

namespace System.Buffers.Text
{    
    public sealed class IndexOfAnyAsciiValues
    {
        public IndexOfAnyAsciiValues(ReadOnlySpan<char> asciiValues);
        // No other public members
    }
}

API Usage

internal static class HttpRuleParser
{
    private static readonly IndexOfAnyAsciiValues s_tokenChars =
        new("!#$%&'*+-.0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ^_`abcdefghijklmnopqrstuvwxyz|~");
    
    public static bool IsToken(ReadOnlySpan<byte> input) =>
        input.IndexOfAnyExcept(s_tokenChars) < 0;
}

Alternative Designs

Static members on Ascii instead of extension methods in MemoryExtensions.
Based on feedback from API review on 2022-10-25.

namespace System.Buffers.Text;

public static class Ascii
{
    public static int IndexOfAny(ReadOnlySpan<byte> span, IndexOfAnyAsciiValues values);
    public static int IndexOfAny(ReadOnlySpan<char> span, IndexOfAnyAsciiValues values);
    // ...
}

public sealed class IndexOfAnyAsciiValues
{
    public IndexOfAnyAsciiValues(ReadOnlySpan<char> asciiValues);
}

Original post:

Earlier in the .NET 7 release, we combined the best of string.IndexOfAny and MemoryExtensions.IndexOfAny to provide a single implementation used by both. As part of this, whereas <= 5 characters are vectorized, > 5 characters use a "probabilistic map":

private static unsafe int IndexOfAnyProbabilistic(ref char searchSpace, int searchSpaceLength, ref char values, int valuesLength)
{
Debug.Assert(searchSpaceLength >= 0);
Debug.Assert(valuesLength >= 0);
ReadOnlySpan<char> valuesSpan = new ReadOnlySpan<char>(ref values, valuesLength);
ProbabilisticMap map = default;
uint* charMap = (uint*)&map;
ProbabilisticMap.Initialize(charMap, valuesSpan);
ref char cur = ref searchSpace;
while (searchSpaceLength != 0)
{
int ch = cur;
if (ProbabilisticMap.IsCharBitSet(charMap, (byte)ch) &&
ProbabilisticMap.IsCharBitSet(charMap, (byte)(ch >> 8)) &&
ProbabilisticMap.SpanContains(valuesSpan, (char)ch))
{
return (int)((nint)Unsafe.ByteOffset(ref searchSpace, ref cur) / sizeof(char));
}
searchSpaceLength--;
cur = ref Unsafe.Add(ref cur, 1);
}
return -1;
}

essentially a Bloom filter to quickly screen out characters that are definitely not in the set so that the more expensive check need only be done if it's likely the character is in the set. While faster than comparing every input character against every character in the set, we still have to walk the input character by character.

@MihaZupan's xoofx/markdig@da756f4 is an implementation of http://0x80.pl/articles/simd-byte-lookup.html#universal-algorithm, and highlights that it's possible to vectorize this operation instead. It would require startup overhead of determining whether all of the set characters are ASCII and generating a bitmap from them, but we're already doing such a loop in order to build the probabilistic map:

public static unsafe void Initialize(uint* charMap, ReadOnlySpan<char> values)
{
#if DEBUG
for (int i = 0; i < Size; i++)
{
Debug.Assert(charMap[i] == 0, "Expected charMap to be zero-initialized.");
}
#endif
bool hasAscii = false;
uint* charMapLocal = charMap; // https://github.com/dotnet/runtime/issues/9040
for (int i = 0; i < values.Length; ++i)
{
int c = values[i];
// Map low bit
SetCharBit(charMapLocal, (byte)c);
// Map high bit
c >>= 8;
if (c == 0)
{
hasAscii = true;
}
else
{
SetCharBit(charMapLocal, (byte)c);
}
}
if (hasAscii)
{
// Common to search for ASCII symbols. Just set the high value once.
charMapLocal[0] |= 1u;
}
}

We could simply augment that to also build up the ASCII bitmap, and if after initialization determine that every character was indeed ASCII, take the vectorized path of using that ASCII bitmap, otherwise fall back to using the probabilistic map as we do today.

@stephentoub stephentoub added area-System.Memory tenet-performance Performance related issue help wanted [up-for-grabs] Good issue for external contributors labels Apr 21, 2022
@stephentoub stephentoub added this to the 7.0.0 milestone Apr 21, 2022
@ghost
Copy link

ghost commented Apr 21, 2022

Tagging subscribers to this area: @dotnet/area-system-memory
See info in area-owners.md if you want to be subscribed.

Issue Details

Earlier in the .NET 7 release, we combined the best of string.IndexOfAny and MemoryExtensions.IndexOfAny to provide a single implementation used by both. As part of this, whereas <= 5 characters are vectorized, > 5 characters use a "probabilistic map":

private static unsafe int IndexOfAnyProbabilistic(ref char searchSpace, int searchSpaceLength, ref char values, int valuesLength)
{
Debug.Assert(searchSpaceLength >= 0);
Debug.Assert(valuesLength >= 0);
ReadOnlySpan<char> valuesSpan = new ReadOnlySpan<char>(ref values, valuesLength);
ProbabilisticMap map = default;
uint* charMap = (uint*)&map;
ProbabilisticMap.Initialize(charMap, valuesSpan);
ref char cur = ref searchSpace;
while (searchSpaceLength != 0)
{
int ch = cur;
if (ProbabilisticMap.IsCharBitSet(charMap, (byte)ch) &&
ProbabilisticMap.IsCharBitSet(charMap, (byte)(ch >> 8)) &&
ProbabilisticMap.SpanContains(valuesSpan, (char)ch))
{
return (int)((nint)Unsafe.ByteOffset(ref searchSpace, ref cur) / sizeof(char));
}
searchSpaceLength--;
cur = ref Unsafe.Add(ref cur, 1);
}
return -1;
}

essentially a Bloom filter to quickly screen out characters that are definitely not in the set so that the more expensive check need only be done if it's likely the character is in the set. While faster than comparing every input character against every character in the set, we still have to walk the input character by character.

@MihaZupan's xoofx/markdig@da756f4 is an implementation of http://0x80.pl/articles/simd-byte-lookup.html#universal-algorithm, and highlights that it's possible to vectorize this operation instead. It would require startup overhead of determining whether all of the set characters are ASCII and generating a bitmap from them, but we're already doing such a loop in order to build the probabilistic map:

public static unsafe void Initialize(uint* charMap, ReadOnlySpan<char> values)
{
#if DEBUG
for (int i = 0; i < Size; i++)
{
Debug.Assert(charMap[i] == 0, "Expected charMap to be zero-initialized.");
}
#endif
bool hasAscii = false;
uint* charMapLocal = charMap; // https://github.com/dotnet/runtime/issues/9040
for (int i = 0; i < values.Length; ++i)
{
int c = values[i];
// Map low bit
SetCharBit(charMapLocal, (byte)c);
// Map high bit
c >>= 8;
if (c == 0)
{
hasAscii = true;
}
else
{
SetCharBit(charMapLocal, (byte)c);
}
}
if (hasAscii)
{
// Common to search for ASCII symbols. Just set the high value once.
charMapLocal[0] |= 1u;
}
}

We could simply augment that to also build up the ASCII bitmap, and if after initialization determine that every character was indeed ASCII, take the vectorized path of using that ASCII bitmap, otherwise fall back to using the probabilistic map as we do today.

Author: stephentoub
Assignees: -
Labels:

area-System.Memory, tenet-performance, up-for-grabs

Milestone: 7.0.0

@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Apr 21, 2022
@stephentoub
Copy link
Member Author

stephentoub commented Apr 23, 2022

Instead of or in addition to this, we might also consider a new overload:

IndexOfAny(ReadOnlySpan<char>, UInt128)

(or potentially a ReadOnlySpan<byte> or Vector128<byte> instead of the UInt128). The existing IndexOfAny overload could use it, but it could also be used directly by code that could precompute the ASCII bitmap, reducing the per-call overhead, e.g. Regex could use this to vectorize set searches too big for the current IndexOfAny.

@jeffschwMSFT jeffschwMSFT removed the untriaged New issue has not been triaged by the area owner label Apr 25, 2022
@danmoseley danmoseley self-assigned this May 22, 2022
@danmoseley
Copy link
Member

I’d like to take a look, but if someone can do this sooner, just let me know.

It looks like the algorithm cannot be immediately transliterated into Vector128 because (at least) it relies on Ssse3.Shuffle and the generic shuffle on Vector128 doesn’t have the same behavior. Therefore we would rely on software fallback for Arm64 and potentially improve later.

@deeprobin
Copy link
Contributor

@stephentoub Is the constant 5 dependent on other parameters? Or was this constant only selected based on benchmarks?

In other words, in any case, if the length is > 5, is it guaranteed that the performance will be better than the probablistic map?


I was thinking in general about these constants that we have everywhere. Especially that we somehow evaluate them dynamically in the startup phase of the JIT, etc., because certainly in many cases parameters like microarchitecture or the use of NativeAOT or ReadyToRun make a crucial difference (i.e. a (lightweight): Runtime Benchmarking).

@stephentoub
Copy link
Member Author

Is the constant 5 dependent on other parameters?

It's not just a constant. If you look at the implementation, you'll see there's work that has to be done specific to each additional character, each one getting its own vector that needs to be factored in. It's not like there's some number to just change and magically higher counts of chars are vectorized.

@MihaZupan
Copy link
Member

we might also consider a new overload:
IndexOfAny(ReadOnlySpan<char>, UInt128)
... it could also be used directly by code that could precompute the ASCII bitmap

The only problem I see with such an API is that the pre-computed bitmap used by the vectorized algorithm isn't the ASCII bitmap itself (described a bit under the universal algorithm description).
My implementation is somewhat different from the "universal algorithm" as it only deals with the ASCII range.
For each character, the nibbles are swapped and packed together (int position = (c >> 4) | ((c & 0x0F) << 3)).

   01234567
=>
    4567     ((c & 0xF) << 3)
&      0123  (c >> 4)
==
   X4567123  (the X bit is 0)

The 0th (highest) bit is assumed to be 0 (ASCII), so it can overlap with the 7th bit.
With this trick, the whole bitmap fits into a single Vector128<byte>.

I don't know if it's possible to quickly turn an input bitmap representing the set of characters into this modified one.
For the lowest overhead, the caller would want to precompute the modified bitmap, but that may not be obvious from the API signature.

As Dan pointed out, the above approach depends on the behavior of Ssse3's Shuffle. I'm not familiar with ARM instructions, but if there's no "4 bit => 4 bit lookup" equivalent, the optimal bitmap may be different.

@stephentoub
Copy link
Member Author

For the lowest overhead, the caller would want to precompute the modified bitmap, but that may not be obvious from the API signature.

There could be a separate method that computes and returns the input to be used.

@adamsitnik adamsitnik modified the milestones: 7.0.0, 8.0.0 Aug 9, 2022
@adamsitnik
Copy link
Member

I am moving it to 8 as I don't see us implementing it before we snap. Please feel free to prove me wrong by sending a PR ;)

@gfoidl
Copy link
Member

gfoidl commented Sep 19, 2022

Re: dotnet/aspnetcore#33776 (comment)

Prelude and motivating example

Some context for the problem domain in ASP.NET Core:
Kestrel needs to verify that authority, host, etc. is valid. This is done via the helpers in HttpCharacters.
The validation is done via lookup-tables, which get initialized when kestrel starts, e.g. InitializeAuthority. The actual validation is done per plain for-loop.
In dotnet/aspnetcore#33776 (comment) I thought about vectorizing these checks, which resulted after a prototype and long wait for xplat-intrinsics in dotnet/aspnetcore#44041.
@stephentoub then reminded me about this issue, and asked what IndexOfXyz methods in runtime could be needed to avoid the custom code in aspnetcore-repo. Thus I created another prototype for GeneratedIndexOfAny.

Source generator for IndexOfAny{Except}

https://github.com/gfoidl-Tests/SourceGenerator-Vectors is quickly written prototype for a Roslyn source generator* to create vectorized code for IndexOfAny and IndexOfExcept.

* I'm sure some pieces may be written in a better way, as I don't have too much experience with writing generators.

The aforementioned validation of e.g. Authority in ASP.NET Core could be written as

private const string AlphaNumeric        = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";
private const string AuthoritySpecific   = ":.-[]@";
private const string ValidAuthorityChars = AlphaNumeric + AuthoritySpecific;

// ...

public bool IsValidByGeneratedIndexOfAnyExcept() => IsAuthorityValid(this.Authority) < 0;

[GeneratedIndexOfAny(ValidAuthorityChars, FindAnyExcept = true)]
private static partial int IsAuthorityValid(ReadOnlySpan<char> host);

so basically as two-liner.

In a benchmark compared with the shipped IndexOfExcept it looks like

|                             Method |            Authority |      Mean | Ratio |
|----------------------------------- |--------------------- |----------:|------:|
|          IsValidByIndexOfAnyExcept |        hostname:8080 |  66.20 ns |  1.00 |
| IsValidByGeneratedIndexOfAnyExcept |        hostname:8080 |  12.28 ns |  0.18 |
|                                    |                      |           |       |
|          IsValidByIndexOfAnyExcept | www.t(...)e.com [71] | 339.29 ns |  1.00 |
| IsValidByGeneratedIndexOfAnyExcept | www.t(...)e.com [71] |  18.18 ns |  0.05 |
benchmark code
using BenchmarkDotNet.Attributes;

Bench bench = new();
Console.WriteLine(bench.Authority);
Console.WriteLine(bench.IsValidByIndexOfAnyExcept());
Console.WriteLine(bench.IsValidByGeneratedIndexOfAnyExcept());

#if !DEBUG
BenchmarkDotNet.Running.BenchmarkRunner.Run<Bench>();
#endif

public partial class Bench
{
    private const string AlphaNumeric        = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";
    private const string AuthoritySpecific   = ":.-[]@";
    private const string ValidAuthorityChars = AlphaNumeric + AuthoritySpecific;

    [Params("hostname:8080", "www.thelongestdomainnameintheworldandthensomeandthensomemoreandmore.com")]
    public string Authority { get; set; } = "hostname:8080";

    [Benchmark(Baseline = true)]
    public bool IsValidByIndexOfAnyExcept() => this.Authority.AsSpan().IndexOfAnyExcept(ValidAuthorityChars) < 0;

    [Benchmark]
    public bool IsValidByGeneratedIndexOfAnyExcept() => IsAuthorityValid(this.Authority) < 0;

    [GeneratedIndexOfAny(ValidAuthorityChars, FindAnyExcept = true)]
    private static partial int IsAuthorityValid(ReadOnlySpan<char> host);
}

For code like

internal static partial class Demo1
{
    [GeneratedIndexOfAny("1234567890")]
    public static partial int FirstIndexOfNumber(ReadOnlySpan<char> value);

    [GeneratedIndexOfAny("drvxyz")]
    public static partial int FirstIndexOfSet0(ReadOnlySpan<char> value);

    [GeneratedIndexOfAny("12🌄34")]
    public static partial int FirstIndexOfNonAsciiSet(ReadOnlySpan<char> value);
}

namespace Demo.Internal
{
    internal static partial class Demo2
    {
        [GeneratedIndexOfAny("abcd", FindAnyExcept = true)]
        public static partial int FirstIndexOfNotSet0(ReadOnlySpan<char> value);

        [GeneratedIndexOfAny("abcdef", FindAnyExcept = true)]
        public static partial int FirstIndexOfNotSet1(ReadOnlySpan<char> value);
    }
}

the generator emits

// ...
partial class Demo1
{
    [GeneratedCode("Generator", "1.0.0.0")]
    [EditorBrowsable(EditorBrowsableState.Never)]
    public static partial int FirstIndexOfNumber(System.ReadOnlySpan<char> value)
    {
        ReadOnlySpan<bool> lookup = new bool[128] { false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, true, true, true, true, true, true, true, true, true, true, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false };
        Vector128<sbyte> mask     = Vector128.Create(0x08, 0x08, 0x08, 0x08, 0x08, 0x08, 0x08, 0x08, 0x08, 0x08, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00).AsSByte();

        return Vector128.IsHardwareAccelerated && value.Length >= Vector128<short>.Count
            ? Core.IndexOfMatchCharVectorized<Core.Negate>(value, mask)
            : Core.IndexOfMatchCharScalar<Core.DontNegate>(value, lookup);
    }

    [GeneratedCode("Generator", "1.0.0.0")]
    [EditorBrowsable(EditorBrowsableState.Never)]
    public static partial int FirstIndexOfSet0(System.ReadOnlySpan<char> value)
    {
        ReadOnlySpan<bool> lookup = new bool[128] { false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, true, false, false, false, false, false, false, false, false, false, false, false, false, false, true, false, false, false, true, false, true, true, true, false, false, false, false, false };
        Vector128<sbyte> mask     = Vector128.Create(0x00, 0x00, 0x80, 0x00, 0x40, 0x00, 0x80, 0x00, 0x80, 0x80, 0x80, 0x00, 0x00, 0x00, 0x00, 0x00).AsSByte();

        return Vector128.IsHardwareAccelerated && value.Length >= Vector128<short>.Count
            ? Core.IndexOfMatchCharVectorized<Core.Negate>(value, mask)
            : Core.IndexOfMatchCharScalar<Core.DontNegate>(value, lookup);
    }

    [GeneratedCode("Generator", "1.0.0.0")]
    [EditorBrowsable(EditorBrowsableState.Never)]
    public static partial int FirstIndexOfNonAsciiSet(System.ReadOnlySpan<char> value)
    {
        // Given SetChars consist not of all ASCII, can't handle vectorized, so use fallback
        return value.IndexOfAny("12🌄34");
    }
}

namespace Demo.Internal
{
    partial class Demo2
    {
        [GeneratedCode("Generator", "1.0.0.0")]
        [EditorBrowsable(EditorBrowsableState.Never)]
        public static partial int FirstIndexOfNotSet0(System.ReadOnlySpan<char> value)
        {
            // Given SetChars <= 5, so use specialized method from IndexOfAny{Except}
            return value.IndexOfAnyExcept("abcd");
        }

        [GeneratedCode("Generator", "1.0.0.0")]
        [EditorBrowsable(EditorBrowsableState.Never)]
        public static partial int FirstIndexOfNotSet1(System.ReadOnlySpan<char> value)
        {
            ReadOnlySpan<bool> lookup = new bool[128] { false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, true, true, true, true, true, true, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false };
            Vector128<sbyte> mask     = Vector128.Create(0x00, 0x40, 0x40, 0x40, 0x40, 0x40, 0x40, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00).AsSByte();

            return Vector128.IsHardwareAccelerated && value.Length >= Vector128<short>.Count
                ? Core.IndexOfMatchCharVectorized<Core.DontNegate>(value, mask)
                : Core.IndexOfMatchCharScalar<Core.Negate>(value, lookup);
        }
    }
}

// ... + helper-type with the actual workhorse-methods
emitted helper-method
    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public static int IndexOfMatchCharScalar<TNegator>(ReadOnlySpan<char> value, ReadOnlySpan<bool> lookup)
        where TNegator : struct, INegator
    {
        for (int i = 0; i < value.Length; ++i)
        {
            char c     = value[i];
            bool match = TNegator.NegateIfNeeded(lookup[c]);

            if (match)
            {
                return i;
            }
        }

        return -1;
    }

    public static int IndexOfMatchCharVectorized<TNegator>(ReadOnlySpan<char> value, Vector128<sbyte> bitMaskLookup)
        where TNegator : struct, INegator
    {
        Debug.Assert(Vector128.IsHardwareAccelerated);
        Debug.Assert(value.Length >= Vector128<short>.Count);

        // To check if a bit in a bitmask from the Bitmask is set, in a sequential code
        // we would do ((1 << bitIndex) & bitmask) != 0
        // As there is no hardware instrinic for such a shift, we use a lookup that
        // stores the shifted bitpositions.
        // So (1 << bitIndex) becomes BitPosLook[bitIndex], which is simd-friendly.
        //
        // A bitmask from the bitMaskLookup is created only for values 0..7 (one byte),
        // so to avoid a explicit check for values outside 0..7, i.e.
        // high nibbles 8..F, we use a bitpos that always results in escaping.
        Vector128<sbyte> bitPosLookup = Vector128.Create(
            0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80,     // high-nibble 0..7
            0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF      // high-nibble 8..F
        ).AsSByte();

        Vector128<sbyte> nibbleMaskSByte = Vector128.Create((sbyte)0xF);
        Vector128<sbyte> zeroMaskSByte   = Vector128<sbyte>.Zero;

        nuint idx = 0;
        uint mask;
        ref short ptr = ref Unsafe.As<char, short>(ref MemoryMarshal.GetReference(value));

        if (value.Length >= 2 * Vector128<short>.Count)
        {
            nuint end = (uint)(value.Length - 2 * Vector128<short>.Count);

            do
            {
                Vector128<short> source0 = Vector128.LoadUnsafe(ref ptr, idx);
                Vector128<short> source1 = Vector128.LoadUnsafe(ref ptr, idx + 8);
                Vector128<sbyte> values  = NarrowWithSaturation(source0, source1);

                mask = CreateEscapingMask<TNegator>(values, bitMaskLookup, bitPosLookup, nibbleMaskSByte, zeroMaskSByte);
                if (mask != 0)
                {
                    goto Found;
                }

                idx += 2 * (uint)Vector128<short>.Count;
            }
            while (idx <= end);
        }

        // Here we know that 8 to 15 chars are remaining. Process the first 8 chars.
        if (idx <= (uint)(value.Length - Vector128<short>.Count))
        {
            Vector128<short> source = Vector128.LoadUnsafe(ref ptr, idx);
            Vector128<sbyte> values = NarrowWithSaturation(source, source);

            mask = CreateEscapingMask<TNegator>(values, bitMaskLookup, bitPosLookup, nibbleMaskSByte, zeroMaskSByte);
            if (mask != 0)
            {
                goto Found;
            }

            idx += (uint)Vector128<short>.Count;
        }

        // Here we know that < 8 chars are remaining. We shift the space around to process
        // another full vector.
        nuint remaining = (uint)value.Length - idx;
        if ((nint)remaining > 0)
        {
            remaining -= (uint)Vector128<short>.Count;

            Vector128<short> source = Vector128.LoadUnsafe(ref ptr, idx + remaining);
            Vector128<sbyte> values = NarrowWithSaturation(source, source);

            mask = CreateEscapingMask<TNegator>(values, bitMaskLookup, bitPosLookup, nibbleMaskSByte, zeroMaskSByte);
            if (mask != 0)
            {
                idx += remaining;
                goto Found;
            }
        }

        goto NotFound;

    Found:
        idx += GetIndexOfFirstNeedToEscape(mask);
        return (int)idx;

    NotFound:
        return -1;
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private static uint GetIndexOfFirstNeedToEscape(uint mask)
    {
        // Found at least one byte that needs to be escaped, figure out the index of
        // the first one found that needs to be escaped within the 16 bytes.
        Debug.Assert(mask > 0 && mask <= 65_535);
        uint tzc = uint.TrailingZeroCount(mask);
        Debug.Assert(tzc < 16);

        return tzc;
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private static uint CreateEscapingMask<TNegator>(
        Vector128<sbyte> values,
        Vector128<sbyte> bitMaskLookup,
        Vector128<sbyte> bitPosLookup,
        Vector128<sbyte> nibbleMaskSByte,
        Vector128<sbyte> nullMaskSByte)
        where TNegator : struct, INegator
    {
        // To check if an input byte matches or not, we use a bitmask-lookup.
        // Therefore we split the input byte into the low- and high-nibble, which will get
        // the row-/column-index in the bit-mask.
        // The bitmask-lookup looks like:
        //                                     high-nibble
        // low-nibble  0   1   2   3   4   5   6   7   8   9   A   B   C   D   E   F
        //         0   1   1   0   0   0   0   1   0   1   1   1   1   1   1   1   1
        //         1   1   1   0   0   0   0   0   0   1   1   1   1   1   1   1   1
        //         2   1   1   1   0   0   0   0   0   1   1   1   1   1   1   1   1
        //         3   1   1   0   0   0   0   0   0   1   1   1   1   1   1   1   1
        //         4   1   1   0   0   0   0   0   0   1   1   1   1   1   1   1   1
        //         5   1   1   0   0   0   0   0   0   1   1   1   1   1   1   1   1
        //         6   1   1   1   0   0   0   0   0   1   1   1   1   1   1   1   1
        //         7   1   1   1   0   0   0   0   0   1   1   1   1   1   1   1   1
        //         8   1   1   0   0   0   0   0   0   1   1   1   1   1   1   1   1
        //         9   1   1   0   0   0   0   0   0   1   1   1   1   1   1   1   1
        //         A   1   1   0   0   0   0   0   0   1   1   1   1   1   1   1   1
        //         B   1   1   1   0   0   0   0   0   1   1   1   1   1   1   1   1
        //         C   1   1   0   1   0   1   0   0   1   1   1   1   1   1   1   1
        //         D   1   1   0   0   0   0   0   0   1   1   1   1   1   1   1   1
        //         E   1   1   0   1   0   0   0   0   1   1   1   1   1   1   1   1
        //         F   1   1   0   0   0   0   0   1   1   1   1   1   1   1   1   1
        //
        // where 1 denotes a matach, while 0 means no match.
        // For high-nibbles in the range 8..F every input needs to be escaped, so we
        // can omit them in the bit-mask, thus only high-nibbles in the range 0..7 need
        // to be considered, hence the entries in the bit-mask can be of type byte.
        //
        // In the bitmask-lookup for each row (= low-nibble) a bit-mask for the
        // high-nibbles (= columns) is created.

        Debug.Assert(Vector128.IsHardwareAccelerated);

        // Perf: the shift needs to be done as Int32, as there exists a hw-instruction and no sw-emulation needs to be done.
        // Cf. https://github.com/dotnet/runtime/issues/75770
        // Due to the Int32-shift we need to mask out any remaining bits from the shifted nibble.
        Vector128<sbyte> highNibbles = Vector128.ShiftRightLogical(values.AsInt32(), 4).AsSByte() & nibbleMaskSByte;
        Vector128<sbyte> lowNibbles  = values & nibbleMaskSByte;

        Vector128<sbyte> bitMask      = Shuffle(bitMaskLookup, lowNibbles);
        Vector128<sbyte> bitPositions = Shuffle(bitPosLookup , highNibbles);

        Vector128<sbyte> mask       = bitPositions & bitMask;
        Vector128<sbyte> comparison = Vector128.Equals(nullMaskSByte, mask);
        comparison                  = TNegator.NegateIfNeeded(comparison);

        return comparison.ExtractMostSignificantBits();
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private static Vector128<sbyte> NarrowWithSaturation(Vector128<short> v0, Vector128<short> v1)
    {
        Debug.Assert(Vector128.IsHardwareAccelerated);

        // TODO: https://github.com/dotnet/runtime/issues/75724

        if (Sse2.IsSupported)
        {
            return Sse2.PackSignedSaturate(v0, v1);
        }
        else
        {
            // This is not the exact algorithm for saturation, but for the use-case
            // here it's does what it should do. I.e. eliminate non-ASCII chars in the
            // results.

            Vector128<short> v0HighNibbles = Vector128.ShiftRightLogical(v0, 8);
            Vector128<short> v1HighNibbles = Vector128.ShiftRightLogical(v1, 8);

            Vector128<short> ascii0 = Vector128.Equals(Vector128<short>.Zero, v0HighNibbles);
            Vector128<short> ascii1 = Vector128.Equals(Vector128<short>.Zero, v1HighNibbles);

            v0 &= ascii0;
            v1 &= ascii1;

            return Vector128.Narrow(v0, v1);
        }
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private static Vector128<sbyte> Shuffle(Vector128<sbyte> vector, Vector128<sbyte> indices)
    {
        // Perf: Ssse3.Shuffle produces better code
        return Sse3.IsSupported
            ? Ssse3.Shuffle(vector, indices)
            : Vector128.Shuffle(vector, indices);
    }

    public interface INegator
    {
        static abstract bool NegateIfNeeded(bool equals);
        static abstract Vector128<sbyte> NegateIfNeeded(Vector128<sbyte> equals);
    }

    public struct DontNegate : INegator
    {
        public static bool NegateIfNeeded(bool equals) => equals;
        public static Vector128<sbyte> NegateIfNeeded(Vector128<sbyte> equals) => equals;
    }

    public struct Negate : INegator
    {
        public static bool NegateIfNeeded(bool equals) => !equals;
        public static Vector128<sbyte> NegateIfNeeded(Vector128<sbyte> equals) => ~equals;
    }

The behavior is as follows:

  • less than 5 chars in the set to search for --> fallback to vectorized implementation from MemoryExtensions
  • if the set to search for contains non-ASCII --> fallback to MemoryExtensions
  • otherwise: emit code that uses either vectorization or a table-based lookup

Right now only char is implemented, but it could be quite easily extended to other (primitive) types.
From ASP.NET Core's use case at least byte in addition would be desireable.

Side note: for ASP.NET Core's validation there is an "extended" case which allows chars > 0x7F (non-ASCII) too. I don't know how hot this method is, and if it is worth to vectorize it, as it's only executed if opted into non-ASCII, which I believe is seldom (maybe wrong here).

Reflection of the generator approach

Preliminary word: due my work for dotnet/aspnetcore#44041 I'm a bit biased towards the generator at the moment, so it may not be the most objective reflection.

Benefits

No startup and runtime overhead

The generator spits out code that is ready to use. All the validation, analysis is already done.

For the lowest overhead, the caller would want to precompute the modified bitmap, but that may not be obvious from the API signature.

There could be a separate method that computes and returns the input to be used.

Hm, here I see similar problems as with "Discoverability" below. At least it seems a bit unwieldy to me, especially if the result of that init-method is a data-structure which needs to be kept around (and alive).

Performance

The current state of the generator emits ROS<bool> which will be read from assembly's data section, and inline Vector128.Create(sbyte, ...) which results in best code-gen.

As the generator has enough time for analysis, more sophisticated searches could be applied for special cases or contiguous sets, which would add more overhead to a runtime-based version. Though one with init-method could amortized that.

Downsides

Discoverability

Not so intuitive and discoverable as just typing IndexOfAny / IndexOfAnyExcept.
Remedy: there could be an analyzer to suggest the use of the generator iff > 5 chars and all are ASCII

Use in Regex generator

As alluded in previous comments the regex generator would be a principal beneficiary of faster IndexOfAnyXyz-variants.

In a quick test (.NET 7 daily build from today, VS 17.3.4) it didn't work that one generator emits code that needs another generator (which I hoped it worked).
I'm not aware if this should be possible or if there are any measures to enable such a scenario -- though I didn't follow along on that topic.

If this is a show-stopper then the generator-based approach has to be abandoned.

Only for contant sets

I don't know if this is a real downside, as most if not all (?) sets to check are known at development-time.

More maintenance work

Updating the implementation in MemoryExtensions is cheaper than maintaining a source generator.

Could be provided w/o generator too

As cited above a init-method could be used to create a data-structe that contains the relevant pieces and that is handed to an inbox-implementation.
The emitted code from the generator is no magick, more or less one entry-method (for each scalar and vectorized) to which the correct mask is handed over and that's it.

Summary

The generator maybe like taking a sledgehammer to crack a nut, but as it has more time to analyse the given "pattern" it can emit more ideal code. So (at the moment) a generator-based approach is what I'm leaning towards.

Major pain-point is on how to align with the use in the regex-generator.

Notes

  • The generator is far from ideal, and there are some TODOs left which (I) need to address.

  • The sets to search are given by a (constant) string, but I could images to extend that with a regex-like character-group/range. I.e. to write the example for validation the authority could be given as [A-Za-z0-9:.-[]@].
    (although that may be overkill...)

  • The emitted helper-methods (see above under the details-toggle) could be provided inbox, so that they don't need to be emitted each time. But I think similar was discussed for the Regex-generator...same could / should apply here too.

  • Perf from the generator looks like, so the threshould of 5 in MemoryExtensions should be re-evaluated


PS: Now I'm pretty sure to have missed a valid point, as I didn't take notes while developing the generator...mea culpa.

@stephentoub
Copy link
Member Author

stephentoub commented Sep 19, 2022

Thanks for all your thinking / work on this!

Use in Regex generator

Yes, generators currently can't depend on other generators. I don't know if / when / how that will change, but it's definitely a known limitation.

As cited above a init-method could be used to create a data-structe that contains the relevant pieces and that is handed to an inbox-implementation.

If we were going to utilize a generator as part of a solution here, this is the path I'd want to take: a usable in-box implementation, and then a generator that makes it easier to produce the right inputs and call that helper but that doesn't require a generator to do so.

I don't know if this is a real downside, as most if not all (?) sets to check are known at development-time.

It is. For example, RegexOptions.Compiled should be able to use this as well.

Downsides

A huge downside for me is I'd really, really, really like to avoid spitting use of unsafe code / VectorXx into user assemblies. It's a large amount of code, likely processing untrusted input, and I think we should strongly err on the side of having our generators avoid such output whenever possible. In some situations it might be unavoidable, but I don't believe this is one of those cases.

@gfoidl
Copy link
Member

gfoidl commented Sep 19, 2022

Yep, thanks for your thoughts on this. So I'll try to create an inbox-solution (after a little break 😉).

@gfoidl
Copy link
Member

gfoidl commented Sep 19, 2022

Idea for pure inbox-solution with an init-method:

API

(please don't mind the naming, my brain isn't creative enough for this today)

public static class MemoryExtensions
{
    // ...
    public static IndexOfAnyInitData IndexOfAnyInitialize(ReadOnlySpan<char> values);

    public readonly struct IndexOfAnyInitData
    {
        public bool IsAllAscii       { get; }
        public UInt128 Lookup        { get; }
        public Vector128<sbyte> Mask { get; }

        public IndexOfAnyInitData(bool isAllAscii, UInt128 lookup, Vector128<sbyte> mask);
    }
}

Usage example

private const string AlphaNumeric        = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";
private const string AuthoritySpecific   = ":.-[]@";
private const string ValidAuthorityChars = AlphaNumeric + AuthoritySpecific;

private static readonly MemoryExtension.IndexOfAnyInitData s_authorityInitData = MemoryExtension.IndexOfAnyInitialize(ValidAuthorityChars);

public bool IsAuthoryValid(ReadOnlySpan<char> authoritiy) => authority.IndexOfAnyExcept(ValidAuthorityChars, s_authorityInitData) < 0;

Perf-number are the same as above (it's the same worker-code).

Implementation

Workhorse-code is the same above emitted from the generator. Here for reference / completeness.

using System.Diagnostics;
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;

namespace StrippedCoreLib;

public static class MemoryExtensions
{
    public readonly struct IndexOfAnyInitData
    {
        public bool IsAllAscii       { get; }
        public UInt128 Lookup        { get; }
        public Vector128<sbyte> Mask { get; }

        public IndexOfAnyInitData(bool isAllAscii, UInt128 lookup, Vector128<sbyte> mask)
        {
            this.IsAllAscii = isAllAscii;
            this.Lookup     = lookup;
            this.Mask       = mask;
        }
    }

    public static IndexOfAnyInitData IndexOfAnyInitialize(ReadOnlySpan<char> values)
    {
        if (values.Length <= 5)
        {
            return default;
        }

        foreach (char c in values)
        {
            if (!char.IsAscii(c))
            {
                return default;
            }
        }

        UInt128 lookup = 0;
        sbyte[] mask   = new sbyte[128 / 8];

        foreach (char c in values)
        {
            SetBitInMask(ref lookup, mask, c);
        }

        Vector128<sbyte> vecMask = Vector128.LoadUnsafe(ref MemoryMarshal.GetArrayDataReference(mask));
        return new IndexOfAnyInitData(true, lookup, vecMask);

        static void SetBitInMask(ref UInt128 lookup, sbyte[] mask, int c)
        {
            Debug.Assert(c < 128);

            lookup |= (1UL << c);

            int highNibble = c >> 4;
            int lowNibble  = c & 0xF;

            mask[lowNibble] |= (sbyte)(1 << highNibble);
        }
    }

    public static int IndexOfAny(this ReadOnlySpan<char> span, ReadOnlySpan<char> values, IndexOfAnyInitData initData)
    {
        if (values.Length <= 5 || !initData.IsAllAscii)
        {
            return span.IndexOfAny(values);
        }

        return Vector128.IsHardwareAccelerated && span.Length >= Vector128<sbyte>.Count
            ? IndexOfMatchVectorized<Negate>(span, initData.Mask)
            : IndexOfMatchScalar<DontNegate>(span, initData.Lookup);
    }

    public static int IndexOfAnyExcept(this ReadOnlySpan<char> span, ReadOnlySpan<char> values, IndexOfAnyInitData initData)
    {
        if (values.Length <= 5 || !initData.IsAllAscii)
        {
            return span.IndexOfAnyExcept(values);
        }

        return Vector128.IsHardwareAccelerated && span.Length >= Vector128<sbyte>.Count
            ? IndexOfMatchVectorized<DontNegate>(span, initData.Mask)
            : IndexOfMatchScalar<Negate>(span, initData.Lookup);
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private static int IndexOfMatchScalar<TNegator>(ReadOnlySpan<char> value, UInt128 lookup)
        where TNegator : struct, INegator
    {
        for (int i = 0; i < value.Length; ++i)
        {
            char c     = value[i];
            bool match = ((1UL << c) & lookup) != 0;
            match      = TNegator.NegateIfNeeded(match);

            if (match)
            {
                return i;
            }
        }

        return -1;
    }

    private static int IndexOfMatchVectorized<TNegator>(ReadOnlySpan<char> value, Vector128<sbyte> bitMaskLookup)
        where TNegator : struct, INegator
    {
        Debug.Assert(Vector128.IsHardwareAccelerated);
        Debug.Assert(value.Length >= Vector128<short>.Count);

        // To check if a bit in a bitmask from the Bitmask is set, in a sequential code
        // we would do ((1 << bitIndex) & bitmask) != 0
        // As there is no hardware instrinic for such a shift, we use a lookup that
        // stores the shifted bitpositions.
        // So (1 << bitIndex) becomes BitPosLook[bitIndex], which is simd-friendly.
        //
        // A bitmask from the bitMaskLookup is created only for values 0..7 (one byte),
        // so to avoid a explicit check for values outside 0..7, i.e.
        // high nibbles 8..F, we use a bitpos that always results in escaping.
        Vector128<sbyte> bitPosLookup = Vector128.Create(
            0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80,     // high-nibble 0..7
            0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF      // high-nibble 8..F
        ).AsSByte();

        Vector128<sbyte> nibbleMaskSByte = Vector128.Create((sbyte)0xF);
        Vector128<sbyte> zeroMaskSByte   = Vector128<sbyte>.Zero;

        nuint idx = 0;
        uint mask;
        ref short ptr = ref Unsafe.As<char, short>(ref MemoryMarshal.GetReference(value));

        if (value.Length >= 2 * Vector128<short>.Count)
        {
            nuint end = (uint)(value.Length - 2 * Vector128<short>.Count);

            do
            {
                Vector128<short> source0 = Vector128.LoadUnsafe(ref ptr, idx);
                Vector128<short> source1 = Vector128.LoadUnsafe(ref ptr, idx + 8);
                Vector128<sbyte> values  = NarrowWithSaturation(source0, source1);

                mask = CreateEscapingMask<TNegator>(values, bitMaskLookup, bitPosLookup, nibbleMaskSByte, zeroMaskSByte);
                if (mask != 0)
                {
                    goto Found;
                }

                idx += 2 * (uint)Vector128<short>.Count;
            }
            while (idx <= end);
        }

        // Here we know that 8 to 15 chars are remaining. Process the first 8 chars.
        if (idx <= (uint)(value.Length - Vector128<short>.Count))
        {
            Vector128<short> source = Vector128.LoadUnsafe(ref ptr, idx);
            Vector128<sbyte> values = NarrowWithSaturation(source, source);

            mask = CreateEscapingMask<TNegator>(values, bitMaskLookup, bitPosLookup, nibbleMaskSByte, zeroMaskSByte);
            if (mask != 0)
            {
                goto Found;
            }

            idx += (uint)Vector128<short>.Count;
        }

        // Here we know that < 8 chars are remaining. We shift the space around to process
        // another full vector.
        nuint remaining = (uint)value.Length - idx;
        if ((nint)remaining > 0)
        {
            remaining -= (uint)Vector128<short>.Count;

            Vector128<short> source = Vector128.LoadUnsafe(ref ptr, idx + remaining);
            Vector128<sbyte> values = NarrowWithSaturation(source, source);

            mask = CreateEscapingMask<TNegator>(values, bitMaskLookup, bitPosLookup, nibbleMaskSByte, zeroMaskSByte);
            if (mask != 0)
            {
                idx += remaining;
                goto Found;
            }
        }

        goto NotFound;

    Found:
        idx += GetIndexOfFirstNeedToEscape(mask);
        return (int)idx;

    NotFound:
        return -1;
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private static uint GetIndexOfFirstNeedToEscape(uint mask)
    {
        // Found at least one byte that needs to be escaped, figure out the index of
        // the first one found that needs to be escaped within the 16 bytes.
        Debug.Assert(mask > 0 && mask <= 65_535);
        uint tzc = uint.TrailingZeroCount(mask);
        Debug.Assert(tzc < 16);

        return tzc;
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private static uint CreateEscapingMask<TNegator>(
        Vector128<sbyte> values,
        Vector128<sbyte> bitMaskLookup,
        Vector128<sbyte> bitPosLookup,
        Vector128<sbyte> nibbleMaskSByte,
        Vector128<sbyte> nullMaskSByte)
        where TNegator : struct, INegator
    {
        // To check if an input byte matches or not, we use a bitmask-lookup.
        // Therefore we split the input byte into the low- and high-nibble, which will get
        // the row-/column-index in the bit-mask.
        // The bitmask-lookup looks like:
        //                                     high-nibble
        // low-nibble  0   1   2   3   4   5   6   7   8   9   A   B   C   D   E   F
        //         0   1   1   0   0   0   0   1   0   1   1   1   1   1   1   1   1
        //         1   1   1   0   0   0   0   0   0   1   1   1   1   1   1   1   1
        //         2   1   1   1   0   0   0   0   0   1   1   1   1   1   1   1   1
        //         3   1   1   0   0   0   0   0   0   1   1   1   1   1   1   1   1
        //         4   1   1   0   0   0   0   0   0   1   1   1   1   1   1   1   1
        //         5   1   1   0   0   0   0   0   0   1   1   1   1   1   1   1   1
        //         6   1   1   1   0   0   0   0   0   1   1   1   1   1   1   1   1
        //         7   1   1   1   0   0   0   0   0   1   1   1   1   1   1   1   1
        //         8   1   1   0   0   0   0   0   0   1   1   1   1   1   1   1   1
        //         9   1   1   0   0   0   0   0   0   1   1   1   1   1   1   1   1
        //         A   1   1   0   0   0   0   0   0   1   1   1   1   1   1   1   1
        //         B   1   1   1   0   0   0   0   0   1   1   1   1   1   1   1   1
        //         C   1   1   0   1   0   1   0   0   1   1   1   1   1   1   1   1
        //         D   1   1   0   0   0   0   0   0   1   1   1   1   1   1   1   1
        //         E   1   1   0   1   0   0   0   0   1   1   1   1   1   1   1   1
        //         F   1   1   0   0   0   0   0   1   1   1   1   1   1   1   1   1
        //
        // where 1 denotes a matach, while 0 means no match.
        // For high-nibbles in the range 8..F every input needs to be escaped, so we
        // can omit them in the bit-mask, thus only high-nibbles in the range 0..7 need
        // to be considered, hence the entries in the bit-mask can be of type byte.
        //
        // In the bitmask-lookup for each row (= low-nibble) a bit-mask for the
        // high-nibbles (= columns) is created.

        Debug.Assert(Vector128.IsHardwareAccelerated);

        // Perf: the shift needs to be done as Int32, as there exists a hw-instruction and no sw-emulation needs to be done.
        // Cf. https://github.com/dotnet/runtime/issues/75770
        // Due to the Int32-shift we need to mask out any remaining bits from the shifted nibble.
        Vector128<sbyte> highNibbles = Vector128.ShiftRightLogical(values.AsInt32(), 4).AsSByte() & nibbleMaskSByte;
        Vector128<sbyte> lowNibbles  = values & nibbleMaskSByte;

        Vector128<sbyte> bitMask      = Shuffle(bitMaskLookup, lowNibbles);
        Vector128<sbyte> bitPositions = Shuffle(bitPosLookup, highNibbles);

        Vector128<sbyte> mask       = bitPositions & bitMask;
        Vector128<sbyte> comparison = Vector128.Equals(nullMaskSByte, mask);
        comparison                  = TNegator.NegateIfNeeded(comparison);

        return comparison.ExtractMostSignificantBits();
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private static Vector128<sbyte> NarrowWithSaturation(Vector128<short> v0, Vector128<short> v1)
    {
        Debug.Assert(Vector128.IsHardwareAccelerated);

        // TODO: https://github.com/dotnet/runtime/issues/75724

        if (Sse2.IsSupported)
        {
            return Sse2.PackSignedSaturate(v0, v1);
        }
        else
        {
            // This is not the exact algorithm for saturation, but for the use-case
            // here it's does what it should do. I.e. eliminate non-ASCII chars in the
            // results.

            Vector128<short> v0HighNibbles = Vector128.ShiftRightLogical(v0, 8);
            Vector128<short> v1HighNibbles = Vector128.ShiftRightLogical(v1, 8);

            Vector128<short> ascii0 = Vector128.Equals(Vector128<short>.Zero, v0HighNibbles);
            Vector128<short> ascii1 = Vector128.Equals(Vector128<short>.Zero, v1HighNibbles);

            v0 &= ascii0;
            v1 &= ascii1;

            return Vector128.Narrow(v0, v1);
        }
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private static Vector128<sbyte> Shuffle(Vector128<sbyte> vector, Vector128<sbyte> indices)
    {
        // Perf: Ssse3.Shuffle produces better code
        return Sse3.IsSupported
            ? Ssse3    .Shuffle(vector, indices)
            : Vector128.Shuffle(vector, indices);
    }

    public interface INegator
    {
        static abstract bool NegateIfNeeded(bool equals);
        static abstract Vector128<sbyte> NegateIfNeeded(Vector128<sbyte> equals);
    }

    public struct DontNegate : INegator
    {
        public static bool NegateIfNeeded(bool equals) => equals;
        public static Vector128<sbyte> NegateIfNeeded(Vector128<sbyte> equals) => equals;
    }

    public struct Negate : INegator
    {
        public static bool NegateIfNeeded(bool equals) => !equals;
        public static Vector128<sbyte> NegateIfNeeded(Vector128<sbyte> equals) => ~equals;
    }
}

Notes

If an init-method is used to "prepare" IndexOfAny{Except} so all necessary data should be returned. Thus instead of plain UInt128 a bit more info is in the IndexOfAnyInitData:

  • IsAllAscii needed check if the vectorized approach can be used or if a fallback to IndexOfAny is needed -- I think this flag should be passed to the fallback too, in order to avoid the try for the probabilistic map
  • Lookup for the table lookup if span is too short to vectorized -- maybe we don't need this as if span is too short to vectorize just fallback to scalar-scan there?
  • Mask the bitmask for the vectorized approach

@stephentoub
Copy link
Member Author

Thanks, @gfoidl!

lookup |= (1UL << c);

Is that actually doing the "right thing" with UInt128? I'd have expected this to produce the ulong, which will overflow after 64 bits, and then the result of that is what'll be or'd in.

IsAllAscii

This feels like extra complication we could instead address just by naming, e.g. putting this on the new Ascii class, or having Ascii in the method name like IndexOfAnyAscii / IndexOfAnyExceptAscii. If you passed a non-ASCII value to the initialize method, it would throw. My expectation is the 99% use case for this method is a known set of input characters, and if you know any isn't ASCII, then don't use this method.

Vector128<sbyte>

Why sbyte rather than byte?

UInt128 Lookup

Does this significantly help? If the only thing you had was a Vector128<byte> produced by the init method, what would the implementation do differently? I'm trying to simplify this, and wondering if we could get to the point where the init method just returns a Vector128<byte>, IndexOfAnyAscii just takes a Vector128<byte>, and the scheme for populating that Vector128<byte> is documented so you wouldn't even need the InitializeIndexOfAnyAscii method if you didn't want it.

@gfoidl
Copy link
Member

gfoidl commented Sep 21, 2022

UInt128 Lookup

If the only thing you had was a Vector128<byte>

In the implementation (last comment) there is a difference between the scalar- and vectorized-path as perf-optimization.

Scalar is basically bit-test ((1 << c) & lookup) != 0. There's no intrinsic to do such a thing vectorized, so the approach there is to split the byte into nibbles, then use these nibbles as shuffle-mask for the lookup and bit-test.

The scalar-path could to the same nibble-splitting to get rid of the UInt128-mask, but I guess this decreases perf. The question is that it matter?
Talking about {byte, char} vectorization kicks in when span.Length >= {16, 8}, so I think it matters but it should be measured to have numbers for this.

init method just returns a Vector128<byte>, IndexOfAnyAscii just takes a Vector128<byte>, and the scheme for populating that Vector128<byte> is documented

I think to understand your motivation for this, but I see these drawbacks:

  • Implementation tied to the API-surface
    Especially if the scheme for populating the mask is documented and part of the public API, then we can't change the implementation in a easy way.

  • Can't use more sophisticated algorithms
    Say we find a clever way to handle special set-chars in more performant way that doesn't use Vector128<byte> as mask but something different. That can't be expressed with such an API.

For this API I imagine that based on the shape of set-chars different algorithms could be applied. E.g. the set consists of a few numbers, or control-characters, etc. There may be more specific ways to handle that than the "generic" algorithm based on nibble-splitting and the lookup.

Thus for this iteration of my thinking about this the API should look like:

public static class MemoryExtensions
{
    public abstract class IndexOfAnyInitData { }

    public static IndexOfAnyInitData IndexOfAnyInitialize(ReadOnlySpan<char> values);
}

So we gain the freedom to return for IndexOfAnyInitData whatever seems to be the best fit for the given values.

Downside is that IndexOfAnyInitData needs to be allocated. For storing in a static readonly not really a problem, but maybe for RegexOptions.Compiled not as easy as passing a value-type around?!

An implementation could look like then:

private sealed class BitMaskIndexOfAnyInitData : IndexOfAnyInitData
{
    public UInt128 Lookup        { get; init; }
    public Vector128<sbyte> Mask { get; init; }
}

public static IndexOfAnyInitData IndexOfAnyInitialize(ReadOnlySpan<char> values)
{
    foreach (char c in values)
    {
        if (!char.IsAscii(c))
        {
            throw new InvalidOperationException("Non ASCII here");
        }
    }

    // Analyze the values here and determine the best algorithm to use.
    // Return the init-data specific to that algorithm.
    // E.g.
    return new BitMaskIndexOfAnyInitData { Lookup = lookup, Mask = vecMask };
}

public static int IndexOfAny(this ReadOnlySpan<char> span, IndexOfAnyInitData initData)
{
    // Switch on the initData to select the proper algorithm

    if (initData is BitMaskIndexOfAnyInitData bitMaskIndexOfAnyInitData)
    {
        return Vector128.IsHardwareAccelerated && span.Length >= Vector128<sbyte>.Count
            ? BitMaskIndexOfMatchVectorized<Negate>(span, bitMaskIndexOfAnyInitData.Mask)
            : BitMaskIndexOfMatchScalar<DontNegate>(span, bitMaskIndexOfAnyInitData.Lookup);
    }

    throw new NotSupportedException("At the moment only the bitmask-appraoch is supported");
}

public static int IndexOfAnyExcept(this ReadOnlySpan<char> span, IndexOfAnyInitData initData)
{
    // Analogous to above method
}

Why sbyte rather than byte?

The code only works for ASCII-sets, so sbyte as the >= 0 values are the ASCII-range.
Though from an API point of view it doesn't really matter, as Vector128<byte> can be re-interpreted as Vector128<sbyte> easily.
sbyte is mainly motivated from the implementation which works on Vector128<sbyte>.
I have no strong stance here, and whatever seems better for the API I'm fine with.

If you passed a non-ASCII value to the initialize method, it would throw. My expectation is the 99% use case for this method is a known set of input characters, and if you know any isn't ASCII, then don't use this method.

👍
I'd expect the same.

Now I also think this method shouldn't fallback to the current IndexOfAny-methods. In the implementation from last comment it would fallback if <= 5 set-chars or if not all are ASCII.
For the ASCII I like what you suggested (just throw from init in case of non-ASCII encountered).

Don't fallback for the case of <= 5 set-chars provides another choice for the consumer which API to use which can be based on custom benchmarks.
In any way we should reconsider the threshould of 5 set-chars. And / or maybe also move the probabilistic map initialization into the init-method here (with its own IndexOfAnyInitData derived-type).

lookup |= (1UL << c);

Oh, you are right. I though to have checked that (can't remember what I tested actually).

UInt128 a = 0;
UInt128 b = UInt128.MaxValue;
UInt128 c = (1UL << 127);
UInt128 d = (UInt128.One << 127);

Print(a);
Print(b);
Print(c);
Print(d);

static void Print(UInt128 value) => Console.WriteLine($"0x{value:X32}");
0x00000000000000000000000000000000
0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

0x00000000000000008000000000000000
                  ^ (overflowed)
0x80000000000000000000000000000000
  ^ (correct)

@stephentoub
Copy link
Member Author

stephentoub commented Sep 21, 2022

UInt128 Lookup

If the only thing you had was a Vector128

In the implementation (last comment) there is a difference between the scalar- and vectorized-path as perf-optimization.

Sure, but the UInt128 and the Vector128<byte> are just two different representations of the same information. How expensive is it to convert from one to the other? My question/suggestion was basically whether we could get away with the method just taking one of them, and if the other was needed, convert it once in IndexOfAnyAscii. For example, just always take a UInt128 with the relevant bits set, and then create a Vector128<byte> from it if there are enough items being searched to warrant vectorization... what's the overhead of that? If it's significant, obviously we don't want to do that. If it's a few instructions, the simplification could be worthwhile, especially since at that point we probably wouldn't need the initialization method at all: anyone who wanted to call this could either hardcode the UInt128 value from having precomputed it, or write the two-line loop themselves:

UInt128 lookup = 0;
foreach (char c in "...") lookup |= (UInt128.One << c);

I'm mainly just trying to understand / explore options at this point. Note that I care about overheads for small inputs and also vectorized larger inputs when the match is found earlier; I care a lot less about overheads when Vector128 isn't hardware accelerated.

Downside is that IndexOfAnyInitData needs to be allocated. For storing in a static readonly not really a problem

It could be. Not that it's impossible to do, but it gets awkward. For example, I'd like to be able to use this in source generated regexes. If this were just relevant to finding the next place a regex could match, we'd have only one of them. But we use {Last}IndexOf{Any} in a bunch of places, e.g. if there's a set after a (lazy) loop, we'll use {Last}IndexOf{Any} to optimize the backtracking, so with this API we'd end up needing to store one of these per such construct.

I also have a visceral negative reaction to a source generator not being able to compute everything it needs at compile time and instead defer work to startup time. (Even if it's so cheap it's unfounded, and even if solutions like NativeAOT's ability to execute static cctors at compile time would obviate even those concerns.)

but maybe for RegexOptions.Compiled not as easy as passing a value-type around?!

Right. Compiled uses DynamicMethod: we can't add fields.

I have no strong stance here, and whatever seems better for the API I'm fine with.

I'd prefer byte.

So we gain the freedom to return for IndexOfAnyInitData whatever seems to be the best fit for the given values.

There are certainly benefits to keeping the data opaque such that the IndexOfAnyAscii and its associated initialization method can be evolved. I'd like to better understand the cost of converting the UInt128 simple lookup form into a Vector128<byte> form; if that's super cheap such that doing it on every IndexOfAnyAscii invocation that needs the vector would be barely measurable, I think that's still my preference. If it's more costly, the opaque approach is probably the right one, even though it complicates the source generator scenario.

@gfoidl
Copy link
Member

gfoidl commented Sep 21, 2022

How expensive is it to convert from one to the other?

Quick benchmark for any except of ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789:.-[]@ (authority from kestrel).

WithInitData is the approach outlined in #68328 (comment).

WithUInt128 is foreach (char c in "...") lookup |= (UInt128.One << c); with these APIs:

public static UInt128 IndexOfAnyInitialize(ReadOnlySpan<char> values);
public static int IndexOfAnyExcept(this ReadOnlySpan<char> span, UInt128 initData);

so the vectorized version needs to convert the bit-array into the bit-mask needed at invocation.

WithVector is the nibble-splitting (for both vectorized + scalar) with these APIs:

public static Vector128<byte> IndexOfAnyInitialize(ReadOnlySpan<char> values);
public static int IndexOfAnyExcept(this ReadOnlySpan<char> span, Vector128<byte> initData);

This yields interesting results, which I did not expect to be that way (especially that nibble-splitting is so fast).

|       Method |     Authority |       Mean | Ratio |
|------------- |-------------- |-----------:|------:|
| WithInitData |        [fe80] |  11.649 ns |  1.00 |
|  WithUInt128 |        [fe80] |  10.206 ns |  0.89 |
|   WithVector |        [fe80] |   6.602 ns |  0.57 |
|              |               |            |       |
| WithInitData | hostname:8080 |   5.035 ns |  1.00 |
|  WithUInt128 | hostname:8080 | 252.096 ns | 50.07 |
|   WithVector | hostname:8080 |   4.581 ns |  0.91 |

Explanation first set with authority [fe80]:
WithInitData and WithUInt128 have the same algorithm, but WithUInt128 has less overhead, as no check for the correct type of init-data (see comment above).
Shifting UInt128.One << c is more expensive than splitting into nibbles and doing simple bit-operations. Thus WithVector is way faster (also the loop-size is just about a third of the ones which shifts the UInt128).

Explanation second set with authority hostname:8080:
WithInitData and WithVector have the same algorithm, but WithVector has less overhead, as no check for the correct type of init-data (see comment above).
WithUInt128 is really slow, as the bit-mask needed for the vectorized approach needs to be constructed out of the lookup |= (UInt128.One << c) data.


So concluding I'd go with following approach:

public static class MemoryExtensions
{
    public static Vector128<byte> IndexOfAnyInitialize(ReadOnlySpan<char> values);

    public static int IndexOfAny(this ReadOnlySpan<char> span, Vector128<byte> initData);
    public static int IndexOfAnyExcept(this ReadOnlySpan<char> span, Vector128<byte> initData);
}

Questions to this:

Should the the scheme for populating that Vector128<byte> be documented officially?
This would become part of the API-contract, so any change is prohibited. Although when used from a generator is also like setting that schema into stone, as updates to the scheme won't be reflected in binaries that are created from generatored code.
TBH I can't imagine that this scheme will change, as it's simple and performant (which is what I like).

Should the type for init-data be Vector128<byte> or UInt128?
Perf-wise it doesn't matter, they can be reinterpreted to each other as they have the same bit-size.
Stick to Vector128<byte> to put some emphasis that the code relies on vectorization? On the other hand there's no (?) API that exposes vector-types (expect the intrinsics ifself, of course), so it may be more natural to have UInt128.

There are certainly benefits to keeping the data opaque such that the IndexOfAnyAscii and its associated initialization method can be evolved.

In case we see some common patterns of sets that could use a better algorithms, then another overload with an opaque init-data could be added.

With this approach all our points should be satisfied?
(Mine are 😉)

@stephentoub
Copy link
Member Author

stephentoub commented Sep 21, 2022

Thanks!

Quick benchmark

Can you share the benchmark code?

Should the the scheme for populating that Vector128 be documented officially?

I think so. Whether or not we do, any changes to it are breaking, as the value can and will be persisted in binaries. This is why I was hoping the input could be a simple bitmap that the API would then transform as it needs. I'm surprised that transformation takes 250ns, that there's no way to reinterpret a UInt128 into a Vector128<byte> and then do manipulations on that vector to transform it from the simple bitmap into the required shape much faster than that.

Should the type for init-data be Vector128<byte> or UInt128?

I'd prefer UInt128. The code doesn't require vectorization (it'll function fine if Vector128.IsHardwareAccelerated is false and we take a fallback path), and Vector128<T> looks scary.

@teo-tsirpanis
Copy link
Contributor

You said in #69682 (comment):

It might be counterintuitive, but I would like to avoid adding optimizations to the interpreter that aren't also in Compiled and the source generator, and similarly optimizations to Compiled that aren't in the source generator.

Judging from it the Compiled mode should not use it but the interpreted mode could, right?

@stephentoub
Copy link
Member Author

Judging from it the Compiled mode should not use it but the interpreted mode could, right?

Assuming there's a performance improvement to be had and it doesn't too negatively impact first-use, "it" could apply to all of them, where "it" is using the algorithm. Source generator would use a source generated implementation to avoid all startup costs and optimize throughput, Compiled could potentially use an IL emitted implementation in order to optimize throughpuit but could also use a built-in IndexOfAny if that made more sense, and the interpreter could use the built-in IndexOfAny.

@jkotas
Copy link
Member

jkotas commented Oct 27, 2022

If we are not going to use this API as an implementation detail for source generated regexes, we may consider dropping this API and double-down on efficient source-generated regex support with good usability instead.

It would reduce concept count and it would be similar to how some other language environments work. You just specify a regex and let the compiler worry about the details.

@gfoidl
Copy link
Member

gfoidl commented Oct 27, 2022

consider dropping this API

There will be usages outside of regex, too. E.g. in ASP.NET Core (dotnet/aspnetcore#44041).

@stephentoub
Copy link
Member Author

stephentoub commented Oct 27, 2022

If we are not going to use this API as an implementation detail for source generated regexes, we may consider dropping this API and double-down on efficient source-generated regex support with good usability instead.

I think there's some confusion here. The API currently proposed will be used by regex and many other places. What I was saying won't be used by source gen regex is the hypothetical additional method for aho corasick, mentioned here as being a possible future extension, that @teo-tsirpanis is looking at.

@jkotas
Copy link
Member

jkotas commented Oct 27, 2022

What are all Ts and other special cases that we expect to provide an optimized path for in IndexOfAnyValues.Create<T>? It is the kind of method that can turn into a choke point for trimming, keeping number of unreachable codepaths around unnecessarily.

// in the future, could add an `IndexOfAnyValues Create(ReadOnlySpan values) overload

Nit: This overload would be ambiguous with the existing method. Consider var mask = IndexOfAnyValues.Create(new ReadOnlySpan<string>(new string[] { "Hello", "World" }));. I think it would need to be called something else.

@stephentoub
Copy link
Member Author

stephentoub commented Oct 27, 2022

What are all Ts and other special cases that we expect to provide an optimized path for in IndexOfAnyValues.Create?

At least char and byte. If you think it's better to have explicit overloads for char and byte, we could do that, too. Though I'd hope the trimmer would be able to figure out (if not now, eventually) that a method M<T> only called as M<char> could have its byte-based code paths trimmed away.

And for char at least the cases where all inputs are ASCII (in which case it can use the vectorized implementation) or they aren't (in which case it can use the probabilistic map, or other such lookup tables). I'm hopeful in the future we'll discover additional ways to optimize as well, and that can all be done under the covers without additional APIs. We could also special-case inputs that would be satisfied by other public API, e.g. detect that non-ASCII inputs are all in a single range and then delegate to IndexOfAnyInRange, but for such cases I think we should start simple and just say callers should themselves use those APIs rather than having the API try to detect and "fix" it.

@stephentoub
Copy link
Member Author

stephentoub commented Oct 27, 2022

This overload would be ambiguous with the existing method. Consider var mask = IndexOfAnyValues.Create(new ReadOnlySpan(new string[] { "Hello", "World" }));. I think it would need to be called something else.

It wouldn't be ambiguous from the compiler's perspective: if you used IndexOfAnyValues.Create(new ReadOnlySpan<string>(new string[] { "Hello", "World" })), you'd get an IndexOfAnyValues<char>, and if you used IndexOfAnyValues.Create<string>(new ReadOnlySpan<string>(new string[] { "Hello", "World" })), you'd get an IndexOfAnyValues<string>. However, I agree the distinction is subtle and we might want to use a different name (if we even add such a method in the future; it's not part of this proposal). My main point was that the factory pattern lets us produce an IndexOfAnyValues<char> based on different inputs and that the resulting instance can be specialized for that input without incurring additional per-invocation overheads.

@jkotas
Copy link
Member

jkotas commented Oct 27, 2022

At least char and byte. If you think it's better to have explicit overloads for char and byte, we could do that, too.

I keep thinking about this API primarily as a low-level regex helper that saves us from stamping the complex unsafe vectorized code in every source-generated regex. The API can be used directly too, but I expect only a small number of people are going to be successful in figuring when and how to correctly use it to their advantage.

Having the surface limited and specialized to just what regex and parsers need would be fine with me (ie byte and char only). I am not even sure we need methods for this on MemoryExtensions. I think having a method on IndexOfAnyValues<T> type would be good enough.

My main point was that the factory pattern lets us produce an IndexOfAnyValues based on different inputs and that the resulting instance can be specialized for that input without incurring additional per-invocation overheads.

How would the IndexOfAnyExcept work with the aho corasick? I am not sure what the sensible behavior would be. And if there is a sensible behavior, would it mean that we would need to keep multiple state machines around for different IndexOf variants?

@stephentoub
Copy link
Member Author

stephentoub commented Oct 27, 2022

I keep thinking about this API primarily as a low-level regex helper that saves us from stamping the complex unsafe vectorized code in every source-generated regex.

Why do you believe that's the case? It'll certainly be used by regex, but most places currently doing an IndexOfAny with more than 5 characters can benefit from this (and others that aren't using IndexOfAny because they could do something faster based on precomputing a bitmap). Some examples from a quick search:

expect only a small number of people are going to be successful in figuring when and how to correctly use it to their advantage.

I'm surprised by that. Plenty of folks figured out how to write:

private static readonly char[] s_chars = new char[] { ... };
...
int index = something.IndexOfAny(s_chars);

This follows the exact same form, just changing the line:

private static readonly char[] s_chars = new char[] { ... };

to instead be:

private static readonly IndexOfAnyValues<char> s_chars = IndexOfAnyValues.Create(new char[] { ... });

and as an overload of IndexOfAny, you're looking in IntelliSense through the overloads, you see one that takes an IndexOfAnyValues, you see it has a Create method, and you substitute an IndexOfAnyValues for your s_chars. We can even have an analyzer/fixer for it.

Having the surface limited and specialized to just what regex and parsers need would be fine with me (ie byte and char only). I am not even sure we need methods for this on MemoryExtensions. I think having a method on IndexOfAnyValues type would be good enough.

I think there's real value having these simply as overloads of the existing IndexOfAny: it's just one more way you can provide inputs to the existing functionality. I'd be fine replacing the generic Create<T>(ReadOnlySpan<T>) with Create(ReadOnlySpan<char>) and Create(ReadOnlySpan<byte>) if you believe that'll be better; I personally like it as a single generic method that accomodates any future optimizations we want for other types, but as I don't have any in mind or that are pressing, we could do without it for now, and I understand the trimming concerns (though I still think we should be thinking about imbuing the trimmer with more smarts around this kind of input/type tracking).

@jkotas
Copy link
Member

jkotas commented Oct 27, 2022

Why do you believe that's the case?

I expect that most people will just use the existing IndexOfAny API that is the most straightforward one to use, and not bother with making their code less readable by outlining the pattern to search for.

private static readonly char[] s_chars = new char[] { ... };

int index = something.IndexOfAny(s_chars);

I expect only a fraction of read-world places to have outlined cache like this. I expect that majority are just going to allocate the array of chars to search for inline since that it is the most straightforward way to do it. Our examples show you to allocate the chars to search for inline too: https://learn.microsoft.com/en-us/dotnet/api/system.string.indexofany?view=net-7.0.

Yes, there are certainly going to be cases where this API will be used directly, especially in our own libraries. I see it as an API for advanced performance-oriented developers only.

@stephentoub
Copy link
Member Author

Our examples show you to allocate the chars to search for inline too

Our examples show an entire program in a single snippet focused on highlighting what output you get for what input. They're not real-world.

I expect only a fraction of read-world places to have outlined cache like this.

I see a lot of it in searching around GitHub. But my searches may be skewed; I asked @marklio for help in getting some data.

Regardless, whether it's a majority or minority, there are lots of places that will be able to benefit from it (as outlined in #68328 (comment)). We have multiple IndexOfAny overloads taking inputs in a variety of forms; this adds just one more, and I think it's the right way to expose this. I think trying to bury it elsewhere actually makes it more confusing, as it means there's now a set of IndexOfAny methods elsewhere that aren't obviously connected and thus force people to think about whether they need to consider the other one; as an overload, it's just another way of providing the same inputs.

@terrajobst
Copy link
Member

terrajobst commented Nov 8, 2022

Video

  • Let's move the types to System.Buffers as they are more advanced
  • Let's not have a generic factory method for now as we're only planning to optimize for byte and char
  • We should make sure that the debugger displays something useful for instances of IndexOfAnyValues<T>
namespace System;

public static partial class MemoryExtensions
{
    public static int IndexOfAny<T>(this ReadOnlySpan<T> span, IndexOfAnyValues<T> values) where T : IEquatable<T>?;
    public static int IndexOfAnyExcept<T>(this ReadOnlySpan<T> span, IndexOfAnyValues<T> values) where T : IEquatable<T>?;
    public static int LastIndexOfAny<T>(this ReadOnlySpan<T> span, IndexOfAnyValues<T> values) where T : IEquatable<T>?;
    public static int LastIndexOfAnyExcept<T>(this ReadOnlySpan<T> span, IndexOfAnyValues<T> values) where T : IEquatable<T>?;
}
namespace System.Buffers;

public static class IndexOfAnyValues
{
    public static IndexOfAnyValues<byte> Create(ReadOnlySpan<byte> values);
    public static IndexOfAnyValues<char> Create(ReadOnlySpan<char> values);
}

public class IndexOfAnyValues<T> where T : IEquatable<T>?
{
}

@terrajobst terrajobst added api-approved API was approved in API review, it can be implemented and removed api-ready-for-review API is ready for review, it is NOT ready for implementation labels Nov 8, 2022
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Nov 9, 2022
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Nov 21, 2022
@gfoidl
Copy link
Member

gfoidl commented Nov 22, 2022

This was an interesting journey. Thanks for all the beatiful discussion and thoughts on this!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-approved API was approved in API review, it can be implemented area-System.Memory tenet-performance Performance related issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants