Consider adding more IndexOf variations #75175

stephentoub · 2022-09-07T02:11:03Z

Searching is a key aspect of many workloads. Historically, we provided just the variations that everyone needed, e.g. IndexOf. Then we expanded to IndexOfAny with a few arguments. We just recently expanded to IndexOfAnyExcept. But beyond these, our general stance has been "if you need something any more customized, it's easy to just write it yourself". And historically, it has been easy. Let's say I wanted a static int IndexOfRange(ReadOnlySpan<char> span, char lowInclusive, highInclusive); it's straightforward to implement that "in just a few lines":

static int IndexOfRange(ReadOnlySpan<char> span, char lowInclusive, char highInclusive)
{
    uint range = (uint)(highInclusive - lowInclusive);
    for (int i = 0; i < span.Length; i++)
        if ((uint)(span[i] - lowInclusive) <= range)
            return i;

    return -1;
}

Not a big deal. But now, the expectation is such operations will take advantage of vectorization, and that's no longer "just a few lines". That ends up being something more like this (untested):

static int IndexOfRange(ReadOnlySpan<char> span, char lowInclusive, char highInclusive)
{
    if (Vector128.IsHardwareAccelerated && span.Length >= Vector128<ushort>.Count)
    {
        if (Vector256.IsHardwareAccelerated && span.Length >= Vector256<ushort>.Count)
        {
            ref ushort searchSpace = ref Unsafe.As<char, ushort>(ref MemoryMarshal.GetReference(span));
            ref ushort current = ref searchSpace;

            Vector256<ushort> lowVector = Vector256.Create((ushort)lowInclusive);
            Vector256<ushort> rangeVector = Vector256.Create((ushort)highInclusive) - lowVector;
            Vector256<ushort> lessThanOrEqual;

            ref ushort endMinusOneVector = ref Unsafe.Add(ref current, span.Length - Vector256<ushort>.Count);
            do
            {
                lessThanOrEqual = Vector256.LessThanOrEqual(Vector256.LoadUnsafe(ref current) - lowVector, rangeVector);
                if (lessThanOrEqual != Vector256<ushort>.Zero)
                {
                    return
                        BitOperations.TrailingZeroCount(lessThanOrEqual.ExtractMostSignificantBits()) +
                        (int)(Unsafe.ByteOffset(ref searchSpace, ref current) / Unsafe.SizeOf<ushort>());
                }

                current = ref Unsafe.Add(ref current, Vector256<ushort>.Count);
            }
            while (Unsafe.IsAddressLessThan(ref current, ref endMinusOneVector));

            lessThanOrEqual = Vector256.LessThanOrEqual(Vector256.LoadUnsafe(ref endMinusOneVector) - lowVector, rangeVector);
            if (lessThanOrEqual != Vector256<ushort>.Zero)
            {
                return
                    BitOperations.TrailingZeroCount(lessThanOrEqual.ExtractMostSignificantBits()) +
                    (int)(Unsafe.ByteOffset(ref searchSpace, ref endMinusOneVector) / Unsafe.SizeOf<ushort>());
            }
        }
        else
        {
            ref ushort searchSpace = ref Unsafe.As<char, ushort>(ref MemoryMarshal.GetReference(span));
            ref ushort current = ref searchSpace;

            Vector128<ushort> lowVector = Vector128.Create((ushort)lowInclusive);
            Vector128<ushort> rangeVector = Vector128.Create((ushort)highInclusive) - lowVector;
            Vector128<ushort> lessThanOrEqual;

            ref ushort endMinusOneVector = ref Unsafe.Add(ref current, span.Length - Vector128<ushort>.Count);
            do
            {
                lessThanOrEqual = Vector128.LessThanOrEqual(Vector128.LoadUnsafe(ref current) - lowVector, rangeVector);
                if (lessThanOrEqual != Vector128<ushort>.Zero)
                {
                    return
                        BitOperations.TrailingZeroCount(lessThanOrEqual.ExtractMostSignificantBits()) +
                        (int)(Unsafe.ByteOffset(ref searchSpace, ref current) / Unsafe.SizeOf<ushort>());
                }

                current = ref Unsafe.Add(ref current, Vector128<ushort>.Count);
            }
            while (Unsafe.IsAddressLessThan(ref current, ref endMinusOneVector));

            lessThanOrEqual = Vector128.LessThanOrEqual(Vector128.LoadUnsafe(ref endMinusOneVector) - lowVector, rangeVector);
            if (lessThanOrEqual != Vector128<ushort>.Zero)
            {
                return
                    BitOperations.TrailingZeroCount(lessThanOrEqual.ExtractMostSignificantBits()) +
                    (int)(Unsafe.ByteOffset(ref searchSpace, ref endMinusOneVector) / Unsafe.SizeOf<ushort>());
            }
        }
    }
    else
    {
        uint range = (uint)(highInclusive - lowInclusive);
        for (int i = 0; i < span.Length; i++)
        {
            if ((uint)(span[i] - lowInclusive) <= range)
            {
                return i;
            }
        }
    }

    return -1;
}

which is closer to ~80 lines of non-trivial code, not something someone just whips up.

In addition to finding ways to reduce the development cost of this (e.g. could we make it easier to share 128/256 code paths via having those types implement an interface and employing generic specialization?), it seems like we should consider lowering our bar a bit for what we consider worthy of inclusion, such that we can provide more such operations with good vectorized implementations.

This could include:

IndexOfRange and IndexOfAnyExceptRange (plus "Last" variations)
IndexOfAsciiLetter, IndexOfAsciiDigit, etc.
IndexOfAny and IndexOfAnyExcept where the set of items is provided via a bitmap (ala Vectorize IndexOfAny on more than 5 chars #68328)
... and probably other variations.

We should:

Audit dotnet/runtime and dotnet/aspnetcore to see where such implementations could be used. For example, System.Text.Json implements its own vectorized IndexOfQuoteOrAnyControlOrBackSlash (which is implemented more generally as searching for one of two specific characters or anything less than a third). And HttpValidationHelpers.cs searches for characters less than something or greater than something else (so basically two ranges instead of one).
Audit the regex corpus to see what patterns we're currently unable to vectorize a starting search for and see what would enable us to fix that.
Audit grep.app, GitHub, various 3rd party appropriately-licensed repos, etc. for the same.
Solicit feedback from everyone reading this about variations that would be useful to them.
Consider which subset should be added and do so.

Also, would it be possible to expose a generalized helper for this, where someone just needs to supply the kernel of the thing being searched for and plug that into the overall algorithm? e.g. an interface with a method accepting the vector to analyze and a scaffolded IndexOf search that uses generic specialization to make it fast to plug in the caller's code.

This concept also extends beyond IndexOf, of course. There are multiple places where we've already vectorized implementations that could be generalized to be used by others, e.g.

BitArray implements vectorized and, or, xor, and not algorithms. Could those be exposed for operating on arbitrary spans of bytes, used by BitArray but then also available for others to consume?
WebSocket implements a vectorized xor with a mask. Would anyone else benefit from that being public?
Enumerable has a vectorized Min/Max/Sum/Average. Could those implementations be exposed on MemoryExtensions to be usable with any span and not just an array/list we can fish out of an enumerable?
String.Replace implements a vectorized replace. Should that be exposed on MemoryExtensions as a generalized Replace on a span?

The text was updated successfully, but these errors were encountered:

ghost · 2022-09-07T02:11:25Z

Tagging subscribers to this area: @dotnet/area-system-memory
See info in area-owners.md if you want to be subscribed.

Issue Details

Searching is a key aspect of many workloads. Historically, we provided just the variations that everyone needed, e.g. IndexOf. Then we expanded to IndexOfAny with a few arguments. We just recently expanded to IndexOfAnyExcept. But beyond these, our general stance has been "if you need something any more customized, it's easy to just write it yourself". And historically, it has been easy. Let's say I wanted a static int IndexOfRange(ReadOnlySpan<char> span, char lowInclusive, highInclusive); it's straightforward to implement that "in just a few lines":

static int IndexOfRange(ReadOnlySpan<char> span, char lowInclusive, char highInclusive)
{
    uint range = (uint)(highInclusive - lowInclusive);
    for (int i = 0; i < span.Length; i++)
        if ((uint)(span[i] - lowInclusive) <= range)
            return i;

    return -1;
}

Not a big deal. But now, the expectation is such operations will take advantage of vectorization, and that's no longer "just a few lines". That ends up being something more like this (untested):

static int IndexOfRange(ReadOnlySpan<char> span, char lowInclusive, char highInclusive)
{
    if (Vector128.IsHardwareAccelerated && span.Length >= Vector128<ushort>.Count)
    {
        if (Vector256.IsHardwareAccelerated && span.Length >= Vector256<ushort>.Count)
        {
            ref ushort searchSpace = ref Unsafe.As<char, ushort>(ref MemoryMarshal.GetReference(span));
            ref ushort current = ref searchSpace;

            Vector256<ushort> lowVector = Vector256.Create((ushort)lowInclusive);
            Vector256<ushort> rangeVector = Vector256.Create((ushort)highInclusive) - lowVector;
            Vector256<ushort> lessThanOrEqual;

            ref ushort endMinusOneVector = ref Unsafe.Add(ref current, span.Length - Vector256<ushort>.Count);
            do
            {
                lessThanOrEqual = Vector256.LessThanOrEqual(Vector256.LoadUnsafe(ref current) - lowVector, rangeVector);
                if (lessThanOrEqual != Vector256<ushort>.Zero)
                {
                    return
                        BitOperations.TrailingZeroCount(lessThanOrEqual.ExtractMostSignificantBits()) +
                        (int)(Unsafe.ByteOffset(ref searchSpace, ref current) / Unsafe.SizeOf<ushort>());
                }

                current = ref Unsafe.Add(ref current, Vector256<ushort>.Count);
            }
            while (Unsafe.IsAddressLessThan(ref current, ref endMinusOneVector));

            lessThanOrEqual = Vector256.LessThanOrEqual(Vector256.LoadUnsafe(ref endMinusOneVector) - lowVector, rangeVector);
            if (lessThanOrEqual != Vector256<ushort>.Zero)
            {
                return
                    BitOperations.TrailingZeroCount(lessThanOrEqual.ExtractMostSignificantBits()) +
                    (int)(Unsafe.ByteOffset(ref searchSpace, ref endMinusOneVector) / Unsafe.SizeOf<ushort>());
            }
        }
        else
        {
            ref ushort searchSpace = ref Unsafe.As<char, ushort>(ref MemoryMarshal.GetReference(span));
            ref ushort current = ref searchSpace;

            Vector128<ushort> lowVector = Vector128.Create((ushort)lowInclusive);
            Vector128<ushort> rangeVector = Vector128.Create((ushort)highInclusive) - lowVector;
            Vector128<ushort> lessThanOrEqual;

            ref ushort endMinusOneVector = ref Unsafe.Add(ref current, span.Length - Vector128<ushort>.Count);
            do
            {
                lessThanOrEqual = Vector128.LessThanOrEqual(Vector128.LoadUnsafe(ref current) - lowVector, rangeVector);
                if (lessThanOrEqual != Vector128<ushort>.Zero)
                {
                    return
                        BitOperations.TrailingZeroCount(lessThanOrEqual.ExtractMostSignificantBits()) +
                        (int)(Unsafe.ByteOffset(ref searchSpace, ref current) / Unsafe.SizeOf<ushort>());
                }

                current = ref Unsafe.Add(ref current, Vector128<ushort>.Count);
            }
            while (Unsafe.IsAddressLessThan(ref current, ref endMinusOneVector));

            lessThanOrEqual = Vector128.LessThanOrEqual(Vector128.LoadUnsafe(ref endMinusOneVector) - lowVector, rangeVector);
            if (lessThanOrEqual != Vector128<ushort>.Zero)
            {
                return
                    BitOperations.TrailingZeroCount(lessThanOrEqual.ExtractMostSignificantBits()) +
                    (int)(Unsafe.ByteOffset(ref searchSpace, ref endMinusOneVector) / Unsafe.SizeOf<ushort>());
            }
        }
    }
    else
    {
        uint range = (uint)(highInclusive - lowInclusive);
        for (int i = 0; i < span.Length; i++)
        {
            if ((uint)(span[i] - lowInclusive) <= range)
            {
                return i;
            }
        }
    }

    return -1;
}

which is closer to ~80 lines of non-trivial code, not something someone just whips up.

In addition to finding ways to reduce the development cost of this (e.g. could we make it easier to share 128/256 code paths via having those types implement an interface and employing generic specialization?), it seems like we should consider lowering our bar a bit for what we consider worthy of inclusion, such that we can provide more such operations with good vectorized implementations.

This could include:

IndexOfRange and IndexOfAnyExceptRange (plus "Last" variations)
IndexOfAsciiLetter, IndexOfAsciiDigit, etc.
IndexOfAny and IndexOfAnyExcept where the set of items is provided via a bitmap (ala Vectorize IndexOfAny on more than 5 chars #68328)
... and probably other variations.

We should:

Audit dotnet/runtime and dotnet/aspnetcore to see where such implementations could be used. For example, System.Text.Json implements its own vectorized IndexOfQuoteOrAnyControlOrBackSlash (which is implemented more generally as searching for one of two specific characters or anything less than a third).
Audit the regex corpus to see what patterns we're currently unable to vectorize a starting search for and see what would enable us to fix that.
Solicit feedback from everyone reading this about variations that would be useful to them.
Consider which subset should be added and do so.

This concept also extends beyond IndexOf, of course. There are multiple places where we've already vectorized implementations that could be generalized to be used by others, e.g.

BitArray implements vectorized and, or, xor, and not algorithms. Could those be exposed for operating on arbitrary spans of bytes, used by BitArray but then also available for others to consume?
WebSocket implements a vectorized xor with a mask. Would anyone else benefit from that being public?
Enumerable has a vectorized Min/Max/Sum/Average. Could those implementations be exposed on MemoryExtensions to be usable with any span and not just an array/list we can fish out of an enumerable?
String.Replace implements a vectorized replace. Should that be exposed on MemoryExtensions as a generalized Replace on a span?

Author:	stephentoub
Assignees:	-
Labels:	`area-System.Memory`, `tenet-performance`
Milestone:	8.0.0

jkotas · 2022-09-07T02:52:24Z

An alternative to providing number of specialized APIs may be to create a source generator or something like that that produces the vectorized version from the simple form.

jkotas · 2022-09-07T02:54:38Z

https://www.microsoft.com/en-us/research/wp-content/uploads/2016/07/SimdSynth.pdf

stephentoub · 2022-09-07T03:02:58Z

An alternative to providing number of specialized APIs may be to create a source generator or something like that that produces the vectorized version from the simple form

It's an option; Tanner had a prototype of something similar.

But if it's possible to do that with perfect fidelity, it should be possible to do the same in the compiler/JIT with an auto-vectorizing implementation and save the dev the trouble. And if it's not possible with perfect fidelity and the generated code would be more of a starting point the dev would need to copy and mutate, it's not really addressing the core problem this aims to.

I think pursuing both approaches is reasonable, with built-in implementations for the things we ourselves use and others could as well plus the next layer of variation beyond what we already have, and then a source generator to help with the next layer still. As an example, I'd want to use some of these from the regex source generator and compiler, and it'd be nice to avoid injecting a lot of such unsafe code into the assembly that would also likely be duplicated across many assemblies.

MihaZupan · 2022-10-23T00:01:20Z

IndexOfAsciiLetter, IndexOfAsciiDigit, etc.

Just curious, would these just delegate to #68328, or did you have a different approach in mind?

gfoidl · 2022-10-23T09:57:02Z

Also #76106 can be used (e.g. for IndexOfAsciiDigit).

stephentoub · 2022-10-24T01:26:39Z

IndexOfAsciiLetter, IndexOfAsciiDigit, etc.

Just curious, would these just delegate to #68328, or did you have a different approach in mind?

If there were to be an IndexOfAsciiDigit, my assumption is it would just delegate to IndexOfAnyInRange. If there were to be an IndexOfAsciiLetter, my assumption is it could have a more efficient implementation than delegating to the implementation that supports any combination of ASCII chars, though that would need to be measured, e.g. using an implementation that enabled efficiently searching for two ranges.

But, warranting new APIs for these in particular would necessitate both a) showing significant usage potential and b) significantly better performance potential than just using one of the other "existing" APIs.

stephentoub · 2023-12-26T19:41:01Z

The scope of work covered by this issue is done. We can always add more IndexOf helpers on an as-needed basis.

stephentoub added area-System.Memory tenet-performance Performance related issue labels Sep 7, 2022

stephentoub added this to the 8.0.0 milestone Sep 7, 2022

stephentoub mentioned this issue Sep 9, 2022

[API Proposal]: MemoryExtensions.Count / number of occurrences of a value in a span #59466

Closed

dubiousconst282 mentioned this issue Mar 1, 2023

[API Proposal]: Rename IndexOfAnyValues #82622

Closed

tannergooding added the api-suggestion Early API idea and discussion, it is NOT ready for implementation label Aug 14, 2023

tannergooding modified the milestones: 8.0.0, Future Aug 14, 2023

stephentoub closed this as completed Dec 26, 2023

github-actions bot locked and limited conversation to collaborators Jan 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider adding more IndexOf variations #75175

Consider adding more IndexOf variations #75175

stephentoub commented Sep 7, 2022 •

edited

Loading

ghost commented Sep 7, 2022

jkotas commented Sep 7, 2022

jkotas commented Sep 7, 2022

stephentoub commented Sep 7, 2022

MihaZupan commented Oct 23, 2022

gfoidl commented Oct 23, 2022

stephentoub commented Oct 24, 2022

stephentoub commented Dec 26, 2023

Consider adding more IndexOf variations #75175

Consider adding more IndexOf variations #75175

Comments

stephentoub commented Sep 7, 2022 • edited Loading

ghost commented Sep 7, 2022

jkotas commented Sep 7, 2022

jkotas commented Sep 7, 2022

stephentoub commented Sep 7, 2022

MihaZupan commented Oct 23, 2022

gfoidl commented Oct 23, 2022

stephentoub commented Oct 24, 2022

stephentoub commented Dec 26, 2023

stephentoub commented Sep 7, 2022 •

edited

Loading