Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add API IndexNotOf #28795

Closed
benaadams opened this issue Feb 26, 2019 · 35 comments · Fixed by #67941
Closed

Add API IndexNotOf #28795

benaadams opened this issue Feb 26, 2019 · 35 comments · Fixed by #67941
Assignees
Labels
api-approved API was approved in API review, it can be implemented area-System.Memory
Milestone

Comments

@benaadams
Copy link
Member

benaadams commented Feb 26, 2019

public static partial class MemoryExtensions
{
    int IndexNotOf(this Span<T> span, T value) where T : IEquatable<T>;
    int IndexNotOfAny(this Span<T> span, T value0, T value1) where T : IEquatable<T>;
    int IndexNotOfAny(this Span<T> span, T value0, T value1, T value2) where T : IEquatable<T>;
    int IndexNotOfAny(this Span<T> span, ReadOnlySpan<T> values) where T : IEquatable<T>;

    int IndexNotOf(this ReadOnlySpan<T> span, T value) where T : IEquatable<T>;
    int IndexNotOfAny(this ReadOnlySpan<T> span, T value0, T value1) where T : IEquatable<T>;
    int IndexNotOfAny(this ReadOnlySpan<T> span, T value0, T value1, T value2) where T : IEquatable<T>;
    int IndexNotOfAny(this ReadOnlySpan<T> span, ReadOnlySpan<T> values) where T : IEquatable<T>;

    int LastIndexNotOf(this Span<T> span, T value) where T : IEquatable<T>;
    int LastIndexNotOfAny(this Span<T> span, T value0, T value1) where T : IEquatable<T>;
    int LastIndexNotOfAny(this Span<T> span, T value0, T value1, T value2) where T : IEquatable<T>;
    int LastIndexNotOfAny(this Span<T> span, ReadOnlySpan<T> values) where T : IEquatable<T>;

    int LastIndexNotOf(this ReadOnlySpan<T> span, T value) where T : IEquatable<T>;
    int LastIndexNotOfAny(this ReadOnlySpan<T> span, T value0, T value1) where T : IEquatable<T>;
    int LastIndexNotOfAny(this ReadOnlySpan<T> span, T value0, T value1, T value2) where T : IEquatable<T>;
    int LastIndexNotOfAny(this ReadOnlySpan<T> span, ReadOnlySpan<T> values) where T : IEquatable<T>;
}

Example usage:

var firstNonSpace = span.IndexNotOf(' ');
if (firstNonSpace > 0)
{
    return span.Slice(firstNonSpace);
}

return span;
@EgorBo
Copy link
Member

EgorBo commented Feb 26, 2019

As an alternative name - IndexOfAnyBut

@grant-d
Copy link
Contributor

grant-d commented Feb 26, 2019

Or IndexOfAnyExcept

@danmoseley
Copy link
Member

Are these more commonly needed on Span than on String? Since we've got by without them on String since day 1 (which is not an argument against, but it's a datapoint)

@benaadams
Copy link
Member Author

benaadams commented Feb 26, 2019

Are these more commonly needed on Span than on String?

var firstNonSpace = text.AsSpan().IndexNotOf(' ');
if (firstNonSpace > 0)
{
    return text.Substring(firstNonSpace, text.Length - firstNonSpace);
}

return text;

😉

string would also presumably require culture and ignorecase variants? Thought I'd start simple :)

@krwq
Copy link
Member

krwq commented Feb 26, 2019

What about Find which takes a predicate? That should cover this scenario?

@benaadams
Copy link
Member Author

What about Find which takes a predicate?

Can't vectorize that for primitive types

@krwq
Copy link
Member

krwq commented Feb 26, 2019

I feel like we should instead try to look into making some of the delegate scenarios vectorizable but I don't feel super strongly on not having this API (just feel like we will start adding up other APIs like: IndexOfValueInRange IndexOfValueNotInRange soon and that can quickly pollute the framework)

@stephentoub
Copy link
Member

If we add this, we can take advantage of it from our Regex implementation. It now uses IndexOf and could easily use IndexNotOf should it be available. For example, building on @benaadams' example from earlier in the issue, if you had a regex "eat *spaces" that would find "eat" followed by any number of spaces followed "spaces", that would end up being represented in our IR as a multi ("eat") followed by a one loop (" *") followed by another multi ("spaces"), and we'd use the IndexNotOf to implement the one loop.

@benaadams
Copy link
Member Author

@danmosemsft Are these more commonly needed on Span than on String? Since we've got by without them on String since day 1 (which is not an argument against, but it's a datapoint)

For string would could re-implement the following apis with these span versions

partial class String
{
    string Trim();
    string Trim(char trimChar);
    string Trim(params char[] trimChars);
    string TrimEnd();
    string TrimEnd(char trimChar);
    string TrimEnd(params char[] trimChars);
    string TrimStart();
    string TrimStart(char trimChar);
    string TrimStart(params char[] trimChars);
}

@stephentoub
Copy link
Member

For string would could re-implement the following apis with these span versions

We could, but I'm skeptical we'd want to... in my experience the number of spaces being trimmed is usually few to none, which could make the overhead here be very pronounced. Also, the trim methods that don't take characters use IsWhitespace and would need to compare against every Unicode space character.

@msftgits msftgits transferred this issue from dotnet/corefx Feb 1, 2020
@msftgits msftgits added this to the 5.0 milestone Feb 1, 2020
@danmoseley
Copy link
Member

Would we want all the variety of overloads we have for IndexOf, etc? That would give us

public static int IndexNotOf<T>(this ReadOnlySpan<T> span, T value) where T : IEquatable<T>;
public static int IndexNotOf<T>(this Span<T> span, T value) where T : IEquatable<T>;
public static int IndexNotOf<T>(this Span<T> span, ReadOnlySpan<T> value) where T : IEquatable<T>;
public static int IndexNotOf<T>(this ReadOnlySpan<T> span, ReadOnlySpan<T> value) where T : IEquatable<T>;
public static int IndexNotOf(this ReadOnlySpan<char> span, ReadOnlySpan<char> value, StringComparison comparisonType);
public static int IndexNotOfAny<T>(this Span<T> span, T value0, T value1) where T : IEquatable<T>;
public static int IndexNotOfAny<T>(this Span<T> span, T value0, T value1, T value2) where T : IEquatable<T>;
public static int IndexNotOfAny<T>(this ReadOnlySpan<T> span, T value0, T value1, T value2) where T : IEquatable<T>;
public static int IndexNotOfAny<T>(this ReadOnlySpan<T> span, T value0, T value1) where T : IEquatable<T>;
public static int IndexNotOfAny<T>(this ReadOnlySpan<T> span, ReadOnlySpan<T> values) where T : IEquatable<T>;
public static int IndexNotOfAny<T>(this Span<T> span, ReadOnlySpan<T> values) where T : IEquatable<T>;
public static int LastIndexNotOf<T>(this ReadOnlySpan<T> span, T value) where T : IEquatable<T>;
public static int LastIndexNotOf(this ReadOnlySpan<char> span, ReadOnlySpan<char> value, StringComparison comparisonType);
public static int LastIndexNotOf<T>(this Span<T> span, T value) where T : IEquatable<T>;
public static int LastIndexNotOf<T>(this Span<T> span, ReadOnlySpan<T> value) where T : IEquatable<T>;
public static int LastIndexNotOf<T>(this ReadOnlySpan<T> span, ReadOnlySpan<T> value) where T : IEquatable<T>;
public static int LastIndexNotOfAny<T>(this Span<T> span, T value0, T value1, T value2) where T : IEquatable<T>;
public static int LastIndexNotOfAny<T>(this Span<T> span, T value0, T value1) where T : IEquatable<T>;
public static int LastIndexNotOfAny<T>(this Span<T> span, ReadOnlySpan<T> values) where T : IEquatable<T>;
public static int LastIndexNotOfAny<T>(this ReadOnlySpan<T> span, T value0, T value1, T value2) where T : IEquatable<T>;
public static int LastIndexNotOfAny<T>(this ReadOnlySpan<T> span, T value0, T value1) where T : IEquatable<T>;
public static int LastIndexNotOfAny<T>(this ReadOnlySpan<T> span, ReadOnlySpan<T> values) where T : IEquatable<T>;

@GrabYourPitchforks GrabYourPitchforks modified the milestones: 5.0.0, Future Jul 28, 2020
@FiniteReality
Copy link

I'm implementing JsonPatch using System.Text.Json (the existing ASP.Net Core JsonPatch support is based on Newtonsoft) and this would be incredibly useful to parse array indexes as part of implementing JSON Pointer.

The JSON Pointer RFC defines a valid array index as:

   array-index = %x30 / ( %x31-39 *(%x30-39) )
                 ; "0", or digits without a leading "0"

So ideally, I'd love to use this as my validation for that:

var isValidArrayIndex =
    span.SequenceEqual(Zero) ||
    (span.IndexOfAny(OneThroughNine) == 0 &&
        span.IndexNotOfAny(Digits) == -1);

@danmoseley
Copy link
Member

Thanks for the info. Perhaps someone wants to share thoughts about whether it should be the full set of overloads or not. Then perhaps this is ready for API review.

@benaadams
Copy link
Member Author

Don't think the StringComparison ones are needed unless someone asks specifically; ordinal should be fine as normally looking for a special delimiter (spaces, slashes, brackets, etc)

IndexNotOf(this ReadOnlySpan<char> span, ReadOnlySpan<char> value, StringComparison comparisonType)

@danmoseley
Copy link
Member

Are there good scenarios for LastIndexNotOfXXX if we wouldn't use them for TrimXXX ?

In the top proposal ther'es a mixture of Span and ReadOnlySpan maybe we could clean up which we need.

@benaadams
Copy link
Member Author

Span and ReadOnly span are both needed as they are extension methods; e.g. string only converts to ReadOnlySpan but array would convert to Span.

Don't have any strong feelings for LastIndexNotOf... dropped from summary

@benaadams
Copy link
Member Author

IndexNotOf<T>(this Span<T> span, ReadOnlySpan<T> value) where T : IEquatable<T> would be a bit strange as is a multichar delimiter

@benaadams
Copy link
Member Author

Another example where it could be used

foreach (char c in path)
{
if (c != ' ')
return false;
}
return true;

return path.IndexNotOf(' ') < 0

@krwq
Copy link
Member

krwq commented Oct 19, 2020

@benaadams for single char/element scenario perhaps might be better to try to optimize Linq to do that fast (perhaps detect patterns similar to path.Any(c => c == ' ')).

@benaadams
Copy link
Member Author

It's a not pattern so would be

path.All(c => c == ' ')

However having Linq autovectorize from a IEnumerable and indirect Func would be a challenge?

public static bool All<char>(this IEnumerable<char> source, Func<char, bool> predicate)

Could change the predicate to be Expression<Func<T>> and then parse it (which would be a big task in itself); however the IEnumerable doesn't imply contiguous data for vectorization.

Span/ReadOnlySpan are also ref structs so can't convert to IEnumerable; so would need to add them explicitly; as which point its just changing namespace from MemoryExtensions where all the Span/ReadOnlySpan extensions currently are to Linq?

Some kind of Linq-style autovectorization for Span would be interesting via C# source generators where it converts the expressions to platform intrinsics e.g.

public static bool All<TSource>(this ReadOnlySpan<TSource> source, Expression<Func<TSource, bool>> predicate)
    where TSource : unmanaged

but it's probably also a multi year research project?

@danmoseley
Copy link
Member

Span and ReadOnly span are both needed as they are extension methods; e.g. string only converts to ReadOnlySpan but array would convert to Span.

Yeah, I was confused because you originally had ROS for LastIndexOfAny, and not IndexOfAny. 😄

It looks good to me -- @GrabYourPitchforks ?

@GrabYourPitchforks
Copy link
Member

Ben's original proposal (plus their Last* counterparts) looks ready to go forward to review. We shouldn't provide StringComparison overloads since we're not comparing a sequence of elements against a sequence of elements. Instead, we're looking at each element in the source sequence in isolation without regard to any adjacent elements in the source sequence. By definition this means we're not providing a linguistic text search.

Steve already provided reasons for not rewriting string.Trim atop these newly-introduced APIs. I agree with his reasoning. I would even go further and suggest that we not vectorize the implementations of these newly-introduced methods, as I believe the use cases for these methods (much like the use cases for Trim) will terminate very early and won't benefit from vectorization.

@benaadams
Copy link
Member Author

Added the Last ones back

@danmoseley
Copy link
Member

suggest that we not vectorize the implementations of these newly-introduced methods,

@stephentoub when you mentioned taking advantage in Regex, were you hoping they'd be faster than the naive loop it currently does? Unfortunately the number of characters that the regex might scan could be anything from one to megabytes. For regex, a vectorized approach might be a better tradeoff across all patterns/texts.

@GrabYourPitchforks
Copy link
Member

GrabYourPitchforks commented Oct 19, 2020

FWIW the implementation nit I mentioned above doesn't have to be resolved before API review. We can always create the appropriate implementation as we become aware of various scenarios.

The only real requirement from an SDL perspective is that given this:

int idx = source.IndexNotOfAny(values);

The runtime complexity of this method must be bound by O(idx * values.Length). (Unless idx = -1, at which point the runtime complexity may be bound by O(source.Length * values.Length)). Vectorization vs. non-vectorization simply changes the constant factor, but it won't change the runtime complexity. IndexOfAny's current implementation follows the same constraint. The reason for this is that if you call it in a loop, you want the loop's total complexity to be bound by O(source.Length * values.Length), not O(source.Length^2 * values.Length), which could allow algorithmic complexity attacks.

@stephentoub
Copy link
Member

when you mentioned taking advantage in Regex, were you hoping they'd be faster than the naive loop it currently does?

There are two benefits for regex:

  1. Simpler code getting generated whenever the relevant pattern comes up, i.e. being able to emit a method call rather than open-coding it each time it occurs, with that method call performing at least as well as the open-coded loop would have
  2. Having that method call perform better than the open-coded loop would have

Just the first is a reasonable place to start. If that can be done, it's worth using in regex. Then if the implementation can be made even faster, yay, regex just gets better implicitly.

I do agree with Levi, though; I think it's likely that even in the regex scenarios we'd find in common cases the loops to have very short iteration counts. But it'd be worth looking at what the opportunities are for using the method and then looking at common patterns and inputs to see what we think the win would actually be.

@benaadams
Copy link
Member Author

Implementation details... ;)

Sounds like is ready to mark for review?

@GrabYourPitchforks GrabYourPitchforks added api-ready-for-review API is ready for review, it is NOT ready for implementation and removed api-suggestion Early API idea and discussion, it is NOT ready for implementation labels Oct 19, 2020
@GrabYourPitchforks
Copy link
Member

Sounds like is ready to mark for review?

Marked. :)

@terrajobst
Copy link
Member

terrajobst commented Oct 27, 2020

Video

  • These APIs seem fairly specialized; it feels this might open up a large amount of potential new APIs due to combinations
  • These seem to live in user code
  • So unless there is more evidence that suggests this API is useful
namespace System
{
    public static partial class MemoryExtensions
    {
        int IndexNotOf(this Span<T> span, T value) where T : IEquatable<T>;
        int IndexNotOfAny(this Span<T> span, T value0, T value1) where T : IEquatable<T>;
        int IndexNotOfAny(this Span<T> span, T value0, T value1, T value2) where T : IEquatable<T>;
        int IndexNotOfAny(this Span<T> span, ReadOnlySpan<T> values) where T : IEquatable<T>;

        int IndexNotOf(this ReadOnlySpan<T> span, T value) where T : IEquatable<T>;
        int IndexNotOfAny(this ReadOnlySpan<T> span, T value0, T value1) where T : IEquatable<T>;
        int IndexNotOfAny(this ReadOnlySpan<T> span, T value0, T value1, T value2) where T : IEquatable<T>;
        int IndexNotOfAny(this ReadOnlySpan<T> span, ReadOnlySpan<T> values) where T : IEquatable<T>;

        int LastIndexNotOf(this Span<T> span, T value) where T : IEquatable<T>;
        int LastIndexNotOfAny(this Span<T> span, T value0, T value1) where T : IEquatable<T>;
        int LastIndexNotOfAny(this Span<T> span, T value0, T value1, T value2) where T : IEquatable<T>;
        int LastIndexNotOfAny(this Span<T> span, ReadOnlySpan<T> values) where T : IEquatable<T>;

        int LastIndexNotOf(this ReadOnlySpan<T> span, T value) where T : IEquatable<T>;
        int LastIndexNotOfAny(this ReadOnlySpan<T> span, T value0, T value1) where T : IEquatable<T>;
        int LastIndexNotOfAny(this ReadOnlySpan<T> span, T value0, T value1, T value2) where T : IEquatable<T>;
        int LastIndexNotOfAny(this ReadOnlySpan<T> span, ReadOnlySpan<T> values) where T : IEquatable<T>;
    }
}

@stephentoub
Copy link
Member

stephentoub commented Mar 23, 2022

These APIs seem fairly specialized; it feels this might open up a large amount of potential new APIs due to combinations

I think we should reconsider this one. There are a variety of places it's useful, e.g.

  • In bit maps, determining whether the set is empty (IndexNotOf(0) < 0) or full (IndexNotOf(0xFF) < 0), or finding the first element that has a bit set in order to start processing from there (int pos = IndexNotOf(0)), etc.
  • Determining whether a set of bytes has been cleared or represents a default value.
  • In text searching, looking for the first non-space (IndexNotOf(' ')), as in an expression like "\w+ *\w+" to find two words with any number of spaces in between.

@stephentoub stephentoub reopened this Mar 23, 2022
@dotnet dotnet unlocked this conversation Mar 23, 2022
@stephentoub stephentoub modified the milestones: Future, 7.0.0 Mar 23, 2022
@EgorBo
Copy link
Member

EgorBo commented Mar 23, 2022

@stephentoub
Copy link
Member

I just had another case, where I wanted to determine whether some memory was all zero'd. We don't have a great helper for that built-in today, but IndexNotOf(0) would do it.

@adamsitnik
Copy link
Member

In the original proposal, there was 16 methods being proposed. @stephentoub from what you have wrote it seems that a single one would be enough for our current needs. Is that correct? If so, would you like to update the proposal and present it in the API review?

@stephentoub
Copy link
Member

stephentoub commented Mar 31, 2022

In the original proposal, there was 16 methods being proposed. @stephentoub from what you have wrote it seems that a single one would be enough for our current needs. Is that correct?

The methods are:

  • Span vs ReadOnlySpan, crossed with
  • IndexOf vs LastIndexOf, crossed with
  • 1/2/3/any number of values being searched for

We have a general policy with MemoryExtensions that we add both the Span and ReadOnlySpan variants because extensions on ReadOnlySpan don't show up for Span implicitly.

The cited use cases cover both IndexOf and LastIndexOf; it's just a matter of which way you need to search in the input.

So at a minimum, we'd need 4 of the 16.

But without the "any" overloads, you don't have any support for more than one value, and those scenarios certainly exist. For example, there are many regex patterns that begin with something like [^ab] to start matching something that begins with anything other than a or b, and for that we'd use an IndexNotOf that took more than one value.

So I think we'd actually want at least the 8 of the 16.

The other 8 (all of the 2/3 arg overloads).are an optimization we could do without if we were concerned about number of overloads. Leaving them out would however also be inconsistent with IndexOf, which has those overloads, and would have at least a little more overhead... most likely we'd still end up having them in the implementation, just as internal, with the any overloads special-casing those lengths and delegating.

@terrajobst
Copy link
Member

terrajobst commented Apr 12, 2022

Video

  • We'd prefer a less-Yoda like name such as IndexOfAnyExcept which would also allow us to fold the single case into the multiple case.
namespace System;

public static partial class MemoryExtensions
{
    int IndexOfAnyExcept(this Span<T> span, T value) where T : IEquatable<T>;
    int IndexOfAnyExcept(this Span<T> span, T value0, T value1) where T : IEquatable<T>;
    int IndexOfAnyExcept(this Span<T> span, T value0, T value1, T value2) where T : IEquatable<T>;
    int IndexOfAnyExcept(this Span<T> span, ReadOnlySpan<T> values) where T : IEquatable<T>;

    int IndexOfAnyExcept(this ReadOnlySpan<T> span, T value) where T : IEquatable<T>;
    int IndexOfAnyExcept(this ReadOnlySpan<T> span, T value0, T value1) where T : IEquatable<T>;
    int IndexOfAnyExcept(this ReadOnlySpan<T> span, T value0, T value1, T value2) where T : IEquatable<T>;
    int IndexOfAnyExcept(this ReadOnlySpan<T> span, ReadOnlySpan<T> values) where T : IEquatable<T>;

    int LastIndexOfAnyExcept(this Span<T> span, T value) where T : IEquatable<T>;
    int LastIndexOfAnyExcept(this Span<T> span, T value0, T value1) where T : IEquatable<T>;
    int LastIndexOfAnyExcept(this Span<T> span, T value0, T value1, T value2) where T : IEquatable<T>;
    int LastIndexOfAnyExcept(this Span<T> span, ReadOnlySpan<T> values) where T : IEquatable<T>;

    int LastIndexOfAnyExcept(this ReadOnlySpan<T> span, T value) where T : IEquatable<T>;
    int LastIndexOfAnyExcept(this ReadOnlySpan<T> span, T value0, T value1) where T : IEquatable<T>;
    int LastIndexOfAnyExcept(this ReadOnlySpan<T> span, T value0, T value1, T value2) where T : IEquatable<T>;
    int LastIndexOfAnyExcept(this ReadOnlySpan<T> span, ReadOnlySpan<T> values) where T : IEquatable<T>;
}

@terrajobst terrajobst added api-approved API was approved in API review, it can be implemented and removed api-ready-for-review API is ready for review, it is NOT ready for implementation labels Apr 12, 2022
@stephentoub stephentoub self-assigned this Apr 12, 2022
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Apr 13, 2022
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Apr 19, 2022
@dotnet dotnet locked as resolved and limited conversation to collaborators May 20, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-approved API was approved in API review, it can be implemented area-System.Memory
Projects
None yet
Development

Successfully merging a pull request may close this issue.