[API Proposal]: MemoryExtensions.CommonPrefixLength<T> #64271

stephentoub · 2022-01-25T15:47:08Z

Background and motivation

We have MemoryExtensions.SequenceEqual that reports true or false as to whether two inputs contain all the same contents, but we don't have a mechanism for it reporting back the first place a difference occurred. In multiple places we end up open-coding such a loop, e.g.

runtime/src/libraries/System.Text.Json/src/System/Text/Json/Reader/Utf8JsonReader.MultiSegment.cs

Lines 629 to 650 in d910ce3

    
           private int FindMismatch(ReadOnlySpan<byte> span, ReadOnlySpan<byte> literal) 
        
           { 
        
               Debug.Assert(span.Length > 0); 
        
               int indexOfFirstMismatch; 
        
               int minLength = Math.Min(span.Length, literal.Length); 
        
               int i = 0; 
        
               for (; i < minLength; i++) 
        
               { 
        
                   if (span[i] != literal[i]) 
        
                   { 
        
                       break; 
        
                   } 
        
               } 
        
               indexOfFirstMismatch = i; 
        
               Debug.Assert(indexOfFirstMismatch >= 0 && indexOfFirstMismatch < literal.Length); 
        
               return indexOfFirstMismatch; 
        
           }

runtime/src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexPrefixAnalyzer.cs

Lines 109 to 117 in d910ce3

    
           addedLength = Math.Min(addedLength, alternateSb.Length); 
        
           for (int j = 0; j < addedLength; j++) 
        
           { 
        
               if (vsb[initialLength + j] != alternateSb[j]) 
        
               { 
        
                   addedLength = j; 
        
                   break; 
        
               } 
        
           }

runtime/src/libraries/Common/src/System/IO/PathInternal.cs

Lines 53 to 77 in 09c1a1f

    
           internal static unsafe int EqualStartingCharacterCount(string? first, string? second, bool ignoreCase) 
        
           { 
        
               if (string.IsNullOrEmpty(first) || string.IsNullOrEmpty(second)) return 0; 
        
               int commonChars = 0; 
        
               fixed (char* f = first) 
        
               fixed (char* s = second) 
        
               { 
        
                   char* l = f; 
        
                   char* r = s; 
        
                   char* leftEnd = l + first.Length; 
        
                   char* rightEnd = r + second.Length; 
        
                   while (l != leftEnd && r != rightEnd 
        
                       && (*l == *r || (ignoreCase && char.ToUpperInvariant(*l) == char.ToUpperInvariant(*r)))) 
        
                   { 
        
                       commonChars++; 
        
                       l++; 
        
                       r++; 
        
                   } 
        
               } 
        
               return commonChars; 
        
           }

We should have a helper that a) allows someone to write this in a single line rather than writing their own loops and b) does so as efficiently as possible, e.g. vectorized ala SequenceEqual is for many types.

API Proposal

namespace System
{
    public static class MemoryExtensions
    {
+        public static int CommonPrefixLength<T>(this ReadOnlySpan<T> span, ReadOnlySpan<T> other);
+        public static int CommonPrefixLength<T>(this ReadOnlySpan<T> span, ReadOnlySpan<T> other, IEqualityComparer<T>? comparer = null);
    }
}

API Usage

// Find prefix overlap between multiple strings
ReadOnlySpan<char> prefix = strings[0];
for (int i = 1; i < strings.Length && !prefix.IsEmpty; i++)
{
    prefix = prefix.Slice(0, prefix.CommonPrefixLength(strings[i]));
}

Alternative Designs

We could add overloads of the existing SequenceEquals that have an additional out int argument, which would be 0 in the case of returning true and the position of the first difference in the case of returning false.

Risks

No response

The text was updated successfully, but these errors were encountered:

ghost · 2022-01-25T15:47:12Z

Tagging subscribers to this area: @dotnet/area-system-memory
See info in area-owners.md if you want to be subscribed.

Issue Details

Background and motivation

We have MemoryExtensions.SequenceEqual that reports true or false as to whether two inputs contain all the same contents, but we don't have a mechanism for it reporting back the first place a difference occurred. In multiple places we end up open-coding such a loop, e.g.

runtime/src/libraries/System.Text.Json/src/System/Text/Json/Reader/Utf8JsonReader.MultiSegment.cs

Lines 629 to 650 in d910ce3

    
           private int FindMismatch(ReadOnlySpan<byte> span, ReadOnlySpan<byte> literal) 
        
           { 
        
               Debug.Assert(span.Length > 0); 
        
               int indexOfFirstMismatch; 
        
               int minLength = Math.Min(span.Length, literal.Length); 
        
               int i = 0; 
        
               for (; i < minLength; i++) 
        
               { 
        
                   if (span[i] != literal[i]) 
        
                   { 
        
                       break; 
        
                   } 
        
               } 
        
               indexOfFirstMismatch = i; 
        
               Debug.Assert(indexOfFirstMismatch >= 0 && indexOfFirstMismatch < literal.Length); 
        
               return indexOfFirstMismatch; 
        
           }

runtime/src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexPrefixAnalyzer.cs

Lines 109 to 117 in d910ce3

    
           addedLength = Math.Min(addedLength, alternateSb.Length); 
        
           for (int j = 0; j < addedLength; j++) 
        
           { 
        
               if (vsb[initialLength + j] != alternateSb[j]) 
        
               { 
        
                   addedLength = j; 
        
                   break; 
        
               } 
        
           }

We should have a helper that a) allows someone to write this in a single line rather than writing their own loops and b) does so as efficiently as possible, e.g. vectorized ala SequenceEqual is for many types.

API Proposal

namespace System
{
    public static class MemoryExtensions
    {
+        public static int SequenceEqualUntil<T>(this ReadOnlySpan<T> span, ReadOnlySpan<T> other);
+        public static int SequenceEqualUntil<T>(this ReadOnlySpan<T> span, ReadOnlySpan<T> other, IEqualityComparer<T>? comparer = null);
    }
}

API Usage

// Find prefix overlap between multiple strings
ReadOnlySpan<char> prefix = strings[0];
for (int i = 1; i < strings.Length && !prefix.IsEmpty; i++)
{
    prefix = prefix.Slice(0, prefix.SequenceEqualUntil(strings[i]));
}

Alternative Designs

We could add overloads of the existing SequenceEquals that have an additional out int argument, which would be 0 in the case of returning true and the position of the first difference in the case of returning false.

Risks

No response

Author:	stephentoub
Assignees:	-
Labels:	`api-suggestion`, `area-System.Memory`
Milestone:	7.0.0

Tornhoof · 2022-01-25T17:04:58Z

Are there better names? If I read that, I'd except an API taking a length (pointless for span) or some other 'until' parameter. Like sequenceEqual until newline.

Maybe more in the direction of FindFirstDifference or FindIndexOfFirstDifference?

adamsitnik · 2022-01-25T17:41:16Z

What value would be returned if there is no match even for the first item? A 0?
What value would be returned if the other span was empty? Also a 0 or -1?
What value would be returned for a full match? The length of the shorter span?

Sorry if these questions are dumb, I just want to get a good understanding of how it would be consumed by the end users.

I like the FindIndexOfFirstDifference name proposal (I would shorten it to IndexOfFirstDifference). It's more intuitive

stephentoub · 2022-01-25T17:49:33Z

What value would be returned if there is no match even for the first item? A 0?

Yes, 0.

What value would be returned if the other span was empty? Also a 0 or -1?

0

What value would be returned for a full match? The length of the shorter span?

The length of the shorter span.

Pseudo-code implementation would be like:

public static int SequenceEqualUntil<T>(this ReadOnlySpan<T> span, ReadOnlySpan<T> other, IEqualityComparer<T> comparer)
{
    int length = Math.Min(span.Length, other.Length);
    for (int i = 0; i < length; i++)
        if (!EqualityComparer<T>.Default.Equals(span[i], other[i]))
            return i;
    return length;
}

So the return value indicates how much was the same between them. If it returns 0, no elements were the same. If it returns 1, the first element of each was the same. If it returns 2, the first two elements of each were the same. Etc.

IndexOfFirstDifference

I don't love that naming because it suggests the result might be a valid index in both spans, but it wouldn't be in the case of an empty span.

adamsitnik · 2022-01-25T19:27:32Z

How about "CommonSuffixLength"? Or is it too string-oriented?

stephentoub · 2022-01-25T19:52:41Z

Presumably it'd be "CommonPrefixLength" 😄

I'd be ok with that or some variation thereof.

FiniteReality · 2022-01-25T21:39:31Z

I've written similar methods in my own code previously, but I chose the name IsPrefixedBy, and I've used it like this:

if (!spanOne.IsPrefixedBy(spanTwo, out var firstDifference))
{
    _logger.LogDebug("No prefix match (first deviance at index {index})", firstDifference);
}

This way, IsPrefixedBy returns true when spanOne.StartsWith(spanTwo), making it rather convenient for string parsing where I need to show a user-facing error if no prefix is shared.

GSPP · 2022-02-04T09:33:41Z

Alternative name: IndexOfDifference. It's an "index of" type of method.

stephentoub · 2022-02-04T12:22:25Z

Alternative name: IndexOfDifference. It's an "index of" type of method.

IndexOfFirstDifference was suggested earlier. My concern with the IndexOfDifference is the same: other IndexOf methods always return a valid index or -1. That wouldn't be the case here if one or both inputs were empty. In fact it's not actually returning an index, but a length.

adamsitnik · 2022-03-14T12:15:56Z

@stephentoub I am supportive of this idea in the current shape and I've marked it as ready for review. Would you like to present it in the API proposal meeting?

stephentoub · 2022-03-14T13:00:16Z

Sure, thanks.

GrabYourPitchforks · 2022-03-15T17:12:54Z

We should consider having specific overloads for string (ReadOnlySpan<char> or whatever), since these would require special handling. Due to surrogate handling - and optional case-insensitivity handling, if you want that as an option - it's not possible to compute a true "common prefix length" answer without looking at pairs of chars rather than chars in isolation. This could result in a naïve implementation that goes element-by-element (char-by-char) giving off-by-one errors in its response.

Code which performs UTF-8 text processing would need similar logic to prevent errors or data corruption in its output.

stephentoub · 2022-03-15T17:16:27Z

We should consider having specific overloads for string (ReadOnlySpan or whatever), since these would require special handling

For the cited examples, Regex specifically doesn't consider surrogates; while that might be considered "wrong", it wouldn't be able to use this if the input did and it had no way to avoid it.

Are the other cited uses in the opening issue already wrong then? Or for them are there other mitigating reasons why going byte-by-byte or char-by-char correct?

I want to make sure that if we "fix" the implementation to handle surrogates and the like that we're not then making it so that no one can actually use it.

GrabYourPitchforks · 2022-03-15T17:34:24Z

For the cited examples, Regex specifically doesn't consider surrogates; while that might be considered "wrong", it wouldn't be able to use this if the input did and it had no way to avoid it.

It could call the <T> overload, which is presumably element-by-element.

Are the other cited uses in the opening issue already wrong then? Or for them are there other mitigating reasons why going byte-by-byte or char-by-char correct?

I don't know what the other cited code samples are used for. But in general, if you're splitting a string (UTF-16 or UTF-8) at arbitrary indexes and treating the resulting splits as binary data, party on. Nobody's going to care what you do with the data.

If you're splitting a string (UTF-16 or UTF-8) at arbitrary indexes and attempting to process those splits as text, bad things will happen. You'll almost certainly end up with data corruption for some inputs.

As a concrete example, consider the two French terms pêcheur ("fisherman") and pécheur ("sinner"). They differ only in the type of accent over the first e character. The UTF-8 representations of these texts are:

[ 70 C3 AA 63 68 65 75 72 ] = "pêcheur"
[ 70 C3 A9 63 68 65 75 72 ] = "pécheur"

The proposed CommonPrefixLength API would return 2 when presented with these inputs. Again, if you're treating this as binary data, whatever. But if you're attempting to do anything string-like with these APIs - such as case conversion, UTF-8 <-> UTF-16 conversion (for writing to i/o) - etc., then these strings once split will turn into:

"p�" + "�cheur"
"p�" + "�cheur"

And hey, look at that! In the "best" case they're corrupted and the meaning has changed. In the "worst" case the data corruption has caused them to be have identical contents once reconstituted, which can cause your application to run down a code path you never intended.

stephentoub · 2022-03-15T18:07:56Z

It could call the overload, which is presumably element-by-element.

It feels like a real pit of failure that

ReadOnlySpan<char> span = ...;
SomeMethod(span)

and

ReadOnlySpan<char> span = ...;
SomeMethod<char>(span);

would have different semantics.

FiniteReality · 2022-03-15T19:29:55Z

IMHO, If the user wishes for the implementation to be unicode-aware, they should pass another parameter indicating as such. The most likely parameter that would indicate that, to me, would be a StringComparison value. That is:

ReadOnlySpan<char> span = ...;
SomeMethod(span)

and

ReadOnlySpan<char> span = ...;
SomeMethod(span, StringComparison.OrdinalIgnoreCase)

have different semantics, and it becomes obvious why, as for the second overload you specified you wanted it to be treated "stringly", rather than as a sequence of char-size values

GrabYourPitchforks · 2022-03-19T19:01:56Z

It feels like a real pit of failure that ... and ... would have different semantics.

You're not wrong. 😄 But that's why I think we really need to think long and hard about the scenarios that this API is intended for. My concern is that the API as currently proposed is already a pit of failure.

When processing raw binary data, the behavior as proposed is just fine. Everything is a bit-by-bit comparison and we're good to go.

When processing variable-length data formats like UTF-8/16, the behavior as proposed makes this API appropriate only for trie-like data structures and other scenarios where the caller never treats the prefix or suffix as anything other than opaque binary data. The only operations that would ever be valid for that opaque data are bit-for-bit equality comparisons. If you wanted to do anything "interesting" with the data (such as interpret it in a meaningful fashion other than as a binary blob), you'd have to concatenate it back into its whole, otherwise you risk misinterpreting the data.

The scenarios listed in the issue description strongly indicate that we expect callers to pass stringy data in here. This makes it very easy for them to fall into a trap where they unwittingly perform contextual processing of the data.

var dict = new Dictionary<string, object>(StringComparison.OrdinalIgnoreCase);
int commonPrefixLength = GetCommonPrefixLength(str1, str2);
string str3 = str1.Substring(0, commonPrefixLength);
dict[str3] = new object(); // <-- !! BUG !! - indirect contextual processing of substring (via StringComparison.OrdinalIgnoreCase.GetHashCode)

If we're still ok with this and don't expect that developers will commonly run afoul of this, let's march forward as-is. But at minimum this deserves a mention in the docs.

IMHO, If the user wishes for the implementation to be unicode-aware, they should pass another parameter indicating as such.

Not sure what you mean by "Unicode-aware". If you're talking about case insensitivity or culture-aware (such as en-US or de-DE), you'd need new overloads anyway, as the current proposed overloads wouldn't suffice. I wouldn't recommend adding culture awareness anyway, since both the API surface and the implementation get very messy very quickly.

FiniteReality · 2022-03-19T19:07:11Z

IMHO, If the user wishes for the implementation to be unicode-aware, they should pass another parameter indicating as such.

Not sure what you mean by "Unicode-aware". If you're talking about case insensitivity or culture-aware (such as en-US or de-DE), you'd need new overloads anyway, as the current proposed overloads wouldn't suffice. I wouldn't recommend adding culture awareness anyway, since both the API surface and the implementation get very messy very quickly.

I meant moreso "dealing with the way UTF-8 or UTF-16 encodes text" rather than "dealing with a sequence of bytes/similar" when I said "Unicode-aware". Sorry for the confusion 😅

GrabYourPitchforks · 2022-03-19T19:22:52Z

I meant moreso "dealing with the way UTF-8 or UTF-16 encodes text" rather than "dealing with a sequence of bytes/similar" when I said "Unicode-aware". Sorry for the confusion 😅

Understood. :)

If we doc the behavior as "don't use this API over structured data (like UTF-* text)" then this should be ok. But otherwise it's hard to separate char from the concept of Unicode, as we literally define the char type as the fundamental unit of UTF-16 data. So it's a reasonable expectation that if you see a sequence of chars, they're meant to be interpreted as a run of UTF-16 text. Anything that violates that expectation is a potential pit of failure.

Separately, I've been pondering for a while whether concerns like this are best handled by analyzers. For example, Array.Reverse<T> is a useful API (as is the API proposed here!), but calls to Array.Reverse<char> should probably be flagged as suspicious. Maybe that's also the best solution here.

GrabYourPitchforks · 2022-04-05T20:18:08Z

Had a chance to take another look at the three consumers mentioned previously.

The JSON usage is fine because it's performing binary (not string-contextual) comparisons, and it's not doing anything with the data other than saying "here's where the error occurred."

The Regex usage I assume is by design because that type already explicitly disavows supplementary character support.

The Path logic is problematic for a handful of reasons, many of which aren't relevant to this issue. (Fun example: the strings "c:\Ꮿ.txt" and "c:\ꮿ.txt" compare as equal-ignore-case, but NTFS treats them as not-equal.) But to the question of surrogates, it handles them incorrectly and could return the wrong result.

terrajobst · 2022-04-12T18:58:31Z

Video

Looks good as proposed
- Need overloads so that it shows up on instances of Span<T>
- We should constrain the first one to IEquatable<T>
- We might need to add an overload that is specific to char and takes StringComparison

namespace System;

public static class MemoryExtensions
{
    public static int CommonPrefixLength<T>(this Span<T> span, ReadOnlySpan<T> other) where T: IEquatable<T>;
    public static int CommonPrefixLength<T>(this Span<T> span, ReadOnlySpan<T> other, IEqualityComparer<T>? comparer = null);

    public static int CommonPrefixLength<T>(this ReadOnlySpan<T> span, ReadOnlySpan<T> other) where T: IEquatable<T>;
    public static int CommonPrefixLength<T>(this ReadOnlySpan<T> span, ReadOnlySpan<T> other, IEqualityComparer<T>? comparer = null);
}

stephentoub added api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.Memory labels Jan 25, 2022

stephentoub added this to the 7.0.0 milestone Jan 25, 2022

dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Jan 25, 2022

adamsitnik removed the untriaged New issue has not been triaged by the area owner label Jan 25, 2022

stephentoub changed the title ~~[API Proposal]: MemoryExtensions.SequenceEqualUntil<T>~~ [API Proposal]: MemoryExtensions.CommonPrefixLength<T> Feb 17, 2022

adamsitnik added api-ready-for-review API is ready for review, it is NOT ready for implementation and removed api-suggestion Early API idea and discussion, it is NOT ready for implementation labels Mar 14, 2022

adamsitnik assigned stephentoub Mar 14, 2022

terrajobst added api-approved API was approved in API review, it can be implemented and removed api-ready-for-review API is ready for review, it is NOT ready for implementation labels Apr 12, 2022

stephentoub mentioned this issue Apr 12, 2022

Add MemoryExtensions.CommonPrefixLength #67929

Merged

ghost added the in-pr There is an active PR which will close this issue when it is merged label Apr 12, 2022

stephentoub closed this as completed in #67929 Apr 17, 2022

ghost removed the in-pr There is an active PR which will close this issue when it is merged label Apr 17, 2022

dotnet locked as resolved and limited conversation to collaborators May 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[API Proposal]: MemoryExtensions.CommonPrefixLength<T> #64271

[API Proposal]: MemoryExtensions.CommonPrefixLength<T> #64271

stephentoub commented Jan 25, 2022 •

edited

ghost commented Jan 25, 2022

Background and motivation

API Proposal

API Usage

Alternative Designs

Risks

Tornhoof commented Jan 25, 2022 •

edited

adamsitnik commented Jan 25, 2022

stephentoub commented Jan 25, 2022 •

edited

adamsitnik commented Jan 25, 2022

stephentoub commented Jan 25, 2022

FiniteReality commented Jan 25, 2022

GSPP commented Feb 4, 2022 •

edited

stephentoub commented Feb 4, 2022 •

edited

adamsitnik commented Mar 14, 2022

stephentoub commented Mar 14, 2022

GrabYourPitchforks commented Mar 15, 2022 •

edited

stephentoub commented Mar 15, 2022

GrabYourPitchforks commented Mar 15, 2022

stephentoub commented Mar 15, 2022

FiniteReality commented Mar 15, 2022

GrabYourPitchforks commented Mar 19, 2022 •

edited

FiniteReality commented Mar 19, 2022

GrabYourPitchforks commented Mar 19, 2022

GrabYourPitchforks commented Apr 5, 2022

terrajobst commented Apr 12, 2022 •

edited by dotnet-api-review bot

[API Proposal]: MemoryExtensions.CommonPrefixLength<T> #64271

[API Proposal]: MemoryExtensions.CommonPrefixLength<T> #64271

Comments

stephentoub commented Jan 25, 2022 • edited

Background and motivation

API Proposal

API Usage

Alternative Designs

Risks

ghost commented Jan 25, 2022

Background and motivation

API Proposal

API Usage

Alternative Designs

Risks

Tornhoof commented Jan 25, 2022 • edited

adamsitnik commented Jan 25, 2022

stephentoub commented Jan 25, 2022 • edited

adamsitnik commented Jan 25, 2022

stephentoub commented Jan 25, 2022

FiniteReality commented Jan 25, 2022

GSPP commented Feb 4, 2022 • edited

stephentoub commented Feb 4, 2022 • edited

adamsitnik commented Mar 14, 2022

stephentoub commented Mar 14, 2022

GrabYourPitchforks commented Mar 15, 2022 • edited

stephentoub commented Mar 15, 2022

GrabYourPitchforks commented Mar 15, 2022

stephentoub commented Mar 15, 2022

FiniteReality commented Mar 15, 2022

GrabYourPitchforks commented Mar 19, 2022 • edited

FiniteReality commented Mar 19, 2022

GrabYourPitchforks commented Mar 19, 2022

GrabYourPitchforks commented Apr 5, 2022

terrajobst commented Apr 12, 2022 • edited by dotnet-api-review bot

stephentoub commented Jan 25, 2022 •

edited

Tornhoof commented Jan 25, 2022 •

edited

stephentoub commented Jan 25, 2022 •

edited

GSPP commented Feb 4, 2022 •

edited

stephentoub commented Feb 4, 2022 •

edited

GrabYourPitchforks commented Mar 15, 2022 •

edited

GrabYourPitchforks commented Mar 19, 2022 •

edited

terrajobst commented Apr 12, 2022 •

edited by dotnet-api-review bot