Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[API Proposal]: MemoryExtensions.CommonPrefixLength<T> #64271

Closed
stephentoub opened this issue Jan 25, 2022 · 21 comments · Fixed by #67929
Closed

[API Proposal]: MemoryExtensions.CommonPrefixLength<T> #64271

stephentoub opened this issue Jan 25, 2022 · 21 comments · Fixed by #67929
Assignees
Labels
api-approved API was approved in API review, it can be implemented area-System.Memory
Milestone

Comments

@stephentoub
Copy link
Member

stephentoub commented Jan 25, 2022

Background and motivation

We have MemoryExtensions.SequenceEqual that reports true or false as to whether two inputs contain all the same contents, but we don't have a mechanism for it reporting back the first place a difference occurred. In multiple places we end up open-coding such a loop, e.g.

private int FindMismatch(ReadOnlySpan<byte> span, ReadOnlySpan<byte> literal)
{
Debug.Assert(span.Length > 0);
int indexOfFirstMismatch;
int minLength = Math.Min(span.Length, literal.Length);
int i = 0;
for (; i < minLength; i++)
{
if (span[i] != literal[i])
{
break;
}
}
indexOfFirstMismatch = i;
Debug.Assert(indexOfFirstMismatch >= 0 && indexOfFirstMismatch < literal.Length);
return indexOfFirstMismatch;
}

addedLength = Math.Min(addedLength, alternateSb.Length);
for (int j = 0; j < addedLength; j++)
{
if (vsb[initialLength + j] != alternateSb[j])
{
addedLength = j;
break;
}
}

internal static unsafe int EqualStartingCharacterCount(string? first, string? second, bool ignoreCase)
{
if (string.IsNullOrEmpty(first) || string.IsNullOrEmpty(second)) return 0;
int commonChars = 0;
fixed (char* f = first)
fixed (char* s = second)
{
char* l = f;
char* r = s;
char* leftEnd = l + first.Length;
char* rightEnd = r + second.Length;
while (l != leftEnd && r != rightEnd
&& (*l == *r || (ignoreCase && char.ToUpperInvariant(*l) == char.ToUpperInvariant(*r))))
{
commonChars++;
l++;
r++;
}
}
return commonChars;
}

We should have a helper that a) allows someone to write this in a single line rather than writing their own loops and b) does so as efficiently as possible, e.g. vectorized ala SequenceEqual is for many types.

API Proposal

namespace System
{
    public static class MemoryExtensions
    {
+        public static int CommonPrefixLength<T>(this ReadOnlySpan<T> span, ReadOnlySpan<T> other);
+        public static int CommonPrefixLength<T>(this ReadOnlySpan<T> span, ReadOnlySpan<T> other, IEqualityComparer<T>? comparer = null);
    }
}

API Usage

// Find prefix overlap between multiple strings
ReadOnlySpan<char> prefix = strings[0];
for (int i = 1; i < strings.Length && !prefix.IsEmpty; i++)
{
    prefix = prefix.Slice(0, prefix.CommonPrefixLength(strings[i]));
}

Alternative Designs

We could add overloads of the existing SequenceEquals that have an additional out int argument, which would be 0 in the case of returning true and the position of the first difference in the case of returning false.

Risks

No response

@stephentoub stephentoub added api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.Memory labels Jan 25, 2022
@stephentoub stephentoub added this to the 7.0.0 milestone Jan 25, 2022
@ghost
Copy link

ghost commented Jan 25, 2022

Tagging subscribers to this area: @dotnet/area-system-memory
See info in area-owners.md if you want to be subscribed.

Issue Details

Background and motivation

We have MemoryExtensions.SequenceEqual that reports true or false as to whether two inputs contain all the same contents, but we don't have a mechanism for it reporting back the first place a difference occurred. In multiple places we end up open-coding such a loop, e.g.

private int FindMismatch(ReadOnlySpan<byte> span, ReadOnlySpan<byte> literal)
{
Debug.Assert(span.Length > 0);
int indexOfFirstMismatch;
int minLength = Math.Min(span.Length, literal.Length);
int i = 0;
for (; i < minLength; i++)
{
if (span[i] != literal[i])
{
break;
}
}
indexOfFirstMismatch = i;
Debug.Assert(indexOfFirstMismatch >= 0 && indexOfFirstMismatch < literal.Length);
return indexOfFirstMismatch;
}

addedLength = Math.Min(addedLength, alternateSb.Length);
for (int j = 0; j < addedLength; j++)
{
if (vsb[initialLength + j] != alternateSb[j])
{
addedLength = j;
break;
}
}

We should have a helper that a) allows someone to write this in a single line rather than writing their own loops and b) does so as efficiently as possible, e.g. vectorized ala SequenceEqual is for many types.

API Proposal

namespace System
{
    public static class MemoryExtensions
    {
+        public static int SequenceEqualUntil<T>(this ReadOnlySpan<T> span, ReadOnlySpan<T> other);
+        public static int SequenceEqualUntil<T>(this ReadOnlySpan<T> span, ReadOnlySpan<T> other, IEqualityComparer<T>? comparer = null);
    }
}

API Usage

// Find prefix overlap between multiple strings
ReadOnlySpan<char> prefix = strings[0];
for (int i = 1; i < strings.Length && !prefix.IsEmpty; i++)
{
    prefix = prefix.Slice(0, prefix.SequenceEqualUntil(strings[i]));
}

Alternative Designs

We could add overloads of the existing SequenceEquals that have an additional out int argument, which would be 0 in the case of returning true and the position of the first difference in the case of returning false.

Risks

No response

Author: stephentoub
Assignees: -
Labels:

api-suggestion, area-System.Memory

Milestone: 7.0.0

@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Jan 25, 2022
@Tornhoof
Copy link
Contributor

Tornhoof commented Jan 25, 2022

Are there better names? If I read that, I'd except an API taking a length (pointless for span) or some other 'until' parameter. Like sequenceEqual until newline.

Maybe more in the direction of FindFirstDifference or FindIndexOfFirstDifference?

@adamsitnik adamsitnik removed the untriaged New issue has not been triaged by the area owner label Jan 25, 2022
@adamsitnik
Copy link
Member

What value would be returned if there is no match even for the first item? A 0?
What value would be returned if the other span was empty? Also a 0 or -1?
What value would be returned for a full match? The length of the shorter span?

Sorry if these questions are dumb, I just want to get a good understanding of how it would be consumed by the end users.

I like the FindIndexOfFirstDifference name proposal (I would shorten it to IndexOfFirstDifference). It's more intuitive

@stephentoub
Copy link
Member Author

stephentoub commented Jan 25, 2022

What value would be returned if there is no match even for the first item? A 0?

Yes, 0.

What value would be returned if the other span was empty? Also a 0 or -1?

0

What value would be returned for a full match? The length of the shorter span?

The length of the shorter span.

Pseudo-code implementation would be like:

public static int SequenceEqualUntil<T>(this ReadOnlySpan<T> span, ReadOnlySpan<T> other, IEqualityComparer<T> comparer)
{
    int length = Math.Min(span.Length, other.Length);
    for (int i = 0; i < length; i++)
        if (!EqualityComparer<T>.Default.Equals(span[i], other[i]))
            return i;
    return length;
}

So the return value indicates how much was the same between them. If it returns 0, no elements were the same. If it returns 1, the first element of each was the same. If it returns 2, the first two elements of each were the same. Etc.

IndexOfFirstDifference

I don't love that naming because it suggests the result might be a valid index in both spans, but it wouldn't be in the case of an empty span.

@adamsitnik
Copy link
Member

How about "CommonSuffixLength"? Or is it too string-oriented?

@stephentoub
Copy link
Member Author

Presumably it'd be "CommonPrefixLength" 😄

I'd be ok with that or some variation thereof.

@FiniteReality
Copy link

I've written similar methods in my own code previously, but I chose the name IsPrefixedBy, and I've used it like this:

if (!spanOne.IsPrefixedBy(spanTwo, out var firstDifference))
{
    _logger.LogDebug("No prefix match (first deviance at index {index})", firstDifference);
}

This way, IsPrefixedBy returns true when spanOne.StartsWith(spanTwo), making it rather convenient for string parsing where I need to show a user-facing error if no prefix is shared.

@GSPP
Copy link

GSPP commented Feb 4, 2022

Alternative name: IndexOfDifference. It's an "index of" type of method.

@stephentoub
Copy link
Member Author

stephentoub commented Feb 4, 2022

Alternative name: IndexOfDifference. It's an "index of" type of method.

IndexOfFirstDifference was suggested earlier. My concern with the IndexOfDifference is the same: other IndexOf methods always return a valid index or -1. That wouldn't be the case here if one or both inputs were empty. In fact it's not actually returning an index, but a length.

@stephentoub stephentoub changed the title [API Proposal]: MemoryExtensions.SequenceEqualUntil<T> [API Proposal]: MemoryExtensions.CommonPrefixLength<T> Feb 17, 2022
@adamsitnik adamsitnik added api-ready-for-review API is ready for review, it is NOT ready for implementation and removed api-suggestion Early API idea and discussion, it is NOT ready for implementation labels Mar 14, 2022
@adamsitnik
Copy link
Member

@stephentoub I am supportive of this idea in the current shape and I've marked it as ready for review. Would you like to present it in the API proposal meeting?

@stephentoub
Copy link
Member Author

Sure, thanks.

@GrabYourPitchforks
Copy link
Member

GrabYourPitchforks commented Mar 15, 2022

We should consider having specific overloads for string (ReadOnlySpan<char> or whatever), since these would require special handling. Due to surrogate handling - and optional case-insensitivity handling, if you want that as an option - it's not possible to compute a true "common prefix length" answer without looking at pairs of chars rather than chars in isolation. This could result in a naïve implementation that goes element-by-element (char-by-char) giving off-by-one errors in its response.

Code which performs UTF-8 text processing would need similar logic to prevent errors or data corruption in its output.

@stephentoub
Copy link
Member Author

We should consider having specific overloads for string (ReadOnlySpan or whatever), since these would require special handling

For the cited examples, Regex specifically doesn't consider surrogates; while that might be considered "wrong", it wouldn't be able to use this if the input did and it had no way to avoid it.

Are the other cited uses in the opening issue already wrong then? Or for them are there other mitigating reasons why going byte-by-byte or char-by-char correct?

I want to make sure that if we "fix" the implementation to handle surrogates and the like that we're not then making it so that no one can actually use it.

@GrabYourPitchforks
Copy link
Member

For the cited examples, Regex specifically doesn't consider surrogates; while that might be considered "wrong", it wouldn't be able to use this if the input did and it had no way to avoid it.

It could call the <T> overload, which is presumably element-by-element.

Are the other cited uses in the opening issue already wrong then? Or for them are there other mitigating reasons why going byte-by-byte or char-by-char correct?

I don't know what the other cited code samples are used for. But in general, if you're splitting a string (UTF-16 or UTF-8) at arbitrary indexes and treating the resulting splits as binary data, party on. Nobody's going to care what you do with the data.

If you're splitting a string (UTF-16 or UTF-8) at arbitrary indexes and attempting to process those splits as text, bad things will happen. You'll almost certainly end up with data corruption for some inputs.

As a concrete example, consider the two French terms pêcheur ("fisherman") and pécheur ("sinner"). They differ only in the type of accent over the first e character. The UTF-8 representations of these texts are:

[ 70 C3 AA 63 68 65 75 72 ] = "pêcheur"
[ 70 C3 A9 63 68 65 75 72 ] = "pécheur"

The proposed CommonPrefixLength API would return 2 when presented with these inputs. Again, if you're treating this as binary data, whatever. But if you're attempting to do anything string-like with these APIs - such as case conversion, UTF-8 <-> UTF-16 conversion (for writing to i/o) - etc., then these strings once split will turn into:

"p�" + "�cheur"
"p�" + "�cheur"

And hey, look at that! In the "best" case they're corrupted and the meaning has changed. In the "worst" case the data corruption has caused them to be have identical contents once reconstituted, which can cause your application to run down a code path you never intended.

@stephentoub
Copy link
Member Author

It could call the overload, which is presumably element-by-element.

It feels like a real pit of failure that

ReadOnlySpan<char> span = ...;
SomeMethod(span)

and

ReadOnlySpan<char> span = ...;
SomeMethod<char>(span);

would have different semantics.

@FiniteReality
Copy link

IMHO, If the user wishes for the implementation to be unicode-aware, they should pass another parameter indicating as such. The most likely parameter that would indicate that, to me, would be a StringComparison value. That is:

ReadOnlySpan<char> span = ...;
SomeMethod(span)

and

ReadOnlySpan<char> span = ...;
SomeMethod(span, StringComparison.OrdinalIgnoreCase)

have different semantics, and it becomes obvious why, as for the second overload you specified you wanted it to be treated "stringly", rather than as a sequence of char-size values

@GrabYourPitchforks
Copy link
Member

GrabYourPitchforks commented Mar 19, 2022

It feels like a real pit of failure that ... and ... would have different semantics.

You're not wrong. 😄 But that's why I think we really need to think long and hard about the scenarios that this API is intended for. My concern is that the API as currently proposed is already a pit of failure.

When processing raw binary data, the behavior as proposed is just fine. Everything is a bit-by-bit comparison and we're good to go.

When processing variable-length data formats like UTF-8/16, the behavior as proposed makes this API appropriate only for trie-like data structures and other scenarios where the caller never treats the prefix or suffix as anything other than opaque binary data. The only operations that would ever be valid for that opaque data are bit-for-bit equality comparisons. If you wanted to do anything "interesting" with the data (such as interpret it in a meaningful fashion other than as a binary blob), you'd have to concatenate it back into its whole, otherwise you risk misinterpreting the data.

The scenarios listed in the issue description strongly indicate that we expect callers to pass stringy data in here. This makes it very easy for them to fall into a trap where they unwittingly perform contextual processing of the data.

var dict = new Dictionary<string, object>(StringComparison.OrdinalIgnoreCase);
int commonPrefixLength = GetCommonPrefixLength(str1, str2);
string str3 = str1.Substring(0, commonPrefixLength);
dict[str3] = new object(); // <-- !! BUG !! - indirect contextual processing of substring (via StringComparison.OrdinalIgnoreCase.GetHashCode)

If we're still ok with this and don't expect that developers will commonly run afoul of this, let's march forward as-is. But at minimum this deserves a mention in the docs.

IMHO, If the user wishes for the implementation to be unicode-aware, they should pass another parameter indicating as such.

Not sure what you mean by "Unicode-aware". If you're talking about case insensitivity or culture-aware (such as en-US or de-DE), you'd need new overloads anyway, as the current proposed overloads wouldn't suffice. I wouldn't recommend adding culture awareness anyway, since both the API surface and the implementation get very messy very quickly.

@FiniteReality
Copy link

IMHO, If the user wishes for the implementation to be unicode-aware, they should pass another parameter indicating as such.

Not sure what you mean by "Unicode-aware". If you're talking about case insensitivity or culture-aware (such as en-US or de-DE), you'd need new overloads anyway, as the current proposed overloads wouldn't suffice. I wouldn't recommend adding culture awareness anyway, since both the API surface and the implementation get very messy very quickly.

I meant moreso "dealing with the way UTF-8 or UTF-16 encodes text" rather than "dealing with a sequence of bytes/similar" when I said "Unicode-aware". Sorry for the confusion 😅

@GrabYourPitchforks
Copy link
Member

I meant moreso "dealing with the way UTF-8 or UTF-16 encodes text" rather than "dealing with a sequence of bytes/similar" when I said "Unicode-aware". Sorry for the confusion 😅

Understood. :)

If we doc the behavior as "don't use this API over structured data (like UTF-* text)" then this should be ok. But otherwise it's hard to separate char from the concept of Unicode, as we literally define the char type as the fundamental unit of UTF-16 data. So it's a reasonable expectation that if you see a sequence of chars, they're meant to be interpreted as a run of UTF-16 text. Anything that violates that expectation is a potential pit of failure.

Separately, I've been pondering for a while whether concerns like this are best handled by analyzers. For example, Array.Reverse<T> is a useful API (as is the API proposed here!), but calls to Array.Reverse<char> should probably be flagged as suspicious. Maybe that's also the best solution here.

@GrabYourPitchforks
Copy link
Member

Had a chance to take another look at the three consumers mentioned previously.

The JSON usage is fine because it's performing binary (not string-contextual) comparisons, and it's not doing anything with the data other than saying "here's where the error occurred."

The Regex usage I assume is by design because that type already explicitly disavows supplementary character support.

The Path logic is problematic for a handful of reasons, many of which aren't relevant to this issue. (Fun example: the strings "c:\Ꮿ.txt" and "c:\ꮿ.txt" compare as equal-ignore-case, but NTFS treats them as not-equal.) But to the question of surrogates, it handles them incorrectly and could return the wrong result.

@terrajobst
Copy link
Member

terrajobst commented Apr 12, 2022

Video

  • Looks good as proposed
    • Need overloads so that it shows up on instances of Span<T>
    • We should constrain the first one to IEquatable<T>
    • We might need to add an overload that is specific to char and takes StringComparison
namespace System;

public static class MemoryExtensions
{
    public static int CommonPrefixLength<T>(this Span<T> span, ReadOnlySpan<T> other) where T: IEquatable<T>;
    public static int CommonPrefixLength<T>(this Span<T> span, ReadOnlySpan<T> other, IEqualityComparer<T>? comparer = null);

    public static int CommonPrefixLength<T>(this ReadOnlySpan<T> span, ReadOnlySpan<T> other) where T: IEquatable<T>;
    public static int CommonPrefixLength<T>(this ReadOnlySpan<T> span, ReadOnlySpan<T> other, IEqualityComparer<T>? comparer = null);
}

@terrajobst terrajobst added api-approved API was approved in API review, it can be implemented and removed api-ready-for-review API is ready for review, it is NOT ready for implementation labels Apr 12, 2022
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Apr 12, 2022
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Apr 17, 2022
@dotnet dotnet locked as resolved and limited conversation to collaborators May 18, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-approved API was approved in API review, it can be implemented area-System.Memory
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants