Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create APIs to deal with processing ASCII text (as bytes) #28230

Closed
Tracked by #64598
GrabYourPitchforks opened this issue Dec 18, 2018 · 56 comments · Fixed by #75012
Closed
Tracked by #64598

Create APIs to deal with processing ASCII text (as bytes) #28230

GrabYourPitchforks opened this issue Dec 18, 2018 · 56 comments · Fixed by #75012
Assignees
Labels
api-approved API was approved in API review, it can be implemented area-System.Buffers
Milestone

Comments

@GrabYourPitchforks
Copy link
Member

GrabYourPitchforks commented Dec 18, 2018

EDIT: most up to date proposal can be found here: #28230 (comment)

Most recent API review comment: #28230 (comment)

If you are interested in the original proposal, click "details" below:

To make @KrzysztofCwalina happy. :)

Networking protocols and parsers in particular use ASCII very frequently - even more so than UTF-8. It would be useful to introduce high-performance APIs that work very specifically on ASCII text without trying to interpret it as UTF-8.

using System;
using System.Buffers;

/*
 * This work item proposes a new static class for performing operations
 * on byte buffers assumed to contain ASCII text. For the purposes of
 * these APIs, an ASCII byte / char is any value in the range [ 00 .. 7F ], inclusive;
 * and ASCII text is a sequence of such values. Most APIs in this type will
 * treat non-ASCII values as opaque data unless the API description specifies
 * a different behavior.
 *
 * For APIs which take both a source and a destination buffer, the behavior of
 * the method is undefined if the source and destination buffers overlap,
 * unless the API description specifies otherwise. The behavior of all APIs
 * is undefined if another thread mutates the buffers while these APIs are
 * operating on them.
 *
 * All case conversion APIs are culture-unaware.
 */

namespace System.Buffers.Text
{
    public static class Ascii
    {
        // Compares two ASCII buffers for equality, optionally treating [A-Z] and [a-z] as equal.
        // All non-ASCII bytes / chars are compared for pure binary equivalence.

        public static bool Equals(ReadOnlySpan<byte> left, ReadOnlySpan<byte> right);
        public static bool Equals(ReadOnlySpan<char> left, ReadOnlySpan<char> right);
        public static bool EqualsIgnoreCase(ReadOnlySpan<byte> left, ReadOnlySpan<byte> right);
        public static bool EqualsIgnoreCase(ReadOnlySpan<char> left, ReadOnlySpan<char> right);

        // Compares an ASCII byte buffer and an ASCII char buffer for equality, optionally treating
        // [A-Z] and [a-z] as equal. Returns false if the ASCII byte buffer contains any non-ASCII
        // data or if the char buffer contains any element in the range [ 0080 .. FFFF ], as we
        // wouldn't know what encoding to use to perform the transcode-then-compare operation.

        public static bool Equals(ReadOnlySpan<byte> left, ReadOnlySpan<char> right);
        public static bool Equals(ReadOnlySpan<byte> left, string right);
        public static bool EqualsIgnoreCase(ReadOnlySpan<byte> left, ReadOnlySpan<char> right);
        public static bool EqualsIgnoreCase(ReadOnlySpan<byte> left, string right);

        // Searches for the first occurrence of the target substring within the search space,
        // optionally treating [A-Z] and [a-z] as equal. All non-ASCII bytes are compared for pure
        // binary equivalence. Returns the index of where the first match is found, else returns -1.
        // ADDENDUM: Also assume there are *Last equivalents of the following.

        public static int IndexOf(ReadOnlySpan<byte> text, ReadOnlySpan<byte> value);
        public static int IndexOfIgnoreCase(ReadOnlySpan<byte> text, ReadOnlySpan<byte> value);

        // Searches for the first occurrence of the target substring within the search space,
        // optionally treating [A-Z] and [a-z] as equal. Returns the index of where the first match
        // is found, else returns -1. If the target string contains any non-ASCII chars ([ 0080 .. FFFF ]),
        // the search is assume to have failed, and the method returns -1.
        // ADDENDUM: Also assume there are *Last equivalents of the following.

        public static int IndexOf(ReadOnlySpan<byte> text, ReadOnlySpan<char> value);
        public static int IndexOf(ReadOnlySpan<byte> text, string value);
        public static int IndexOfIgnoreCase(ReadOnlySpan<byte> text, ReadOnlySpan<char> value);
        public static int IndexOfIgnoreCase(ReadOnlySpan<byte> text, string value);

        // Given a buffer, returns the index of the first element in the buffer which
        // is a non-ASCII byte, or -1 if the buffer is empty or all-ASCII. The bool-
        // returning method is a convenience shortcut to perform the same check.

        public static int GetIndexOfFirstNonAsciiByte(ReadOnlySpan<byte> buffer);
        public static int GetIndexOfFirstNonAsciiChar(ReadOnlySpan<char> buffer);
        public static bool IsAllAscii(ReadOnlySpan<byte> buffer);
        public static bool IsAllAscii(ReadOnlySpan<char> buffer);

        // Returns true iff the provided byte is an ASCII byte; i.e., in the range [ 00 .. 7F ];
        // or if the provided char is in the range [ 0000 .. 007F ].

        public static bool IsAsciiByte(byte value);
        public static bool IsAsciiChar(char value);

        // Copies source to destination, converting [A-Z] -> [a-z] or vice versa during
        // the copy. All values outside [A-Za-z] - including non-ASCII values - are unchanged
        // during the copy.
        //
        // If source.Length <= destination.Length, succeeds and returns source.Length (# bytes copied).
        // If source.Length > destination.Length, returns -1.

        public static int ToLowerInvariant(ReadOnlySpan<byte> source, Span<byte> destination);
        public static int ToLowerInvariant(ReadOnlySpan<char> source, Span<char> destination);
        public static int ToUpperInvariant(ReadOnlySpan<byte> source, Span<byte> destination);
        public static int ToUpperInvariant(ReadOnlySpan<char> source, Span<char> destination);

        // Performs case conversion ([A-Z] -> [a-z] or vice versa) in-place. All values
        // outside [A-Za-z] - including non-ASCII values - are unchanged.

        public static void ToLowerInvariantInPlace(Span<byte> buffer);
        public static void ToLowerInvariantInPlace(Span<char> buffer);
        public static void ToUpperInvariantInPlace(Span<byte> buffer);
        public static void ToUpperInvariantInPlace(Span<char> buffer);

        // Performs case conversion on a single value, converting [A-Z] -> [a-z] or vice versa.
        // All values outside [A-Za-z] - including non-ASCII values - are unchanged.

        public static byte ToLowerInvariant(byte value);
        public static byte ToLowerInvariant(char value);
        public static byte ToUpperInvariant(byte value);
        public static byte ToUpperInvariant(char value);

        // Returns a hash code for the provided buffer suitable for use in a dictionary or
        // other keyed collection. For the OrdinalIgnoreCase method, the values [A-Z] and [a-z]
        // are treated as equivalent during hash code computation. All non-ASCII values
        // are treated as opaque data. The hash code is randomized but is not guaranteed to
        // implement any particular algorithm, nor is it guaranteed to be a member of the same
        // PRF family as other GetHashCode routines in the framework.

        public static int GetHashCode(ReadOnlySpan<byte> buffer);
        public static int GetHashCode(ReadOnlySpan<char> buffer);
        public static int GetHashCodeOrdinalIgnoreCase(ReadOnlySpan<byte> buffer);
        public static int GetHashCodeOrdinalIgnoreCase(ReadOnlySpan<char> buffer);

        // Widens an ASCII buffer to UTF-16 or narrows a UTF-16 buffer to ASCII.
        // Returns OperationStatus.InvalidData if the source buffer contains a non-ASCII byte
        // or a char in the range [ 0080 .. FFFF ].
        // OPEN QUESTION: Should we have an equivalent with Latin-1 semantics? Probably doesn't
        // belong on the ASCII class if we do that.

        public static OperationStatus WidenToUtf16(ReadOnlySpan<byte> source, Span<char> destination, out int bytesRead, out int charsWritten);
        public static OperationStatus NarrowFromUTf16(ReadOnlySpan<char> source, Span<byte> destination, out int charsRead, out int bytesWritten);

        // Widens the provided buffer to UTF-16, then returns a new string instance from it.
        // Throws ArgumentException if the buffer contains non-ASCII data. If the desired behavior
        // is character substitution instead of throwing an exception, consider instead using
        // Encoding.ASCII.GetString.        

        public static string ToString(ReadOnlySpan<byte> buffer);

        // Trims only ASCII whitespace values from the buffer, returning the trimmed buffer.

        public static ReadOnlySpan<byte> Trim(ReadOnlySpan<byte> buffer);
        public static ReadOnlySpan<char> Trim(ReadOnlySpan<char> buffer);
        public static ReadOnlySpan<byte> TrimStart(ReadOnlySpan<byte> buffer);
        public static ReadOnlySpan<char> TrimStart(ReadOnlySpan<char> buffer);
        public static ReadOnlySpan<byte> TrimEnd(ReadOnlySpan<byte> buffer);
        public static ReadOnlySpan<char> TrimEnd(ReadOnlySpan<char> buffer);

        // Similar to Trim, but returns the Range of the untrimmed data. Useful for
        // scenarios where the resulting ROS<byte> needs to be mapped back to some
        // container data structure.
        // Example: Assume ROM<byte> mem = GetSomeAsciiData();
        // Then ROM<byte> trimmed = mem[Ascii.GetTrimmedRange(mem.Span)];

        public static Range GetTrimmedRange(ReadOnlySpan<byte> buffer);
        public static Range GetTrimmedRange(ReadOnlySpan<char> buffer);
    }
}
@danmoseley
Copy link
Member

Why is it useful to return source length?

@GrabYourPitchforks
Copy link
Member Author

@danmosemsft It's not useful. It's just to match the API shape of MemoryExtensions.ToUpperInvariant, which also unnecessarily returns the source length.

@danmoseley
Copy link
Member

I suggest not to perpetuate the design flaw?

@gfoidl
Copy link
Member

gfoidl commented Dec 19, 2018

In-place APIs - do we need these?

As mentioned that's easy to achieve, but it would be convenient to have these (as common use case?) in corefx.

@GrabYourPitchforks
Copy link
Member Author

@danmosemsft - Sure, no complaints from my end to make it void-returning. We can simply document the behavior.

@tarekgh
Copy link
Member

tarekgh commented Jan 17, 2019

would it be better to have these helper methods on the ASCIIEncoding class instead of creating a new utility class?

@GrabYourPitchforks
Copy link
Member Author

@tarekgh I had this same concern with the UTF-8 APIs. Why not put them on the existing UTF8Encoding class rather than a separate static class?

The argument that pushed me away from putting them on the existing Encoding classes is that the proposed static APIs are very opinionated in how they behave. If you subclass AsciiEncoding, UTF8Encoding, etc., these static APIs would still show up, and their behavior might differ from that of the overridden instance methods on the type. This is not a pit of success.

@tarekgh
Copy link
Member

tarekgh commented Jan 19, 2019

I didn't see anyone sub-classing Ascii encoding at all. this mean is unlikely to happen. even it show up in the list when sub-classing, I don't think this will be issue as these static methods anyway. my whole point having these in the Ascii encoding for easier discoverability as one class encapsulating the Ascii operations.

You don't have to block on that though :-) and we should proceed with the design review.

@vcsjones
Copy link
Member

What is the behavior of this API if source and destination partially overlap?

@GrabYourPitchforks
Copy link
Member Author

@vcsjones I assume undefined. See also https://github.com/dotnet/corefx/issues/34799, where I contemplated this same question with the existing span-based text case conversion APIs.

@ahsonkhan
Copy link
Member

I suggest not to perpetuate the design flaw?

It was done intentionally.

Sure, no complaints from my end to make it void-returning. We can simply document the behavior.

I would keep it consistent and return an int.

Here is the context for why we returned an int: https://github.com/dotnet/corefx/issues/31340#issuecomment-407866739

The API initially returned void but was changed, the reasoning for which can be seen in this GitHub thread.

Essentially, if the presupposition that the ToUpper/ToLower APIs do not change the length of the data was removed, we would have to change the public signature of the API. Having the API return the amount of data written avoids API change in the future.

All four functions always return source.length in case of success. Do you think it can be something different on other platforms?

Hence, this is an implementation detail and not guaranteed, necessarily.

Yes, exceptions are expensive, but this is error prone design, because callers will forget they must check the return value.

That is why the API was designed with a return value. The contract suggests you should do something with the return value (in an attempt to avoid scenario 1). A void returning method gives even less indication that the caller needs to do anything and as you mentioned, exceptions are expensive (especially if we are catching the exception to resize within a loop). The span-resizing pattern is a common practice with a lot of our span-based APIs and hence we haven't used exceptions for length mismatch anywhere.

Hopefully that adds some context to the design decision.

@GSPP
Copy link

GSPP commented Feb 3, 2019

@GrabYourPitchforks regarding undefined behavior: It seems better to detect this and throw or to make it defined (for example, by using a slower but safe algorithm).

@GrabYourPitchforks
Copy link
Member Author

GrabYourPitchforks commented Feb 22, 2019

Essentially, if the presupposition that the ToUpper/ToLower APIs do not change the length of the data was removed, we would have to change the public signature of the API. Having the API return the amount of data written avoids API change in the future.

I don't think this is ever going to happen realistically. String manipulation APIs like ToUpper / ToLower in the Framework have always used simple case folding semantics, and simple case folding semantics are guaranteed per the Unicode Standard never to change the length of a UTF-16 string. The only public APIs which use complex case folding semantics are String comparison APIs (e.g., Equals(..., StringComparison.CurrentCulture)) and CompareInfo.GetSortKey.

An enormous amount of code is written with the assumption that simple case folding semantics are used for all String manipulation APIs, and it would introduce a massive breaking change throughout our ecosystem if we were to suddenly upend this. If we were going to introduce an API to perform full folding (which could change the length of the UTF-16 data), it really should be on a brand new API rather than one that purports to have the same semantics as existing APIs. Something like TextInfo.ToUpperFullFolding(...) would fit the bill if we can think of a good name.

See also https://github.com/dotnet/coreclr/issues/20955#issuecomment-438389558.

Edit: I'd still prefer returning an int just for consistency at this point. And any ToUpper(ROS<char8>) overloads would absolutely need it since the length of UTF-8 data can change, even with simple folding semantics. My intent with this comment was that the original reasoning for making the ToUpper(ROS<char>) overloads return an int appears to have been flawed.

@tarekgh
Copy link
Member

tarekgh commented Apr 3, 2019

@GrabYourPitchforks are we still targeting this for 3.0?

@GrabYourPitchforks
Copy link
Member Author

@tarekgh Probably not unless somebody really wants it. I'll re-milestone.

@scalablecory
Copy link
Contributor

More food for thought: HttpClient could make use of a ToLower(byte) and a ToLower(ReadOnlySpan<char> src, Span<byte> dst).

@GrabYourPitchforks
Copy link
Member Author

@scalablecory are the heterogeneous parameters ROS<char> and Span<byte> intentional? What would the behavior of such a method be?

@scalablecory
Copy link
Contributor

@scalablecory are the heterogeneous parameters ROS<char> and Span<byte> intentional? What would the behavior of such a method be?

Think ASCII.GetBytes() + ToLower. We do such a thing when writing out headers for HTTP/2.

@terrajobst
Copy link
Member

terrajobst commented Dec 10, 2019

Video

Looks good but needs work:

  • We didn't like the name AsciiUtility
  • We felt the type should be in System.Buffers.Text as that's the home for low-level span-based APIs for specific encodings
  • We should add other operations, such as:
    • GetHashCode
    • Compare
    • ...
namespace System.Buffers.Text
{
   public static class Ascii
   {
      public static bool EqualsOrdinalIgnoreCase(ReadOnlySpan<byte> left, ReadOnlySpan<byte> right);
      public static int ToLowerInvariant(ReadOnlySpan<byte> source, Span<byte> destination);
      public static int ToUpperInvariant(ReadOnlySpan<byte> source, Span<byte> destination);
      public static void ToLowerInvariantInPlace(Span<byte> buffer);
      public static void ToUpperInvariantInPlace(Span<byte> buffer);
   }
}

@GrabYourPitchforks
Copy link
Member Author

GrabYourPitchforks commented Dec 12, 2019

(Comment deleted since the updated API surface has been rolled into the proposal at the top of this issue.)

@GrabYourPitchforks
Copy link
Member Author

Marking as ready for review since there's now a more fleshed-out API surface.

@benaadams
Copy link
Member

If one is EqualsOrdinalIgnoreCase should the other be EqualsOrdinal rather than Equals?

@GrabYourPitchforks
Copy link
Member Author

I'm not too attached to the name to be honest, so sure. :)

Though the nice thing about naming it Equals is that most consumers won't run into the trap of typing Ascii.Equals(span1, span2) then receive a cryptic "can't box Span" error.

@benaadams
Copy link
Member

True, also matches behaviour of String.Equals(string, string)

Question then, are other comparisions in the mix other than ordinal (e.g. see some Invariants in the list)

So would a Equals(ReadOnlySpan<byte>, ReadOnlySpan<char>, StringComparison) be something to offer also?

@GrabYourPitchforks
Copy link
Member Author

I don't think it makes sense to put proper culture-aware APIs here. At that point we're talking globalization, which almost always involves non-ASCII data at some point, and which isn't really prevalent for the scenarios we're considering here like network protocol parsing. If globalization is a concern for the developer IMO they'd be better served by the existing APIs on string and CultureInfo.

@benaadams
Copy link
Member

That makes sense; then my final concern from that would be inconsistency here

EqualsOrdinalIgnoreCase
ToLowerInvariant
IndexOfOrdinalIgnoreCase

Why are some Invariant and others OrdinalIgnoreCase

@msftgits msftgits transferred this issue from dotnet/corefx Feb 1, 2020
@msftgits msftgits added this to the 5.0 milestone Feb 1, 2020
@terrajobst terrajobst added the blocked Issue/PR is blocked on something - see comments label Feb 14, 2020
@adamsitnik adamsitnik added the api-ready-for-review API is ready for review, it is NOT ready for implementation label Sep 7, 2022
@adamsitnik adamsitnik self-assigned this Sep 13, 2022
@adamsitnik
Copy link
Member

adamsitnik commented Sep 13, 2022

namespace System.Buffers.Text
{
    public static class Ascii
    {
        // Compares an ASCII byte buffer and an ASCII char buffer for value equality.
        
-       public static bool Equals(ReadOnlySpan<byte> left, ReadOnlySpan<byte> right);
-       public static bool Equals(ReadOnlySpan<char> left, ReadOnlySpan<char> right);
-       public static bool Equals(ReadOnlySpan<byte> left, string right);
        public static bool Equals(ReadOnlySpan<byte> left, ReadOnlySpan<char> right);

        // Compare two ASCII buffers for value equality, treating [A-Z] and [a-z] as equal.

-       public static bool EqualsIgnoreCase(ReadOnlySpan<byte> left, string right);
        public static bool EqualsIgnoreCase(ReadOnlySpan<byte> left, ReadOnlySpan<byte> right);
        public static bool EqualsIgnoreCase(ReadOnlySpan<char> left, ReadOnlySpan<char> right);
        public static bool EqualsIgnoreCase(ReadOnlySpan<byte> left, ReadOnlySpan<char> right);

        // Searches for the first (last) occurrence of the target substring within the
        // search space, optionally treating [A-Z] and [a-z] as equal.
        // Returns -1 if not found.
        // Throws an exception if the needle contains non-ASCII data.
        // The haystack is allowed to contain non-ASCII data.
        
-       public static int IndexOf(ReadOnlySpan<byte> text, ReadOnlySpan<byte> value);
-       public static int IndexOf(ReadOnlySpan<byte> text, string value);
-       public static int IndexOf(ReadOnlySpan<byte> text, ReadOnlySpan<char> value);
+       public static int IndexOf(ReadOnlySpan<char> text, ReadOnlySpan<byte> value);


-       public static int IndexOfIgnoreCase(ReadOnlySpan<byte> text, string value);
+       public static int IndexOfIgnoreCase(ReadOnlySpan<byte> text, ReadOnlySpan<char> value);


+       public static int LastIndexOf(ReadOnlySpan<byte> text, ReadOnlySpan<char> value);
+       public static int LastIndexOf(ReadOnlySpan<char> text, ReadOnlySpan<byte> value);
+       public static int LastIndexOfIgnoreCase(ReadOnlySpan<byte> text, ReadOnlySpan<byte> value);
+       public static int LastIndexOfIgnoreCase(ReadOnlySpan<char> text, ReadOnlySpan<char> value);
+       public static int LastIndexOfIgnoreCase(ReadOnlySpan<byte> text, ReadOnlySpan<char> value);
+       public static int LastIndexOfIgnoreCase(ReadOnlySpan<char> text, ReadOnlySpan<byte> value);

        // Given a buffer, determines whether it starts (ends) with the given value,
        // optionally treating [A-Z] and [a-z] as equal.
        // Throws an exception if the needle contains non-ASCII data.
        // The haystack is allowed to contain non-ASCII data.

+       public static bool StartsWith(ReadOnlySpan<byte> text, ReadOnlySpan<char> value);
+       public static bool StartsWith(ReadOnlySpan<char> text, ReadOnlySpan<byte> value);

+       public static bool EndsWith(ReadOnlySpan<byte> text, ReadOnlySpan<char> value);
+       public static bool EndsWith(ReadOnlySpan<char> text, ReadOnlySpan<byte> value);

+       public static bool StartsWithIgnoreCase(ReadOnlySpan<byte> text, ReadOnlySpan<byte> value);
+       public static bool StartsWithIgnoreCase(ReadOnlySpan<char> text, ReadOnlySpan<char> value);
+       public static bool StartsWithIgnoreCase(ReadOnlySpan<byte> text, ReadOnlySpan<char> value);
+       public static bool StartsWithIgnoreCase(ReadOnlySpan<char> text, ReadOnlySpan<byte> value);

+       public static bool EndsWithIgnoreCase(ReadOnlySpan<byte> text, ReadOnlySpan<byte> value);
+       public static bool EndsWithIgnoreCase(ReadOnlySpan<char> text, ReadOnlySpan<char> value);
+       public static bool EndsWithIgnoreCase(ReadOnlySpan<byte> text, ReadOnlySpan<char> value);
+       public static bool EndsWithIgnoreCase(ReadOnlySpan<char> text, ReadOnlySpan<byte> value);

        // Given a buffer, returns the index of the first element in the buffer which
        // is a non-ASCII byte, or -1 if the buffer is empty or all-ASCII. The bool-
        // returning method is a convenience shortcut to perform the same check.
        
        public static int GetIndexOfFirstNonAsciiByte(ReadOnlySpan<byte> buffer);
        public static int GetIndexOfFirstNonAsciiChar(ReadOnlySpan<char> buffer);
        public static bool IsAscii(ReadOnlySpan<byte> value);
        public static bool IsAscii(ReadOnlySpan<char> value);

        // Returns true iff the provided byte is an ASCII byte; i.e., in the range [ 00 .. 7F ];
        // or if the provided char is in the range [ 0000 .. 007F ].

        public static bool IsAscii(byte value);
        public static bool IsAscii(char value);
        
        // Copies source to destination, converting [A-Z] <-> [a-z] during the copy.
        // Returns OperationStatus.InvalidData on non-ASCII input.

-       public static int ToLower(ReadOnlySpan<byte> source, Span<byte> destination);
-       public static int ToLower(ReadOnlySpan<char> source, Span<char> destination);
-       public static int ToUpper(ReadOnlySpan<byte> source, Span<byte> destination);
-       public static int ToUpper(ReadOnlySpan<char> source, Span<char> destination);
        
+       public static OperationStatus ToUpper(ReadOnlySpan<byte> source, Span<byte> destination, out int elementsProcessed);
+       public static OperationStatus ToUpper(ReadOnlySpan<char> source, Span<char> destination, out int elementsProcessed);
+       public static OperationStatus ToUpper(ReadOnlySpan<byte> source, Span<char> destination, out int elementsProcessed);
+       public static OperationStatus ToUpper(ReadOnlySpan<char> source, Span<byte> destination, out int elementsProcessed);

+       public static OperationStatus ToLower(ReadOnlySpan<byte> source, Span<byte> destination, out int elementsProcessed);
+       public static OperationStatus ToLower(ReadOnlySpan<char> source, Span<char> destination, out int elementsProcessed);
+       public static OperationStatus ToLower(ReadOnlySpan<byte> source, Span<char> destination, out int elementsProcessed);
+       public static OperationStatus ToLower(ReadOnlySpan<char> source, Span<byte> destination, out int elementsProcessed);        
        
-       public static byte ToLower(byte value);
-       public static byte ToLower(char value);
-       public static byte ToUpper(byte value);
-       public static byte ToUpper(char value);

        // Performs case conversion ([A-Z] <-> [a-z]) in-place.
        // Returns true on success, false if any non-ASCII data was found (and stops processing).
        // Always outs the number of elements it was able to convert.

-       public static void ToLowerInPlace(Span<byte> value);
+       public static bool TryToUpperInPlace(Span<byte> value, out int bytesProcessed);
-       public static void ToLowerInPlace(Span<char> value);
+       public static bool TryToUpperInPlace(Span<char> value, out int charsProcessed);
-       public static void ToUpperInPlace(Span<byte> value);
+       public static bool TryToLowerInPlace(Span<byte> value, out int bytesProcessed);
-       public static void ToUpperInPlace(Span<char> value);
+       public static bool TryToLowerInPlace(Span<char> value, out int charsProcessed);

        // Returns a hash code for the provided buffer suitable for use in a dictionary or
        // other keyed collection, optionally treating [A-Z] and [a-z] as equal.
        // The hash code is randomized and is not guaranteed to match string.GetHashCode
        // or any other public PRF within the framework. The ROS<byte> and ROS<char> overloads
        // will produce the same hash code if Equals[IgnoreCase](ROS<byte>, ROS<char>) returns
        // true. The Try* methods return false (and out 0) if the input is not ASCII;
        // the non-Try* methods throw an exception.
       
+       public static bool TryGetHashCode(ReadOnlySpan<byte> value, out int hashCode);
+       public static bool TryGetHashCode(ReadOnlySpan<char> value, out int hashCode);
+       public static bool TryGetHashCodeIgnoreCase(ReadOnlySpan<byte> value, out int hashCode);
+       public static bool TryGetHashCodeIgnoreCase(ReadOnlySpan<char> value, out int hashCode);

        public static int GetHashCode(ReadOnlySpan<byte> value);
        public static int GetHashCode(ReadOnlySpan<char> value);
        public static int GetHashCodeIgnoreCase(ReadOnlySpan<byte> value);
        public static int GetHashCodeIgnoreCase(ReadOnlySpan<char> value);

        // Widens an ASCII buffer to UTF-16 or narrows a UTF-16 buffer to ASCII.
        // Returns OperationStatus.InvalidData if the source buffer contains a non-ASCII byte
        // or a char in the range [ 0080 .. FFFF ].

-       public static OperationStatus WidenToUtf16(ReadOnlySpan<byte> source, Span<char> destination, out int bytesConsumed, out int charsWritten);
-       public static OperationStatus NarrowFromUTf16(ReadOnlySpan<char> source, Span<byte> destination, out int charsConsumed, out int bytesWritten);
+       public static OperationStatus ToUtf16(ReadOnlySpan<byte> source, Span<char> destination, out int elementsProcessed);
+       public static OperationStatus FromUtf16(ReadOnlySpan<char> source, Span<byte> destination, out int elementsProcessed);

        // Trims only ASCII whitespace values from the buffer, returning the trimmed buffer.
        // The full set of trimmed values is: U+0009..U+000D and U+0020.
        // All other elements, including non-ASCII elements, are ignored.

        public static Range Trim(ReadOnlySpan<byte> value);
        public static Range Trim(ReadOnlySpan<char> value);
        public static Range TrimStart(ReadOnlySpan<byte> value);
        public static Range TrimStart(ReadOnlySpan<char> value);
        public static Range TrimEnd(ReadOnlySpan<byte> value);
        public static Range TrimEnd(ReadOnlySpan<char> value);
    }
}

@danmoseley
Copy link
Member

+       public static int IndexOf(ReadOnlySpan<char> text, ReadOnlySpan<byte> value);
+       public static int IndexOfIgnoreCase(ReadOnlySpan<byte> text, ReadOnlySpan<char> value);

is the difference between these intentional?

@terrajobst
Copy link
Member

  • There is a concern that System.Buffers.Text isn't discoverable. Should this be in System.Text?
  • We need to clarify the behavior for throwing vs returning false
    • We don't want behavior that are function of lengths
    • It's OK to have APIs where haystack (text) is never validated but needle (value) is
    • It's not OK to have an API where the needle is sometimes not validated, e.g. if if (value.Length > text.Length) return false.
  • The text parameter (the haystack) should probably called source because in some cases we use it as a binary that contains ASCII but also other data.
  • Equals
    • Don't throw if either side isn't ASCII, just return false
    • We should add these to match EqualsIgnoreCase
      • Equals(ReadOnlySpan<byte> left, ReadOnlySpan<byte> right)
      • Equals(ReadOnlySpan<char> left, ReadOnlySpan<char> right)
  • We need to be more deliberate about the types we accept. We don't want to duplicated APIs that already exist, such as SequenceEqual
  • The ToLowerInPlace API doesn't follow the standard try-pattern. It should probably use operation status.

Original API:

namespace System.Buffers.Text;

public static class Ascii
{
    public static bool Equals(ReadOnlySpan<byte> left, ReadOnlySpan<char> right);

    public static bool EqualsIgnoreCase(ReadOnlySpan<byte> left, ReadOnlySpan<byte> right);
    public static bool EqualsIgnoreCase(ReadOnlySpan<char> left, ReadOnlySpan<char> right);
    public static bool EqualsIgnoreCase(ReadOnlySpan<byte> left, ReadOnlySpan<char> right);

    public static int IndexOf(ReadOnlySpan<byte> text, ReadOnlySpan<char> value);
    public static int IndexOf(ReadOnlySpan<char> text, ReadOnlySpan<byte> value);

    public static int LastIndexOf(ReadOnlySpan<byte> text, ReadOnlySpan<char> value);
    public static int LastIndexOf(ReadOnlySpan<char> text, ReadOnlySpan<byte> value);

    public static int IndexOfIgnoreCase(ReadOnlySpan<byte> text, ReadOnlySpan<byte> value);
    public static int IndexOfIgnoreCase(ReadOnlySpan<char> text, ReadOnlySpan<char> value);
    public static int IndexOfIgnoreCase(ReadOnlySpan<byte> text, ReadOnlySpan<char> value);
    public static int IndexOfIgnoreCase(ReadOnlySpan<char> text, ReadOnlySpan<byte> value);

    public static int LastIndexOfIgnoreCase(ReadOnlySpan<byte> text, ReadOnlySpan<byte> value);
    public static int LastIndexOfIgnoreCase(ReadOnlySpan<char> text, ReadOnlySpan<char> value);
    public static int LastIndexOfIgnoreCase(ReadOnlySpan<byte> text, ReadOnlySpan<char> value);
    public static int LastIndexOfIgnoreCase(ReadOnlySpan<char> text, ReadOnlySpan<byte> value);

    public static bool StartsWith(ReadOnlySpan<byte> text, ReadOnlySpan<char> value);
    public static bool StartsWith(ReadOnlySpan<char> text, ReadOnlySpan<byte> value);

    public static bool EndsWith(ReadOnlySpan<byte> text, ReadOnlySpan<char> value);
    public static bool EndsWith(ReadOnlySpan<char> text, ReadOnlySpan<byte> value);

    public static bool StartsWithIgnoreCase(ReadOnlySpan<byte> text, ReadOnlySpan<byte> value);
    public static bool StartsWithIgnoreCase(ReadOnlySpan<char> text, ReadOnlySpan<char> value);
    public static bool StartsWithIgnoreCase(ReadOnlySpan<byte> text, ReadOnlySpan<char> value);
    public static bool StartsWithIgnoreCase(ReadOnlySpan<char> text, ReadOnlySpan<byte> value);

    public static bool EndsWithIgnoreCase(ReadOnlySpan<byte> text, ReadOnlySpan<byte> value);
    public static bool EndsWithIgnoreCase(ReadOnlySpan<char> text, ReadOnlySpan<char> value);
    public static bool EndsWithIgnoreCase(ReadOnlySpan<byte> text, ReadOnlySpan<char> value);
    public static bool EndsWithIgnoreCase(ReadOnlySpan<char> text, ReadOnlySpan<byte> value);

    public static int GetIndexOfFirstNonAsciiByte(ReadOnlySpan<byte> buffer);
    public static int GetIndexOfFirstNonAsciiChar(ReadOnlySpan<char> buffer);
    public static bool IsAscii(ReadOnlySpan<byte> value);
    public static bool IsAscii(ReadOnlySpan<char> value);

    public static bool IsAscii(byte value);
    public static bool IsAscii(char value);

    public static OperationStatus ToUpper(ReadOnlySpan<byte> source, Span<byte> destination, out int elementsProcessed);
    public static OperationStatus ToUpper(ReadOnlySpan<char> source, Span<char> destination, out int elementsProcessed);
    public static OperationStatus ToUpper(ReadOnlySpan<byte> source, Span<char> destination, out int elementsProcessed);
    public static OperationStatus ToUpper(ReadOnlySpan<char> source, Span<byte> destination, out int elementsProcessed);

    public static OperationStatus ToLower(ReadOnlySpan<byte> source, Span<byte> destination, out int elementsProcessed);
    public static OperationStatus ToLower(ReadOnlySpan<char> source, Span<char> destination, out int elementsProcessed);
    public static OperationStatus ToLower(ReadOnlySpan<byte> source, Span<char> destination, out int elementsProcessed);
    public static OperationStatus ToLower(ReadOnlySpan<char> source, Span<byte> destination, out int elementsProcessed);

    public static bool TryToUpperInPlace(Span<byte> value, out int bytesProcessed);
    public static bool TryToUpperInPlace(Span<char> value, out int charsProcessed);

    public static bool TryToLowerInPlace(Span<byte> value, out int bytesProcessed);
    public static bool TryToLowerInPlace(Span<char> value, out int charsProcessed);

    public static bool TryGetHashCode(ReadOnlySpan<byte> value, out int hashCode);
    public static bool TryGetHashCode(ReadOnlySpan<char> value, out int hashCode);
    public static bool TryGetHashCodeIgnoreCase(ReadOnlySpan<byte> value, out int hashCode);
    public static bool TryGetHashCodeIgnoreCase(ReadOnlySpan<char> value, out int hashCode);

    public static int GetHashCode(ReadOnlySpan<byte> value);
    public static int GetHashCode(ReadOnlySpan<char> value);
    public static int GetHashCodeIgnoreCase(ReadOnlySpan<byte> value);
    public static int GetHashCodeIgnoreCase(ReadOnlySpan<char> value);

    public static OperationStatus ToUtf16(ReadOnlySpan<byte> source, Span<char> destination, out int elementsProcessed);
    public static OperationStatus FromUtf16(ReadOnlySpan<char> source, Span<byte> destination, out int elementsProcessed);

    public static Range Trim(ReadOnlySpan<byte> value);
    public static Range Trim(ReadOnlySpan<char> value);
    public static Range TrimStart(ReadOnlySpan<byte> value);
    public static Range TrimStart(ReadOnlySpan<char> value);
    public static Range TrimEnd(ReadOnlySpan<byte> value);
    public static Range TrimEnd(ReadOnlySpan<char> value);
}

@terrajobst terrajobst added api-needs-work API needs work before it is approved, it is NOT ready for implementation and removed api-ready-for-review API is ready for review, it is NOT ready for implementation labels Sep 13, 2022
@dakersnar dakersnar modified the milestones: 8.0.0, Future Sep 29, 2022
@adamsitnik
Copy link
Member

adamsitnik commented Nov 29, 2022

To make some progress on #75012 I would like to discuss only a subset of the ASCII APIs. Here is what was changes since last discussion:

- namespace System.Buffers.Text;
+ namespace System.Text;

public static class Ascii
{   
-    public static bool IsAscii(ReadOnlySpan<byte> value);
-    public static bool IsAscii(ReadOnlySpan<char> value);

+    public static bool IsValid(ReadOnlySpan<byte> value);
+    public static bool IsValid(ReadOnlySpan<char> value);

-    public static bool IsAscii(byte value);
-    public static bool IsAscii(char value);

+    public static bool IsValid(byte value);
+    public static bool IsValid(char value);

    public static OperationStatus ToUpper(ReadOnlySpan<byte> source, Span<byte> destination, out int elementsProcessed);
    public static OperationStatus ToUpper(ReadOnlySpan<char> source, Span<char> destination, out int elementsProcessed);
    public static OperationStatus ToUpper(ReadOnlySpan<byte> source, Span<char> destination, out int elementsProcessed);
    public static OperationStatus ToUpper(ReadOnlySpan<char> source, Span<byte> destination, out int elementsProcessed);

    public static OperationStatus ToLower(ReadOnlySpan<byte> source, Span<byte> destination, out int elementsProcessed);
    public static OperationStatus ToLower(ReadOnlySpan<char> source, Span<char> destination, out int elementsProcessed);
    public static OperationStatus ToLower(ReadOnlySpan<byte> source, Span<char> destination, out int elementsProcessed);
    public static OperationStatus ToLower(ReadOnlySpan<char> source, Span<byte> destination, out int elementsProcessed);

    public static bool TryToUpperInPlace(Span<byte> value, out int bytesProcessed);
    public static bool TryToUpperInPlace(Span<char> value, out int charsProcessed);

    public static bool TryToLowerInPlace(Span<byte> value, out int bytesProcessed);
    public static bool TryToLowerInPlace(Span<char> value, out int charsProcessed);

    public static OperationStatus ToUtf16(ReadOnlySpan<byte> source, Span<char> destination, out int elementsProcessed);
    public static OperationStatus FromUtf16(ReadOnlySpan<char> source, Span<byte> destination, out int elementsProcessed);

    public static Range Trim(ReadOnlySpan<byte> value);
    public static Range Trim(ReadOnlySpan<char> value);
    public static Range TrimStart(ReadOnlySpan<byte> value);
    public static Range TrimStart(ReadOnlySpan<char> value);
    public static Range TrimEnd(ReadOnlySpan<byte> value);
    public static Range TrimEnd(ReadOnlySpan<char> value);
}

There is a concern that System.Buffers.Text isn't discoverable. Should this be in System.Text?

Very good point, fixed.

The ToLowerInPlace API doesn't follow the standard try-pattern. It should probably use operation status.

OperationStatus defines four values:

  • Done - success.
  • InvalidData - for ASCII it means that we have found some non-ASCII bytes/characters.
  • DestinationTooSmall in case of ASCII this is impossible: for valid data the size is always the same, for invalid we would return InvalidData.
  • NeedMoreData same as above

In my opinion, users would be confused if TryToLowerInPlace/TryToLowerInPlace were defined as returning OperationStatus, but using only OperationStatus.Done and OperationStatus.InvalidData.

Writing the following code is just more natural and obvious to me:

if (Ascii.TryToLowerInPlace(buffer, out int cosumed))
{
    // success
}
else
{
    // non-ascii found at index "consumed"
}

@adamsitnik adamsitnik added api-ready-for-review API is ready for review, it is NOT ready for implementation and removed api-needs-work API needs work before it is approved, it is NOT ready for implementation labels Nov 29, 2022
@adamsitnik
Copy link
Member

@GrabYourPitchforks @stephentoub while studying ASP.NET code today I've realized that their corresponding helper for ToUtf16 method treats NULL character (0) as an invalid character:

https://github.com/dotnet/aspnetcore/blob/792e021af928d435276ffdb2149082ea3d8ce9c5/src/Servers/Kestrel/Core/src/Internal/Infrastructure/HttpUtilities.cs#L96-L97

The check if performed here: https://github.com/dotnet/aspnetcore/blob/c4b57c452447f70ab874097436226542c70d74c0/src/Shared/ServerInfrastructure/StringUtilities.cs#L799

So what we have right now in the proposal won't meet their requirements. What is your opinion on extending ToUtf16 and FromUtf16 with a boolean parameter that describes how to treat NULL characters?

-    public static OperationStatus ToUtf16(ReadOnlySpan<byte> source, Span<char> destination, out int elementsProcessed);
+    public static OperationStatus ToUtf16(ReadOnlySpan<byte> source, Span<char> destination, bool allowNullCharacters = true, out int elementsProcessed);
-    public static OperationStatus FromUtf16(ReadOnlySpan<char> source, Span<byte> destination, out int elementsProcessed);
+    public static OperationStatus FromUtf16(ReadOnlySpan<char> source, Span<byte> destination, bool allowNullCharacters = true, out int elementsProcessed);

I just want to make sure that they can delete their own copy of ASCIIUtility when we introduce the new Ascii APIs.

@gfoidl
Copy link
Member

gfoidl commented Dec 2, 2022

@adamsitnik in case you missed it: dotnet/aspnetcore#44040 (comment)

@adamsitnik adamsitnik modified the milestones: Future, 8.0.0 Dec 5, 2022
@GrabYourPitchforks
Copy link
Member Author

GrabYourPitchforks commented Dec 6, 2022

I strongly oppose special-casing NULL (or any other control character, including 0x7F) in the general purpose text processing APIs. These are all perfectly valid ASCII characters. It's fine for protocols to impose their own restrictions above and beyond mere ASCII: some protocols disallow NUL, some protocols disallow CRLF, some protocols disallow VT, some protocols disallow 0x7F, and so on. But general-purpose building block APIs should normally remain agnostic over all of these, leaving the vagaries of protocol details up to whatever higher level component is build atop this. We don't special-case NULL or any other control character anywhere else within the System.Text namespace, for instance.

For the particular API being suggested here, we don't special-case control chars within System.Text.ASCII.GetString(byte[]). Similarly, we should not special-case it here. The behaviors of the two APIs should be perfectly identical when given valid ASCII as input.

@bartonjs
Copy link
Member

bartonjs commented Dec 6, 2022

  • For the case of an OperationStatus API where Consumed is guaranteed equal to Written we decided to go with omitting Consumed (and documenting the behavior), leaving the naming up to the destination type (for bytesWritten/charsWritten/elementsWritten).
  • We discussed, and reiterated, that "Try" is not appropriate if a false return would out a non-default value, so went with OperationStatus everywhere.
  • We discussed, and rejected, adding replacement semantics to the UTF-16->ASCII transformations.
namespace System.Text;

public static class Ascii
{   
    public static bool IsValid(ReadOnlySpan<byte> value);
    public static bool IsValid(ReadOnlySpan<char> value);

    public static bool IsValid(byte value);
    public static bool IsValid(char value);

    public static OperationStatus ToUpper(ReadOnlySpan<byte> source, Span<byte> destination, out int bytesWritten);
    public static OperationStatus ToUpper(ReadOnlySpan<char> source, Span<char> destination, out int charsWritten);
    public static OperationStatus ToUpper(ReadOnlySpan<byte> source, Span<char> destination, out int charsWritten);
    public static OperationStatus ToUpper(ReadOnlySpan<char> source, Span<byte> destination, out int bytesWritten);

    public static OperationStatus ToLower(ReadOnlySpan<byte> source, Span<byte> destination, out int bytesWritten);
    public static OperationStatus ToLower(ReadOnlySpan<char> source, Span<char> destination, out int charsWritten);
    public static OperationStatus ToLower(ReadOnlySpan<byte> source, Span<char> destination, out int charsWritten);
    public static OperationStatus ToLower(ReadOnlySpan<char> source, Span<byte> destination, out int bytesWritten);

    public static OperationStatus ToUpperInPlace(Span<byte> value, out int bytesWritten);
    public static OperationStatus ToUpperInPlace(Span<char> value, out int charsWritten);

    public static OperationStatus ToLowerInPlace(Span<byte> value, out int bytesWritten);
    public static OperationStatus ToLowerInPlace(Span<char> value, out int charsWritten);

    public static OperationStatus ToUtf16(ReadOnlySpan<byte> source, Span<char> destination, out int charsWritten);
    public static OperationStatus FromUtf16(ReadOnlySpan<char> source, Span<byte> destination, out int bytesWritten);

    public static Range Trim(ReadOnlySpan<byte> value);
    public static Range Trim(ReadOnlySpan<char> value);
    public static Range TrimStart(ReadOnlySpan<byte> value);
    public static Range TrimStart(ReadOnlySpan<char> value);
    public static Range TrimEnd(ReadOnlySpan<byte> value);
    public static Range TrimEnd(ReadOnlySpan<char> value);
}

@bartonjs bartonjs added api-approved API was approved in API review, it can be implemented and removed api-ready-for-review API is ready for review, it is NOT ready for implementation labels Dec 6, 2022
@adamsitnik
Copy link
Member

@bartonjs @GrabYourPitchforks @stephentoub I've realized that bytesWritten/charsWritten is not very accurate name for the out parameter for the in-place ToLower/ToUpper conversion when there is no need to convert anything. Imagine that we want to uppercase "ABC" string in-place. There is no need to copy any bytes/characters (charsWritten=0), but we need to process all of them.

What do you think about renaming written to processed for the in-place conversions?

- public static OperationStatus ToUpperInPlace(Span<byte> value, out int bytesWritten);
- public static OperationStatus ToUpperInPlace(Span<char> value, out int charsWritten);
+ public static OperationStatus ToUpperInPlace(Span<byte> value, out int bytesProcessed);
+ public static OperationStatus ToUpperInPlace(Span<char> value, out int charsProcessed);

- public static OperationStatus ToLowerInPlace(Span<byte> value, out int bytesWritten);
- public static OperationStatus ToLowerInPlace(Span<char> value, out int charsWritten);
+ public static OperationStatus ToLowerInPlace(Span<byte> value, out int bytesProcessed);
+ public static OperationStatus ToLowerInPlace(Span<char> value, out int charsProcessed);

@bartonjs
Copy link
Member

bartonjs commented Dec 7, 2022

What do you think about renaming written to processed for the in-place conversions?

The guideline is

CONSIDER naming an out parameter that reports the number of values
written by a buffer-writing method “bytesWritten,” “charsWritten,” or
“valuesWritten” (as appropriate to the data type).

If you see any existing OperationStatus method that has done something other than "Written" and has a similar justification, then maybe... but "Written" is what the pattern suggests, and it isn't so wrong as to feel like it needs something different.

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Dec 8, 2022
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Dec 21, 2022
@dotnet dotnet locked as resolved and limited conversation to collaborators Jan 20, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-approved API was approved in API review, it can be implemented area-System.Buffers
Projects
None yet
Development

Successfully merging a pull request may close this issue.