Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MemoryExtensions.GetHashCode(ReadOnlySpan<Char>, StringComparison) Method #50322

Closed
weifenluo opened this issue Mar 27, 2021 · 12 comments
Closed
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.Memory

Comments

@weifenluo
Copy link

weifenluo commented Mar 27, 2021

Background and Motivation

MemoryExtensions has Equals extension method to compare two ReadOnlySpan<char> structs:

public static bool Equals (this ReadOnlySpan<char> span, ReadOnlySpan<char> other, StringComparison comparisonType);`

It should also include a GetHashCode method for ReadOnlySpan<char> struct:

public static int GetHashCode(this ReadOnlySpan<char> span, StringComparison comparisonType);`

The reason is obvious: Equals and GetHashCode should appear in pair. This API is helpful when writing libraries targeting legacy .Net Framework or .Net Standard 2.0 for broader audience, where String.GetHashCode(ReadOnlySpan<char>, StringComparison) is not available.

Proposed API

namespace System
{
    public static class MemoryExtensions {
+       public static int GetHashCode(this ReadOnlySpan<char> span, StringComparison comparisonType) {
+           ...    
+       }
    }
}

Usage Examples

ReadOnlySpan<char> span = ...;
int hashCode = span.GetHashCode(StringComparision.OrdinalIgnoreCase);

Alternative Designs

I believe this is the best place to add this API.

Risks

Don't see any risk - already have Equals any way. I believe the implementation is already there, only a few lines of code needs to be added.

@weifenluo weifenluo added the api-suggestion Early API idea and discussion, it is NOT ready for implementation label Mar 27, 2021
@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Mar 27, 2021
@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@stephentoub
Copy link
Member

To make sure I understand the request here, you're asking for functionality equivalent to the existing string.GetHashCode but shipped in a separate nuget targeting downlevel?

@ghost
Copy link

ghost commented Mar 27, 2021

Tagging subscribers to this area: @tannergooding, @pgovind, @GrabYourPitchforks
See info in area-owners.md if you want to be subscribed.

Issue Details

Background and Motivation

MemoryExtensions has Equals extension method to compare two ReadOnlySpan<char> structs:

public static bool Equals (this ReadOnlySpan<char> span, ReadOnlySpan<char> other, StringComparison comparisonType);`

It should also include a GetHashCode method for ReadOnlySpan<char> struct:

public static bool GetHashCode(this ReadOnlySpan<char> span, StringComparison comparisonType);`

The reason is obvious: Equals and GetHashCode should appear in pair. This API is helpful when writing libraries targeting legacy .Net Framework or .Net Standard 2.0 for broader audience, where String.GetHashCode(ReadOnlySpan<char>, StringComparison) is not available.

Proposed API

namespace System
{
    public static class MemoryExtensions {
+       public static bool GetHashCode(this ReadOnlySpan<char> span, StringComparison comparisonType) {
+           ...    
+       }
    }
}

Usage Examples

ReadOnlySpan<char> span = ...;
int hashCode = span.GetHashCode(StringComparision.OrdinalIgnoreCase);

Alternative Designs

I believe this is the best place to add this API.

Risks

Don't see any risk - already have Equals any way. I believe the implementation is already there, only a few lines of code needs to be added.

Author: weifenluo
Assignees: -
Labels:

api-suggestion, area-System.Buffers, untriaged

Milestone: -

@GrabYourPitchforks
Copy link
Member

GrabYourPitchforks commented Mar 27, 2021

I don't think we should add this API. The downlevel performance would be extremely poor, as it would require converting the spans back into a string instance and calling string.GetHashCode on it. (We wouldn't want to ship our own implementation downlevel because then string.GetHashCode would return a different value than this API, which would confuse callers and possibly lead to incorrect app behavior.)

@stephentoub
Copy link
Member

(I agree, just trying to ensure I understand the request. )

@weifenluo
Copy link
Author

weifenluo commented Mar 28, 2021

Yes, the request is to ship functionality equivalent to the existing string.GetHashCode in a separate nuget targeting downlevel.

The real use case is a custom JSON deserializer which requires a custom Dictionary to lookup JSON property by UTF8 span name, compares ordinal or ordinal case insensitive. It's a library should target downlevel to .Net Standard 2.0 and Framework 4.6.1 for broader adoption. Performance is critical so converting the spans back into a string instance is not acceptable (it also destroys the whole point of providing this API). Returning a different value than string.GetHashCode doesn't really matter in this case. Maybe we can have another name to distinguish it from string.GetHashCode, or make it clear in the documentation?

System.Text.Json implements the lookup by first converting the spans back into a string instance for comparison, than cache the original utf8 bytes for upcoming lookups. IMO It's kind of hacky because this may work well for certain test cases, it will revert back to string instance in many real world use cases.

@GrabYourPitchforks
Copy link
Member

Performance is critical so converting the spans back into a string instance is not acceptable

Unfortunately I don't think this is something we can deliver in a downlevel package. It would rely on APIs like MemoryExtensions.ToUpper, which which running downlevel perform significant allocations under the covers. Even if we did try to work around this downlevel, it almost certainly wouldn't be resilient against malicious input, which would make it inappropriate for a use case such as a JSON deserializer.

Honestly I think your best bet is when running downlevel to use a regular string-based dictionary. Yes, it'll take a perf hit, but at least it will work as expected. If that's truly not acceptable for your scenario, you could clone some of the logic from the implementations in this repo. For instance, if all you need is Ordinal and OrdinalIgnoreCase (no InvariantCulture, etc.), there are code paths in this repo that you can clone. You can tailor to these code paths to your specific needs better than we could provide a generalized API suited to your needs. But you may ultimately end up needing to p/invoke into Win32 if you encounter non-ASCII characters, which means you'll need to begin special-casing those possibilities, and you'd need an OS check to ensure you're not going to attempt the p/invoke if you're on a non-Windows platform. All of this is why I'd recommend sticking with tried and true strings.

@ghost
Copy link

ghost commented Mar 28, 2021

Tagging subscribers to this area: @GrabYourPitchforks, @carlossanlop
See info in area-owners.md if you want to be subscribed.

Issue Details

Background and Motivation

MemoryExtensions has Equals extension method to compare two ReadOnlySpan<char> structs:

public static bool Equals (this ReadOnlySpan<char> span, ReadOnlySpan<char> other, StringComparison comparisonType);`

It should also include a GetHashCode method for ReadOnlySpan<char> struct:

public static int GetHashCode(this ReadOnlySpan<char> span, StringComparison comparisonType);`

The reason is obvious: Equals and GetHashCode should appear in pair. This API is helpful when writing libraries targeting legacy .Net Framework or .Net Standard 2.0 for broader audience, where String.GetHashCode(ReadOnlySpan<char>, StringComparison) is not available.

Proposed API

namespace System
{
    public static class MemoryExtensions {
+       public static int GetHashCode(this ReadOnlySpan<char> span, StringComparison comparisonType) {
+           ...    
+       }
    }
}

Usage Examples

ReadOnlySpan<char> span = ...;
int hashCode = span.GetHashCode(StringComparision.OrdinalIgnoreCase);

Alternative Designs

I believe this is the best place to add this API.

Risks

Don't see any risk - already have Equals any way. I believe the implementation is already there, only a few lines of code needs to be added.

Author: weifenluo
Assignees: -
Labels:

api-suggestion, area-System.Memory, untriaged

Milestone: -

@gfoidl
Copy link
Member

gfoidl commented Mar 28, 2021

@weifenluo maybe for interest: #27229 and #2010

@weifenluo
Copy link
Author

weifenluo commented Mar 29, 2021

@GrabYourPitchforks Thanks a lot for your great explanation and advice. I managed to work out the solution which falling back to string only when there is non-ASCII char. After all, JSON property name rarely contains non-ASCII char.

MemoryExtensions.GetHashCode.cs:

using System;
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;

internal static partial class MemoryExtensions
{
    // A span-based equivalent of String.GetHashCode(StringComparison). Uses the specified comparison type.
    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public static int GetHashCode(this ReadOnlySpan<char> value, StringComparison comparisonType)
    {
        switch (comparisonType)
        {
            case StringComparison.Ordinal:
                return GetHashCode(value);

            case StringComparison.OrdinalIgnoreCase:
                return GetHashCodeOrdinalIgnoreCase(value);

            default:
                throw new ArgumentException($"Comparison type '{comparisonType}' is not supported.", nameof(comparisonType));
        }
    }

    // A span-based equivalent of String.GetHashCode(). Computes an ordinal hash code.
    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private static int GetHashCode(ReadOnlySpan<char> value)
    {
        ulong seed = Marvin.DefaultSeed;
        return Marvin.ComputeHash32(MemoryMarshal.AsBytes(value), seed);
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private static int GetHashCodeOrdinalIgnoreCase(ReadOnlySpan<char> value)
    {
        ulong seed = Marvin.DefaultSeed;
        return Marvin.ComputeHash32OrdinalIgnoreCase(ref MemoryMarshal.GetReference(value), value.Length /* in chars, not bytes */, (uint)seed, (uint)(seed >> 32));
    }
}

This goes down to Marvin.ComputeHash32 and Marvin.ComputeHash32OrdinalIgnoreCase, with two small tricks:

Marvin.GenerateSeed:

private static ulong GenerateSeed()
{
    //***********************************************************
    // Replace original implementation using Interop
    // with hashcode of two GUIDs.
    //***********************************************************

    //ulong seed;
    //Interop.GetRandomBytes((byte*)&seed, sizeof(ulong));
    //return seed;

    var hashCode1 = (ulong)Guid.NewGuid().GetHashCode();
    var hashCode2 = (ulong)Guid.NewGuid().GetHashCode();
    return (hashCode1 << 32) & hashCode2;
}

and Marvin.ComputeHash32OrdinalIgnoreCaseSlow:

private static unsafe int ComputeHash32OrdinalIgnoreCaseSlow(ref char data, int count, uint p0, uint p1)
{
    Debug.Assert(count > 0);

    //****************************************************************************
    // The original implementation does not work because Ordinal.ToUpperOrdinal
    // will hit on GlobalizationMode.Invariant and GlobalizationMode.UseNls
    // which are OS-specific
    //***************************************************************************
    //char[]? borrowedArr = null;
    //Span<char> scratch = (uint)count <= 64 ? stackalloc char[64] : (borrowedArr = ArrayPool<char>.Shared.Rent(count));

    //int charsWritten = System.Globalization.Ordinal.ToUpperOrdinal(new ReadOnlySpan<char>(ref data, count), scratch);
    //Debug.Assert(charsWritten == count); // invariant case conversion should involve simple folding; preserve code unit count

    //// Slice the array to the size returned by ToUpperInvariant.
    //// Multiplication below will not overflow since going from positive Int32 to UInt32.
    //int hash = ComputeHash32(ref Unsafe.As<char, byte>(ref MemoryMarshal.GetReference(scratch)), (uint)charsWritten * 2, p0, p1);

    //// Return the borrowed array if necessary.
    //if (borrowedArr != null)
    //{
    //    ArrayPool<char>.Shared.Return(borrowedArr);
    //}

    //return hash;

    //*****************************************************************************
    // The only way to calculate ordinal ignore case hashcode for non-ASCII string
    // is using StringComparer.OrdinalIgnoreCase.GetHashCode() because
    // there is no ToUpperOrdinal API for string.
    // The result is different from the original implementation, it's ok
    // because we will use the hash code privately only.
    //*****************************************************************************
    var charSpan = new ReadOnlySpan<char>(Unsafe.AsPointer(ref data), count);
    string s = charSpan.ToString();
    int hash = StringComparer.OrdinalIgnoreCase.GetHashCode(s);
    return HashCode.Combine(p0, p1, hash);
}

@weifenluo
Copy link
Author

@gfoidl Wow, didn't expect coming across a world champion here! Thanks a lot for your information, which helped me to realize that ReadOnlySpan<char>, the long lost twin of string, has a long way to catch up. This may take years!

@GrabYourPitchforks
Copy link
Member

@weifenluo I'm glad you found a solution that works for you! Please remember to have your code security-reviewed, as Marvin32 is intended to be a security-sensitive algorithm, and calling StringComparer.OrdinalIgnoreCase.GetHashCode(...) in the middle of the algorithm will defeat the protections afforded by Marvin.

@jkotas jkotas closed this as completed Mar 29, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Apr 28, 2021
@tannergooding tannergooding removed the untriaged New issue has not been triaged by the area owner label Jun 24, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.Memory
Projects
None yet
Development

No branches or pull requests

7 participants