Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate perf optimizations of unescape logic of Utf8JsonReader TextEquals #36066

Open
ahsonkhan opened this Issue Mar 15, 2019 · 0 comments

Comments

Projects
None yet
1 participant
@ahsonkhan
Copy link
Member

ahsonkhan commented Mar 15, 2019

Currently, if we ever hit an escaped character, we either stackalloc or rent a buffer and unescape the entire JSON string value from that character onward (using a simple while loop). After we have unescaped the entire token, we then compare to the string text the user passed in to see if they match. An alternate approach could be to compare and unescape as you search, and ping-pong between iterative unescaping and vectorized SequenceEqual.

Is this alternative approach worth it?

For context, see: #35979

Something along the route of:

  • FindEscapedChar (I think @benaadams vectorized the IndexOf methods recently, so you'll profit from all the avx goodness automatically.)
  • SequenceEquals up to this char (this is already vectorized afaik).
  • Unescape and compare
  • Repeat until you hit the end.

Let's say we have something as follows. Here are the performance characteristics of various payloads:
image

We see ~10-15%% improvement in some cases (~20-25% in the best case - first escaped, rest unescaped, especially for really large strings). However, we also see a 10-20% regression in some cases (~30% regression in the worst case - all escaped).

Here's my premise:

  • TextEquals() will primarily be called on matching property names (not value strings)
  • TextEquals() calls will more often return false than true. More property names will not match the lookup string compared to the one that will.
  • Property names tend to be small, I would imagine 3-16 characters being the most common cases
    • Vectorized searches are less helpful here given the input size.
  • Therefore, they will mostly fit within a single segment (and in rare cases, at most into 2, if segment size is reasonable - say 4K).
  • Generally they will be ASCII, with some cases where they will have escaped characters sprinkled within (I would imagine the value string having more escaped characters, especially given the default Utf8JsonWriter behavior, and maybe the entire property name is escaped)
  • If we happen to have escaped characters somewhere in the middle, the first SequenceEqual check will likely reject most search strings, so the chances we are on the "it's a match" path is higher by the time we are unescaping the characters.
  • In conclusion, given the rarity of escaped characters (0 or all being escaped more common than 1-3 sprinkled throughout), the size of property names, and how matching eagerly rejects, I don't see much benefit in going down the iterative approach.

I am sure the implementation can be made faster but I wanted to do an apples to apples comparison between what exists today (unescape all and compare) versus the new change (unescape and compare as you go). I would imagine perf improvements on the new UnescapeAndCompare would translate to the existing UnescapeAndCompare as well.

New Implementation:
private bool UnescapeAndCompare(ReadOnlySpan<byte> other)
{
    Debug.Assert(!HasValueSequence);
    ReadOnlySpan<byte> localSpan = ValueSpan;

    if (localSpan.Length < other.Length || localSpan.Length / JsonConstants.MaxExpansionFactorWhileEscaping > other.Length)
    {
        return false;
    }

    int idx = localSpan.IndexOf(JsonConstants.BackSlash);
    Debug.Assert(idx != -1);

    do
    {
        if (!other.StartsWith(localSpan.Slice(0, idx)))
        {
            return false;
        }

        if (!JsonReaderHelper.UnescapeAndCompare(localSpan.Slice(idx), other.Slice(idx), out int sourceBytesConsumed, out int otherBytesConsumed))
        {
            return false;
        }

        localSpan = localSpan.Slice(idx + sourceBytesConsumed);
        other = other.Slice(idx + otherBytesConsumed);

        if (other.Length == 0 || localSpan.Length == 0)
        {
            return localSpan.Length == other.Length;
        }

        idx = localSpan.IndexOf(JsonConstants.BackSlash);

        if (idx == -1)
        {
            return other.SequenceEqual(localSpan);
        }

    } while (true);
}

public static bool UnescapeAndCompare(ReadOnlySpan<byte> utf8Source, ReadOnlySpan<byte> other, out int sourceBytesConsumed, out int otherBytesConsumed)
{
    Debug.Assert(utf8Source[0] == JsonConstants.BackSlash);

    if (other.Length <= 0)
    {
        sourceBytesConsumed = default;
        otherBytesConsumed = default;
        return false;
    }

    bool result = false;

    byte currentByte = utf8Source[1];
    byte otherByte = other[0];

    sourceBytesConsumed = 2;
    otherBytesConsumed = 1;

    if (currentByte == JsonConstants.Quote)
    {
        result = otherByte == JsonConstants.Quote;
    }
    else if (currentByte == 'n')
    {
        result = otherByte == JsonConstants.LineFeed;
    }
    else if (currentByte == 'r')
    {
        result = otherByte == JsonConstants.CarriageReturn;
    }
    else if (currentByte == JsonConstants.BackSlash)
    {
        result = otherByte == JsonConstants.BackSlash;
    }
    else if (currentByte == JsonConstants.Slash)
    {
        result = otherByte == JsonConstants.Slash;
    }
    else if (currentByte == 't')
    {
        result = otherByte == JsonConstants.Tab;
    }
    else if (currentByte == 'b')
    {
        result = otherByte == JsonConstants.BackSpace;
    }
    else if (currentByte == 'f')
    {
        result = otherByte == JsonConstants.FormFeed;
    }
    else if (currentByte == 'u')
    {
        // The source is known to be valid JSON, and hence if we see a \u, it is guaranteed to have 4 hex digits following it
        // Otherwise, the Utf8JsonReader would have already thrown an exception.
        Debug.Assert(utf8Source.Length >= 6);

        bool parseResult = Utf8Parser.TryParse(utf8Source.Slice(2, 4), out int scalar, out int bytesConsumed, 'x');
        Debug.Assert(parseResult);
        Debug.Assert(bytesConsumed == 4);

        sourceBytesConsumed += 4;

        if (JsonHelpers.IsInRangeInclusive((uint)scalar, JsonConstants.HighSurrogateStartValue, JsonConstants.LowSurrogateEndValue))
        {
            // The first hex value cannot be a low surrogate.
            if (scalar >= JsonConstants.LowSurrogateStartValue)
            {
                ThrowHelper.ThrowInvalidOperationException_ReadInvalidUTF16(scalar);
            }

            Debug.Assert(JsonHelpers.IsInRangeInclusive((uint)scalar, JsonConstants.HighSurrogateStartValue, JsonConstants.HighSurrogateEndValue));

            // We must have a low surrogate following a high surrogate.
            if (utf8Source.Length < 12 || utf8Source[6] != '\\' || utf8Source[7] != 'u')
            {
                ThrowHelper.ThrowInvalidOperationException_ReadInvalidUTF16();
            }

            // The source is known to be valid JSON, and hence if we see a \u, it is guaranteed to have 4 hex digits following it
            // Otherwise, the Utf8JsonReader would have already thrown an exception.
            result = Utf8Parser.TryParse(utf8Source.Slice(8, 4), out int lowSurrogate, out bytesConsumed, 'x');
            Debug.Assert(result);
            Debug.Assert(bytesConsumed == 4);

            // If the first hex value is a high surrogate, the next one must be a low surrogate.
            if (!JsonHelpers.IsInRangeInclusive((uint)lowSurrogate, JsonConstants.LowSurrogateStartValue, JsonConstants.LowSurrogateEndValue))
            {
                ThrowHelper.ThrowInvalidOperationException_ReadInvalidUTF16(lowSurrogate);
            }

            // To find the unicode scalar:
            // (0x400 * (High surrogate - 0xD800)) + Low surrogate - 0xDC00 + 0x10000
            scalar = (JsonConstants.BitShiftBy10 * (scalar - JsonConstants.HighSurrogateStartValue))
                + (lowSurrogate - JsonConstants.LowSurrogateStartValue)
                + JsonConstants.UnicodePlane01StartValue;

            sourceBytesConsumed += 6;
        }

        Span<byte> destination = stackalloc byte[4];
#if BUILDING_INBOX_LIBRARY
        var rune = new Rune(scalar);
        result = rune.TryEncodeToUtf8Bytes(destination, out int bytesWritten);
        Debug.Assert(result);
#else
        EncodeToUtf8Bytes((uint)scalar, destination, out int bytesWritten);
#endif
        Debug.Assert(bytesWritten <= 4);
        result = other.StartsWith(destination.Slice(0, bytesWritten));

        otherBytesConsumed += bytesWritten - 1;
    }

    return result;
}
Benchmark:
public class Perf_TextEquals
{
    private readonly byte[] dataUtf8 = Encoding.UTF8.GetBytes("{\"proper\\u0074y1\": 1, \"proper\\u0074y2\": 2, \"proper\\u0074y3\": 3, \"proper\\u0074y4\": 4, \"proper\\u0074y5\": 5 }");
    private readonly byte[] dataUtf8_Unescaped = Encoding.UTF8.GetBytes("{\"property1\": 1, \"property2\": 2, \"property3\": 3, \"property4\": 4, \"property5\": 5 }");
    private readonly byte[] dataUtf8_One = Encoding.UTF8.GetBytes("{\"\\u0070\\u0072\\u006F\\u0070\\u0065\\u0072\\u0074\\u00791\": 1, \"property2\": 2, \"property3\": 3, \"property4\": 4, \"property5\": 5 }");
    private readonly byte[] dataUtf8_First = Encoding.UTF8.GetBytes("{\"\\u0070roperty1\": 1, \"\\u0070roperty2\": 2, \"\\u0070roperty3\": 3, \"\\u0070roperty4\": 4, \"\\u0070roperty5\": 5 }");
    private readonly byte[] dataUtf8_Many = Encoding.UTF8.GetBytes("{\"\\u0070ro\\u0070erty1\": 1, \"\\u0070ro\\u0070erty2\": 2, \"\\u0070ro\\u0070erty3\": 3, \"\\u0070ro\\u0070erty4\": 4, \"\\u0070ro\\u0070erty5\": 5 }");
    private readonly byte[] dataUtf8_All = Encoding.UTF8.GetBytes("{\"\\u0070\\u0072\\u006F\\u0070\\u0065\\u0072\\u0074\\u00791\": 1, \"\\u0070\\u0072\\u006F\\u0070\\u0065\\u0072\\u0074\\u00792\": 2, \"\\u0070\\u0072\\u006F\\u0070\\u0065\\u0072\\u0074\\u00793\": 3, \"\\u0070\\u0072\\u006F\\u0070\\u0065\\u0072\\u0074\\u00794\": 4, \"\\u0070\\u0072\\u006F\\u0070\\u0065\\u0072\\u0074\\u00795\": 5 }");

    private readonly byte[] lookup = Encoding.UTF8.GetBytes("property5");

    private readonly byte[] dataUtf8_OneProperty = Encoding.UTF8.GetBytes("{\"\\u0070ropertyaaaaaaaaaaaaaaaaaaaa6\": 6 }");
    private readonly byte[] dataUtf8_OnePropertyLargest = Encoding.UTF8.GetBytes("{\"\\u0070ropertyaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa6\": 6 }");

    private readonly byte[] lookupLarge = Encoding.UTF8.GetBytes("propertyaaaaaaaaaaaaaaaaaaaa6");
    private readonly byte[] lookupLargest = Encoding.UTF8.GetBytes("propertyaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa6");

    [Benchmark]
    public int UTF8()
    {
        var json = new Utf8JsonReader(dataUtf8, isFinalBlock: true, default);
        while (json.Read())
        {
            if (json.TokenType == JsonTokenType.PropertyName)
            {
                if (json.TextEquals(lookup))
                {
                    json.Read();
                    int result = json.GetInt32();
                    return result;
                }
            }
        }
        return -1;
    }

    [Benchmark]
    public int UTF8_LargeProperty()
    {
        var json = new Utf8JsonReader(dataUtf8_OneProperty, isFinalBlock: true, default);
        while (json.Read())
        {
            if (json.TokenType == JsonTokenType.PropertyName)
            {
                if (json.TextEquals(lookupLarge))
                {
                    json.Read();
                    int result = json.GetInt32();
                    return result;
                }
            }
        }
        return -1;
    }

    [Benchmark]
    public int UTF8_LargestProperty()
    {
        var json = new Utf8JsonReader(dataUtf8_OnePropertyLargest, isFinalBlock: true, default);
        while (json.Read())
        {
            if (json.TokenType == JsonTokenType.PropertyName)
            {
                if (json.TextEquals(lookupLargest))
                {
                    json.Read();
                    int result = json.GetInt32();
                    return result;
                }
            }
        }
        return -1;
    }

    [Benchmark]
    public int UTF8_Unescaped()
    {
        var json = new Utf8JsonReader(dataUtf8_Unescaped, isFinalBlock: true, default);
        while (json.Read())
        {
            if (json.TokenType == JsonTokenType.PropertyName)
            {
                if (json.TextEquals(lookup))
                {
                    json.Read();
                    int result = json.GetInt32();
                    return result;
                }
            }
        }
        return -1;
    }

    [Benchmark]
    public int UTF8_FirstEscaped()
    {
        var json = new Utf8JsonReader(dataUtf8_First, isFinalBlock: true, default);
        while (json.Read())
        {
            if (json.TokenType == JsonTokenType.PropertyName)
            {
                if (json.TextEquals(lookup))
                {
                    json.Read();
                    int result = json.GetInt32();
                    return result;
                }
            }
        }
        return -1;
    }

    [Benchmark]
    public int UTF8_ManyEscaped()
    {
        var json = new Utf8JsonReader(dataUtf8_Many, isFinalBlock: true, default);
        while (json.Read())
        {
            if (json.TokenType == JsonTokenType.PropertyName)
            {
                if (json.TextEquals(lookup))
                {
                    json.Read();
                    int result = json.GetInt32();
                    return result;
                }
            }
        }
        return -1;
    }

    [Benchmark]
    public int UTF8_AllEscaped()
    {
        var json = new Utf8JsonReader(dataUtf8_All, isFinalBlock: true, default);
        while (json.Read())
        {
            if (json.TokenType == JsonTokenType.PropertyName)
            {
                if (json.TextEquals(lookup))
                {
                    json.Read();
                    int result = json.GetInt32();
                    return result;
                }
            }
        }
        return -1;
    }

    [Benchmark]
    public int UTF8_OneEscaped()
    {
        var json = new Utf8JsonReader(dataUtf8_One, isFinalBlock: true, default);
        while (json.Read())
        {
            if (json.TokenType == JsonTokenType.PropertyName)
            {
                if (json.TextEquals(lookup))
                {
                    json.Read();
                    int result = json.GetInt32();
                    return result;
                }
            }
        }
        return -1;
    }
}

cc @Tornhoof, @GrabYourPitchforks, @stephentoub

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.