Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Add Utf8String type #23209

Merged
merged 11 commits into from Mar 19, 2019
Merged

Add Utf8String type #23209

merged 11 commits into from Mar 19, 2019

Conversation

GrabYourPitchforks
Copy link
Member

This PR contains the type itself and enlightens the VM, the GC, reflection, and certain APIs (Memory / Span) about the new type. Functionality like culture-aware or case-insensitive comparisons are not yet hooked up. Those are implemented in the feature branch, but I didn't include them here because I'm still working on increasing the test coverage.

All Utf8String-related code is currently surrounded by #ifdef statements to allow easy removal from release branches.

All Utf8String-related code is currently surrounded by ifdefs to allow easy removal from release branches
@GrabYourPitchforks
Copy link
Member Author

Experimental package, ref asm, and unit tests at dotnet/corefx#35989.

@@ -2,6 +2,7 @@
<!-- Features we're currently flighting, but don't intend to ship in officially supported releases -->
<PropertyGroup Condition="'$(IsPrerelease)' == 'true'">
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please make sure that this is not set in the Release branch? (It is not the case today.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is still true in the release/3.0 branch currently, but that's because there's not yet an officially shipping release coming from that branch. This was set correctly to false in the release/2.x branches (see 5b5dca3).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previews shipping out of release/3.0 are proxies for officially shipping releases. We should flip that in release/3.0 now. It has not been flipped because of we do not have anything really depending on it for 3.0.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also kicked off a local build with IsPrerelease=false and ensured that no part of the feature inadvertently makes its way in to System.Private.CoreLib.dll.

@GrabYourPitchforks
Copy link
Member Author

I reran the perf benchmarks for the Memory<byte>.Span property getter since that code path was modified. Results are below.

Method Toolchain Mean Error StdDev Ratio RatioSD
GetSpan_MemOfBytesFromArray 2.2-release 4.244 ms 0.0882 ms 0.1474 ms 1.00 0.00
GetSpan_MemOfBytesFromArray 3.0-master 1.933 ms 0.0373 ms 0.0366 ms 0.45 0.02
GetSpan_MemOfBytesFromArray dev-utf8string 2.268 ms 0.0280 ms 0.0262 ms 0.53 0.03
GetSpan_MemOfBytesFromMemMgr 2.2-release 5.306 ms 0.0996 ms 0.0978 ms 1.00 0.00
GetSpan_MemOfBytesFromMemMgr 3.0-master 3.150 ms 0.0719 ms 0.0769 ms 0.59 0.02
GetSpan_MemOfBytesFromMemMgr dev-utf8string 3.101 ms 0.0270 ms 0.0239 ms 0.58 0.01
GetSpan_MemOfBytesEmpty 2.2-release 1.393 ms 0.0163 ms 0.0153 ms 1.00 0.00
GetSpan_MemOfBytesEmpty 3.0-master 1.108 ms 0.0208 ms 0.0240 ms 0.79 0.02
GetSpan_MemOfBytesEmpty dev-utf8string 1.141 ms 0.0227 ms 0.0213 ms 0.82 0.02

Overall the results are about what I'd expect. The byte[] case sees the largest regression because there's now an additional "are you Utf8String?" check where there was none before. However, due to the optimizations introduced earlier in the 3.0 timeframe (see #20386) we're still considerably better than 2.2. The MemoryManager<byte> case sees essentially no impact because that scenario was already expensive to begin with, and the extra few instructions are within the realm of noise.

@GrabYourPitchforks
Copy link
Member Author

(In the latest iteration I merged master manually because GitHub was having trouble auto-resolving merge conflicts.)

@yahorsi
Copy link

yahorsi commented Mar 14, 2019

May I ask silly question, are there any perf tests showing benefits of having special Utf8String string?

@GrabYourPitchforks
Copy link
Member Author

@yahorsi - Not a silly question at all. :) I have some rough numbers showing this, but they're from an earlier iteration of this prototype. Off the top of my head here are the quick vitals:

  • Given a binary buffer that represents UTF-8 data, creating a Utf8String instance from that buffer is approx. 5x faster than creating a String instance.

  • For most payloads, UTF-8 consumes less memory than UTF-16. Our data show that most strings still consist of mainly ASCII data, even in programs used primarily by East Asian users. (East Asian is particularly interesting here because UTF-8 representation tends to expand it, not compact it, when compared to UTF-16.) Assuming you could flow Utf8String instances instead of String instances throughout your applications, you should expect to see a double-digit percentage decrease in overall memory utilization.

  • Utf8String is slightly more compact than using byte[] for the same data. It's only 4 bytes or so on average, but these add up quickly, especially when that represents 10 - 15% of the total object size of a small object.

  • Projecting from Utf8String back to ROS<byte> or ROM<byte> so that the data can be sent across the wire is an O(1) operation, compared to an O(n) operation for regular String.

  • Given ASCII Utf8String instances, certain text operations (ToUpperInvariant and friends) perform at around 2x the throughput of those same operations on String. This makes sense: they contain the same amount of information in half the memory.

There are also some drawbacks.

  • ToUpperInvariant on non-ASCII inputs (or ToUpper in locales like tr-TR, even over ASCII input) are much more expensive on Utf8String than they are on String. This is because the localization tables use UTF-16 representation natively, and calling into them requires temporarily transcoding the input data.

  • If you find yourself switching between Utf8String and String frequently, there will be considerable overhead to changing the representation. The best course of action is to stick with one representation for as long as you can, only switching between the two if / when absolutely necessary. This could make the type a bit rough to use until there's a healthy ecosystem of APIs built up around it. (And most APIs aren't going to be enlightened about this type since there's no real need to do so.)

In short, the best performance and memory utilization is going to be seen when reading data from the wire, keeping it in UTF-8 representation (instead of converting to UTF-16), performing operations on it while remaining in the UTF-8 world, then sending it back out again to wherever it needs to go.

@yahorsi
Copy link

yahorsi commented Mar 14, 2019

Am I right ASP.NET/Kestrel guys didn't try to utilise it yet?

@benaadams
Copy link
Member

Am I right ASP.NET/Kestrel guys didn't try to utilise it yet?

Its not been added yet so ASP.NET/Kestrel can't currently use it

@jkotas
Copy link
Member

jkotas commented Mar 14, 2019

ASP.NET/Kestrel

It is unclear whether this type would help Kestrel much. If you are just using UTF8 as an internal implementation detail, you can get all the performance benefits by simply using byte arrays and Spans.

We do not have data and agreement that it is a good idea to add this type. New string exchange type comes with extremely large cost to the ecosystem. It is why this is added as experimental feature under ifdef that is not going to ship in 3.0.

We do have an agreement that it is a good idea to add various Utf8 helpers that operate on Spans and support using UTF8 as an internal implementation detail. Those helpers are gradually being added.

@yahorsi
Copy link

yahorsi commented Mar 14, 2019

If we think about applications of the Utf8 there are a lot in the WEB world. But, currently, ASP.NET Core is defined using strings everywhere in public interfaces and that cannot be changed. So my bet is that unfortunately we won't be able to get enough benefits as in the final end we'll have to convert Utf8String anyway to just String. So, do you think it make sense to run another research to have same type String but have an option to still store Utf8 internally? Compiler could do something when he sees fixed & string, like call something to convert to utf16 first, or even additionally.

@jkotas
Copy link
Member

jkotas commented Mar 14, 2019

ASP.NET Core is defined using strings everywhere in public interfaces

It is not just ASP.NET Core. It is everything in .NET.

it make sense to run another research to have same type String but have an option to still store Utf8 internally

https://github.com/dotnet/coreclr/issues/7083 has discussions about potential designs and tradeoffs. It is a ton of complex work to make something like this work and the estimated returns are not that big (can get better returns on investment elsewhere)...

@GrabYourPitchforks
Copy link
Member Author

CI was clogged due to some outages from yesterday. I'm hoping that the latest iteration will make it through unscathed.

@GrabYourPitchforks
Copy link
Member Author

@yahorsi This is an area of active investigation runtime-wide and Framework-wide. The Utf8String type is one potential experiment, and it's possible this won't pan out. As @jkotas said this is all being done under an ifdef so that it can be ripped out easily if necessary.

There are discussions about the merits of the various proposals in existing tracking issues (see https://github.com/dotnet/coreclr/issues/7083, https://github.com/dotnet/corefx/issues/34094, https://github.com/dotnet/corefx/issues/30503, dotnet/corefxlab#2350, and dotnet/corefxlab#2368 for various discussions). It would help keep things more focused and organized if we reuse those existing issues for feature discussion and instead limit the conversation here on the actual code under review.

// One which operates on a single-byte (ASCII) search value,
// the other which operates on a multi-byte (non-ASCII) search value.

Span<byte> runeBytes = stackalloc byte[Utf8Utility.MaxBytesPerScalar];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As MaxBytesPerScalar is constant this method could be written without stackalloc as:

public unsafe bool Contains(Rune value)
{
    Debug.Assert(sizeof(int) == Utf8Utility.MaxBytesPerScalar);

    int tmp;
    var runeBytes = new Span<byte>(&tmp, Utf8Utility.MaxBytesPerScalar);
    value.TryEncodeToUtf8Bytes(runeBytes, out int runeBytesWritten);

    return SpanHelpers.IndexOf(
        ref DangerousGetMutableReference(), Length,
        ref MemoryMarshal.GetReference(runeBytes), runeBytesWritten) >= 0;
}

Results in better codegen, and ~5% perf-improvement in a benchmark based on the example-code below.

Contrieved example
using System;
using System.Diagnostics;
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;

namespace ConsoleApp3
{
    class Program
    {
        static unsafe void Main(string[] args)
        {
            var rune = new Rune('b');
            var utf8String = new Utf8String();

            bool b1 = utf8String.Contains1(rune);
            bool b2 = utf8String.Contains2(rune);

            if (b1 != b2) Environment.Exit(1);
        }
    }

    internal static class Utf8Utility
    {
        internal const int MaxBytesPerScalar = 4;
    }

    public sealed class Utf8String
    {
        private byte _reference = (byte)'a';

        public bool Contains1(Rune value)
        {
            Span<byte> runeBytes = stackalloc byte[Utf8Utility.MaxBytesPerScalar];
            value.TryEncodeToUtf8Bytes(runeBytes, out int runeBytesWritten);

            return runeBytes.IndexOf(_reference) >= 0;
        }

        public unsafe bool Contains2(Rune value)
        {
            Debug.Assert(sizeof(int) == Utf8Utility.MaxBytesPerScalar);

            int tmp;
            var runeBytes = new Span<byte>(&tmp, Utf8Utility.MaxBytesPerScalar);
            value.TryEncodeToUtf8Bytes(runeBytes, out int runeBytesWritten);

            return runeBytes.IndexOf(_reference) >= 0;
        }
    }

    public readonly struct Rune
    {
        private readonly char _value;

        public Rune(char value) => _value = value;

        [MethodImpl(MethodImplOptions.NoInlining)]
        public bool TryEncodeToUtf8Bytes(Span<byte> runeBytes, out int runeBytesWritten)
        {
            if (runeBytes.Length < sizeof(char))
            {
                runeBytesWritten = 0;
                return false;
            }

            runeBytesWritten = sizeof(char);
            char value = _value;
            MemoryMarshal.Write(runeBytes, ref value);
            return true;
        }
    }
}
; Assembly listing for method Utf8String:Contains1(struct):bool:this
; Emitting BLENDED_CODE for X64 CPU with AVX - Unix
; optimized code
; rbp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 this         [V00,T01] (  3,  3   )     ref  ->  rdi         this class-hnd
;  V01 arg1         [V01    ] (  3,  3   )  struct ( 8) [rbp-0x30]   do-not-enreg[XS] addr-exposed ld-addr-op ptr
;* V02 loc0         [V02    ] (  0,  0   )  struct (16) zero-ref
;  V03 loc1         [V03    ] (  1,  1   )     int  ->  [rbp-0x38]   do-not-enreg[X] must-init addr-exposed ld-addr-op
;# V04 OutArgs      [V04    ] (  1,  1   )  lclBlk ( 0) [rsp+0x00]   "OutgoingArgSpace"
;  V05 tmp1         [V05    ] (  1,  1   )     blk ( 8) [rbp-0x28]   do-not-enreg[X] must-init addr-exposed ptr unsafe-buffer "stackallocLocal"
;* V06 tmp2         [V06    ] (  0,  0   )  struct (16) zero-ref    "NewObj constructor temp"
;* V07 tmp3         [V07    ] (  0,  0   )  struct ( 8) zero-ref    "NewObj constructor temp"
;  V08 tmp4         [V08,T11] (  2,  2   )     int  ->  rax         "Inline return value spill temp"
;* V09 tmp5         [V09    ] (  0,  0   )  struct (16) zero-ref    ld-addr-op "Inlining Arg"
;  V10 tmp6         [V10,T03] (  2,  4   )   ubyte  ->  rsi         ld-addr-op "Inlining Arg"
;* V11 tmp7         [V11    ] (  0,  0   )   byref  ->  zero-ref    ptr "impAppendStmt"
;  V12 tmp8         [V12,T04] (  2,  4   )     int  ->  rsi         "impAppendStmt"
;* V13 tmp9         [V13    ] (  0,  0   )  struct (16) zero-ref    ld-addr-op "Inlining Arg"
;* V14 tmp10        [V14    ] (  0,  0   )   byref  ->  zero-ref    ptr "Inlining Arg"
;  V15 tmp11        [V15,T05] (  3,  3   )   byref  ->  r14         V02._pointer(offs=0x00) P-INDEP "field V02._pointer (fldOffset=0x0)"
;  V16 tmp12        [V16,T06] (  2,  2   )     int  ->  rdi         V02._length(offs=0x08) P-INDEP "field V02._length (fldOffset=0x8)"
;  V17 tmp13        [V17,T07] (  2,  2   )   byref  ->  r14         V06._pointer(offs=0x00) P-INDEP "field V06._pointer (fldOffset=0x0)"
;  V18 tmp14        [V18,T12] (  2,  2   )     int  ->  rdi         V06._length(offs=0x08) P-INDEP "field V06._length (fldOffset=0x8)"
;  V19 tmp15        [V19,T08] (  2,  2   )   byref  ->  r14         V07._value(offs=0x00) P-INDEP "field V07._value (fldOffset=0x0)"
;  V20 tmp16        [V20,T09] (  2,  2   )   byref  ->  rdi         V09._pointer(offs=0x00) P-INDEP "field V09._pointer (fldOffset=0x0)"
;* V21 tmp17        [V21,T13] (  0,  0   )     int  ->  zero-ref    ptr V09._length(offs=0x08) P-INDEP "field V09._length (fldOffset=0x8)"
;* V22 tmp18        [V22    ] (  0,  0   )   byref  ->  zero-ref    ptr V13._pointer(offs=0x00) P-INDEP "field V13._pointer (fldOffset=0x0)"
;* V23 tmp19        [V23    ] (  0,  0   )     int  ->  zero-ref    ptr V13._length(offs=0x08) P-INDEP "field V13._length (fldOffset=0x8)"
;  V24 tmp20        [V24    ] (  3,  6   )  struct (16) [rbp-0x48]   do-not-enreg[XSFBA] multireg-arg must-init addr-exposed ptr "by-value struct argument"
;  V25 tmp21        [V25,T00] (  3,  6   )   byref  ->  rdx         stack-byref "BlockOp address local"
;  V26 tmp22        [V26,T02] (  2,  4   )   byref  ->  rsi         "argument with side effect"
;* V27 GsCookie     [V27    ] (  0,  0   )    long  ->  zero-ref    do-not-enreg[X] must-init addr-exposed "GSSecurityCookie"
;  V28 tmp24        [V28,T10] (  2,  2   )     ref  ->  rbx         this "shadowVar"
;  V29 tmp25        [V29    ] (  2,  2   )  struct ( 8) [rbp-0x50]   do-not-enreg[XSF] must-init addr-exposed ptr "shadowVar"
;
; Lcl frame size = 56

G_M25080_IG01:
       55                   push     rbp
       4156                 push     r14
       4155                 push     r13
       53                   push     rbx
       4883EC38             sub      rsp, 56
       488D6C2450           lea      rbp, [rsp+50H]
       4C8BEF               mov      r13, rdi
       488D7DB0             lea      rdi, [rbp-50H]
       B90E000000           mov      ecx, 14
       33C0                 xor      rax, rax
       F3AB                 rep stosd
       498BFD               mov      rdi, r13
       B8DC125B00           mov      eax, 0x5B12DC
       488945E0             mov      qword ptr [rbp-20H], rax
       8975D0               mov      dword ptr [rbp-30H], esi

G_M25080_IG02:
       480FBF75D0           movsx    rsi, word  ptr [rbp-30H]
       668975B0             mov      word  ptr [rbp-50H], si
       488BDF               mov      rbx, rdi

G_M25080_IG03:
       4C8D75D8             lea      r14, bword ptr [rbp-28H]
       BF04000000           mov      edi, 4
       488D75B0             lea      rsi, bword ptr [rbp-50H]
       488D55B8             lea      rdx, bword ptr [rbp-48H]
       4C8932               mov      bword ptr [rdx], r14
       897A08               mov      dword ptr [rdx+8], edi
       488BFE               mov      rdi, rsi
       488B75B8             mov      rsi, bword ptr [rbp-48H]
       488B55C0             mov      rdx, qword ptr [rbp-40H]
       488D4DC8             lea      rcx, bword ptr [rbp-38H]
       E8F3E8FFFF           call     Rune:TryEncodeToUtf8Bytes(struct,byref):bool:this
       498BFE               mov      rdi, r14
       400FB67308           movzx    rsi, byte  ptr [rbx+8]
       BA04000000           mov      edx, 4
       E8B170FFFF           call     SpanHelpers:IndexOf(byref,ubyte,int):int
       85C0                 test     eax, eax
       0F9DC0               setge    al
       0FB6C0               movzx    rax, al
       48817DE0DC125B00     cmp      qword ptr [rbp-20H], 0x5B12DC
       7405                 je       SHORT G_M25080_IG04
       E8F248B778           call     CORINFO_HELP_FAIL_FAST

G_M25080_IG04:
       90                   nop

G_M25080_IG05:
       488D65E8             lea      rsp, [rbp-18H]
       5B                   pop      rbx
       415D                 pop      r13
       415E                 pop      r14
       5D                   pop      rbp
       C3                   ret

; Total bytes of code 154, prolog size 43 for method Utf8String:Contains1(struct):bool:this
; ============================================================
; Assembly listing for method Utf8String:Contains2(struct):bool:this
; Emitting BLENDED_CODE for X64 CPU with AVX - Unix
; optimized code
; rbp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 this         [V00,T01] (  3,  3   )     ref  ->  rbx         this class-hnd
;  V01 arg1         [V01    ] (  4,  4   )  struct ( 8) [rbp-0x20]   do-not-enreg[XS] addr-exposed ld-addr-op
;  V02 loc0         [V02    ] (  1,  1   )     int  ->  [rbp-0x24]   do-not-enreg[X] must-init addr-exposed ld-addr-op
;* V03 loc1         [V03    ] (  0,  0   )  struct (16) zero-ref    ld-addr-op
;  V04 loc2         [V04    ] (  1,  1   )     int  ->  [rbp-0x30]   do-not-enreg[X] must-init addr-exposed ld-addr-op
;# V05 OutArgs      [V05    ] (  1,  1   )  lclBlk ( 0) [rsp+0x00]   "OutgoingArgSpace"
;* V06 tmp1         [V06    ] (  0,  0   )  struct ( 8) zero-ref    "NewObj constructor temp"
;  V07 tmp2         [V07,T09] (  2,  2   )     int  ->  rax         "Inline return value spill temp"
;* V08 tmp3         [V08    ] (  0,  0   )  struct (16) zero-ref    ld-addr-op "Inlining Arg"
;  V09 tmp4         [V09,T03] (  2,  4   )   ubyte  ->  rsi         ld-addr-op "Inlining Arg"
;* V10 tmp5         [V10    ] (  0,  0   )   byref  ->  zero-ref    "impAppendStmt"
;  V11 tmp6         [V11,T04] (  2,  4   )     int  ->  rsi         "impAppendStmt"
;* V12 tmp7         [V12    ] (  0,  0   )  struct (16) zero-ref    ld-addr-op "Inlining Arg"
;* V13 tmp8         [V13    ] (  0,  0   )   byref  ->  zero-ref    "Inlining Arg"
;  V14 tmp9         [V14    ] (  4,  4   )  ushort  ->  [rbp-0x20]   do-not-enreg[X] addr-exposed V01._value(offs=0x00) P-DEP "field V01._value (fldOffset=0x0)"
;  V15 tmp10        [V15,T05] (  3,  3   )   byref  ->  r14         V03._pointer(offs=0x00) P-INDEP "field V03._pointer (fldOffset=0x0)"
;  V16 tmp11        [V16,T06] (  2,  2   )     int  ->  rdi         V03._length(offs=0x08) P-INDEP "field V03._length (fldOffset=0x8)"
;  V17 tmp12        [V17,T07] (  2,  2   )   byref  ->  r14         V06._value(offs=0x00) P-INDEP "field V06._value (fldOffset=0x0)"
;  V18 tmp13        [V18,T08] (  2,  2   )   byref  ->  rdi         V08._pointer(offs=0x00) P-INDEP "field V08._pointer (fldOffset=0x0)"
;* V19 tmp14        [V19,T10] (  0,  0   )     int  ->  zero-ref    V08._length(offs=0x08) P-INDEP "field V08._length (fldOffset=0x8)"
;* V20 tmp15        [V20    ] (  0,  0   )   byref  ->  zero-ref    V12._pointer(offs=0x00) P-INDEP "field V12._pointer (fldOffset=0x0)"
;* V21 tmp16        [V21    ] (  0,  0   )     int  ->  zero-ref    V12._length(offs=0x08) P-INDEP "field V12._length (fldOffset=0x8)"
;  V22 tmp17        [V22    ] (  3,  6   )  struct (16) [rbp-0x40]   do-not-enreg[XSFBA] multireg-arg must-init addr-exposed "by-value struct argument"
;  V23 tmp18        [V23,T00] (  3,  6   )   byref  ->  rdx         stack-byref "BlockOp address local"
;  V24 tmp19        [V24,T02] (  2,  4   )   byref  ->  rsi         "argument with side effect"
;
; Lcl frame size = 40

G_M25083_IG01:
       55                   push     rbp
       4156                 push     r14
       4155                 push     r13
       53                   push     rbx
       4883EC28             sub      rsp, 40
       488D6C2440           lea      rbp, [rsp+40H]
       4C8BEF               mov      r13, rdi
       488D7DC0             lea      rdi, [rbp-40H]
       B908000000           mov      ecx, 8
       33C0                 xor      rax, rax
       F3AB                 rep stosd
       498BFD               mov      rdi, r13
       8975E0               mov      dword ptr [rbp-20H], esi
       488BDF               mov      rbx, rdi

G_M25083_IG02:
       4C8D75DC             lea      r14, bword ptr [rbp-24H]
       BF04000000           mov      edi, 4
       488D75E0             lea      rsi, bword ptr [rbp-20H]
       488D55C0             lea      rdx, bword ptr [rbp-40H]
       4C8932               mov      bword ptr [rdx], r14
       897A08               mov      dword ptr [rdx+8], edi
       488BFE               mov      rdi, rsi
       488B75C0             mov      rsi, bword ptr [rbp-40H]
       488B55C8             mov      rdx, qword ptr [rbp-38H]
       488D4DD0             lea      rcx, bword ptr [rbp-30H]
       E87DC7FFFF           call     Rune:TryEncodeToUtf8Bytes(struct,byref):bool:this
       498BFE               mov      rdi, r14
       400FB67308           movzx    rsi, byte  ptr [rbx+8]
       BA04000000           mov      edx, 4
       E8BBFCFFFF           call     SpanHelpers:IndexOf(byref,ubyte,int):int
       85C0                 test     eax, eax
       0F9DC0               setge    al
       0FB6C0               movzx    rax, al

G_M25083_IG03:
       488D65E8             lea      rsp, [rbp-18H]
       5B                   pop      rbx
       415D                 pop      r13
       415E                 pop      r14
       5D                   pop      rbp
       C3                   ret

; Total bytes of code 120, prolog size 34 for method Utf8String:Contains2(struct):bool:this
; ============================================================

But I'd like to see the JIT recognizing this and doing this optimization for us.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gfoidl I was experimenting in a separate branch with moving some of this into the JIT if possible, using a technique very similar to what you had done. But it'll be a while before I get around to performing these sorts of optimizations. Pseudocode:

public bool Contains(Rune value)
{
    if (JitHelpers.TryConstRuneToAscii(value, out byte valueAscii)
    {
        // if the JIT knows that 'value' is const and ASCII, go down this code path
        // (valueAscii would've gotten baked into the codegen as a constant)
        return SpanHelpers.IndexOf(..., valueAscii) >= 0
    }
    else if (JitHelpers.TryConstRuneToUtf8(value, out int data, out int dataLength))
    {
        // if the JIT knows that 'value' is const and not ASCII, go down this code path
        // (data and dataLength would've gotten baked into the codegen as constants)
        return SpanHelpers.IndexOf(..., ref Unsafe.As<int, byte>(ref data), dataLength) >= 0;
    }
    else
    {
        // if the JIT can't reason about the const-ness of 'value', go down this code path
        // stackalloc and convert at runtime
    }
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(In the above example it would be perfectly valid for the TryConstXyz methods to be hardcoded to always return false, and the first two branches would be folded away by the JIT. The end result is that the code would still work; it just wouldn't be as optimized.)

@GrabYourPitchforks GrabYourPitchforks merged commit 1f3f474 into master Mar 19, 2019
Dotnet-GitSync-Bot pushed a commit to Dotnet-GitSync-Bot/corefx that referenced this pull request Mar 19, 2019
Utf8String is an experimental type that is string-like (heap-allocated, immutable, variable-length, null-terminated) but whose inner representation is UTF-8, not UTF-16.

This is a skeleton implementation of the basic API shape. The ecosystem of APIs has not yet been built around it. All Utf8String-related code is currently surrounded by ifdefs to allow easy identification and removal from release branches.

Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
Dotnet-GitSync-Bot pushed a commit to Dotnet-GitSync-Bot/mono that referenced this pull request Mar 19, 2019
Utf8String is an experimental type that is string-like (heap-allocated, immutable, variable-length, null-terminated) but whose inner representation is UTF-8, not UTF-16.

This is a skeleton implementation of the basic API shape. The ecosystem of APIs has not yet been built around it. All Utf8String-related code is currently surrounded by ifdefs to allow easy identification and removal from release branches.

Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
marek-safar pushed a commit to mono/mono that referenced this pull request Mar 19, 2019
Utf8String is an experimental type that is string-like (heap-allocated, immutable, variable-length, null-terminated) but whose inner representation is UTF-8, not UTF-16.

This is a skeleton implementation of the basic API shape. The ecosystem of APIs has not yet been built around it. All Utf8String-related code is currently surrounded by ifdefs to allow easy identification and removal from release branches.

Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
stephentoub pushed a commit to dotnet/corefx that referenced this pull request Mar 19, 2019
Utf8String is an experimental type that is string-like (heap-allocated, immutable, variable-length, null-terminated) but whose inner representation is UTF-8, not UTF-16.

This is a skeleton implementation of the basic API shape. The ecosystem of APIs has not yet been built around it. All Utf8String-related code is currently surrounded by ifdefs to allow easy identification and removal from release branches.

Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
@jkotas jkotas deleted the dev/utf8string branch March 19, 2019 19:27
@jkotas jkotas restored the dev/utf8string branch March 19, 2019 19:27
Dotnet-GitSync-Bot pushed a commit to Dotnet-GitSync-Bot/corert that referenced this pull request Mar 21, 2019
Utf8String is an experimental type that is string-like (heap-allocated, immutable, variable-length, null-terminated) but whose inner representation is UTF-8, not UTF-16.

This is a skeleton implementation of the basic API shape. The ecosystem of APIs has not yet been built around it. All Utf8String-related code is currently surrounded by ifdefs to allow easy identification and removal from release branches.

Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
jkotas pushed a commit to dotnet/corert that referenced this pull request Mar 21, 2019
Utf8String is an experimental type that is string-like (heap-allocated, immutable, variable-length, null-terminated) but whose inner representation is UTF-8, not UTF-16.

This is a skeleton implementation of the basic API shape. The ecosystem of APIs has not yet been built around it. All Utf8String-related code is currently surrounded by ifdefs to allow easy identification and removal from release branches.

Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
alexanderkyte pushed a commit to alexanderkyte/mono that referenced this pull request Mar 27, 2019
Utf8String is an experimental type that is string-like (heap-allocated, immutable, variable-length, null-terminated) but whose inner representation is UTF-8, not UTF-16.

This is a skeleton implementation of the basic API shape. The ecosystem of APIs has not yet been built around it. All Utf8String-related code is currently surrounded by ifdefs to allow easy identification and removal from release branches.

Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
@stephentoub stephentoub deleted the dev/utf8string branch June 12, 2019 20:08
picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
Utf8String is an experimental type that is string-like (heap-allocated, immutable, variable-length, null-terminated) but whose inner representation is UTF-8, not UTF-16.

This is a skeleton implementation of the basic API shape. The ecosystem of APIs has not yet been built around it. All Utf8String-related code is currently surrounded by ifdefs to allow easy identification and removal from release branches.

Commit migrated from dotnet/coreclr@1f3f474
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
9 participants