New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intrinsicify SpanHelpers.IndexOf{Any}(byte, ...) #22118

Merged
merged 12 commits into from Jan 24, 2019

Conversation

Projects
None yet
9 participants
@benaadams
Copy link
Collaborator

benaadams commented Jan 21, 2019

Learnings from #22019; improvement on #21073

Applied to

internal static partial class SpanHelpers
{
    static int IndexOf(ref byte searchSpace, byte value, int length);
    static int IndexOfAny(ref byte searchSpace, byte value0, byte value1, int length);
    static int IndexOfAny(ref byte searchSpace, byte value0, byte value1, byte value2, int length);
}

MoveMask/vpmovmskb used for equality in the vectorized path already has flagged the bits that match; so if we use the specific intrinsics rather than generic Vector, we just need to determine the bit set, rather than do further processing to determine the element offset.

-; Total bytes of code 597, prolog size 5 for method SpanHelpers:IndexOf(byref,ubyte,int):int
+; Total bytes of code 549, prolog size 5 for method SpanHelpers:IndexOf(byref,ubyte,int):int

Performance measurements:

Array length 512 significant improvements (up to more than double) for found item in any position. #22118 (comment)

Array lengths >= 32 with item in last position significant improvements (up to more than double) #22118 (comment)

/cc @CarolEidt @fiigii @tannergooding @ahsonkhan

@benaadams

This comment has been minimized.

Copy link
Collaborator Author

benaadams commented Jan 21, 2019

Working on perf numbers

@benaadams

This comment has been minimized.

Copy link
Collaborator Author

benaadams commented Jan 21, 2019

Some interesting speed bumps in there; might be able to go better

@benaadams

This comment has been minimized.

Copy link
Collaborator Author

benaadams commented Jan 21, 2019

Array length 512

 Method   | Pos |    Current |         PR | % Improvement |
----------|-----|------------|------------|---------------|
  IndexOf |   0 |   6.403 ns |   4.461 ns |        +43.5% |
  IndexOf |   1 |   6.256 ns |   4.516 ns |        +38.5% |
  IndexOf |   2 |   6.192 ns |   4.494 ns |        +37.7% |
  IndexOf |   3 |   6.425 ns |   4.460 ns |        +44.0% |
  IndexOf |   4 |   6.394 ns |   4.420 ns |        +44.6% |
  IndexOf |   7 |   6.176 ns |   4.424 ns |        +39.6% |
  IndexOf |   8 |   6.265 ns |   4.464 ns |        +40.3% |
  IndexOf |   9 |   6.202 ns |   4.564 ns |        +35.8% |
  IndexOf |  10 |   6.274 ns |   4.458 ns |        +40.7% |
  IndexOf |  11 |   6.198 ns |   4.607 ns |        +34.5% |
  IndexOf |  12 |   6.537 ns |   4.469 ns |        +46.2% |
  IndexOf |  13 |   7.124 ns |   4.421 ns |        +61.1% |
  IndexOf |  14 |   7.530 ns |   4.429 ns |        +70.0% |
  IndexOf |  15 |   7.055 ns |   4.401 ns |        +60.3% |
  IndexOf |  16 |   9.521 ns |   4.421 ns |       +115.3% |
  IndexOf |  17 |   9.510 ns |   4.863 ns |        +95.5% |
  IndexOf |  18 |   9.452 ns |   4.531 ns |       +108.6% |
  IndexOf |  19 |   9.518 ns |   4.535 ns |       +109.8% |
  IndexOf |  20 |   9.511 ns |   4.499 ns |       +111.4% |
  IndexOf |  21 |   9.501 ns |   4.322 ns |       +119.8% |
  IndexOf |  22 |   9.458 ns |   4.418 ns |       +114.0% |
  IndexOf |  23 |   9.278 ns |   4.579 ns |       +102.6% |
  IndexOf |  24 |   9.904 ns |   4.637 ns |       +113.5% |
  IndexOf |  25 |   9.697 ns |   4.667 ns |       +107.7% |
  IndexOf |  26 |   9.720 ns |   4.459 ns |       +117.9% |
  IndexOf |  27 |   9.819 ns |   4.380 ns |       +124.1% |
  IndexOf |  28 |   9.795 ns |   4.520 ns |       +116.7% |
  IndexOf |  29 |   9.584 ns |   4.628 ns |       +107.0% |
  IndexOf |  30 |   9.751 ns |   4.457 ns |       +118.7% |
  IndexOf |  31 |   9.778 ns |   4.520 ns |       +116.3% |
  IndexOf |  32 |  10.110 ns |   6.077 ns |        +66.3% |
  IndexOf |  33 |   9.933 ns |   6.129 ns |        +62.0% |
  IndexOf |  34 |   9.922 ns |   6.183 ns |        +60.4% |
  IndexOf |  35 |  10.034 ns |   6.155 ns |        +63.0% |
  IndexOf |  36 |   9.994 ns |   5.244 ns |        +90.5% |
  IndexOf |  37 |  10.059 ns |   6.211 ns |        +61.9% |
  IndexOf |  38 |  10.101 ns |   5.085 ns |        +98.6% |
  IndexOf |  39 |  10.096 ns |   6.255 ns |        +61.4% |
  IndexOf |  40 |  10.147 ns |   5.560 ns |        +82.5% |
  IndexOf |  41 |  10.449 ns |   6.150 ns |        +69.9% |
  IndexOf |  42 |  10.161 ns |   5.542 ns |        +83.3% |
  IndexOf |  43 |  10.296 ns |   5.075 ns |       +102.8% |
  IndexOf |  44 |  10.240 ns |   4.993 ns |       +105.0% |
  IndexOf |  45 |  10.385 ns |   4.956 ns |       +109.5% |
  IndexOf |  46 |  10.246 ns |   6.147 ns |        +66.6% |
  IndexOf |  47 |  10.427 ns |   4.988 ns |       +109.0% |
  IndexOf |  48 |  10.877 ns |   6.229 ns |        +74.6% |
  IndexOf |  49 |  10.688 ns |   6.090 ns |        +75.5% |
  IndexOf |  50 |  10.895 ns |   6.019 ns |        +81.0% |
  IndexOf |  51 |  10.919 ns |   6.206 ns |        +75.9% |
  IndexOf |  52 |  10.591 ns |   6.012 ns |        +76.1% |
  IndexOf |  53 |  10.772 ns |   6.232 ns |        +72.8% |
  IndexOf |  54 |  10.760 ns |   6.212 ns |        +73.2% |
  IndexOf |  55 |  10.898 ns |   5.241 ns |       +107.9% |
  IndexOf |  56 |  10.776 ns |   4.983 ns |       +116.2% |
  IndexOf |  57 |  11.472 ns |   6.150 ns |        +86.5% |
  IndexOf |  58 |  10.756 ns |   7.093 ns |        +51.6% |
  IndexOf |  59 |  10.946 ns |   4.891 ns |       +123.7% |
  IndexOf |  60 |  10.745 ns |   5.260 ns |       +104.2% |
  IndexOf |  61 |  10.889 ns |   6.158 ns |        +76.8% |
  IndexOf |  62 |  10.755 ns |   6.050 ns |        +77.7% |
  IndexOf |  63 |  10.777 ns |   4.971 ns |       +116.7% |
  IndexOf |  64 |  11.007 ns |   5.782 ns |        +90.3% |
  IndexOf |  65 |  11.187 ns |   6.599 ns |        +69.5% |
  IndexOf |  66 |  11.009 ns |   5.690 ns |        +93.4% |
  IndexOf |  67 |  11.129 ns |   5.653 ns |        +96.8% |
  IndexOf |  68 |  11.129 ns |   5.775 ns |        +92.7% |
  IndexOf |  69 |  10.839 ns |   6.633 ns |        +63.4% |
  IndexOf |  70 |  11.207 ns |   5.734 ns |        +95.4% |
  IndexOf |  71 |  11.139 ns |   5.611 ns |        +98.5% |
  IndexOf |  72 |  11.345 ns |   5.701 ns |        +99.0% |
  IndexOf |  73 |  11.559 ns |   6.728 ns |        +71.8% |
  IndexOf |  74 |  11.396 ns |   5.614 ns |       +102.9% |
  IndexOf |  75 |  11.271 ns |   6.672 ns |        +68.9% |
  IndexOf |  76 |  11.282 ns |   6.741 ns |        +67.3% |
  IndexOf |  77 |  11.889 ns |   5.737 ns |       +107.2% |
  IndexOf |  78 |  11.292 ns |   6.755 ns |        +67.1% |
  IndexOf |  79 |  11.417 ns |   6.746 ns |        +69.2% |
  IndexOf |  80 |  11.843 ns |   5.734 ns |       +106.5% |
  IndexOf |  81 |  11.679 ns |   6.817 ns |        +71.3% |
  IndexOf |  82 |  11.755 ns |   5.670 ns |       +107.3% |
  IndexOf |  83 |  11.757 ns |   5.736 ns |       +104.9% |
  IndexOf |  84 |  11.785 ns |   5.743 ns |       +105.2% |
  IndexOf |  85 |  11.625 ns |   6.828 ns |        +70.2% |
  IndexOf |  86 |  11.900 ns |   6.753 ns |        +76.2% |
  IndexOf | 126 |  10.102 ns |   6.467 ns |        +56.2% |
  IndexOf | 127 |  10.030 ns |   6.393 ns |        +56.8% |
  IndexOf | 128 |  10.144 ns |   7.130 ns |        +42.2% |
  IndexOf | 129 |  10.342 ns |   7.978 ns |        +29.6% |
  IndexOf | 130 |  10.213 ns |   8.016 ns |        +27.4% |
  IndexOf | 131 |  10.170 ns |   7.112 ns |        +42.9% |
  IndexOf | 250 |  13.517 ns |   9.686 ns |        +39.5% |
  IndexOf | 251 |  13.416 ns |   9.778 ns |        +37.2% |
  IndexOf | 252 |  13.473 ns |   7.693 ns |        +75.1% |
  IndexOf | 253 |  13.629 ns |   9.824 ns |        +38.7% |
  IndexOf | 254 |  13.355 ns |   9.702 ns |        +37.6% |
  IndexOf | 255 |  26.023 ns |  22.202 ns |        +17.2% |
public class Program
{
    private static byte[] s_source;
    private const int Iters = 100;

    public static void Main(string[] args)
    {
        var summary = BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);
    }

    [Params(
        0, 1, 2, 3, 4, 7, 
        8, 9, 10, 11, 12, 13, 14, 15, 
        16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 
        32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 
        48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 
        64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79,
        80, 81, 82, 83, 84, 85, 86,
        126, 127, 128, 129, 130, 131,
        250, 251, 252, 253, 254, 255)]
    public int Position { get; set; }

    [Benchmark(OperationsPerInvoke = Iters)]
    public int IndexOf()
    {
        int total = 0;
        byte value = (byte)Position;
        ReadOnlySpan<byte> span = s_source;
        for (int i = 0; i < Iters; i++)
        {
            total += span.IndexOf(value);
        }

        return total;
    }

    [GlobalSetup]
    public void Setup()
    {
        var e = Enumerable.Range(0, 255).Select(b => (byte)b);
        s_source = e.Concat(e).ToArray();
    }
}
@benaadams

This comment has been minimized.

Copy link
Collaborator Author

benaadams commented Jan 21, 2019

Array length = Length, position length-1 (asm < 32 length is identical)

        | Length |   Current |        PR |  Change |
--------|--------|-----------|-----------|---------| 
IndexOf |     32 | 10.656 ns |  4.341 ns | +145.4% |
IndexOf |     33 | 11.519 ns |  6.885 ns |  +67.3% |
IndexOf |     34 | 11.598 ns |  7.919 ns |  +46.4% |
IndexOf |     35 | 12.137 ns |  9.070 ns |  +33.8% |
IndexOf |     36 | 11.576 ns |  6.869 ns |  +68.5% |
IndexOf |     37 | 12.408 ns |  7.759 ns |  +59.9% |
IndexOf |     38 | 12.698 ns |  8.847 ns |  +43.5% |
IndexOf |     39 | 13.556 ns |  9.545 ns |  +42.0% |
IndexOf |     40 | 12.697 ns |  7.737 ns |  +64.1% |
IndexOf |     41 | 13.432 ns |  8.837 ns |  +51.9% |
IndexOf |     42 | 13.722 ns |  9.884 ns |  +38.8% |
IndexOf |     43 | 14.293 ns | 10.869 ns |  +31.5% |
IndexOf |     44 | 13.778 ns |  8.467 ns |  +62.7% |
IndexOf |     45 | 14.098 ns |  9.696 ns |  +45.4% |
IndexOf |     46 | 14.577 ns | 10.531 ns |  +38.4% |
IndexOf |     47 | 14.968 ns | 11.591 ns |  +29.1% |
IndexOf |     48 | 14.570 ns |  9.746 ns |  +49.4% |
IndexOf |     49 | 15.488 ns | 10.501 ns |  +47.4% |
IndexOf |     50 | 15.768 ns | 11.235 ns |  +40.3% |
IndexOf |     51 | 15.911 ns | 12.203 ns |  +30.3% |
IndexOf |     52 | 15.725 ns | 11.275 ns |  +39.4% |
IndexOf |     53 | 16.409 ns | 11.763 ns |  +39.4% |
IndexOf |     54 | 16.784 ns | 12.757 ns |  +31.5% |
IndexOf |     55 | 17.012 ns | 13.623 ns |  +24.8% |
IndexOf |     56 | 16.614 ns | 11.852 ns |  +40.1% |
IndexOf |     57 | 17.657 ns | 12.829 ns |  +37.6% |
IndexOf |     58 | 17.540 ns | 12.969 ns |  +35.2% |
IndexOf |     59 | 18.289 ns | 13.645 ns |  +34.0% |
IndexOf |     60 | 17.954 ns | 13.319 ns |  +34.7% |
IndexOf |     61 | 18.590 ns | 13.571 ns |  +36.9% |
IndexOf |     62 | 18.875 ns | 14.356 ns |  +31.4% |
IndexOf |     63 | 19.249 ns | 15.241 ns |  +26.2% |
IndexOf |     64 | 13.423 ns |  5.675 ns | +136.5% |
IndexOf |     65 | 14.752 ns |  7.573 ns |  +94.7% |
IndexOf |     66 | 17.021 ns |  8.455 ns | +101.3% |
IndexOf |     67 | 17.204 ns |  9.857 ns |  +74.5% |
IndexOf |     68 | 14.591 ns |  7.044 ns | +107.1% |
IndexOf |     69 | 16.171 ns |  8.305 ns |  +94.7% |
IndexOf |     70 | 17.683 ns | 10.001 ns |  +76.8% |
IndexOf |     71 | 17.657 ns |  9.873 ns |  +78.8% |
IndexOf |     72 | 15.210 ns |  7.629 ns |  +99.3% |
IndexOf |     73 | 15.957 ns |  9.059 ns |  +76.1% |
IndexOf |     74 | 17.029 ns | 10.633 ns |  +60.1% |
IndexOf |     75 | 18.380 ns | 11.072 ns |  +66.0% |
IndexOf |     76 | 16.562 ns |  8.931 ns |  +85.4% |
IndexOf |     77 | 17.548 ns |  9.685 ns |  +81.1% |
IndexOf |     78 | 18.575 ns | 10.918 ns |  +70.1% |
IndexOf |     79 | 19.531 ns | 11.723 ns |  +66.6% |
IndexOf |     80 | 11.346 ns |  9.681 ns |  +17.1% |
IndexOf |     81 | 12.372 ns | 11.069 ns |  +11.7% |
IndexOf |     82 | 15.016 ns | 11.506 ns |  +30.5% |
IndexOf |     83 | 14.934 ns | 13.004 ns |  +14.8% |
IndexOf |     84 | 11.788 ns | 10.907 ns |   +8.0% |
IndexOf |     85 | 13.532 ns | 11.802 ns |  +14.6% |
IndexOf |     86 | 15.145 ns | 13.030 ns |  +16.2% |
IndexOf |    126 | 17.694 ns | 15.026 ns |  +17.7% |
IndexOf |    127 | 17.632 ns | 15.711 ns |  +12.2% |
IndexOf |    128 | 14.848 ns |  6.428 ns | +130.9% |
IndexOf |    129 | 16.267 ns |  8.638 ns |  +88.3% |
IndexOf |    130 | 19.239 ns | 10.284 ns |  +87.0% |
IndexOf |    131 | 19.029 ns | 10.843 ns |  +75.4% |
IndexOf |    250 | 19.764 ns | 16.818 ns |  +17.5% |
IndexOf |    251 | 21.105 ns | 16.881 ns |  +25.0% |
IndexOf |    252 | 17.472 ns | 15.944 ns |   +9.5% |
IndexOf |    253 | 19.112 ns | 16.290 ns |  +17.3% |
IndexOf |    254 | 20.948 ns | 17.231 ns |  +21.5% |
IndexOf |    255 | 21.242 ns | 17.344 ns |  +22.4% |
public class Program
{
    private static byte[] s_source;
    private const int Iters = 100;

    public static void Main(string[] args)
    {
        var summary = BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);
    }

    [Params(
        32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 
        48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 
        64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79,
        80, 81, 82, 83, 84, 85, 86,
        126, 127, 128, 129, 130, 131,
        250, 251, 252, 253, 254, 255)]
    public int Length { get; set; }

    [Benchmark(OperationsPerInvoke = Iters)]
    public int IndexOf()
    {
        int total = 0;
        byte value = (byte)(Length - 1);
        ReadOnlySpan<byte> span = s_source;
        for (int i = 0; i < Iters; i++)
        {
            total += span.IndexOf(value);
        }

        return total;
    }

    [GlobalSetup]
    public void Setup()
    {
        s_source = Enumerable.Range(0, Length).Select(b => (byte)b).ToArray();
    }
}
@benaadams

This comment has been minimized.

Copy link
Collaborator Author

benaadams commented Jan 22, 2019

Not sure if worried about coreclr-ci yet?

Mostly seems good other than what looks to be an infra issue?:

Test Pri0 Windows_NT x64 checked Job

tools\Microsoft.DotNet.Helix.Sdk.MultiQueue.targets(87,5): error : 
RestApiException: An Unexpected error occured when processing the request. [F:\vsagent\11\s\tests\helixpublishwitharcade.proj]
  at Microsoft.DotNet.Helix.Client.WorkItem.ListInternalAsync(String job, CancellationToken cancellationToken) in /_/src/Microsoft.DotNet.Helix/Client/CSharp/generated-code/WorkItem.cs:line 89 [F:\vsagent\11\s\tests\helixpublishwitharcade.proj]
  at Microsoft.DotNet.Helix.Client.WorkItem.ListAsync(String job, CancellationToken cancellationToken) in /_/src/Microsoft.DotNet.Helix/Client/CSharp/generated-code/WorkItem.cs:line 47 [F:\vsagent\11\s\tests\helixpublishwitharcade.proj]
  at Microsoft.DotNet.Helix.Client.HelixApi.RetryAsync[T](Func`1 function, Action`1 logRetry, Func`2 isRetryable) [F:\vsagent\11\s\tests\helixpublishwitharcade.proj]
  at Microsoft.DotNet.Helix.Sdk.HelixWait.WaitForHelixJobAsync(String jobName) in /_/src/Microsoft.DotNet.Helix/Sdk/HelixWait.cs:line 70 [F:\vsagent\11\s\tests\helixpublishwitharcade.proj]
  at Microsoft.DotNet.Helix.Sdk.HelixWait.ExecuteCore() in /_/src/Microsoft.DotNet.Helix/Sdk/HelixWait.cs:line 60 [F:\vsagent\11\s\tests\helixpublishwitharcade.proj]
  at Microsoft.DotNet.Helix.Sdk.HelixTask.Execute() in /_/src/Microsoft.DotNet.Helix/Sdk/HelixTask.cs:line 43 [F:\vsagent\11\s\tests\helixpublishwitharcade.proj]
[F:\vsagent\11\s\tests\helixpublishwitharcade.proj]

@benaadams benaadams force-pushed the benaadams:SpanHelpers.IndexOf branch from e2177ec to 497cb8c Jan 22, 2019

Move TrailingZeroCountFallback to common SpanHelpers
So it can be used by other types than byte
@benaadams

This comment has been minimized.

Copy link
Collaborator Author

benaadams commented Jan 22, 2019

Same change would work for .IndexOf(char), .IndexOfAny(byte,...), .IndexOfAny(char,...) and .SequenceCompareTo(...)

@benaadams

This comment has been minimized.

Copy link
Collaborator Author

benaadams commented Jan 22, 2019

Added .IndexOfAny(byte,...)

@benaadams benaadams changed the title Speedup SpanHelpers.IndexOf(byte) Speedup SpanHelpers.IndexOf{Any}(byte, ...) Jan 22, 2019

@jkotas

This comment has been minimized.

Copy link
Member

jkotas commented Jan 22, 2019

@fiigii @tannergooding Could you please take a look?

@jkotas jkotas requested review from fiigii and tannergooding Jan 22, 2019

@@ -199,10 +200,22 @@ public static unsafe int IndexOf(ref byte searchSpace, byte value, int length)
IntPtr index = (IntPtr)0; // Use IntPtr for arithmetic to avoid unnecessary 64->32->64 truncations

This comment has been minimized.

@tannergooding

tannergooding Jan 22, 2019

Member

Do we know why this code is currently using IntPtr rather than the:

#if BIT64
using nint = System.Int64;
#else
using nint = System.Int32;
#endif

code that everywhere else seems to prefer?

This comment has been minimized.

@jkotas

jkotas Jan 22, 2019

Member

This is left-over from times when this lived in CoreFX and it was not compiled bitness-specific. It would be good for readability to switch this over to nuint.

This comment has been minimized.

@benaadams

benaadams Jan 22, 2019

Author Collaborator

Is a bit messy to clean that up in this PR, will do a follow up

nLength = (IntPtr)((Vector<byte>.Count - unaligned) & (Vector<byte>.Count - 1));
if (length >= Vector128<byte>.Count * 2)
{
int unaligned = (int)Unsafe.AsPointer(ref searchSpace) & (Vector128<byte>.Count - 1);

This comment has been minimized.

@tannergooding

tannergooding Jan 22, 2019

Member

It would be really nice if the JIT would just optimize Unsafe.AsPointer(ref searchSpace) % Vector128<byte>.Count, or if we had a helper function for this type of code.

This comment has been minimized.

@CarolEidt

CarolEidt Jan 22, 2019

Member

@tannergooding - is there an open issue for this?

This comment has been minimized.

@tannergooding

tannergooding Jan 22, 2019

Member

Not sure, I'll take a look in a little bit (and will log one if we don't). It's probably worth noting that this optimization is only applicable for unsigned types (IIRC).

This comment has been minimized.

@tannergooding

tannergooding Jan 23, 2019

Member

Looks like the JIT already does this optimization for unsigned types. This is possibly just something that is normally missed due to most inputs being signed by default.

This comment has been minimized.

@benaadams

benaadams Jan 23, 2019

Author Collaborator

Does it do it for an int that becomes a const when inlined? (as would be this case)

This comment has been minimized.

@tannergooding

tannergooding Jan 23, 2019

Member

I'll test that case after lunch. It seems like it would be valid to do when the constant is known to be positive.

Vector256<byte> comparison = Vector256.Create(value);
do
{
Vector256<byte> search = Unsafe.ReadUnaligned<Vector256<byte>>(ref Unsafe.AddByteOffset(ref searchSpace, index));

This comment has been minimized.

@tannergooding

tannergooding Jan 22, 2019

Member

Why ReadUnaligned rather than LoadVector256? Is it to avoid pinning?

This comment has been minimized.

@benaadams

benaadams Jan 22, 2019

Author Collaborator

No pinning here... :)

Its over raw byte data which is generally the longest type of data, so probably best to not get in the way of the GC; if it wants to compact the heap, rather than create a giant fixed block.

This comment has been minimized.

@tannergooding

tannergooding Jan 22, 2019

Member

Hmmm, the initial assumption was that HWIntrinsics would likely be used with longer inputs and, in those cases, that pinning would be desirable (so alignment, cache-coherency, etc could be preserved).

If that isn't the case, we might want to revisit whether or not we also expose Load overloads that take ref T rather than just T*.

This comment has been minimized.

@benaadams

benaadams Jan 22, 2019

Author Collaborator

That's probably still the case for directly using them; but everything is getting plumbed though the SpanHelper methods so they are generic methods for all data sizes. e.g. String, Array, Span, ReadOnlySpan and CompareInfo, all use SpanHelpers now rather than their own implementations.

This comment has been minimized.

@benaadams

benaadams Jan 22, 2019

Author Collaborator

Fixing the data would be required to ensure that the alignment check remains valid.

The user can pin the data if they want the alignment to remain fixed (though they may non-know to do so); Kestrel's arrays are already pre-pinned so a second pin here would add no advantage, native memory pinning here would add no advantage, lengths < 32 pinning would add no advantage; but in all those cases it would be extra unnecessary work to fix the data (GC considerations aside).

Is there a significant performance benefit to using LoadVector256 over ReadUnaligned here?

Otherwise the unalignment issue will only effect the method if GC moves the data; and then, well you just had a GC so that will probably have more impact than being misaligned afterwards; and its also likely a low frequency event vs the number of times these methods are called.

This comment has been minimized.

@jkotas

jkotas Jan 22, 2019

Member

What would the index argument of the API be? We do not have native int public type yet...

This comment has been minimized.

@tannergooding

tannergooding Jan 22, 2019

Member

What would the index argument of the API be? We do not have native int public type yet...

Presumably just IntPtr, as the existing AddByteOffset method takes: public static ref T AddByteOffset<T>(ref T source, IntPtr byteOffset).

This emits an IL signature using native int and works correctly with languages (like F#) that treat IntPtr as a native sized integer. This should also work with C#, unless they decide to create a new wrapper type (rather than use partial erasure) or unless they decide that the nint type should have a modreq/modopt.

This comment has been minimized.

@jkotas

jkotas Jan 22, 2019

Member

This should also work with C#, unless they decide to create a new wrapper type

I do not think there was a decision made on this one yet.

This comment has been minimized.

@tannergooding

tannergooding Jan 22, 2019

Member

Right, what C# does is still pending further language discussion. However, we have already exposed a number of similar APIs that take IntPtr (mostly on S.R.CS.Unsafe), and the worst case scenario is that we may want to expose an additional convenience overload (likely just implemented in managed code) that casts from whatever C# uses to IntPtr.

It could also just be internal for the time being, to make our own code more readable.

This comment has been minimized.

@benaadams

benaadams Jan 23, 2019

Author Collaborator

Added local LoadVector256, LoadVector128, LoadVector and LoadUIntPtr to see how it looks; makes it more readable.

@CarolEidt

This comment has been minimized.

Copy link
Member

CarolEidt commented Jan 22, 2019

I would second the suggestions to factor out some of the common code. Otherwise it LGTM overall.

@fiigii

This comment has been minimized.

Copy link
Collaborator

fiigii commented Jan 22, 2019

I also second the suggestions to factor out common code, but there are some points you may need to pay attention:

  1. It is probably okay to calling into the helper functions without inlining since the helper functions seem not to take vector parameters (that means no additional vector copies). This is good for code size (I-cache), but with a little bit calling-overhead. It is worthwhile to collect perf data (e.g., VTune) to verify.
  2. Please make sure the caller functions do not have vector variables living after the call-sites to avoid caller-saving vector registers.
@benaadams

This comment has been minimized.

Copy link
Collaborator Author

benaadams commented Jan 23, 2019

Slight change from using the helper methods; the second use of BitOps.TrailingZeroCount changes has an extra mov:

G_M10242_IG09:
       4533D2               xor      r10d, r10d
       F3450FBCD3           tzcnt    r10d, r11d
       438D0411             lea      eax, [r9+r10]
       EB55                 jmp      SHORT G_M10242_IG22

G_M10242_IG10:
-      33C0                 xor      eax, eax
-      F3410FBCC2           tzcnt    eax, r10d
-      4103C1               add      eax, r9d
+      418BC1               mov      eax, r9d
+      4533C0               xor      r8d, r8d
+      F3450FBCC2           tzcnt    r8d, r10d
+      4103C0               add      eax, r8d
       EB49                 jmp      SHORT G_M10242_IG22

Could also be a Jit update since last time I generated the asm. Otherwise looks good.

benaadams added some commits Jan 23, 2019

@benaadams benaadams force-pushed the benaadams:SpanHelpers.IndexOf branch from 430a825 to 78e1f89 Jan 23, 2019

@benaadams

This comment has been minimized.

Copy link
Collaborator Author

benaadams commented Jan 23, 2019

Skipped bounds check from software fallback acdd8e9; don't see any other changes to make; any feedback?

@tannergooding
Copy link
Member

tannergooding left a comment

LGTM

@fiigii

fiigii approved these changes Jan 23, 2019

Copy link
Collaborator

fiigii left a comment

LGTM, just one question.

Found7:
return (int)(byte*)(index + 7);
return (int)(byte*)(offset + 7);

This comment has been minimized.

@fiigii

fiigii Jan 23, 2019

Collaborator

Can we use a variable to avoid the multiple goto targets? For example

if (uValue == Unsafe.AddByteOffset(ref searchSpace, offset + i))
{
    delta = i;
    goto Found;
}

...

Found: 
    return (int)(byte*)(offset + delta);

This comment has been minimized.

@benaadams

benaadams Jan 23, 2019

Author Collaborator

Not sure, pushes more code into the fast chain (running through the compares) to avoid more in the slow area (the targets)?

Currently these sections looks like this:

movzx    r11, byte  ptr [rcx+r11]
cmp      r11d, eax
je       G_M10207_IG14
lea      r11, [r9+2]
movzx    r11, byte  ptr [rcx+r11]
cmp      r11d, eax
je       G_M10207_IG15
lea      r11, [r9+3]
movzx    r11, byte  ptr [rcx+r11]
cmp      r11d, eax
je       G_M10207_IG16
lea      r11, [r9+4]
movzx    r11, byte  ptr [rcx+r11]
cmp      r11d, eax
je       G_M10207_IG17
lea      r11, [r9+5]
movzx    r11, byte  ptr [rcx+r11]
cmp      r11d, eax
je       G_M10207_IG18
lea      r11, [r9+6]
movzx    r11, byte  ptr [rcx+r11]
cmp      r11d, eax
je       G_M10207_IG19
lea      r11, [r9+7]
movzx    r11, byte  ptr [rcx+r11]
cmp      r11d, eax
je       G_M10207_IG20

This comment has been minimized.

@fiigii

fiigii Jan 23, 2019

Collaborator

pushes more code into the fast chain (running through the compares) to avoid more in the slow area (the targets)?

Hmm, I always prefer smaller code-size over the "tricky" loop-unrolling (if the loop-body does not have long latency instructions). Sometimes, loop-unrolling can have a bit more beautiful perf data on microbenchmarks, but I believe I-cache is the most precious resource for the large real applications (e.g., ASP.NET servers).

Additionally, if a loop body is small enough, that would be specially optimized on Intel architectures (please see "Loop Stream Detector" sections in Intel optimization manual). I think most of the fast chain loops could be really small.

This comment has been minimized.

@benaadams

benaadams Jan 23, 2019

Author Collaborator

For completeness the exit points are

G_M10207_IG11:
   mov      eax, -1
G_M10207_IG12:
   vzeroupper 
   ret  
G_M10207_IG13:
   mov      eax, r9d
   jmp      SHORT G_M10207_IG21
G_M10207_IG14:
   lea      rax, [r9+1]
   jmp      SHORT G_M10207_IG21
G_M10207_IG15:
   lea      rax, [r9+2]
   jmp      SHORT G_M10207_IG21
G_M10207_IG16:
   lea      rax, [r9+3]
   jmp      SHORT G_M10207_IG21
G_M10207_IG17:
   lea      rax, [r9+4]
   jmp      SHORT G_M10207_IG21
G_M10207_IG18:
   lea      rax, [r9+5]
   jmp      SHORT G_M10207_IG21
G_M10207_IG19:
   lea      rax, [r9+6]
   jmp      SHORT G_M10207_IG21
G_M10207_IG20:
   lea      rax, [r9+7]
G_M10207_IG21:
   vzeroupper 
   ret      

This comment has been minimized.

@benaadams

benaadams Jan 23, 2019

Author Collaborator

A follow up turning them to regular for loops might be worth investigating then?

This comment has been minimized.

@jkotas

jkotas Jan 23, 2019

Member

The current micro-benchmark oriented performance culture in the .NET Core repos favors bigger streamlined code because of it tends to produce better results in microbenchmarks. It is a common wisdom (at Microsoft at least) that smaller code runs faster in real workloads because of the factors @fiigii mentioned. The code bloating optimizations are worth it in like 1% of the cases. You have seen me pushing back on the more extreme cases of code bloat. This case is on the edge. My gut-feel is that this one is probably good as it is. We would need to be able to get data about performance in real workloads to tell with confidence.

This comment has been minimized.

@fiigii

fiigii Jan 23, 2019

Collaborator

A follow up turning them to regular for loops might be worth investigating then?

Yes, that should be a follow-up work, not in this PR.

This comment has been minimized.

@benaadams

benaadams Jan 24, 2019

Author Collaborator

Definitely worth revisiting; lots of remarks that amount to the CPU might not be happy with 8 conditional branches in quick succession e.g.

Assembly/Compiler Coding Rule 10. (M impact, L generality) Do not put more than four
branches in a 16-byte chunk.

This comment has been minimized.

@benaadams

benaadams Jan 24, 2019

Author Collaborator

Though not sure how you could; as here the jump are 6 bytes and the compares are 4 bytes... be a push to get 4 compares and 4 jumps in 16 bytes?

@benaadams

This comment has been minimized.

Copy link
Collaborator Author

benaadams commented Jan 23, 2019

@dotnet-bot test OSX10.12 x64 Checked Innerloop Build and Test

@jkotas

jkotas approved these changes Jan 23, 2019

@benaadams

This comment has been minimized.

Copy link
Collaborator Author

benaadams commented Jan 23, 2019

@dotnet-bot test OSX10.12 x64 Checked Innerloop Build and Test

@benaadams

This comment has been minimized.

Copy link
Collaborator Author

benaadams commented Jan 23, 2019

coreclr-ci passed, first time I've seen that happen! 😃

@jkotas jkotas merged commit 07d1e6b into dotnet:master Jan 24, 2019

23 checks passed

CentOS7.1 x64 Checked Innerloop Build and Test Build finished.
Details
CentOS7.1 x64 Debug Innerloop Build Build finished.
Details
Linux-musl x64 Debug Build Build finished.
Details
OSX10.12 x64 Checked Innerloop Build and Test Build finished.
Details
Ubuntu arm Cross Checked crossgen_comparison Build and Test Build finished.
Details
Ubuntu arm Cross Release crossgen_comparison Build and Test Build finished.
Details
Ubuntu x64 Checked CoreFX Tests Build finished.
Details
Ubuntu x64 Checked Innerloop Build and Test Build finished.
Details
Ubuntu x64 Checked Innerloop Build and Test (Jit - TieredCompilation=0) Build finished.
Details
Ubuntu x64 Formatting Build finished.
Details
Windows_NT x64 Checked CoreFX Tests Build finished.
Details
Windows_NT x64 Checked Innerloop Build and Test Build finished.
Details
Windows_NT x64 Checked Innerloop Build and Test (Jit - TieredCompilation=0) Build finished.
Details
Windows_NT x64 Formatting Build finished.
Details
Windows_NT x64 Release CoreFX Tests Build finished.
Details
Windows_NT x64 full_opt ryujit CoreCLR Perf Tests Correctness Build finished.
Details
Windows_NT x64 min_opt ryujit CoreCLR Perf Tests Correctness Build finished.
Details
Windows_NT x86 Checked Innerloop Build and Test Build finished.
Details
Windows_NT x86 Checked Innerloop Build and Test (Jit - TieredCompilation=0) Build finished.
Details
Windows_NT x86 Release Innerloop Build and Test Build finished.
Details
Windows_NT x86 full_opt ryujit CoreCLR Perf Tests Correctness Build finished.
Details
Windows_NT x86 min_opt ryujit CoreCLR Perf Tests Correctness Build finished.
Details
license/cla All CLA requirements met.
Details

Dotnet-GitSync-Bot pushed a commit to Dotnet-GitSync-Bot/corefx that referenced this pull request Jan 24, 2019

Speedup SpanHelpers.IndexOf{Any}(byte, ...) (dotnet/coreclr#22118)
* Speedup SpanHelpers.IndexOf(byte)

* 128 * 2 alignment

* Move TrailingZeroCountFallback to common SpanHelpers

So it can be used by other types than byte

* Speedup SpanHelpers.IndexOfAny(byte, ...)

* Indent for support flags

* More helpers, constency in local names/formatting, feedback

* Skip bounds check in software fallback

Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>

Dotnet-GitSync-Bot pushed a commit to Dotnet-GitSync-Bot/mono that referenced this pull request Jan 24, 2019

Speedup SpanHelpers.IndexOf{Any}(byte, ...) (dotnet/coreclr#22118)
* Speedup SpanHelpers.IndexOf(byte)

* 128 * 2 alignment

* Move TrailingZeroCountFallback to common SpanHelpers

So it can be used by other types than byte

* Speedup SpanHelpers.IndexOfAny(byte, ...)

* Indent for support flags

* More helpers, constency in local names/formatting, feedback

* Skip bounds check in software fallback

Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>

marek-safar added a commit to mono/mono that referenced this pull request Jan 24, 2019

Speedup SpanHelpers.IndexOf{Any}(byte, ...) (dotnet/coreclr#22118)
* Speedup SpanHelpers.IndexOf(byte)

* 128 * 2 alignment

* Move TrailingZeroCountFallback to common SpanHelpers

So it can be used by other types than byte

* Speedup SpanHelpers.IndexOfAny(byte, ...)

* Indent for support flags

* More helpers, constency in local names/formatting, feedback

* Skip bounds check in software fallback

Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>

stephentoub added a commit to dotnet/corefx that referenced this pull request Jan 24, 2019

Speedup SpanHelpers.IndexOf{Any}(byte, ...) (dotnet/coreclr#22118)
* Speedup SpanHelpers.IndexOf(byte)

* 128 * 2 alignment

* Move TrailingZeroCountFallback to common SpanHelpers

So it can be used by other types than byte

* Speedup SpanHelpers.IndexOfAny(byte, ...)

* Indent for support flags

* More helpers, constency in local names/formatting, feedback

* Skip bounds check in software fallback

Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>

Dotnet-GitSync-Bot pushed a commit to Dotnet-GitSync-Bot/corert that referenced this pull request Jan 24, 2019

Speedup SpanHelpers.IndexOf{Any}(byte, ...) (dotnet/coreclr#22118)
* Speedup SpanHelpers.IndexOf(byte)

* 128 * 2 alignment

* Move TrailingZeroCountFallback to common SpanHelpers

So it can be used by other types than byte

* Speedup SpanHelpers.IndexOfAny(byte, ...)

* Indent for support flags

* More helpers, constency in local names/formatting, feedback

* Skip bounds check in software fallback

Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>

@benaadams benaadams deleted the benaadams:SpanHelpers.IndexOf branch Jan 24, 2019

jkotas added a commit to dotnet/corert that referenced this pull request Jan 24, 2019

Speedup SpanHelpers.IndexOf{Any}(byte, ...) (dotnet/coreclr#22118)
* Speedup SpanHelpers.IndexOf(byte)

* 128 * 2 alignment

* Move TrailingZeroCountFallback to common SpanHelpers

So it can be used by other types than byte

* Speedup SpanHelpers.IndexOfAny(byte, ...)

* Indent for support flags

* More helpers, constency in local names/formatting, feedback

* Skip bounds check in software fallback

Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>

@benaadams benaadams changed the title Speedup SpanHelpers.IndexOf{Any}(byte, ...) Intrinsicify SpanHelpers.IndexOf{Any}(byte, ...) Feb 3, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment