-
Notifications
You must be signed in to change notification settings - Fork 345
Adding Span IndexOf that takes an index and count #1278
Conversation
- Also fixing StartsWith impl
So what is the break even point for IndexOf? i.e. at what point slicing and then regular indexof is fast enough such that the new IndexOf does not pay off? |
Essentially, it is roughly the same for < 10 (+/- 10%) and you get performance gains for > 10 (the gains become significant in all cases by around 500). The results imply it is almost always better to use this new IndexOf rather than Slice and Index, especially if the lookup value is close to the beginning of the span. |
cc @KrzysztofCwalina Please take a look at the implementation and performance test results. |
hmm, this is very surprising. I would think that the longer the search, the less benefit there would be as the slicing operation gets amortized, i.e. Slice is O(1) and search is O(n), so as N gets larger the sum approaches O(N). cc: @vancem, @jkotas, @benaadams |
Well, yes, of course Slice is O(1), but if you can slice closer to the index where the value might be found, you can pass that index to the new IndexOf overload too. In any case, if you keep the information you have between the two consistent, the results make sense. In any case the O(n) search should be the same. Either you index through 500 bytes or 5 bytes and that doesn't really change whether you slice first or not. With slicing, you have that extra Slice operation that you wouldn't have for IndexOf. Am I missing something? |
Yeah, it's not surprising that slicing makes the operation slower. What's surprising is that the ratio get more and more in favor of the overloads that take index as the size of the buffer (and so the index of the found byte) gets larger. |
This suggests that something else going on, and the overload is not actually what's improving the performance. |
This seems to be a strange artefact of our benchmark testing. When I add the InlineData(1000) attribute, I get different results compared to when I have no InlineData attritubte. [Benchmark(InnerIterationCount = 1000000)]
[InlineData(1000)]
public int SpanIndexOfWithIndexAndCountComparisonValueAtEnd(int length)
{
const byte lookupVal = 99;
var a = new byte[length];
a[length-1] = lookupVal;
var bytes = new Span<byte>(a);
int startIndex = length/2;
int count = length/2;
var result = -1;
foreach (var iteration in Benchmark.Iterations)
{
using (iteration.StartMeasurement())
{
for (int i = 0; i < Benchmark.InnerIterationCount; i++)
{
result = bytes.IndexOf(lookupVal, startIndex, count);
}
}
}
Console.WriteLine(result);
return result;
} Runtime: 217.668 (much faster than Slice) [Benchmark(InnerIterationCount = 1000000)]
public int SpanIndexOfWithIndexAndCountComparisonValueAtEnd()
{
const byte lookupVal = 99;
var a = new byte[1000];
a[999] = lookupVal;
var bytes = new Span<byte>(a);
const int startIndex = 500;
const int count = 500;
var result = -1;
foreach (var iteration in Benchmark.Iterations)
{
using (iteration.StartMeasurement())
{
for (int i = 0; i < Benchmark.InnerIterationCount; i++)
{
result = bytes.IndexOf(lookupVal, startIndex, count);
}
}
}
Console.WriteLine(result);
return result;
} Runtime: 394.825 (similar to Slice) For some reason the results remain consistent for SpanIndexOfWithSliceComparisonValueAtEnd cc @jorive, why is this discrepancy occurring? |
I talked to @jorive offline and provided the details of the test runs to him for investigation. Going back to:
I am not sure if this question is relevant. _sResult += bytes.IndexOf(lookupVal, startIndex, count); VS _sResult += bytes.Slice(startIndex, count).IndexOf(lookupVal); Both of these loop through the same amount of bytes to find the lookupVal, but the second one has the Slice overhead. They should be on par with each other except for the slice overhead, seen below: |
I have run a few experiments myself since these numbers do not make sense to me. As far as I can tell, for fast Span - the Slice+IndexOf is always same or better than the IndexOf that you are adding - when everything else is equal. I think that the reason why you are seeing improvements from the IndexOf that takes index are some secondary effects (different register allocations). We need to rather understand and address these secondary effects than to add this IndexOf overload. Here is the disassembly for the simplest one of my experiments, I have also tried more complex cases with varying indices. The results were always same or better for Slice+IndexOf.
|
Also, the inlined Instead of adding this new IndexOf overload, you should look into changing the IndexOf code in corefx to do |
I ran tests using the stopwatch and disagree with the results that you see @jkotas Stopwatch stopWatch = new Stopwatch();
byte[] a = new byte[100];
Span<byte> span = new Span<byte>(a);
stopWatch.Restart();
for (int i = 0; i < 2000000000; i++)
span.Slice(1, 10).IndexOf(0);
Output(stopWatch.Elapsed);
a = new byte[100];
span = new Span<byte>(a);
stopWatch.Restart();
for (int i = 0; i < 2000000000; i++)
span.IndexOf(0, 1, 10);
Output(stopWatch.Elapsed); RunTime 00:00:08.91 // Slice There is some discrepancy within the results that come from benchmarks. One thing that was recommended by @jorive to get more consistent results while benchmarking was to run the perfharness with high priority with affinity to a specific core.
This does give more consistent results. The next step would be to figure out why the results through xunit.perf show IndexOf as almost 2x faster in some cases. const int length = 9;
GetDataValueAtEnd(length, out Span<byte> bytes, out int startIndex, out int count);
stopWatch.Restart();
for (int i = 0; i < 200000000; i++)
TestBenchIndexOf(bytes, startIndex, count);
Output(stopWatch.Elapsed);
GetDataValueAtEnd(length, out bytes, out startIndex, out count);
stopWatch.Restart();
for (int i = 0; i < 200000000; i++)
TestBenchSlice(bytes, startIndex, count);
Output(stopWatch.Elapsed); RunTime 00:00:04.93 // Slice |
Is there any insight from looking at the disassembly? If we trust our results, the IndexOf overload isn't faster in any scenario, so there doesn't seem to be any secondary effects that is improving performance. |
Right, there are not any unexpected secondary effects in the piece of disassembly that I have shared. My understanding was that we have seen the improvement from it somewhere (correct me if I am wrong) ... it is what to drill into if it is the case. |
@ahsonkhan @KrzysztofCwalina @jkotas have you guys tried BenchmarkDotNet? We have spent hundreds (or rather thousands) of hours to build a reliable library for benchmarking .NET code. how the first of your benchmark would look like with BenchmarkDotNet: public class PerformanceTests
{
ReadOnlySpan<byte> bytes;
ReadOnlySpan<byte> slice;
[Params(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 50, 100, 500, 1000, 5000)]
public int Length { get; set; }
[Setup]
public void Setup()
{
byte[] a = new byte[Length];
byte[] b = {1, 2, 3, 4, 5};
for (int i = 0; i < a.Length; i++)
{
a[i] = (byte)(i + 1);
}
bytes = new ReadOnlySpan<byte>(a);
slice = new ReadOnlySpan<byte>(b, 0, Math.Min(length, b.Length));
}
[Benchmark]
public bool SpanStartsWith() => bytes.StartsWith(slice);
} and then to run it: class Program
{
static void Main(string[] args)
{
BenchmarkRunner.Run<PerformanceTests>();
}
} |
Yeah, we have. In general I love BenchmarkDotNet and use it in many cases. It does have better front end reporting capabilities and is more mature from reliability and statistical analysis perspective. But it seems like BenchmarkDotNet does not support measuring instructions retired, cache misses, and branch mispredictions that we often find very useful. Would be great to combine the efforts/features :-) cc: @cmckinsey |
@KrzysztofCwalina, for Instructions Retired, you need I also know cache misses and branch mispredictions are also supported, as they are displayed on the internal benchmark comparison site (I'm not sure how to enable those ones though). |
Are you referring to BenchmarkDotNet? |
Ah, no, I mixed up https://github.com/dotnet/BenchmarkDotNet with https://github.com/Microsoft/xunit-performance (we use the latter for most of the official perf tests in CoreCLR/CoreFX/Roslyn/etc) |
@KrzysztofCwalina I've been looking at this thread and working with @ahsonkhan. There are quite few interesting points here, and I'm compiling my results as I'm typing this :) |
@KrzysztofCwalina I talked to Valentin, and he provided me with more information about the benchmarks are trying to achieve, and it seemed that the body of the loop was intentionally made such that it is loop invariant. Based on that, I did few runs and collected some statistics. First, I modified the benchmarks and now they look like the code below. Essentially, I removed the whole setup part of the benchmark to be passed as a parameter to the benchmark when it is invoked: [Benchmark(InnerIterationCount = 100000)]
[MemberData(nameof(InputData))]
public static int IndexOf_(byte[] array, int startIndex, int count)
{
var bytes = new Span<byte>(array);
var result = -1;
foreach (var iteration in Benchmark.Iterations)
{
using (iteration.StartMeasurement())
{
for (int i = 0; i < Benchmark.InnerIterationCount; ++i)
{
result = bytes.IndexOf(LookupVal, startIndex, count);
}
}
}
return result;
}
[Benchmark(InnerIterationCount = 100000)]
[MemberData(nameof(InputData))]
public static int Slice_(byte[] array, int startIndex, int count)
{
var bytes = new Span<byte>(array);
var result = -1;
foreach (var iteration in Benchmark.Iterations)
{
using (iteration.StartMeasurement())
{
for (int i = 0; i < Benchmark.InnerIterationCount; ++i)
{
result = bytes.Slice(startIndex, count).IndexOf(LookupVal);
}
}
}
return result;
}
public static IEnumerable<object[]> InputData()
{
int[] lengths = new int[] { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 50, 100, 500, 1000, 5000 };
foreach (var length in lengths)
{
var array = new byte[length];
array[length - 1] = LookupVal;
int startIndex = length / 2;
int count = length - startIndex;
// NOTE: Currently the types in the array below need to be serializable.
yield return new object[] { array, startIndex, count };
}
}
private const byte LookupVal = 99; Second, I did some ran the benchmarks ten times and collected some results:
The table above compares the IndexOf vs. the Slice implementations. Then, I compute the ratio of their averages [base/diff == IndexOf/Slice] and added a last column "IsReal". This column determines if a result if to be trusted by comparing the ratio against two times the maximum of the computed sampled standard deviation for both base and diff. In general, the results are very noisy (this is even after ignoring the first iteration of a run for every test, a fix I will add to my next pr). I want to mention I have run the benchmark this way: REM Running the process as high priority on core #2
START "SIMULATOR_PIPE" /B /WAIT /HIGH /AFFINITY 0x2 dotnet.exe run -c release @KrzysztofCwalina @jkotas Please let me know if you have any questions or concerns regarding the performance API, writing benchmarks, or on how I have collected this data (I can share ETL traces generated by the API). Regards, |
Here are the results that I found from running tests within my stand alone application. Given these results, we should push this PR through. @jkotas, thoughts? |
@dotnet-bot test this please |
As discussed over email, your results are against more than one month old CoreCLR that did not have any of the recent work on Span code quality. Rerun the tests against current CoreCLR. |
Essentially, for Span specifically, calling Slice then IndexOf is just as good (if not better) than calling the IndexOf overload which removes the need to Slice. https://gist.github.com/ahsonkhan/e56281d9d0a0df6cb6e7837602ffeab3 |
@ahsonkhan, so should we close this? |
@KrzysztofCwalina, I believe we should. However, I think some of the tests and possibly benchmarks I wrote are useful along with some of the minor startswith impl changes. I can clean up the changes to only keep the relevant parts. |
@ahsonkhan, should we close it? |
@KrzysztofCwalina, I haven't gotten a chance to refactor the tests and clean up the useful parts. Would it make more sense to close this PR and start a new one if necessary? |
Your call. |
I will take these tests and move them to a PR in corefx with StartsWith once the API review is in. I think this can be closed. |
Approximately 5% performance gain calling Span IndexOf that takes an index and count versus Slice.
For Span.StartsWith using SequenceEqual, the break even point is around Span of size 3-4. Using SequenceEqual for span.Length > 3 gives 30%+ perf gains.