Handle the remainder in MemoryExtensions.Count vectorized #82687

gfoidl · 2023-02-26T20:13:12Z

Description

In the current implementation the remainder of the vectorized code-path is processed scalar.
This PR processes the remainder vectorized too.

This is done by reading a vector from the end, comparing with the target vector, and extracting the most significant bits as usual.
As some elements may overlap now, we need to shift them off from the mask to get the correct count.

Benchmarking showed that the cost of doing the remainder vectorized is higher than scalar processing if there are just a few elements.
Thus the remainder is done vectorized only if remaining length is more than half of a vector size.

Benchmarks

Benchmark-code, run on win-x64 with .NET 8 Preview 1.
Bencharks are done for the lengths Vector256<T>.Count + 1 and Vector256<T>.Count * 2 - 1, so the both extreme cases for the remainder.

`byte`

|  Method | Length |      Mean |     Error |    StdDev | Ratio | RatioSD |
|-------- |------- |----------:|----------:|----------:|------:|--------:|
| Default |     33 |  2.550 ns | 0.0897 ns | 0.0795 ns |  1.00 |    0.00 |
|      PR |     33 |  2.460 ns | 0.0574 ns | 0.0509 ns |  0.97 |    0.04 |
|         |        |           |           |           |       |         |
| Default |     63 | 18.252 ns | 0.2371 ns | 0.2102 ns |  1.00 |    0.00 |
|      PR |     63 |  2.816 ns | 0.0921 ns | 0.0862 ns |  0.15 |    0.00 |

`short`

|  Method | Length |     Mean |     Error |    StdDev | Ratio | RatioSD |
|-------- |------- |---------:|----------:|----------:|------:|--------:|
| Default |     17 | 2.729 ns | 0.0630 ns | 0.0589 ns |  1.00 |    0.00 |
|      PR |     17 | 3.255 ns | 0.1044 ns | 0.2358 ns |  1.20 |    0.11 |
|         |        |          |           |           |       |         |
| Default |     31 | 9.784 ns | 0.2347 ns | 0.4172 ns |  1.00 |    0.00 |
|      PR |     31 | 3.707 ns | 0.1104 ns | 0.1473 ns |  0.38 |    0.03 |

`int`

|  Method | Length |     Mean |     Error |    StdDev | Ratio | RatioSD |
|-------- |------- |---------:|----------:|----------:|------:|--------:|
| Default |      9 | 1.960 ns | 0.0805 ns | 0.1452 ns |  1.00 |    0.00 |
|      PR |      9 | 2.177 ns | 0.0427 ns | 0.0356 ns |  1.04 |    0.08 |
|         |        |          |           |           |       |         |
| Default |     15 | 4.317 ns | 0.1263 ns | 0.2278 ns |  1.00 |    0.00 |
|      PR |     15 | 2.205 ns | 0.0291 ns | 0.0272 ns |  0.51 |    0.02 |

`long`

|  Method | Length |     Mean |     Error |    StdDev | Ratio | RatioSD |
|-------- |------- |---------:|----------:|----------:|------:|--------:|
| Default |      5 | 2.274 ns | 0.0866 ns | 0.2414 ns |  1.00 |    0.00 |
|      PR |      5 | 2.460 ns | 0.0922 ns | 0.2703 ns |  1.09 |    0.16 |
|         |        |          |           |           |       |         |
| Default |      7 | 3.071 ns | 0.1044 ns | 0.2944 ns |  1.00 |    0.00 |
|      PR |      7 | 2.469 ns | 0.0941 ns | 0.2199 ns |  0.79 |    0.09 |

I have some other ideas on how to improve perf for Count, but a) I'd like to keep the changes separate to make it easier to track the improvements, and b) at the moment it's a bit difficult with time for me...

Saves an additional vpbroadcastb.

Avoids the signed integer division.

Benchmarking showed that the cost is quite high, so for just a few elements the scalar loop seems better.

ghost · 2023-02-26T20:59:16Z

Tagging subscribers to this area: @dotnet/area-system-memory
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

In the current implementation the remainder of the vectorized code-path is processed scalar.
This PR processes the remainder vectorized too.

This is done by reading a vector from the end, comparing with the target vector, and extracting the most significant bits as usual.
As some elements may overlap now, we need to shift them off from the mask to get the correct count.

Benchmarking showed that the cost of doing the remainder vectorized is higher than scalar processing if there are just a few elements.
Thus the remainder is done vectorized only if remaining length is more than half of a vector size.

Benchmarks

Benchmark-code, run on win-x64 with .NET 8 Preview 1.
Bencharks are done for the lengths Vector256<T>.Count + 1 and Vector256<T>.Count * 2 - 1, so the both extreme cases for the remainder.

`byte`

|  Method | Length |      Mean |     Error |    StdDev | Ratio | RatioSD |
|-------- |------- |----------:|----------:|----------:|------:|--------:|
| Default |     33 |  2.550 ns | 0.0897 ns | 0.0795 ns |  1.00 |    0.00 |
|      PR |     33 |  2.460 ns | 0.0574 ns | 0.0509 ns |  0.97 |    0.04 |
|         |        |           |           |           |       |         |
| Default |     63 | 18.252 ns | 0.2371 ns | 0.2102 ns |  1.00 |    0.00 |
|      PR |     63 |  2.816 ns | 0.0921 ns | 0.0862 ns |  0.15 |    0.00 |

`short`

|  Method | Length |     Mean |     Error |    StdDev | Ratio | RatioSD |
|-------- |------- |---------:|----------:|----------:|------:|--------:|
| Default |     17 | 2.729 ns | 0.0630 ns | 0.0589 ns |  1.00 |    0.00 |
|      PR |     17 | 3.255 ns | 0.1044 ns | 0.2358 ns |  1.20 |    0.11 |
|         |        |          |           |           |       |         |
| Default |     31 | 9.784 ns | 0.2347 ns | 0.4172 ns |  1.00 |    0.00 |
|      PR |     31 | 3.707 ns | 0.1104 ns | 0.1473 ns |  0.38 |    0.03 |

`int`

|  Method | Length |     Mean |     Error |    StdDev | Ratio | RatioSD |
|-------- |------- |---------:|----------:|----------:|------:|--------:|
| Default |      9 | 1.960 ns | 0.0805 ns | 0.1452 ns |  1.00 |    0.00 |
|      PR |      9 | 2.177 ns | 0.0427 ns | 0.0356 ns |  1.04 |    0.08 |
|         |        |          |           |           |       |         |
| Default |     15 | 4.317 ns | 0.1263 ns | 0.2278 ns |  1.00 |    0.00 |
|      PR |     15 | 2.205 ns | 0.0291 ns | 0.0272 ns |  0.51 |    0.02 |

`long`

|  Method | Length |     Mean |     Error |    StdDev | Ratio | RatioSD |
|-------- |------- |---------:|----------:|----------:|------:|--------:|
| Default |      5 | 2.274 ns | 0.0866 ns | 0.2414 ns |  1.00 |    0.00 |
|      PR |      5 | 2.460 ns | 0.0922 ns | 0.2703 ns |  1.09 |    0.16 |
|         |        |          |           |           |       |         |
| Default |      7 | 3.071 ns | 0.1044 ns | 0.2944 ns |  1.00 |    0.00 |
|      PR |      7 | 2.469 ns | 0.0941 ns | 0.2199 ns |  0.79 |    0.09 |

I have some other ideas on how to improve perf for Count, but a) I'd like to keep the changes separate to make it easier to track the improvements, and b) at the moment it's a bit difficult with time for me...

Author:	gfoidl
Assignees:	-
Labels:	`area-System.Memory`, `community-contribution`
Milestone:	-

stephentoub · 2023-02-27T14:34:29Z

Benchmarking showed that the cost of doing the remainder vectorized is higher than scalar processing if there are just a few elements.

How much higher? The / 2 feels a bit arbitrary.

gfoidl · 2023-02-28T11:46:03Z

The / 2 was chosen by some (rough) tests, and to make the code not too complicated, i.e. by avoiding type-based and remainder-count different threshoulds.

Benchmarks for if (remaining > 0):

`byte`

|  Method | Length |      Mean | Ratio |
|-------- |------- |----------:|------:|
| Default |     33 |  2.318 ns |  1.00 |
|      PR |     33 |  2.465 ns |  1.06 |
|         |        |           |       |
| Default |     63 | 16.462 ns |  1.00 |
|      PR |     63 |  2.404 ns |  0.15 |

`short`

|  Method | Length |     Mean | Ratio |
|-------- |------- |---------:|------:|
| Default |     17 | 2.486 ns |  1.00 |
|      PR |     17 | 3.075 ns |  1.24 |
|         |        |          |       |
| Default |     31 | 8.185 ns |  1.00 |
|      PR |     31 | 3.229 ns |  0.39 |

`int`

|  Method | Length |     Mean | Ratio |
|-------- |------- |---------:|------:|
| Default |      9 | 1.653 ns |  1.00 |
|      PR |      9 | 2.384 ns |  1.44 |
|         |        |          |       |
| Default |     15 | 3.514 ns |  1.00 |
|      PR |     15 | 2.387 ns |  0.68 |

`long`

|  Method | Length |     Mean | Ratio |
|-------- |------- |---------:|------:|
| Default |      5 | 1.936 ns |  1.00 |
|      PR |      5 | 2.585 ns |  1.33 |
|         |        |          |       |
| Default |      7 | 2.835 ns |  1.00 |
|      PR |      7 | 2.799 ns |  0.99 |

So if the remainder is small -- just a few elements -- the scalar loop seems faster than doing the bitmask + popcount on the full last vector. Thus for simplicity / 2 was chosen.

src/libraries/System.Private.CoreLib/src/System/SpanHelpers.T.cs

stephentoub

Thanks.

gfoidl · 2023-05-15T18:11:01Z

I have some other ideas on how to improve perf for Count

Played around with these ideas (back then when creating this PR, may not be fully fleshed out), but I don't think that's something that should be merged here, as

code (from wip-project) gets a real mess
benchmark results don't show a clear perf-improvement

gfoidl added 4 commits February 26, 2023 17:27

Re-use targetVector from Vector256 in Vector128 code-path

375fec8

Saves an additional vpbroadcastb.

Process remainder vectorized

325a194

Use unsigned division for remaining elements

c3721ed

Avoids the signed integer division.

Process remainder vectorized only if more than half of vector remains

36b74c7

Benchmarking showed that the cost is quite high, so for just a few elements the scalar loop seems better.

ghost added the community-contribution Indicates that the PR has been added by a community member label Feb 26, 2023

vcsjones added the area-System.Memory label Feb 26, 2023

build-analysis bot mentioned this pull request Feb 26, 2023

Roslyn source generator crash on mono/linux/arm64 #81123

Closed

stephentoub reviewed May 15, 2023

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/SpanHelpers.T.cs Show resolved Hide resolved

stephentoub approved these changes May 15, 2023

View reviewed changes

gfoidl added 2 commits May 15, 2023 19:45

Merge branch 'main' into memoryextensions_count

1c33c77

Added comment why / 2 got chosen

21dff1b

stephentoub merged commit 50d4ecb into dotnet:main May 15, 2023
168 checks passed

gfoidl deleted the memoryextensions_count branch May 16, 2023 09:32

EgorBo mentioned this pull request May 18, 2023

[Perf] Windows/arm64: 9 Improvements on 5/16/2023 12:30:38 AM dotnet/perf-autofiling-issues#17926

Closed

kunalspathak mentioned this pull request May 18, 2023

[Perf] Linux/arm64: 3 Improvements on 5/16/2023 12:30:38 AM dotnet/perf-autofiling-issues#17889

Closed

dotnet locked as resolved and limited conversation to collaborators Jun 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle the remainder in MemoryExtensions.Count vectorized #82687

Handle the remainder in MemoryExtensions.Count vectorized #82687

gfoidl commented Feb 26, 2023

ghost commented Feb 26, 2023

Description

Benchmarks

`byte`

`short`

`int`

`long`

stephentoub commented Feb 27, 2023

gfoidl commented Feb 28, 2023

stephentoub left a comment

gfoidl commented May 15, 2023

Handle the remainder in MemoryExtensions.Count vectorized #82687

Handle the remainder in MemoryExtensions.Count vectorized #82687

Conversation

gfoidl commented Feb 26, 2023

Description

Benchmarks

byte

short

int

long

ghost commented Feb 26, 2023

Description

Benchmarks

byte

short

int

long

stephentoub commented Feb 27, 2023

gfoidl commented Feb 28, 2023

byte

short

int

long

stephentoub left a comment

Choose a reason for hiding this comment

gfoidl commented May 15, 2023

`byte`

`short`

`int`

`long`

`byte`

`short`

`int`

`long`

`byte`

`short`

`int`

`long`