Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle the remainder in MemoryExtensions.Count vectorized #82687

Merged
merged 6 commits into from May 15, 2023

Conversation

gfoidl
Copy link
Member

@gfoidl gfoidl commented Feb 26, 2023

Description

In the current implementation the remainder of the vectorized code-path is processed scalar.
This PR processes the remainder vectorized too.

This is done by reading a vector from the end, comparing with the target vector, and extracting the most significant bits as usual.
As some elements may overlap now, we need to shift them off from the mask to get the correct count.

Benchmarking showed that the cost of doing the remainder vectorized is higher than scalar processing if there are just a few elements.
Thus the remainder is done vectorized only if remaining length is more than half of a vector size.

Benchmarks

Benchmark-code, run on win-x64 with .NET 8 Preview 1.
Bencharks are done for the lengths Vector256<T>.Count + 1 and Vector256<T>.Count * 2 - 1, so the both extreme cases for the remainder.

byte

|  Method | Length |      Mean |     Error |    StdDev | Ratio | RatioSD |
|-------- |------- |----------:|----------:|----------:|------:|--------:|
| Default |     33 |  2.550 ns | 0.0897 ns | 0.0795 ns |  1.00 |    0.00 |
|      PR |     33 |  2.460 ns | 0.0574 ns | 0.0509 ns |  0.97 |    0.04 |
|         |        |           |           |           |       |         |
| Default |     63 | 18.252 ns | 0.2371 ns | 0.2102 ns |  1.00 |    0.00 |
|      PR |     63 |  2.816 ns | 0.0921 ns | 0.0862 ns |  0.15 |    0.00 |

short

|  Method | Length |     Mean |     Error |    StdDev | Ratio | RatioSD |
|-------- |------- |---------:|----------:|----------:|------:|--------:|
| Default |     17 | 2.729 ns | 0.0630 ns | 0.0589 ns |  1.00 |    0.00 |
|      PR |     17 | 3.255 ns | 0.1044 ns | 0.2358 ns |  1.20 |    0.11 |
|         |        |          |           |           |       |         |
| Default |     31 | 9.784 ns | 0.2347 ns | 0.4172 ns |  1.00 |    0.00 |
|      PR |     31 | 3.707 ns | 0.1104 ns | 0.1473 ns |  0.38 |    0.03 |

int

|  Method | Length |     Mean |     Error |    StdDev | Ratio | RatioSD |
|-------- |------- |---------:|----------:|----------:|------:|--------:|
| Default |      9 | 1.960 ns | 0.0805 ns | 0.1452 ns |  1.00 |    0.00 |
|      PR |      9 | 2.177 ns | 0.0427 ns | 0.0356 ns |  1.04 |    0.08 |
|         |        |          |           |           |       |         |
| Default |     15 | 4.317 ns | 0.1263 ns | 0.2278 ns |  1.00 |    0.00 |
|      PR |     15 | 2.205 ns | 0.0291 ns | 0.0272 ns |  0.51 |    0.02 |

long

|  Method | Length |     Mean |     Error |    StdDev | Ratio | RatioSD |
|-------- |------- |---------:|----------:|----------:|------:|--------:|
| Default |      5 | 2.274 ns | 0.0866 ns | 0.2414 ns |  1.00 |    0.00 |
|      PR |      5 | 2.460 ns | 0.0922 ns | 0.2703 ns |  1.09 |    0.16 |
|         |        |          |           |           |       |         |
| Default |      7 | 3.071 ns | 0.1044 ns | 0.2944 ns |  1.00 |    0.00 |
|      PR |      7 | 2.469 ns | 0.0941 ns | 0.2199 ns |  0.79 |    0.09 |

I have some other ideas on how to improve perf for Count, but a) I'd like to keep the changes separate to make it easier to track the improvements, and b) at the moment it's a bit difficult with time for me...

Avoids the signed integer division.
Benchmarking showed that the cost is quite high, so for just a few elements the scalar loop seems better.
@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label Feb 26, 2023
@ghost
Copy link

ghost commented Feb 26, 2023

Tagging subscribers to this area: @dotnet/area-system-memory
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

In the current implementation the remainder of the vectorized code-path is processed scalar.
This PR processes the remainder vectorized too.

This is done by reading a vector from the end, comparing with the target vector, and extracting the most significant bits as usual.
As some elements may overlap now, we need to shift them off from the mask to get the correct count.

Benchmarking showed that the cost of doing the remainder vectorized is higher than scalar processing if there are just a few elements.
Thus the remainder is done vectorized only if remaining length is more than half of a vector size.

Benchmarks

Benchmark-code, run on win-x64 with .NET 8 Preview 1.
Bencharks are done for the lengths Vector256<T>.Count + 1 and Vector256<T>.Count * 2 - 1, so the both extreme cases for the remainder.

byte

|  Method | Length |      Mean |     Error |    StdDev | Ratio | RatioSD |
|-------- |------- |----------:|----------:|----------:|------:|--------:|
| Default |     33 |  2.550 ns | 0.0897 ns | 0.0795 ns |  1.00 |    0.00 |
|      PR |     33 |  2.460 ns | 0.0574 ns | 0.0509 ns |  0.97 |    0.04 |
|         |        |           |           |           |       |         |
| Default |     63 | 18.252 ns | 0.2371 ns | 0.2102 ns |  1.00 |    0.00 |
|      PR |     63 |  2.816 ns | 0.0921 ns | 0.0862 ns |  0.15 |    0.00 |

short

|  Method | Length |     Mean |     Error |    StdDev | Ratio | RatioSD |
|-------- |------- |---------:|----------:|----------:|------:|--------:|
| Default |     17 | 2.729 ns | 0.0630 ns | 0.0589 ns |  1.00 |    0.00 |
|      PR |     17 | 3.255 ns | 0.1044 ns | 0.2358 ns |  1.20 |    0.11 |
|         |        |          |           |           |       |         |
| Default |     31 | 9.784 ns | 0.2347 ns | 0.4172 ns |  1.00 |    0.00 |
|      PR |     31 | 3.707 ns | 0.1104 ns | 0.1473 ns |  0.38 |    0.03 |

int

|  Method | Length |     Mean |     Error |    StdDev | Ratio | RatioSD |
|-------- |------- |---------:|----------:|----------:|------:|--------:|
| Default |      9 | 1.960 ns | 0.0805 ns | 0.1452 ns |  1.00 |    0.00 |
|      PR |      9 | 2.177 ns | 0.0427 ns | 0.0356 ns |  1.04 |    0.08 |
|         |        |          |           |           |       |         |
| Default |     15 | 4.317 ns | 0.1263 ns | 0.2278 ns |  1.00 |    0.00 |
|      PR |     15 | 2.205 ns | 0.0291 ns | 0.0272 ns |  0.51 |    0.02 |

long

|  Method | Length |     Mean |     Error |    StdDev | Ratio | RatioSD |
|-------- |------- |---------:|----------:|----------:|------:|--------:|
| Default |      5 | 2.274 ns | 0.0866 ns | 0.2414 ns |  1.00 |    0.00 |
|      PR |      5 | 2.460 ns | 0.0922 ns | 0.2703 ns |  1.09 |    0.16 |
|         |        |          |           |           |       |         |
| Default |      7 | 3.071 ns | 0.1044 ns | 0.2944 ns |  1.00 |    0.00 |
|      PR |      7 | 2.469 ns | 0.0941 ns | 0.2199 ns |  0.79 |    0.09 |

I have some other ideas on how to improve perf for Count, but a) I'd like to keep the changes separate to make it easier to track the improvements, and b) at the moment it's a bit difficult with time for me...

Author: gfoidl
Assignees: -
Labels:

area-System.Memory, community-contribution

Milestone: -

@stephentoub
Copy link
Member

Benchmarking showed that the cost of doing the remainder vectorized is higher than scalar processing if there are just a few elements.

How much higher? The / 2 feels a bit arbitrary.

@gfoidl
Copy link
Member Author

gfoidl commented Feb 28, 2023

The / 2 was chosen by some (rough) tests, and to make the code not too complicated, i.e. by avoiding type-based and remainder-count different threshoulds.

Benchmarks for if (remaining > 0):

byte

|  Method | Length |      Mean | Ratio |
|-------- |------- |----------:|------:|
| Default |     33 |  2.318 ns |  1.00 |
|      PR |     33 |  2.465 ns |  1.06 |
|         |        |           |       |
| Default |     63 | 16.462 ns |  1.00 |
|      PR |     63 |  2.404 ns |  0.15 |

short

|  Method | Length |     Mean | Ratio |
|-------- |------- |---------:|------:|
| Default |     17 | 2.486 ns |  1.00 |
|      PR |     17 | 3.075 ns |  1.24 |
|         |        |          |       |
| Default |     31 | 8.185 ns |  1.00 |
|      PR |     31 | 3.229 ns |  0.39 |

int

|  Method | Length |     Mean | Ratio |
|-------- |------- |---------:|------:|
| Default |      9 | 1.653 ns |  1.00 |
|      PR |      9 | 2.384 ns |  1.44 |
|         |        |          |       |
| Default |     15 | 3.514 ns |  1.00 |
|      PR |     15 | 2.387 ns |  0.68 |

long

|  Method | Length |     Mean | Ratio |
|-------- |------- |---------:|------:|
| Default |      5 | 1.936 ns |  1.00 |
|      PR |      5 | 2.585 ns |  1.33 |
|         |        |          |       |
| Default |      7 | 2.835 ns |  1.00 |
|      PR |      7 | 2.799 ns |  0.99 |

So if the remainder is small -- just a few elements -- the scalar loop seems faster than doing the bitmask + popcount on the full last vector. Thus for simplicity / 2 was chosen.

Copy link
Member

@stephentoub stephentoub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@gfoidl
Copy link
Member Author

gfoidl commented May 15, 2023

I have some other ideas on how to improve perf for Count

Played around with these ideas (back then when creating this PR, may not be fully fleshed out), but I don't think that's something that should be merged here, as

@stephentoub stephentoub merged commit 50d4ecb into dotnet:main May 15, 2023
168 checks passed
@gfoidl gfoidl deleted the memoryextensions_count branch May 16, 2023 09:32
@dotnet dotnet locked as resolved and limited conversation to collaborators Jun 15, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Memory community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants