Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Light up Ascii.Equality.Equals and Ascii.Equality.EqualsIgnoreCase with Vector512 code path #88650

Merged
merged 14 commits into from Jul 18, 2023

Conversation

khushal1996
Copy link
Contributor

@khushal1996 khushal1996 commented Jul 11, 2023

This PR is about adding Vector512 support to the existing ASCII.Equality.Equals and ASCIIEquality..EqualsIgnoreCase library APIs. The implementation remains very much similar to Vector256.

We have changed the implementation of public static Vector512<ushort> Load512 to make sure we either retain the existing performance or see a performance gain. Please look the comments in the code for detailed explanation.

PERF


Method Toolchain Size Mean Error StdDev Median Min Max Ratio RatioSD
Equals_Bytes Base_impl 6 6.633 ns 0.0262 ns 0.0245 ns 6.636 ns 6.595 ns 6.674 ns 1.00 0.01
Equals_Bytes Diff_impl 6 6.658 ns 0.0351 ns 0.0328 ns 6.658 ns 6.604 ns 6.708 ns 1.00 0.01
Equals_Chars Base_impl 6 5.341 ns 0.0375 ns 0.0351 ns 5.338 ns 5.272 ns 5.403 ns 1.00 0.01
Equals_Chars Diff_impl 6 5.379 ns 0.0308 ns 0.0288 ns 5.379 ns 5.327 ns 5.421 ns 1.00 0.01
Equals_Bytes_Chars Base_impl 6 5.944 ns 0.0635 ns 0.0530 ns 5.960 ns 5.861 ns 6.061 ns 0.99 0.02
Equals_Bytes_Chars Diff_impl 6 6.016 ns 0.0836 ns 0.0782 ns 5.996 ns 5.859 ns 6.161 ns 1.00 0.02
EqualsIgnoreCase_ExactlyTheSame_Bytes Base_impl 6 6.451 ns 0.0460 ns 0.0430 ns 6.444 ns 6.394 ns 6.530 ns 0.99 0.01
EqualsIgnoreCase_ExactlyTheSame_Bytes Diff_impl 6 6.154 ns 0.0247 ns 0.0219 ns 6.165 ns 6.101 ns 6.171 ns 0.95 0.01
EqualsIgnoreCase_ExactlyTheSame_Chars Base_impl 6 6.584 ns 0.0801 ns 0.0749 ns 6.586 ns 6.484 ns 6.716 ns 1.01 0.01
EqualsIgnoreCase_ExactlyTheSame_Chars Diff_impl 6 5.428 ns 0.0189 ns 0.0177 ns 5.436 ns 5.401 ns 5.452 ns 0.83 0.01
EqualsIgnoreCase_ExactlyTheSame_Bytes_Chars Base_impl 6 6.443 ns 0.0551 ns 0.0516 ns 6.455 ns 6.346 ns 6.519 ns 1.00 0.01
EqualsIgnoreCase_ExactlyTheSame_Bytes_Chars Diff_impl 6 6.113 ns 0.0356 ns 0.0315 ns 6.118 ns 6.062 ns 6.177 ns 0.95 0.01
EqualsIgnoreCase_DifferentCase_Bytes Base_impl 6 6.513 ns 0.0393 ns 0.0367 ns 6.510 ns 6.460 ns 6.577 ns 0.98 0.01
EqualsIgnoreCase_DifferentCase_Bytes Diff_impl 6 7.089 ns 0.0481 ns 0.0450 ns 7.079 ns 7.033 ns 7.174 ns 1.07 0.01
EqualsIgnoreCase_DifferentCase_Chars Base_impl 6 6.752 ns 0.0468 ns 0.0415 ns 6.750 ns 6.659 ns 6.813 ns 1.00 0.01
EqualsIgnoreCase_DifferentCase_Chars Diff_impl 6 7.174 ns 0.0413 ns 0.0387 ns 7.177 ns 7.110 ns 7.246 ns 1.06 0.01
EqualsIgnoreCase_DifferentCase_Bytes_Chars Base_impl 6 6.465 ns 0.0518 ns 0.0484 ns 6.461 ns 6.410 ns 6.563 ns 0.93 0.01
EqualsIgnoreCase_DifferentCase_Bytes_Chars Diff_impl 6 7.111 ns 0.0239 ns 0.0212 ns 7.111 ns 7.063 ns 7.151 ns 1.03 0.01
Equals_Bytes Base_impl 32 2.546 ns 0.0305 ns 0.0255 ns 2.545 ns 2.507 ns 2.603 ns 1.01 0.02
Equals_Bytes Diff_impl 32 2.522 ns 0.0257 ns 0.0241 ns 2.515 ns 2.490 ns 2.562 ns 1.00 0.01
Equals_Chars Base_impl 32 2.820 ns 0.0118 ns 0.0111 ns 2.815 ns 2.810 ns 2.840 ns 1.00 0.01
Equals_Chars Diff_impl 32 2.833 ns 0.0030 ns 0.0024 ns 2.834 ns 2.828 ns 2.836 ns 1.00 0.00
Equals_Bytes_Chars Base_impl 32 2.607 ns 0.0089 ns 0.0075 ns 2.608 ns 2.590 ns 2.622 ns 1.00 0.00
Equals_Bytes_Chars Diff_impl 32 2.603 ns 0.0102 ns 0.0085 ns 2.605 ns 2.585 ns 2.616 ns 1.00 0.00
EqualsIgnoreCase_ExactlyTheSame_Bytes Base_impl 32 2.954 ns 0.0045 ns 0.0035 ns 2.954 ns 2.950 ns 2.963 ns 0.85 0.00
EqualsIgnoreCase_ExactlyTheSame_Bytes Diff_impl 32 3.059 ns 0.0086 ns 0.0076 ns 3.061 ns 3.037 ns 3.069 ns 0.88 0.01
EqualsIgnoreCase_ExactlyTheSame_Chars Base_impl 32 4.079 ns 0.0258 ns 0.0215 ns 4.082 ns 4.042 ns 4.118 ns 0.76 0.01
EqualsIgnoreCase_ExactlyTheSame_Chars Diff_impl 32 3.562 ns 0.0175 ns 0.0155 ns 3.561 ns 3.538 ns 3.596 ns 0.67 0.00
EqualsIgnoreCase_ExactlyTheSame_Bytes_Chars Base_impl 32 4.299 ns 0.0193 ns 0.0181 ns 4.301 ns 4.266 ns 4.333 ns 0.76 0.00
EqualsIgnoreCase_ExactlyTheSame_Bytes_Chars Diff_impl 32 3.662 ns 0.0486 ns 0.0455 ns 3.667 ns 3.559 ns 3.717 ns 0.65 0.01
EqualsIgnoreCase_DifferentCase_Bytes Base_impl 32 3.844 ns 0.0236 ns 0.0221 ns 3.844 ns 3.809 ns 3.882 ns 0.65 0.00
EqualsIgnoreCase_DifferentCase_Bytes Diff_impl 32 4.240 ns 0.0414 ns 0.0387 ns 4.226 ns 4.197 ns 4.315 ns 0.72 0.01
EqualsIgnoreCase_DifferentCase_Chars Base_impl 32 6.577 ns 0.0263 ns 0.0233 ns 6.579 ns 6.537 ns 6.623 ns 0.60 0.00
EqualsIgnoreCase_DifferentCase_Chars Diff_impl 32 4.536 ns 0.0203 ns 0.0190 ns 4.542 ns 4.509 ns 4.563 ns 0.42 0.00
EqualsIgnoreCase_DifferentCase_Bytes_Chars Base_impl 32 7.672 ns 0.0320 ns 0.0284 ns 7.681 ns 7.624 ns 7.710 ns 0.71 0.00
EqualsIgnoreCase_DifferentCase_Bytes_Chars Diff_impl 32 5.625 ns 0.0228 ns 0.0190 ns 5.631 ns 5.580 ns 5.655 ns 0.52 0.00
Equals_Bytes Base_impl 128 3.177 ns 0.0010 ns 0.0008 ns 3.177 ns 3.176 ns 3.179 ns 1.00 0.00
Equals_Bytes Diff_impl 128 3.178 ns 0.0013 ns 0.0012 ns 3.178 ns 3.176 ns 3.180 ns 1.00 0.00
Equals_Chars Base_impl 128 5.643 ns 0.0237 ns 0.0210 ns 5.647 ns 5.578 ns 5.666 ns 0.99 0.00
Equals_Chars Diff_impl 128 5.317 ns 0.0048 ns 0.0040 ns 5.317 ns 5.308 ns 5.324 ns 0.94 0.00
Equals_Bytes_Chars Base_impl 128 4.633 ns 0.0292 ns 0.0259 ns 4.631 ns 4.595 ns 4.688 ns 1.00 0.01
Equals_Bytes_Chars Diff_impl 128 4.625 ns 0.0234 ns 0.0196 ns 4.627 ns 4.588 ns 4.650 ns 1.00 0.01
EqualsIgnoreCase_ExactlyTheSame_Bytes Base_impl 128 6.018 ns 0.0185 ns 0.0164 ns 6.011 ns 6.001 ns 6.047 ns 0.63 0.00
EqualsIgnoreCase_ExactlyTheSame_Bytes Diff_impl 128 4.519 ns 0.0334 ns 0.0312 ns 4.534 ns 4.463 ns 4.550 ns 0.47 0.00
EqualsIgnoreCase_ExactlyTheSame_Chars Base_impl 128 11.243 ns 0.0057 ns 0.0053 ns 11.244 ns 11.234 ns 11.251 ns 0.58 0.00
EqualsIgnoreCase_ExactlyTheSame_Chars Diff_impl 128 8.042 ns 0.0035 ns 0.0029 ns 8.041 ns 8.037 ns 8.046 ns 0.42 0.00
EqualsIgnoreCase_ExactlyTheSame_Bytes_Chars Base_impl 128 14.060 ns 0.0333 ns 0.0295 ns 14.053 ns 14.019 ns 14.131 ns 0.68 0.00
EqualsIgnoreCase_ExactlyTheSame_Bytes_Chars Diff_impl 128 8.859 ns 0.0030 ns 0.0027 ns 8.858 ns 8.855 ns 8.864 ns 0.43 0.00
EqualsIgnoreCase_DifferentCase_Bytes Base_impl 128 11.694 ns 0.0036 ns 0.0030 ns 11.695 ns 11.690 ns 11.701 ns 0.60 0.00
EqualsIgnoreCase_DifferentCase_Bytes Diff_impl 128 7.287 ns 0.0126 ns 0.0105 ns 7.287 ns 7.267 ns 7.303 ns 0.38 0.00
EqualsIgnoreCase_DifferentCase_Chars Base_impl 128 20.930 ns 0.0275 ns 0.0230 ns 20.923 ns 20.903 ns 20.989 ns 0.55 0.00
EqualsIgnoreCase_DifferentCase_Chars Diff_impl 128 14.724 ns 0.0124 ns 0.0116 ns 14.725 ns 14.690 ns 14.738 ns 0.38 0.00
EqualsIgnoreCase_DifferentCase_Bytes_Chars Base_impl 128 24.800 ns 0.0270 ns 0.0252 ns 24.802 ns 24.760 ns 24.840 ns 0.61 0.00
EqualsIgnoreCase_DifferentCase_Bytes_Chars Diff_impl 128 14.666 ns 0.0077 ns 0.0072 ns 14.664 ns 14.656 ns 14.682 ns 0.36 0.00
Equals_Bytes Base_impl 512 8.693 ns 0.0057 ns 0.0044 ns 8.693 ns 8.687 ns 8.699 ns 1.00 0.00
Equals_Bytes Diff_impl 512 10.694 ns 0.0031 ns 0.0029 ns 10.693 ns 10.690 ns 10.700 ns 1.23 0.00
Equals_Chars Base_impl 512 17.528 ns 0.0065 ns 0.0058 ns 17.526 ns 17.520 ns 17.542 ns 0.99 0.00
Equals_Chars Diff_impl 512 16.729 ns 0.0044 ns 0.0041 ns 16.727 ns 16.722 ns 16.737 ns 0.95 0.00
Equals_Bytes_Chars Base_impl 512 14.422 ns 0.0048 ns 0.0045 ns 14.423 ns 14.416 ns 14.429 ns 1.01 0.00
Equals_Bytes_Chars Diff_impl 512 14.005 ns 0.0065 ns 0.0061 ns 14.003 ns 13.993 ns 14.018 ns 0.98 0.00
EqualsIgnoreCase_ExactlyTheSame_Bytes Base_impl 512 19.734 ns 0.0095 ns 0.0089 ns 19.733 ns 19.721 ns 19.750 ns 0.44 0.00
EqualsIgnoreCase_ExactlyTheSame_Bytes Diff_impl 512 13.132 ns 0.0036 ns 0.0033 ns 13.132 ns 13.127 ns 13.138 ns 0.29 0.00
EqualsIgnoreCase_ExactlyTheSame_Chars Base_impl 512 51.981 ns 0.0377 ns 0.0352 ns 51.985 ns 51.908 ns 52.039 ns 0.62 0.00
EqualsIgnoreCase_ExactlyTheSame_Chars Diff_impl 512 27.112 ns 0.0064 ns 0.0060 ns 27.110 ns 27.104 ns 27.123 ns 0.32 0.00
EqualsIgnoreCase_ExactlyTheSame_Bytes_Chars Base_impl 512 57.214 ns 0.0301 ns 0.0282 ns 57.213 ns 57.170 ns 57.273 ns 0.61 0.00
EqualsIgnoreCase_ExactlyTheSame_Bytes_Chars Diff_impl 512 31.673 ns 0.0095 ns 0.0074 ns 31.673 ns 31.661 ns 31.682 ns 0.34 0.00
EqualsIgnoreCase_DifferentCase_Bytes Base_impl 512 43.033 ns 0.0091 ns 0.0080 ns 43.034 ns 43.020 ns 43.047 ns 0.50 0.00
EqualsIgnoreCase_DifferentCase_Bytes Diff_impl 512 27.040 ns 0.1025 ns 0.0856 ns 27.064 ns 26.756 ns 27.070 ns 0.32 0.00
EqualsIgnoreCase_DifferentCase_Chars Base_impl 512 89.066 ns 0.0922 ns 0.0770 ns 89.070 ns 88.975 ns 89.238 ns 0.56 0.00
EqualsIgnoreCase_DifferentCase_Chars Diff_impl 512 52.448 ns 0.0152 ns 0.0135 ns 52.448 ns 52.429 ns 52.477 ns 0.33 0.00
EqualsIgnoreCase_DifferentCase_Bytes_Chars Base_impl 512 102.465 ns 0.0737 ns 0.0653 ns 102.459 ns 102.327 ns 102.600 ns 0.60 0.00
EqualsIgnoreCase_DifferentCase_Bytes_Chars Diff_impl 512 56.994 ns 0.1687 ns 0.1409 ns 56.940 ns 56.871 ns 57.312 ns 0.34 0.00
Equals_Bytes Base_impl 1024 15.430 ns 0.0044 ns 0.0041 ns 15.430 ns 15.421 ns 15.435 ns 0.98 0.00
Equals_Bytes Diff_impl 1024 16.261 ns 0.0044 ns 0.0034 ns 16.262 ns 16.254 ns 16.265 ns 1.03 0.00
Equals_Chars Base_impl 1024 38.232 ns 0.7836 ns 0.8710 ns 38.331 ns 36.608 ns 39.653 ns 0.95 0.02
Equals_Chars Diff_impl 1024 39.503 ns 0.0855 ns 0.0799 ns 39.520 ns 39.219 ns 39.548 ns 0.98 0.00
Equals_Bytes_Chars Base_impl 1024 27.625 ns 0.0398 ns 0.0353 ns 27.630 ns 27.507 ns 27.652 ns 1.00 0.00
Equals_Bytes_Chars Diff_impl 1024 27.242 ns 0.0138 ns 0.0129 ns 27.243 ns 27.213 ns 27.259 ns 0.99 0.00
EqualsIgnoreCase_ExactlyTheSame_Bytes Base_impl 1024 49.150 ns 0.1114 ns 0.1042 ns 49.189 ns 48.969 ns 49.267 ns 0.63 0.00
EqualsIgnoreCase_ExactlyTheSame_Bytes Diff_impl 1024 24.301 ns 0.0097 ns 0.0091 ns 24.299 ns 24.288 ns 24.319 ns 0.31 0.00
EqualsIgnoreCase_ExactlyTheSame_Chars Base_impl 1024 94.396 ns 0.0399 ns 0.0374 ns 94.397 ns 94.327 ns 94.459 ns 0.60 0.00
EqualsIgnoreCase_ExactlyTheSame_Chars Diff_impl 1024 60.306 ns 0.0189 ns 0.0167 ns 60.304 ns 60.274 ns 60.342 ns 0.38 0.00
EqualsIgnoreCase_ExactlyTheSame_Bytes_Chars Base_impl 1024 102.150 ns 0.2300 ns 0.1920 ns 102.080 ns 101.973 ns 102.638 ns 0.61 0.00
EqualsIgnoreCase_ExactlyTheSame_Bytes_Chars Diff_impl 1024 72.305 ns 0.0378 ns 0.0335 ns 72.303 ns 72.239 ns 72.378 ns 0.44 0.00
EqualsIgnoreCase_DifferentCase_Bytes Base_impl 1024 92.898 ns 0.1827 ns 0.1525 ns 92.848 ns 92.676 ns 93.261 ns 0.58 0.00
EqualsIgnoreCase_DifferentCase_Bytes Diff_impl 1024 48.550 ns 0.0189 ns 0.0177 ns 48.544 ns 48.531 ns 48.585 ns 0.30 0.00

@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label Jul 11, 2023
@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Jul 11, 2023
@khushal1996 khushal1996 changed the title Adding Vector512 support in Ascii.Equality.Equals and Ascii.Equality.EqualsIgnoreCase Light up Ascii.Equality.Equals and Ascii.Equality.EqualsIgnoreCase with Vector512 code path Jul 11, 2023
@ghost
Copy link

ghost commented Jul 13, 2023

Tagging subscribers to this area: @dotnet/area-system-text-encoding
See info in area-owners.md if you want to be subscribed.

Issue Details

NO NEED FOR REVIEW AT THIS TIME

This PR is about adding Vector512 support to the existing ASCII.Equality.Equals and ASCIIEquality..EqualsIgnoreCase library APIs. The implementation remains very much similar to Vector256.

Perf


Author: khushal1996
Assignees: -
Labels:

area-System.Text.Encoding, community-contribution, needs-area-label

Milestone: -

@lewing lewing removed the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Jul 13, 2023
@tannergooding tannergooding added the arch-avx512 Related to the AVX-512 architecture label Jul 14, 2023
Copy link
Member

@tannergooding tannergooding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Basically a copy/paste of the V256 path and changed to use V512

Copy link
Member

@MihaZupan MihaZupan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I was also relying on it matching the Vector256 variant.

Comment on lines +85 to +92
while (!Unsafe.IsAddressGreaterThan(ref currentRightSearchSpace, ref oneVectorAwayFromRightEnd));

// If any elements remain, process the last vector in the search space.
if (length % (uint)Vector512<TLeft>.Count != 0)
{
ref TLeft oneVectorAwayFromLeftEnd = ref Unsafe.Add(ref left, length - (uint)Vector512<TLeft>.Count);
return TLoader.EqualAndAscii512(ref oneVectorAwayFromLeftEnd, ref oneVectorAwayFromRightEnd);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Not specific to this PR since it's just following the existing pattern)

Since we're already doing the ref arithmetic here, we might be able to save a few instructions by changing such loops to

while (Unsafe.IsAddressLessThan(ref currentRightSearchSpace, ref oneVectorAwayFromRightEnd))
{ ... }

ref TLeft oneVectorAwayFromLeftEnd = ref Unsafe.Add(ref left, length - (uint)Vector512<TLeft>.Count);
return TLoader.EqualAndAscii512(ref oneVectorAwayFromLeftEnd, ref oneVectorAwayFromRightEnd);

@tannergooding tannergooding merged commit bced584 into dotnet:main Jul 18, 2023
155 of 160 checks passed
@dotnet dotnet locked as resolved and limited conversation to collaborators Aug 18, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-avx512 Related to the AVX-512 architecture area-System.Text.Encoding community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants