Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

String.StartsWith Ordinal optimization pt 2 #2667

Closed
wants to merge 1 commit into from

Conversation

benaadams
Copy link
Member

Calling into native comes with some overhead, which is especially significant for short argument strings. (From aspnet/HttpAbstractions#521 follow up on #1632 for rest of string)

This keeps String.StartsWith(string, StringComparison.Ordinal) in managed code for arguments < 512 chars; also loop unrolls uses wider data types etc.

As is managed code the improvements should apply equally to all platforms. Have #ifdef it for FEATURE_CORECLR only.

Results

This is for worse case comparison where the strings match.

NativeStartsWithOrdinal is current, ManagedStartsWithOrdinal is change x64

Method Len AvrTime StdDev op/s Improve
ManagedStartsWithOrdinal 2 4.0052ns 0.0671ns 249,744,319.70 +153.1%
NativeStartsWithOrdinal 2 10.1366ns 0.1697ns 98,679,099.50 -
ManagedStartsWithOrdinal 3 4.0089ns 0.0694ns 249,517,852.81 +174.6%
NativeStartsWithOrdinal 3 11.0067ns 0.1886ns 90,880,104.57 -
ManagedStartsWithOrdinal 4 4.1074ns 0.0219ns 243,466,747.38 +185.5%
NativeStartsWithOrdinal 4 11.7261ns 0.0369ns 85,280,783.99 -
ManagedStartsWithOrdinal 5 4.3801ns 1.4955ns 238,078,604.60 +185.9%
NativeStartsWithOrdinal 5 12.0108ns 0.2094ns 83,282,838.04 -
ManagedStartsWithOrdinal 6 4.3481ns 0.0785ns 230,057,624.51 +149.5%
NativeStartsWithOrdinal 6 10.8458ns 0.0382ns 92,202,381.48 -
ManagedStartsWithOrdinal 7 4.2902ns 0.0756ns 233,158,343.59 +173.7%
NativeStartsWithOrdinal 7 11.7399ns 0.0506ns 85,181,018.69 -
ManagedStartsWithOrdinal 8 4.7742ns 0.1198ns 209,588,833.02 +152.6%
NativeStartsWithOrdinal 8 12.0580ns 0.2386ns 82,963,662.20 -
ManagedStartsWithOrdinal 9 4.7172ns 0.0372ns 212,003,612.92 +170.1%
NativeStartsWithOrdinal 9 12.7465ns 0.2633ns 78,485,475.17 -
ManagedStartsWithOrdinal 10 5.0009ns 0.1068ns 200,055,190.98 +128.8%
NativeStartsWithOrdinal 10 11.4378ns 0.0517ns 87,430,778.79 -
ManagedStartsWithOrdinal 15 5.5410ns 2.0315ns 189,144,906.11 +143.8%
NativeStartsWithOrdinal 15 12.8905ns 0.0076ns 77,576,243.92 -
ManagedStartsWithOrdinal 16 6.7168ns 1.2771ns 151,580,392.59 +104.5%
NativeStartsWithOrdinal 16 13.4929ns 0.0615ns 74,114,322.99 -
ManagedStartsWithOrdinal 17 6.7303ns 1.6666ns 152,640,298.48 +115.2%
NativeStartsWithOrdinal 17 14.1002ns 0.0610ns 70,922,255.44 -
ManagedStartsWithOrdinal 23 7.3643ns 2.3584ns 142,538,976.63 +98.2%
NativeStartsWithOrdinal 23 13.9077ns 0.2307ns 71,922,061.67 -
ManagedStartsWithOrdinal 24 7.1233ns 0.1394ns 140,437,643.35 +103.2%
NativeStartsWithOrdinal 24 14.4703ns 0.2629ns 69,129,538.57 -
ManagedStartsWithOrdinal 25 7.1311ns 0.1326ns 140,277,720.07 +111.3%
NativeStartsWithOrdinal 25 15.0710ns 0.2860ns 66,375,829.39 -
ManagedStartsWithOrdinal 31 7.5545ns 0.2379ns 132,489,110.68 +99.5%
NativeStartsWithOrdinal 31 15.0624ns 0.2548ns 66,409,197.41 -
ManagedStartsWithOrdinal 32 7.3628ns 0.1546ns 135,874,173.21 +115.0%
NativeStartsWithOrdinal 32 15.8253ns 0.0428ns 63,190,410.16 -
ManagedStartsWithOrdinal 33 7.3799ns 0.2234ns 135,615,415.21 +122.5%
NativeStartsWithOrdinal 33 16.4042ns 0.0081ns 60,960,104.45 -
ManagedStartsWithOrdinal 39 7.7385ns 0.2725ns 129,362,550.74 +107.3%
NativeStartsWithOrdinal 39 16.0278ns 0.2819ns 62,410,223.47 -
ManagedStartsWithOrdinal 40 8.5779ns 2.1443ns 121,025,780.58 +105.6%
NativeStartsWithOrdinal 40 16.9862ns 0.0436ns 58,871,649.71 -
ManagedStartsWithOrdinal 41 8.7882ns 2.4842ns 119,135,835.72 +106.9%
NativeStartsWithOrdinal 41 17.3736ns 0.2933ns 57,574,525.30 -
ManagedStartsWithOrdinal 47 8.4154ns 0.2685ns 118,931,636.56 +109.0%
NativeStartsWithOrdinal 47 17.5703ns 0.0295ns 56,914,274.70 -
ManagedStartsWithOrdinal 48 9.4871ns 2.9241ns 112,060,274.14 +103.7%
NativeStartsWithOrdinal 48 18.1746ns 0.0512ns 55,022,369.85 -
ManagedStartsWithOrdinal 49 8.8741ns 2.2628ns 117,111,320.10 +117.0%
NativeStartsWithOrdinal 49 18.5388ns 0.3370ns 53,958,434.56 -
ManagedStartsWithOrdinal 55 8.5552ns 0.3731ns 117,082,289.20 +117.1%
NativeStartsWithOrdinal 55 18.5490ns 0.3260ns 53,927,654.25 -
ManagedStartsWithOrdinal 56 8.7798ns 0.1277ns 113,921,153.05 +117.9%
NativeStartsWithOrdinal 56 19.1315ns 0.3144ns 52,283,496.83 -
ManagedStartsWithOrdinal 57 8.9098ns 0.2010ns 112,289,539.82 +121.0%
NativeStartsWithOrdinal 57 19.6849ns 0.3403ns 50,815,327.94 -
ManagedStartsWithOrdinal 63 8.9971ns 0.0621ns 111,152,591.31 +118.8%
NativeStartsWithOrdinal 63 19.6945ns 0.3447ns 50,790,785.29 -
ManagedStartsWithOrdinal 64 8.6365ns 0.1763ns 115,834,472.25 +137.6%
NativeStartsWithOrdinal 64 20.5101ns 0.0152ns 48,756,472.34 -
ManagedStartsWithOrdinal 65 8.8134ns 0.1938ns 113,515,687.87 +133.7%
NativeStartsWithOrdinal 65 20.5937ns 0.3608ns 48,572,691.61 -
ManagedStartsWithOrdinal 95 11.9802ns 1.7585ns 84,661,262.37 +103.2%
NativeStartsWithOrdinal 95 24.0109ns 0.4561ns 41,662,022.13 -
ManagedStartsWithOrdinal 96 11.9479ns 1.7785ns 85,031,711.79 +141.5%
NativeStartsWithOrdinal 96 28.5653ns 2.1210ns 35,205,870.74 -
ManagedStartsWithOrdinal 97 12.2576ns 1.8786ns 82,943,412.75 +114.0%
NativeStartsWithOrdinal 97 25.8010ns 0.0674ns 38,758,450.70 -
ManagedStartsWithOrdinal 100 13.0203ns 2.2968ns 78,699,075.76 +98.2%
NativeStartsWithOrdinal 100 25.1867ns 0.4323ns 39,714,642.76 -
ManagedStartsWithOrdinal 127 14.6869ns 1.5608ns 68,684,975.90 +101.3%
NativeStartsWithOrdinal 127 29.3134ns 0.0855ns 34,114,370.89 -
ManagedStartsWithOrdinal 128 14.5981ns 1.4783ns 69,070,179.67 +106.5%
NativeStartsWithOrdinal 128 29.9038ns 0.0327ns 33,440,632.27 -
ManagedStartsWithOrdinal 129 15.1645ns 1.6051ns 66,551,676.86 +100.5%
NativeStartsWithOrdinal 129 30.1423ns 0.5061ns 33,185,159.81 -
ManagedStartsWithOrdinal 255 25.1756ns 1.0345ns 39,783,333.21 +107.1%
NativeStartsWithOrdinal 255 52.0581ns 0.1860ns 19,209,537.11 -
ManagedStartsWithOrdinal 256 26.0471ns 1.7805ns 38,550,579.75 +99.5%
NativeStartsWithOrdinal 256 51.7690ns 0.9111ns 19,322,298.01 -
ManagedStartsWithOrdinal 257 26.0985ns 1.5864ns 38,445,624.87 +113.3%
NativeStartsWithOrdinal 257 55.4758ns 0.1344ns 18,025,969.55 -
ManagedStartsWithOrdinal 511 46.7212ns 2.0730ns 21,444,042.76 +71.9%
NativeStartsWithOrdinal 511 80.1409ns 0.0848ns 12,478,037.67 -
ManagedStartsWithOrdinal 512 47.2675ns 1.7022ns 21,182,853.86 +72.1%
NativeStartsWithOrdinal 512 81.2225ns 0.2300ns 12,311,950.20 -

Graphed

Yellow axis is assuming each extra char has fixed cost set as the cost for length 16 ManagedStartsWithOrdinal.

X-axis tick per test

tick-per-test

X-axis uniform

uniform

Details

Function RyuJit x64 asm https://gist.github.com/benaadams/7b17a4171ec7e9b81bbe

Verify and Benchmark: https://gist.github.com/benaadams/792d2734ef569d45be42

Break vs return false https://gist.github.com/adamsitnik/9d4f0107bdc15a802bbf#file-x86jit_break_vs_return_false

@benaadams benaadams changed the title String.StartsWith Ordinal performance String.StartsWith Ordinal optimization pt 2 Jan 15, 2016
@jkotas
Copy link
Member

jkotas commented Jan 15, 2016

Calling into native comes with some overhead

The overhead of calling FCall is same as calling another managed method. It is not where your improvements are coming from. Your improvements are coming from:

  1. Avoiding redundant argument validation
  2. More extensive manual loop unrolling than what's in the current native implementation

{
var byteCount = startsWith.Length << 1;
// value.Length verified to be less than or equal to this.Length by calling function
if (byteCount > 512)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I am reading your data correctly, the fcall is always slower. Why to call it for larger strings?

@jkotas
Copy link
Member

jkotas commented Jan 15, 2016

This change will need to be ported to CoreRT. You may want to take a look what has been done there, so that the two are not diverging.

@jkotas
Copy link
Member

jkotas commented Jan 15, 2016

cc @bbowyersmyth

}

fixed (char* cpString = this)
fixed (char* cpStartsWith = startsWith)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above two lines can be:

fixed (char* cpString = &m_firstChar)
fixed (char* cpStartsWith = &startsWith.m_firstChar)

See #2636

@benaadams
Copy link
Member Author

The overhead of calling FCall is same as calling another managed method.

Was assuming there was a cost in the prolog and epilog in storing and restoring the registers from the BlockCopy testing that a managed call didn't necessarily pay https://github.com/dotnet/coreclr/issues/2430#issuecomment-166594959

Will rework the benchmark to include the extra validation/preamble that happens in .StartsWith

@bbowyersmyth
Copy link

Unrolling to 64 chars in a startswith comparison feels like more of an edge case than what would be common. I can only really think of url compares that would really benefit from that.

I'm curious if a specialised compare would be better for that that worked backwards from the end where it is more likely to be different. A match would probably be slower though.

@benaadams
Copy link
Member Author

Need to look into this further.

@benaadams benaadams closed this Jan 18, 2016
@bbowyersmyth
Copy link

Mind if I take a look at this @benaadams ? It looks like a variation of EqualsHelper performs pretty well up to 100 chars and might be able to be used for Equals itself while still remaining pretty simple code.

@benaadams
Copy link
Member Author

@bbowyersmyth sure; I did some changes to exactly mirror the calling function (validation, early comparisions, full case etc); changed to gotos, statements to if rather than while, added an extra start char* == to go aligned, removed the large unroll (changed second to be while loop)... and...

Saw the start-up perf drop off so the original and the changed were more or less equal and it behaved as @jkotas's opening statement, with the newer loop unrolling pulling ahead as it got longer - but otherwise the same.

I'm not currently sure why though; the validation bit seems to change the function call from a regular pass through to one that pushes 8+ registers to stack and then pops the same registers back at end. Didn't have time to look deeper.

The loop unrolling and gotos seems to produce nice clean and fast asm though - for that portion at least; unless the use of goto triggers some kind of stack canary / protection and that's what I'm seeing? As does the call to native in the original?.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants