[stdlib][SR-7556] Re-implement string-to-integer parsing #36623

xwu · 2021-03-28T21:07:59Z

This PR attempts to improve code size and performance for string-to-integer parsing without introducing any additions to the ABI. As noted in the relevant bug:

Constructing an Int from a String eventually involves calling _parseASCII, which is slow and bloated if not properly specialized.

A serious of unfortunate events played out over time where this function was overly generic and so was marked @inline(__always) to be fast, then it was discovered this function was about 20KB large and the callers were all marked with @inline(never) and various @_semantics to disable specialization for size (and perhaps compilation time).

The end result was something bloated, slow, and yet is still emitted into the user module, increasing code size.

To confront this issue, this PR introduces a new _parseASCII. Instead of taking an argument of type T: IteratorProtocol where T.Element == UnsignedInteger, it expects a buffer of type UnsafeBufferPointer<UInt8>. This is made possible by the existence of the withContiguousStorageIfAvailable API on StringProtocol.

If contiguous storage is not available, then an @inline(never) fallback is called, which initializes a mutable String value and then makes use of the withUTF8 API. We consign this function to the fallback path in order to avoid creating a mutable copy if we can.

The extremely pared down implementation shown here is the result of several iterative rounds of benchmarking and simplification described below. It's tempting to consider further specializations for base 8, 10, or 16, but the possible wins would appear to be negligible without a significantly more sophisticated implementation (such as that attempted in #30094, as pointed out in the conversation below). Those are deferred to prioritize addressing some low-hanging fruit.

Resolves SR-7556.

Summary of findings

Baseline

As a baseline, I used #36625 to demonstrate what would occur if the existing implementation had its @inline(never) and @_semantics markings removed. This revealed sizable improvements in microbenchmark performance but a significant regression in code size (excerpted results below):

PERFORMANCE -O Improvement	OLD	NEW	DELTA	RATIO
ParseInt.IntSmall.Decimal	437	228	-47.8%	1.92x
StrToInt	1600	860	-46.2%	1.86x
ParseInt.IntSmall.UncommonRadix	481	263	-45.3%	1.83x
ParseInt.UInt64.Decimal	204	153	-25.0%	1.33x
ParseInt.UInt64.Hex	332	272	-18.1%	1.22x
ParseInt.UIntSmall.Binary	641	542	-15.4%	1.18x

CODE SIZE -O Regression	OLD	NEW	DELTA	RATIO
IntegerParsing.o ⚠️	56677	88581	+56.3%	0.64x
RangeReplaceableCollectionPlusDefault.o	5442	6902	+26.8%	0.79x

CODE SIZE -O Improvement	OLD	NEW	DELTA	RATIO
StrToInt.o	4615	3657	-20.8%	1.26x

First attempts

The first attempts to improve upon the status quo yielded similar results to the above.

A manually specialized implementation of _parseASCII and a new _parseASCIIDigits were added which take an UnsafeBufferPointer<UInt8> argument as the source. Additionally, generic versions of the above were maintained; with this setup, I attempted to mark the duplicated implementations with different attributes in the hopes of fine tuning code size and performance.

However, even after marking the fallback generic helper functions using @inline(never), there were improvements in microbenchmarks but significant code size regressions (excerpted results below):

PERFORMANCE -O Improvement	OLD	NEW	DELTA	RATIO
ParseInt.IntSmall.Decimal	437	206	-52.9%	2.12x
ParseInt.IntSmall.UncommonRadix	481	229	-52.4%	2.10x
ParseInt.UInt64.Decimal	203	117	-42.4%	1.74x
StrToInt	1600	990	-38.1%	1.62x
ParseInt.UIntSmall.Binary	641	418	-34.8%	1.53x
ParseInt.UInt64.Hex	331	272	-17.8%	1.22x

CODE SIZE -O Regression	OLD	NEW	DELTA	RATIO
IntegerParsing.o ⚠️	56677	90807	+60.2%	0.62x
StrToInt.o	4615	6065	+31.4%	0.76x
RangeReplaceableCollectionPlusDefault.o	5442	7070	+29.9%	0.77x
LuhnAlgoEager.o	11370	13372	+17.6%	0.85x
LuhnAlgoLazy.o	11370	13372	+17.6%	0.85x
DictionaryCompactMapValues.o	13853	15770	+13.8%	0.88x

Minimum code size

After removing all manually repeated code and further simplifying the implementation, I attempted to remove the @inline(__always) marking from FixedWidthInteger.init(_:radix:) and to test the effect of explicitly requiring partial specializations for S == String and for S == Substring using @_specialize(kind: partial, ...).

This produced a result that decreased the compiled size of the standard library itself by ~1%, as well as improvements in the code size of the IntegerParsing microbenchmarks. However, it wiped out most performance improvements at -O, except for StrToInt (excerpted results below):

PERFORMANCE -O Regression	OLD	NEW	DELTA	RATIO
ParseInt.UInt64.Hex	332	365	+9.9%	0.91x

PERFORMANCE -O Improvement	OLD	NEW	DELTA	RATIO
StrToInt	1600	950	-40.6%	1.68x

CODE SIZE -O Regression	OLD	NEW	DELTA	RATIO
RangeReplaceableCollectionPlusDefault.o	5442	6220	+14.3%	0.87x
LuhnAlgoEager.o	11370	12298	+8.2%	0.92x
LuhnAlgoLazy.o	11370	12298	+8.2%	0.92x
DictionaryCompactMapValues.o	13853	14714	+6.2%	0.94x

CODE SIZE -O Improvement	OLD	NEW	DELTA	RATIO
IntegerParsing.o ✅	56677	55733	-1.7%	1.02x

CODE SIZE: -swiftlibs

Improvement	OLD	NEW	DELTA	RATIO
libswiftCore.dylib	3850240	3801088	-1.3%	1.01x

Inlined performance

The final form of this PR restores the @inline(__always) marking to FixedWidthInteger.init(_:radix:) (and simplifies the implementation further). Doing so produces a result where a sizable proportion of the performance benefit seen in the baseline benchmarks can be recovered with a very modest code size increase. As before, the compiled size of the standard library itself is decreased by ~1%.

This implementation relies on no @_semantics annotations and, perhaps relatedly, exhibits performance improvements at -O, -Osize, and -Onone (excerpted -O results below):

PERFORMANCE -O Improvement	OLD	NEW	DELTA	RATIO
StrToInt	1600	1000	-37.5%	1.60x
ParseInt.IntSmall.UncommonRadix	481	328	-31.8%	1.47x
ParseInt.IntSmall.Decimal	437	302	-30.9%	1.45x
ParseInt.UInt64.Decimal	204	166	-18.6%	1.23x
ParseInt.UIntSmall.Binary	641	585	-8.7%	1.10x

CODE SIZE -O Regression	OLD	NEW	DELTA	RATIO
LuhnAlgoEager.o	11370	12178	+7.1%	0.93x
LuhnAlgoLazy.o	11370	12178	+7.1%	0.93x
RangeReplaceableCollectionPlusDefault.o	5442	5804	+6.7%	0.94x
StrToInt.o	4615	4859	+5.3%	0.95x
DictionaryCompactMapValues.o	13853	14570	+5.2%	0.95x
IntegerParsing.o 🏁	56677	59557	+5.1%	0.95x

CODE SIZE: -swiftlibs

Improvement	OLD	NEW	DELTA	RATIO
libswiftCore.dylib	3850240	3801088	-1.3%	1.01x

(All versions of this PR show varying degrees of code size regressions in LuhnAlgoEager, LuhnAlgoLazy, RangeReplaceableCollectionPlusDefault, and DictionaryCompactMapValues. I have to presume that they are attributable to emitting this new implementation into the client; in the final iteration, these code size increases are the most modest yet.)

benrimmington · 2021-03-28T21:31:47Z

There's also #30094 by @PatrickPijnappel

xwu · 2021-03-28T21:43:06Z

@benrimmington It's been long enough that I'd forgotten about that PR 🤦, and the bug doesn't make mention of it. If @PatrickPijnappel wants to finish that one up, happy to set this aside.

The solution presented here is significantly less involved, and I'm curious to see what the benchmarks show. If there's sufficient incremental improvement, this could be landed without blocking a subsequent more sophisticated implementation that makes use of SWAR as @PatrickPijnappel is doing.

PatrickPijnappel · 2021-03-28T22:23:49Z

I feel bad I didn't get to finishing that PR for such a long time, my work situation changed. It was very close to being merged, just stuck on a final simplification that seemed to change retain/release behavior wiping out gains. If you're interested, I'm open to collaborating on that one somehow—I can prioritize some time.

Nevertheless, if this PR delivers significant gains it makes sense to merge it first, especially since it doesn't introduce anything that needs to be maintained from an ABI perspective.

xwu · 2021-03-28T23:29:53Z

@PatrickPijnappel I'm also not blessed with a large amount of time these days, sadly. I was more hoping that there was some low-hanging fruit here; if the benchmarks aren't really exciting, I think I'll have to leave this work in others' hands.

…ix a think-o.

… integers.

xwu · 2021-03-29T02:26:04Z

@swift-ci benchmark

swift-ci · 2021-03-29T03:32:54Z

Performance: -O

Regression	OLD	NEW	DELTA	RATIO
NSStringConversion.UTF8	935	1041	+11.3%	0.90x (?)
ObjectiveCBridgeFromNSArrayAnyObjectToStringForced	30400	33000	+8.6%	0.92x (?)
ObjectiveCBridgeFromNSArrayAnyObjectForced	4420	4780	+8.1%	0.92x (?)
NSStringConversion.MutableCopy.LongUTF8	636	686	+7.9%	0.93x (?)

Improvement	OLD	NEW	DELTA	RATIO
ParseInt.IntSmall.Decimal	437	206	-52.9%	2.12x
ParseInt.IntSmall.UncommonRadix	481	229	-52.4%	2.10x
ParseInt.UInt64.Decimal	203	117	-42.4%	1.74x
StrToInt	1600	990	-38.1%	1.62x
ParseInt.UIntSmall.Binary	641	418	-34.8%	1.53x
StringFromLongWholeSubstring	5	4	-20.0%	1.25x
ParseInt.UInt64.Hex	331	272	-17.8%	1.22x
DictionaryCompactMapValuesOfCastValue	7452	6858	-8.0%	1.09x (?)
Data.hash.Medium	42	39	-7.1%	1.08x (?)
AngryPhonebook.Armenian.Small	877	818	-6.7%	1.07x (?)
String.replaceSubrange.ArrChar.Small	76	71	-6.6%	1.07x (?)

Code size: -O

Regression	OLD	NEW	DELTA	RATIO
IntegerParsing.o	56677	90807	+60.2%	0.62x
StrToInt.o	4615	6065	+31.4%	0.76x
RangeReplaceableCollectionPlusDefault.o	5442	7070	+29.9%	0.77x
LuhnAlgoEager.o	11370	13372	+17.6%	0.85x
LuhnAlgoLazy.o	11370	13372	+17.6%	0.85x
DictionaryCompactMapValues.o	13853	15770	+13.8%	0.88x
DriverUtils.o	129127	133481	+3.4%	0.97x

Performance: -Osize

Regression	OLD	NEW	DELTA	RATIO
RandomShuffleLCG2	416	448	+7.7%	0.93x
DictionaryKeysContainsNative	26	28	+7.7%	0.93x (?)
Array2D	6992	7520	+7.6%	0.93x (?)

Improvement	OLD	NEW	DELTA	RATIO
ParseInt.IntSmall.Decimal	512	202	-60.5%	2.53x
ParseInt.IntSmall.UncommonRadix	569	228	-59.9%	2.50x
StrToInt	1900	970	-48.9%	1.96x
ParseInt.UIntSmall.Binary	692	428	-38.2%	1.62x
ParseInt.UInt64.Decimal	223	141	-36.8%	1.58x
StringFromLongWholeSubstring	5	4	-20.0%	1.25x
DictionaryCompactMapValuesOfCastValue	7506	6858	-8.6%	1.09x
ParseInt.UInt64.Hex	325	301	-7.4%	1.08x (?)
AngryPhonebook.Armenian.Small	883	823	-6.8%	1.07x (?)

Code size: -Osize

Regression	OLD	NEW	DELTA	RATIO
IntegerParsing.o	51810	81283	+56.9%	0.64x
RangeReplaceableCollectionPlusDefault.o	4715	6411	+36.0%	0.74x
StrToInt.o	4404	5400	+22.6%	0.82x
LuhnAlgoEager.o	12327	13929	+13.0%	0.88x
LuhnAlgoLazy.o	12327	13929	+13.0%	0.88x
DictionaryCompactMapValues.o	12248	13804	+12.7%	0.89x
DriverUtils.o	122969	125759	+2.3%	0.98x

Performance: -Onone

Regression	OLD	NEW	DELTA	RATIO
ConvertFloatingPoint.MockFloat64ToInt64	49959	53781	+7.7%	0.93x (?)

Improvement	OLD	NEW	DELTA	RATIO
ParseInt.UIntSmall.Binary	22822	17027	-25.4%	1.34x
ParseInt.UInt64.Decimal	6356	4824	-24.1%	1.32x
ParseInt.UInt64.Hex	5656	4566	-19.3%	1.24x
StrToInt	42680	35060	-17.9%	1.22x
LuhnAlgoEager	4764	4263	-10.5%	1.12x (?)
DictionaryCompactMapValuesOfCastValue	55080	49410	-10.3%	1.11x
RangeReplaceableCollectionPlusDefault	6984	6364	-8.9%	1.10x (?)
ParseInt.IntSmall.UncommonRadix	11331	10336	-8.8%	1.10x (?)
String.replaceSubrange.Substring.Small	87	81	-6.9%	1.07x (?)
ParseInt.IntSmall.Decimal	10017	9359	-6.6%	1.07x (?)

Code size: -swiftlibs

How to read the data

The tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.

If you see any unexpected regressions, you should consider fixing the
regressions before you merge the PR.

Noise: Sometimes the performance results (not code size!) contain false
alarms. Unexpected regressions which are marked with '(?)' are probably noise.
If you see regressions which you cannot explain you can try to run the
benchmarks again. If regressions still show up, please consult with the
performance team (@eeckstein).

Hardware Overview

  Model Name: Mac Pro
  Model Identifier: MacPro6,1
  Processor Name: 12-Core Intel Xeon E5
  Processor Speed: 2.7 GHz
  Number of Processors: 1
  Total Number of Cores: 12
  L2 Cache (per Core): 256 KB
  L3 Cache: 30 MB
  Memory: 64 GB

…g-to-integer parsing.

xwu · 2021-04-01T17:19:05Z

@swift-ci test Linux platform

xwu · 2021-04-01T17:19:13Z

@swift-ci test macOS platform

xwu · 2021-04-01T17:19:20Z

@swift-ci benchmark

xwu · 2021-04-01T18:58:52Z

@swift-ci benchmark

swift-ci · 2021-04-01T19:40:19Z

Build failed
Swift Test OS X Platform
Git Sha - 92d492f

swift-ci · 2021-04-01T20:38:48Z

Performance: -O

Regression	OLD	NEW	DELTA	RATIO
String.data.Medium	95	115	+21.1%	0.83x (?)
FlattenListFlatMap	4763	5273	+10.7%	0.90x (?)
NSError	145	160	+10.3%	0.91x (?)

Improvement	OLD	NEW	DELTA	RATIO
StrToInt	1430	920	-35.7%	1.55x
ParseInt.IntSmall.Decimal	392	270	-31.1%	1.45x
ParseInt.IntSmall.UncommonRadix	432	299	-30.8%	1.44x
Breadcrumbs.MutatedUTF16ToIdx.ASCII	4	3	-25.0%	1.33x
Breadcrumbs.MutatedIdxToUTF16.ASCII	4	3	-25.0%	1.33x
ParseInt.UInt64.Decimal	184	145	-21.2%	1.27x
ParseInt.UIntSmall.Binary	575	517	-10.1%	1.11x
ParseInt.UInt64.Hex	299	273	-8.7%	1.10x
FindString.Loop1.Substring	455	424	-6.8%	1.07x (?)

Code size: -O

Regression	OLD	NEW	DELTA	RATIO
LuhnAlgoEager.o	11370	12162	+7.0%	0.93x
LuhnAlgoLazy.o	11370	12162	+7.0%	0.93x
RangeReplaceableCollectionPlusDefault.o	5442	5788	+6.4%	0.94x
DictionaryCompactMapValues.o	13853	14554	+5.1%	0.95x
StrToInt.o	4615	4843	+4.9%	0.95x
IntegerParsing.o	56677	59477	+4.9%	0.95x

Performance: -Osize

Regression	OLD	NEW	DELTA	RATIO
UTF8Decode_InitFromCustom_contiguous_ascii_as_ascii	346	406	+17.3%	0.85x (?)
FlattenListFlatMap	3446	3800	+10.3%	0.91x (?)
DropFirstAnyCollectionLazy	79438	86611	+9.0%	0.92x (?)
DropLastAnyCollectionLazy	26992	29307	+8.6%	0.92x (?)
SuffixAnyCollectionLazy	26343	28487	+8.1%	0.92x (?)

Improvement	OLD	NEW	DELTA	RATIO
StrToInt	1700	930	-45.3%	1.83x
ParseInt.IntSmall.UncommonRadix	510	307	-39.8%	1.66x
ParseInt.IntSmall.Decimal	459	291	-36.6%	1.58x
ParseInt.UInt64.Decimal	197	151	-23.4%	1.30x
ParseInt.UIntSmall.Binary	621	520	-16.3%	1.19x
DictionaryLiteral	3670	3310	-9.8%	1.11x (?)
DictionaryCompactMapValuesOfCastValue	6696	6210	-7.3%	1.08x (?)

Code size: -Osize

Regression	OLD	NEW	DELTA	RATIO
RangeReplaceableCollectionPlusDefault.o	4715	5511	+16.9%	0.86x
LuhnAlgoEager.o	12327	13079	+6.1%	0.94x
LuhnAlgoLazy.o	12327	13079	+6.1%	0.94x
DictionaryCompactMapValues.o	12248	12951	+5.7%	0.95x
IntegerParsing.o	51810	54751	+5.7%	0.95x
StrToInt.o	4404	4547	+3.2%	0.97x

Performance: -Onone

Regression	OLD	NEW	DELTA	RATIO
NSDictionaryCastToSwift	2490	3020	+21.3%	0.82x (?)
StringBuilderWithLongSubstring	3920	4710	+20.2%	0.83x (?)
RandomDoubleLCG	39156	42382	+8.2%	0.92x (?)

Improvement	OLD	NEW	DELTA	RATIO
ParseInt.UIntSmall.Binary	20519	9198	-55.2%	2.23x
ParseInt.UInt64.Decimal	5793	2622	-54.7%	2.21x
StrToInt	38590	18730	-51.5%	2.06x
ParseInt.UInt64.Hex	5133	2645	-48.5%	1.94x
ParseInt.IntSmall.UncommonRadix	10196	5620	-44.9%	1.81x
ParseInt.IntSmall.Decimal	9002	5291	-41.2%	1.70x
DictionaryCompactMapValuesOfCastValue	49410	37746	-23.6%	1.31x
LuhnAlgoEager	4304	3752	-12.8%	1.15x (?)
LuhnAlgoLazy	4218	3788	-10.2%	1.11x (?)
RangeReplaceableCollectionPlusDefault	5884	5356	-9.0%	1.10x (?)

Code size: -swiftlibs

Improvement	OLD	NEW	DELTA	RATIO
libswiftCore.dylib	3850240	3784704	-1.7%	1.02x

How to read the data

The tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.

If you see any unexpected regressions, you should consider fixing the
regressions before you merge the PR.

Noise: Sometimes the performance results (not code size!) contain false
alarms. Unexpected regressions which are marked with '(?)' are probably noise.
If you see regressions which you cannot explain you can try to run the
benchmarks again. If regressions still show up, please consult with the
performance team (@eeckstein).

Hardware Overview

  Model Name: Mac Pro
  Model Identifier: MacPro6,1
  Processor Name: 8-Core Intel Xeon E5
  Processor Speed: 3 GHz
  Number of Processors: 1
  Total Number of Cores: 8
  L2 Cache (per Core): 256 KB
  L3 Cache: 25 MB
  Memory: 64 GB

xwu · 2021-04-01T20:38:48Z

@milseman I think this is ready.

(The failed macOS tests are also failing in other PRs--e.g., #36669--suggesting they're unrelated to this change.)

xwu · 2021-04-01T20:39:11Z

@swift-ci please smoke test windows

xwu · 2021-04-02T14:24:53Z

@swift-ci smoke test

xwu · 2021-04-02T17:27:37Z

Ugh, really?

@swift-ci smoke test macOS platform

xwu · 2021-04-02T17:30:18Z

@swift-ci test macOS platform

xwu · 2021-04-02T21:20:30Z

@swift-ci smoke test Windows platform

xwu · 2021-04-02T22:33:25Z

@milseman Ship it?

milseman

LGTM

[stdlib][SR-7556] Re-implement string-to-integer parsing.

ad992f4