Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[stdlib] Rewrite integer parsing #30094

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

PatrickPijnappel
Copy link
Collaborator

@PatrickPijnappel PatrickPijnappel commented Feb 27, 2020

This is a rewrite of integer parsing for performance and code size. It address several major issues with the existing implementation (SR-7556), but also uses a new approach described below to achieve significant further gains.

Resolves SR-7556.

[Updated] Latest Benchmark Results

By default, the benchmark suite only runs the tests marked in bold and skips the others.

Decimal Hex Binary Other
Small Int 1.5 1.7 2.5 1.0
UInt 2.1 2.1 5.6 1.4
32-bit Int 2.0 3.3 4.7 1.1
UInt 1.9 3.2 8.0 1.0
64-bit Int 3.5 6.3 7.2 1.4
UInt 3.8 6.2 11.5 1.7

Notes on Approach

The key idea here is using the bytes of a UInt64 in a way similarly to SIMD. When doing ASCII operations the new UTF-8 backing of Strings makes them perfectly suited for this technique. Aside from integer parsing, for example also ASCII case manipulation sees many-X speed-ups when using this approach. Example (details below):

let str = 0x3132_3030_3633_3739 // ASCII "12006379"
let digits = str &- 0x3030_3030_3030_3030 // Subtract "0" from every lane
// … (underflow check omitted for clarity)
let c = (0x7f - 9) &* 0x0101_0101_0101_0101 // Constant to check value > 9
let isAnyAbove9 = (digits &+ c) & 0x8080_8080_8080_8080 != 0

In most cases, each byte is treated as a 7-bit value, with the remaining high bit per byte being considered a flag for tests and to capture under/overflow.

Operation Rules

  • Bitwise operations (|, &, ^, ~) are clearly always valid, and shifts only if you know the lanes won't mix.
  • When any of the flags bits are set, further operations except bitwise are not generally valid.
  • Addition (when both values are 7-bit) is valid, with the flag bit indicating overflow.
  • Subtraction (when both values are 7-bit) is valid, however upon underflow all more significant lanes will be garbled. The flag bit for the underflowing lane will be reliably set though.
  • Comparisons (which set flags) can be made using value &+ c (>/>=) or c &- value (</<=), by picking c such that for the true case the lane will overflow, i.e. be >= 0x80.
  • The flags can be made into 7-bit masks using flags = value & 0x8080…; mask = flags &- (flags &>> 7). This mask can then be used to branchlessly conditionalize further operations, e.g. value &+= (…) & mask.
  • Multiplying a byte by 0x0101_0101_0101_0101 puts it in every lane.

@PatrickPijnappel

This comment has been minimized.

@swift-ci

This comment has been minimized.

stdlib/public/core/IntegerParsing.swift Outdated Show resolved Hide resolved
stdlib/public/core/IntegerParsing.swift Outdated Show resolved Hide resolved
stdlib/public/core/IntegerParsing.swift Outdated Show resolved Hide resolved
@stephentyrone
Copy link
Member

stephentyrone commented Feb 27, 2020

Hi @PatrickPijnappel, this is exciting, and the benchmark numbers look promising!

I won't have time to do a detailed review until after Friday. @milseman, can you take a look at the stringy-aspects of this PR? @tbkka, can you take a look as well?

Copy link
Contributor

@tbkka tbkka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find any tests for this in the current stdlib test suite other than some very basic sanity checks. At a minimum, I would like to see:

  • Test for each standard integer type (Int8, UInt8, ... , Int64, UInt64)
  • Include positive and negative values
  • Tests for min and max values
  • Verify that min - 1 and max + 1 are correctly rejected
  • Verify all special-cased radices (10, 16, 2) and at least one non-special-cased radix (3, 36)
  • Test boundary values for this algorithm such as 9999 and 99999999
  • Exercise short-string and non-short-string cases

/// `radix`.
/// - radix: The radix, or base, to use for converting `text` to an integer
/// value. `radix` must be in the range `2...36`. The default is 10.
@inlinable @inline(__always)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do the perf numbers compare if you make this not inlineable at all?

Generally, I'd like to see us be much less aggressive about inlining; we're getting more feedback from folks that code size is a major concern and is in fact causing serious performance problems on RAM-constrained systems. At a minimum, could we reduce this down to a top-level switch on radix and string implementation that selects among non-inlined functions? Then any given invocation should optimize down to just a single relevant function call.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one maybe should be inlined, but I would still like to see the data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We definitely want to be able to remove the generic type since that vast majority of call-sites will not be generic. The pattern we've often used is to have a small inlinable function that calls out to concrete implementation functions.

}
}

@inlinable @inline(__always)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above. It would be nice to avoid inlining functions at this level.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is definitely way too much inlining; but without inlining the generics don't go away. The fast-paths we care about are mostly around contiguous UTF-8 Strings/Substrings being parsed into results that fit in 64-bits. The special path for small strings might be warranted. However, we should be able to strip all generics fairly early on for fast paths and invoke non-inlined code.

/// `radix`.
/// - radix: The radix, or base, to use for converting `text` to an integer
/// value. `radix` must be in the range `2...36`. The default is 10.
@inlinable @inline(__always)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one maybe should be inlined, but I would still like to see the data.

guard _fastPath(count > 0) else { return nil }
}
// Choose specialization based on radix (normally branch is optimized away).
let result_: UInt64?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this, the implementation can't actually be generic over Result: FixedWidthInteger, because it doesn't work for a conforming type larger than UInt64, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine for base 2/10/16, since the max number of characters in a small string is 15 and 16**15 < 2**64. The default case however could be as large as 36**15 so that one indeed shouldn't use that intermediate.

Copy link
Contributor

@milseman milseman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to give a better code review later, but for now I just wanted to say I really appreciate you looking at this. The existing implementation was a lingering pain point and in general I like your approach.

// platforms to maintain the same in-memory order.
var word1 = rawGuts.0
let word2 = rawGuts.1 & 0x00ff_ffff_ffff_ffff
let count = Int((rawGuts.1 &>> 56) & 0x0f)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can get a _SmallString struct from the guts, which has some of these operations on it. In general, we'd want to avoid duplicating these magic flag values.

/// `radix`.
/// - radix: The radix, or base, to use for converting `text` to an integer
/// value. `radix` must be in the range `2...36`. The default is 10.
@inlinable @inline(__always)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We definitely want to be able to remove the generic type since that vast majority of call-sites will not be generic. The pattern we've often used is to have a small inlinable function that calls out to concrete implementation functions.

}
}

@inlinable @inline(__always)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is definitely way too much inlining; but without inlining the generics don't go away. The fast-paths we care about are mostly around contiguous UTF-8 Strings/Substrings being parsed into results that fit in 64-bits. The special path for small strings might be warranted. However, we should be able to strip all generics fairly early on for fast paths and invoke non-inlined code.

@PatrickPijnappel
Copy link
Collaborator Author

Yes definitely this inlining is not shippable as-is. Best approach I'm thinking now would be have a non-generic parseSmallDec/Hex/Bin -> UInt64 + non-generic parseFromContiguousUTF8 -> UInt64 + generic parseFromUTF8Iterator.

However it'd be really valuable if we could expand the specialized dec/hex/bin methods to any contiguous UTF8 (given <= UInt64). Especially since the small case atm doesn't cover substrings, which would be very common when parsing from text files.

The issue is efficiently loading the chunks. I've been thinking in most cases we can guarantee it's OK to load aligned UInt64 chunks from utf8Ptr & ~7 through (utf8Ptr + count - 1) & ~7 right – i.e. all words in which there's character we need to read. This should be the case for all small strings and if I understand correctly for all native-backed strings (note: not a 100% about the 32-bit case). Bridged strings don't give any guarantees AFAIK, so those would be out. @milseman Thoughts?

@benrimmington
Copy link
Collaborator

I couldn't find any tests for this in the current stdlib test suite other than some very basic sanity checks. At a minimum, I would like to see:

  • Test for each standard integer type (Int8, UInt8, ... , Int64, UInt64)
  • Include positive and negative values
  • Tests for min and max values
  • Verify that min - 1 and max + 1 are correctly rejected
  • Verify all special-cased radices (10, 16, 2) and at least one non-special-cased radix (3, 36)
  • Test boundary values for this algorithm such as 9999 and 99999999
  • Exercise short-string and non-short-string cases

@tbkka I think most of these can be found in test/stdlib/NumericParsing.swift.gyb

@PatrickPijnappel
Copy link
Collaborator Author

@benrimmington @tbkka Yeah there a few there already, however I'm currently writing tests to cover more cases, including ones mentioned that are missing and several more. This rewrite introduces some new code paths and boundary values that should be explicitly covered, and the original tests were fairly limited generally.

@PatrickPijnappel

This comment has been minimized.

@swift-ci

This comment has been minimized.

@swift-ci

This comment has been minimized.

@PatrickPijnappel

This comment has been minimized.

@swift-ci

This comment has been minimized.

@swift-ci

This comment has been minimized.

@PatrickPijnappel PatrickPijnappel force-pushed the integer-parsing branch 2 times, most recently from c9a36fe to dab8fb7 Compare March 10, 2020 06:02
@PatrickPijnappel

This comment has been minimized.

@swift-ci

This comment has been minimized.

@PatrickPijnappel

This comment has been minimized.

@swift-ci

This comment has been minimized.

@PatrickPijnappel

This comment has been minimized.

@swift-ci

This comment has been minimized.

})
}

@inline(__always)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see the benchmark results if this function and the following _parseUnsignedBaseXX functions were all @usableFromInline instead of @inline(__always). Could you try that experiment? I suspect that you would still see a healthy speedup over the old version with dramatically smaller code size in the clients.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that these functions are not ABI-public, they are never inlined in the client only in the standard library itself. They are just a convenience to generate the specialized @usableFromInline methods above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. Of course, that makes perfect sense. Might be worth checking whether these annotations are really needed; I'm not an expert in compiler optimizations, but I have the impression that it does a pretty good job of inlining within a module.

let wholeGuts = text._wholeGuts
// The specialized paths require that all of the contiguous bytes can
// be read using UInt64 loads from (address & ~7).
if wholeGuts._object.isPreferredRepresentation, Self.bitWidth <= 64 {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@milseman Would this be the correct check to guarantee _loadUnalignedChunk is valid? It's clear valid for small strings, and AFAICT also for natively stored large strings (both on 32 & 64-bits), not so sure about literals.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, if we can perhaps even guarantee this for shared and/or bridged UTF-8?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can make this assumption. At the very least, a CFString may come in unaligned. CC @stephentyrone . We want better utilities for dealing with larger-strides alignment in general.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the requirement isn't alignment exactly, but rather that the memory at the address rounded down to a 8-byte multiple (even on 32-bit systems) is valid to read even if it's garbage.

Per string backing:

  • Small: should be fine
  • Large (native, mortal): should be fine on both 32 & 64 bit AFAICT
  • Large, immortal: not sure about this
  • Shared & bridged: Probably not?
  • Foreign: no

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumption does not hold. Swift hasn't yet defined a formal memory model that addresses this detail, but absent a guarantee, it is UB to read outside the bounds of an object, even if that's an aligned read, even if it would be allowed by the hardware. If we want to take advantage of hardware support for doing this, it has to be done in assembly at present (C and C++ make this explicitly UB, Swift may or may not in the future, so we can't depend on the semantics at this point).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PatrickPijnappel -- I think @stephentyrone is referring to this header file: stdlib/public/SwiftShims/RuntimeShims.h and the associated implementations which seem to be mostly in stdlib/public/stubs/Stubs.cpp. These files (and the other files in those directories) provide various small C/C++ support functions to the standard library, many of which serve to hide processor- or operating-system-specific details.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know whether this should be an inline function or whether that doesn't matter for the compiler? The stubs seem to be defined a couple of different places, e.g. RuntimeStubs.h, RuntimeShims.h and Runtime.swift (in stdlib), which is the proper place?

Implementation-wise do we just need

uint64_t loadUInt64Unaligned(void *p) {
  return *(uint64_t *)p;
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that generates an aligned load; by casting the pointer to uint64_t, you are telling the compiler that it is suitably aligned for an 8-byte integer.

You want:

static inline SWIFT_ALWAYS_INLINE
__swift_uint64_t loadUInt64Unaligned(char *p) {
  __swift_uint64_t result;
  memcpy(&result, p, sizeof result);
  return result;
}

You could equivalently write it as follows:

typedef __swift_uint64_t __attribute__((aligned(1))) __swift_unaligned_uint64;

static inline SWIFT_ALWAYS_INLINE
__swift_uint64_t loadUInt64Unaligned(char *p) {
  return *(__swift_unaligned_uint64 *)p;
}

personally, I prefer memcpy (because the second option is dependent on a clang extension), but they will generate identical code in the Swift toolchain.

It only needs to go in the header, you don't need an implementation in a .cpp file.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stephentyrone The memcpy appears to hit some bug in SIL deserialization during compilation of the benchmarks: https://gist.github.com/PatrickPijnappel/f133f7296eb05b966a7bc9486ded35b7

(Implemented it using the alignment attribute for now)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right, forgot to account for the fact that you can't use stdlib functions in the shims. If you use __builtin_memcpy instead it will work.

@eeckstein
Copy link
Member

@PatrickPijnappel It could just be a code alignment issue, which unfortunately sometimes gives "random" results.
To verify, you can compile the benchmarks locally and see if the generated assembly code really differs.

@PatrickPijnappel

This comment has been minimized.

@swift-ci

This comment has been minimized.

@PatrickPijnappel
Copy link
Collaborator Author

PatrickPijnappel commented Apr 26, 2020

The benchmarks seem to have been accidentally deleted in another PR, re-adding them in #31326.

@PatrickPijnappel
Copy link
Collaborator Author

@swift-ci Please benchmark

@PatrickPijnappel
Copy link
Collaborator Author

@milseman I did a bit of a reorganization to simplify by forcing UTF-8. However, two issues:

  • The withUTF8, at least in the way it's currently implemented, is causing retain/release that's traffic killing performance. Any ideas?
  • The uncommon radix performance is more than an order of magnitude worse than master.

@swift-ci
Copy link
Collaborator

Performance: -O

Regression OLD NEW DELTA RATIO
ParseInt.IntSmall.UncommonRadix 331 8649 +2513.0% 0.04x
Set.isStrictSubset.Seq.Int.Empty 107 122 +14.0% 0.88x
Set.subtracting.Empty.Box 8 9 +12.5% 0.89x (?)
DictionaryCompactMapValuesOfCastValue 4968 5508 +10.9% 0.90x (?)
Set.isDisjoint.Box.Empty 93 103 +10.8% 0.90x (?)
Set.isStrictSuperset.Seq.Empty.Int 153 169 +10.5% 0.91x (?)
Set.isDisjoint.Seq.Box.Empty 82 90 +9.8% 0.91x
Set.isSubset.Seq.Int.Empty 113 124 +9.7% 0.91x
LuhnAlgoEager 189 207 +9.5% 0.91x (?)
ObjectiveCBridgeStringHash 77 84 +9.1% 0.92x (?)
ArrayLiteral2 69 75 +8.7% 0.92x (?)
LuhnAlgoLazy 191 207 +8.4% 0.92x
 
Improvement OLD NEW DELTA RATIO
ParseInt.UInt64.Hex 295 145 -50.8% 2.03x
ParseInt.UIntSmall.Binary 502 247 -50.8% 2.03x
StrToInt 1050 860 -18.1% 1.22x
ParseInt.UInt64.Decimal 167 153 -8.4% 1.09x
Calculator 156 143 -8.3% 1.09x (?)
OpenClose 61 56 -8.2% 1.09x (?)
Set.subtracting.Empty.Int 28 26 -7.1% 1.08x (?)

Code size: -O

Regression OLD NEW DELTA RATIO
IntegerParsing.o 58109 62707 +7.9% 0.93x

Performance: -Osize

Regression OLD NEW DELTA RATIO
ParseInt.IntSmall.UncommonRadix 338 8660 +2462.1% 0.04x
Dictionary4 161 198 +23.0% 0.81x (?)
FlattenListLoop 2674 3195 +19.5% 0.84x (?)
Set.isStrictSubset.Seq.Int.Empty 105 120 +14.3% 0.88x
DictionaryCompactMapValuesOfCastValue 5076 5778 +13.8% 0.88x (?)
Set.isSuperset.Seq.Empty.Int 44 50 +13.6% 0.88x (?)
Dictionary4OfObjects 231 262 +13.4% 0.88x
Set.isDisjoint.Seq.Int.Empty 45 51 +13.3% 0.88x
Set.isDisjoint.Seq.Box.Empty 77 87 +13.0% 0.89x (?)
Set.isSubset.Seq.Int.Empty 108 122 +13.0% 0.89x (?)
Set.subtracting.Empty.Box 8 9 +12.5% 0.89x
Set.isStrictSubset.Int.Empty 47 52 +10.6% 0.90x (?)
Set.isDisjoint.Box.Empty 87 96 +10.3% 0.91x (?)
ArraySetElement 262 288 +9.9% 0.91x
Set.isStrictSuperset.Seq.Empty.Int 155 169 +9.0% 0.92x (?)
Set.isSubset.Int.Empty 47 51 +8.5% 0.92x (?)
 
Improvement OLD NEW DELTA RATIO
ParseInt.UIntSmall.Binary 787 244 -69.0% 3.23x
ParseInt.UInt64.Hex 365 146 -60.0% 2.50x
ParseInt.UInt64.Decimal 195 153 -21.5% 1.27x
StrToInt 1050 940 -10.5% 1.12x
PrefixWhileAnySeqCntRange 201 184 -8.5% 1.09x (?)
SubstringFromLongStringGeneric 13 12 -7.7% 1.08x
Chars2 3550 3300 -7.0% 1.08x (?)

Code size: -Osize

Regression OLD NEW DELTA RATIO
RangeReplaceableCollectionPlusDefault.o 4850 6061 +25.0% 0.80x
StrToInt.o 5311 6035 +13.6% 0.88x
IntegerParsing.o 54939 59170 +7.7% 0.93x
DictionaryCompactMapValues.o 12795 13587 +6.2% 0.94x
LuhnAlgoEager.o 16132 16612 +3.0% 0.97x
LuhnAlgoLazy.o 16132 16612 +3.0% 0.97x

Performance: -Onone

Regression OLD NEW DELTA RATIO
ParseInt.IntSmall.UncommonRadix 11702 16095 +37.5% 0.73x
ObjectiveCBridgeStringHash 77 84 +9.1% 0.92x
 
Improvement OLD NEW DELTA RATIO
ParseInt.UInt64.Decimal 5385 1511 -71.9% 3.56x
ParseInt.UInt64.Hex 4894 1504 -69.3% 3.25x
ParseInt.UIntSmall.Binary 19879 6631 -66.6% 3.00x
StrToInt 41590 20440 -50.9% 2.03x
ParseInt.IntSmall.Decimal 10855 7123 -34.4% 1.52x
DictionaryCompactMapValuesOfCastValue 38664 27270 -29.5% 1.42x
ArrayOfGenericPOD2 698 614 -12.0% 1.14x (?)
RangeReplaceableCollectionPlusDefault 4264 3756 -11.9% 1.14x
StringWalk 2920 2640 -9.6% 1.11x (?)
ArrayAppendLatin1Substring 34920 31752 -9.1% 1.10x (?)
ArrayAppendAsciiSubstring 34380 32040 -6.8% 1.07x (?)

Code size: -swiftlibs

Improvement OLD NEW DELTA RATIO
libswiftCore.dylib 4120576 4059136 -1.5% 1.02x
How to read the data The tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.

If you see any unexpected regressions, you should consider fixing the
regressions before you merge the PR.

Noise: Sometimes the performance results (not code size!) contain false
alarms. Unexpected regressions which are marked with '(?)' are probably noise.
If you see regressions which you cannot explain you can try to run the
benchmarks again. If regressions still show up, please consult with the
performance team (@eeckstein).

Hardware Overview
  Model Name: Mac mini
  Model Identifier: Macmini8,1
  Processor Name: 6-Core Intel Core i7
  Processor Speed: 3.2 GHz
  Number of Processors: 1
  Total Number of Cores: 6
  L2 Cache (per Core): 256 KB
  L3 Cache: 12 MB
  Memory: 64 GB

if S.self == String.self {
var text = text as! String
return text.withUTF8(f)
} else { // StringProtocol requires no additional types conform to it.
Copy link
Contributor

@karwa karwa Jul 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may not be correct. I think we may technically be leaving the door open to having additional types conform to StringProtocol. In any case, it is worth considering if we want to expose this assumption in an inlinable function.

@shahmishal
Copy link
Member

Please update the base branch to main by Oct 5th otherwise the pull request will be closed automatically.

  • How to change the base branch: (Link)
  • More detail about the branch update: (Link)

@PatrickPijnappel PatrickPijnappel changed the base branch from master to main October 2, 2020 12:10
@PatrickPijnappel
Copy link
Collaborator Author

Please update the base branch to main by Oct 5th otherwise the pull request will be closed automatically.

  • How to change the base branch: (Link)
  • More detail about the branch update: (Link)

Updated. I unfortunately haven't had the time to get back to finishing this up, but hope to have some time soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants