New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[stdlib] Rewrite integer parsing #30094
base: main
Are you sure you want to change the base?
Conversation
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Hi @PatrickPijnappel, this is exciting, and the benchmark numbers look promising! I won't have time to do a detailed review until after Friday. @milseman, can you take a look at the stringy-aspects of this PR? @tbkka, can you take a look as well? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't find any tests for this in the current stdlib test suite other than some very basic sanity checks. At a minimum, I would like to see:
- Test for each standard integer type (Int8, UInt8, ... , Int64, UInt64)
- Include positive and negative values
- Tests for min and max values
- Verify that min - 1 and max + 1 are correctly rejected
- Verify all special-cased radices (10, 16, 2) and at least one non-special-cased radix (3, 36)
- Test boundary values for this algorithm such as 9999 and 99999999
- Exercise short-string and non-short-string cases
/// `radix`. | ||
/// - radix: The radix, or base, to use for converting `text` to an integer | ||
/// value. `radix` must be in the range `2...36`. The default is 10. | ||
@inlinable @inline(__always) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do the perf numbers compare if you make this not inlineable at all?
Generally, I'd like to see us be much less aggressive about inlining; we're getting more feedback from folks that code size is a major concern and is in fact causing serious performance problems on RAM-constrained systems. At a minimum, could we reduce this down to a top-level switch on radix and string implementation that selects among non-inlined functions? Then any given invocation should optimize down to just a single relevant function call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one maybe should be inlined, but I would still like to see the data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We definitely want to be able to remove the generic type since that vast majority of call-sites will not be generic. The pattern we've often used is to have a small inlinable function that calls out to concrete implementation functions.
} | ||
} | ||
|
||
@inlinable @inline(__always) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above. It would be nice to avoid inlining functions at this level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is definitely way too much inlining; but without inlining the generics don't go away. The fast-paths we care about are mostly around contiguous UTF-8 Strings/Substrings being parsed into results that fit in 64-bits. The special path for small strings might be warranted. However, we should be able to strip all generics fairly early on for fast paths and invoke non-inlined code.
/// `radix`. | ||
/// - radix: The radix, or base, to use for converting `text` to an integer | ||
/// value. `radix` must be in the range `2...36`. The default is 10. | ||
@inlinable @inline(__always) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one maybe should be inlined, but I would still like to see the data.
guard _fastPath(count > 0) else { return nil } | ||
} | ||
// Choose specialization based on radix (normally branch is optimized away). | ||
let result_: UInt64? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given this, the implementation can't actually be generic over Result: FixedWidthInteger
, because it doesn't work for a conforming type larger than UInt64
, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's fine for base 2/10/16, since the max number of characters in a small string is 15 and 16**15 < 2**64
. The default case however could be as large as 36**15
so that one indeed shouldn't use that intermediate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try to give a better code review later, but for now I just wanted to say I really appreciate you looking at this. The existing implementation was a lingering pain point and in general I like your approach.
// platforms to maintain the same in-memory order. | ||
var word1 = rawGuts.0 | ||
let word2 = rawGuts.1 & 0x00ff_ffff_ffff_ffff | ||
let count = Int((rawGuts.1 &>> 56) & 0x0f) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can get a _SmallString
struct from the guts, which has some of these operations on it. In general, we'd want to avoid duplicating these magic flag values.
/// `radix`. | ||
/// - radix: The radix, or base, to use for converting `text` to an integer | ||
/// value. `radix` must be in the range `2...36`. The default is 10. | ||
@inlinable @inline(__always) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We definitely want to be able to remove the generic type since that vast majority of call-sites will not be generic. The pattern we've often used is to have a small inlinable function that calls out to concrete implementation functions.
} | ||
} | ||
|
||
@inlinable @inline(__always) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is definitely way too much inlining; but without inlining the generics don't go away. The fast-paths we care about are mostly around contiguous UTF-8 Strings/Substrings being parsed into results that fit in 64-bits. The special path for small strings might be warranted. However, we should be able to strip all generics fairly early on for fast paths and invoke non-inlined code.
Yes definitely this inlining is not shippable as-is. Best approach I'm thinking now would be have a non-generic However it'd be really valuable if we could expand the specialized dec/hex/bin methods to any contiguous UTF8 (given <= UInt64). Especially since the small case atm doesn't cover substrings, which would be very common when parsing from text files. The issue is efficiently loading the chunks. I've been thinking in most cases we can guarantee it's OK to load aligned UInt64 chunks from |
@tbkka I think most of these can be found in test/stdlib/NumericParsing.swift.gyb |
@benrimmington @tbkka Yeah there a few there already, however I'm currently writing tests to cover more cases, including ones mentioned that are missing and several more. This rewrite introduces some new code paths and boundary values that should be explicitly covered, and the original tests were fairly limited generally. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
6ab1ff0
to
2c665a9
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
c9a36fe
to
dab8fb7
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
01492e7
to
7d0c2d0
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
}) | ||
} | ||
|
||
@inline(__always) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to see the benchmark results if this function and the following _parseUnsignedBaseXX
functions were all @usableFromInline
instead of @inline(__always)
. Could you try that experiment? I suspect that you would still see a healthy speedup over the old version with dramatically smaller code size in the clients.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that these functions are not ABI-public, they are never inlined in the client only in the standard library itself. They are just a convenience to generate the specialized @usableFromInline
methods above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah. Of course, that makes perfect sense. Might be worth checking whether these annotations are really needed; I'm not an expert in compiler optimizations, but I have the impression that it does a pretty good job of inlining within a module.
let wholeGuts = text._wholeGuts | ||
// The specialized paths require that all of the contiguous bytes can | ||
// be read using UInt64 loads from (address & ~7). | ||
if wholeGuts._object.isPreferredRepresentation, Self.bitWidth <= 64 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@milseman Would this be the correct check to guarantee _loadUnalignedChunk
is valid? It's clear valid for small strings, and AFAICT also for natively stored large strings (both on 32 & 64-bits), not so sure about literals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or, if we can perhaps even guarantee this for shared and/or bridged UTF-8?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can make this assumption. At the very least, a CFString
may come in unaligned. CC @stephentyrone . We want better utilities for dealing with larger-strides alignment in general.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the requirement isn't alignment exactly, but rather that the memory at the address rounded down to a 8-byte multiple (even on 32-bit systems) is valid to read even if it's garbage.
Per string backing:
- Small: should be fine
- Large (native, mortal): should be fine on both 32 & 64 bit AFAICT
- Large, immortal: not sure about this
- Shared & bridged: Probably not?
- Foreign: no
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assumption does not hold. Swift hasn't yet defined a formal memory model that addresses this detail, but absent a guarantee, it is UB to read outside the bounds of an object, even if that's an aligned read, even if it would be allowed by the hardware. If we want to take advantage of hardware support for doing this, it has to be done in assembly at present (C and C++ make this explicitly UB, Swift may or may not in the future, so we can't depend on the semantics at this point).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@PatrickPijnappel -- I think @stephentyrone is referring to this header file: stdlib/public/SwiftShims/RuntimeShims.h
and the associated implementations which seem to be mostly in stdlib/public/stubs/Stubs.cpp
. These files (and the other files in those directories) provide various small C/C++ support functions to the standard library, many of which serve to hide processor- or operating-system-specific details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know whether this should be an inline function or whether that doesn't matter for the compiler? The stubs seem to be defined a couple of different places, e.g. RuntimeStubs.h, RuntimeShims.h and Runtime.swift (in stdlib), which is the proper place?
Implementation-wise do we just need
uint64_t loadUInt64Unaligned(void *p) {
return *(uint64_t *)p;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, that generates an aligned load; by casting the pointer to uint64_t
, you are telling the compiler that it is suitably aligned for an 8-byte integer.
You want:
static inline SWIFT_ALWAYS_INLINE
__swift_uint64_t loadUInt64Unaligned(char *p) {
__swift_uint64_t result;
memcpy(&result, p, sizeof result);
return result;
}
You could equivalently write it as follows:
typedef __swift_uint64_t __attribute__((aligned(1))) __swift_unaligned_uint64;
static inline SWIFT_ALWAYS_INLINE
__swift_uint64_t loadUInt64Unaligned(char *p) {
return *(__swift_unaligned_uint64 *)p;
}
personally, I prefer memcpy
(because the second option is dependent on a clang extension), but they will generate identical code in the Swift toolchain.
It only needs to go in the header, you don't need an implementation in a .cpp file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stephentyrone The memcpy
appears to hit some bug in SIL deserialization during compilation of the benchmarks: https://gist.github.com/PatrickPijnappel/f133f7296eb05b966a7bc9486ded35b7
(Implemented it using the alignment attribute for now)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, right, forgot to account for the fact that you can't use stdlib functions in the shims. If you use __builtin_memcpy
instead it will work.
@PatrickPijnappel It could just be a code alignment issue, which unfortunately sometimes gives "random" results. |
233eb0f
to
a02f3c7
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
The benchmarks seem to have been accidentally deleted in another PR, re-adding them in #31326. |
358ac30
to
7ece0f0
Compare
@swift-ci Please benchmark |
@milseman I did a bit of a reorganization to simplify by forcing UTF-8. However, two issues:
|
Performance: -O
Code size: -O
Performance: -Osize
Code size: -Osize
Performance: -Onone
Code size: -swiftlibs
How to read the dataThe tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.If you see any unexpected regressions, you should consider fixing the Noise: Sometimes the performance results (not code size!) contain false Hardware Overview
|
7ece0f0
to
a974a89
Compare
a974a89
to
630cfa4
Compare
if S.self == String.self { | ||
var text = text as! String | ||
return text.withUTF8(f) | ||
} else { // StringProtocol requires no additional types conform to it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may not be correct. I think we may technically be leaving the door open to having additional types conform to StringProtocol. In any case, it is worth considering if we want to expose this assumption in an inlinable
function.
630cfa4
to
f20b784
Compare
Updated. I unfortunately haven't had the time to get back to finishing this up, but hope to have some time soon. |
This is a rewrite of integer parsing for performance and code size. It address several major issues with the existing implementation (SR-7556), but also uses a new approach described below to achieve significant further gains.
Resolves SR-7556.
[Updated] Latest Benchmark Results
By default, the benchmark suite only runs the tests marked in bold and skips the others.
Notes on Approach
The key idea here is using the bytes of a
UInt64
in a way similarly to SIMD. When doing ASCII operations the new UTF-8 backing of Strings makes them perfectly suited for this technique. Aside from integer parsing, for example also ASCII case manipulation sees many-X speed-ups when using this approach. Example (details below):In most cases, each byte is treated as a 7-bit value, with the remaining high bit per byte being considered a flag for tests and to capture under/overflow.
Operation Rules
|
,&
,^
,~
) are clearly always valid, and shifts only if you know the lanes won't mix.value &+ c
(>
/>=
) orc &- value
(<
/<=
), by pickingc
such that for thetrue
case the lane will overflow, i.e. be>= 0x80
.flags = value & 0x8080…; mask = flags &- (flags &>> 7)
. This mask can then be used to branchlessly conditionalize further operations, e.g.value &+= (…) & mask
.0x0101_0101_0101_0101
puts it in every lane.