Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[stdlib] Lemire’s Nearly Divisionless Random Integer Generation #25286

Merged
merged 6 commits into from Jun 11, 2019

Conversation

Projects
None yet
7 participants
@palimondo
Copy link
Collaborator

commented Jun 6, 2019

Implementation of Daniel Lemire’s “Fast Random Integer Generation in Interval”
See https://arxiv.org/pdf/1805.10941.pdf

[stdlib] Lemire’s nearly divisionless random int
Implementation of Daniel Lemire’s “Fast Random Integer Generation in Interval”
See https://arxiv.org/pdf/1805.10941.pdf
@palimondo

This comment has been minimized.

Copy link
Collaborator Author

commented Jun 6, 2019

I have attempted a Swift port of @lemire's algorithm:

uint64_t nearlydivisionless ( uint64_t s, uint64_t (* random64 ) ( void ) ) {
  uint64_t x = random64 () ;
  __uint128_t m = ( __uint128_t ) x * ( __uint128_t ) s;
  uint64_t l = ( uint64_t ) m;
  if (l < s) {
    uint64_t t = -s % s;
    while (l < t) {
      x = random64 () ;
      m = ( __uint128_t ) x * ( __uint128_t ) s;
      l = ( uint64_t ) m;
    }
  }
  return m >> 64;
}

I'm not sure if I have correctly interpreted the negation of unsigned integer, I had to rely on https://stackoverflow.com/questions/8026694/c-unary-minus-operator-behavior-with-unsigned-operands
@stephentyrone can you please verify (since I've seen you answer that SO question...)?

@inlinable
public mutating func next<T: FixedWidthInteger & UnsignedInteger>(
  upperBound: T
) -> T {
  _precondition(upperBound != 0, "upperBound cannot be zero.")
  var random: T = next()
  var m = random.multipliedFullWidth(by: upperBound)
  if (m.low < upperBound) {
    let t = (T.max - upperBound + 1) % upperBound
    while (m.low < t) {
      random = next()
      m = random.multipliedFullWidth(by: upperBound)
    }
  }
  return m.high
}

Is there better way to express this in Swift than (T.max - upperBound + 1)

In my local test with WyRand on the worst case(?) bound UInt64.max / 2 + 1 I'm seeing ~4x speed improvement in random number generation.

Open Questions

  • Is it even possible to change the implementation of next(upperBound:) this way given ABI stability constraints?
  • How do we evaluate performance of algorithms like this across architectures? (@swift-ci does not perform ARM benchmarking AFAIK, does Apple do this internally?)
@palimondo

This comment has been minimized.

Copy link
Collaborator Author

commented Jun 6, 2019

@swift-ci please benchmark

I'm expecting to see improvement in RandomShuffleLCG2 benchmark. I don't think RandomShuffleDef2 will pop up as the slow generator should hide the improvement. We'll see...

@Azoy

This comment has been minimized.

Copy link
Contributor

commented Jun 6, 2019

We are able to change the implementation of next(upperBound:) within ABI constraints. There's currently a notice on the documentation that states:

The algorithm used to create random values may change in a future version of Swift. If you’re passing a generator that results in the same sequence of integer values each time you run your program, that sequence may change when your program is compiled using a different version of Swift.

cc: @stephentyrone

@lemire

This comment has been minimized.

Copy link

commented Jun 6, 2019

Out of instinct, I would replace

 (T.max -upperBound + 1) % upperBound

by

((~upperBound) &+ 1) % upperBound

or

(0 &- upperBound) % upperBound

depending on how clever you feel.

It might be useful to point to the reference as otherwise this function might look like mysterious magic to some: Fast Random Integer Generation in an Interval, ACM Transactions on Modeling and Computer Simulation 29 (1), 2019

@Azoy

This comment has been minimized.

Copy link
Contributor

commented Jun 6, 2019

I expect to see some performance wins with the floating point random API as well.

@palimondo

This comment has been minimized.

Copy link
Collaborator Author

commented Jun 6, 2019

Something's messed up. I need to investigate locally why has the benchmark crashed… but it's late over here, so I'll come back to this tomorrow.

@palimondo

This comment has been minimized.

Copy link
Collaborator Author

commented Jun 6, 2019

@lemire I have, of course, linked to your paper in the PR's description right at the top 😉

I liked the ((~upperBound) &+ 1) % upperBound as it resembles your original C code, but looking at how's the ~ implemented it's just 2 extra and unnecessary operations, so I'll go with (0 &- upperBound) % upperBound for a better unoptimized performance (I understand they all boil down to same instructions in -O).

@lemire

This comment has been minimized.

Copy link

commented Jun 6, 2019

@palimondo Sorry, I missed the first comment.

@palimondo

This comment has been minimized.

Copy link
Collaborator Author

commented Jun 6, 2019

Hmm.. reading through Melissa O'Neil's post, I see there is still a "Bitmask with Rejection (Unbiased) — Apple's Method" that performed even slightly better…
@stephentyrone is there a way to express mask >>= __builtin_clz(range|1); in Swift?

@Azoy

This comment has been minimized.

Copy link
Contributor

commented Jun 6, 2019

@palimondo mask >>= (range | 1).leadingZeroBitCount?

@lemire

This comment has been minimized.

Copy link

commented Jun 6, 2019

@palimondo

I thought that Melissa's conclusion was that "The fastest (unbiased) method is Lemire's with an extra tweak from me [Melissa]."

Let us ping her : @imneme

@palimondo

This comment has been minimized.

Copy link
Collaborator Author

commented Jun 6, 2019

I'm sure you're right, I did not finish reading before posting here… 😇

All this tweaking and performance fine-tuning still hinges on the important question of:

How do we evaluate performance of algorithms like this across architectures? (@swift-ci does not perform ARM benchmarking AFAIK, does Apple do this internally?)

…and I'm guessing that when considering the ARM performance, Swift would decide based on specific performance of these algorithms on A family of processors, not the reference implementations.

@imneme

This comment has been minimized.

Copy link

commented Jun 6, 2019

@lemire, That is what I said. Not sure I can add much beyond that…

Benchmarks are always something of a rabbit hole with more data the deeper you go but typically additional data reveals complexity rather than simple clarity. I really like @lemire's method in general. Frankly, I'd like it if there was a specific mod instruction itself was smart enough to realize when it has no work to do and short-circuit itself rather than needing any tricks. As it is (at least on x86), mod is really just a side effect of div and so it needs to go to the full amount of work, so my optimization is helpful.

(Possibly there is a general compiler optimization for mod, perhaps triggered by profiling!)

@lemire

This comment has been minimized.

Copy link

commented Jun 6, 2019

New blog post where I include ARM benchmarks: Nearly Divisionless Random Integer Generation On Various Systems. I include benchmarks on an ARM box. (I had the numbers inverted in an early version of the blog post: short story is that avoiding division is always good in my tests, using floats is slower.)

@palimondo

This comment has been minimized.

Copy link
Collaborator Author

commented Jun 7, 2019

@swift-ci please benchmark

@swift-ci

This comment has been minimized.

Copy link
Contributor

commented Jun 7, 2019

Performance: -O

Regression OLD NEW DELTA RATIO
Breadcrumbs.MutatedIdxToUTF16.Mixed 258 281 +8.9% 0.92x (?)
ObjectiveCBridgeFromNSString 935 1010 +8.0% 0.93x (?)
 
Improvement OLD NEW DELTA RATIO
RandomShuffleLCG2 2272 704 -69.0% 3.23x
RandomIntegersLCG 178 145 -18.5% 1.23x
RandomDoubleLCG 270 220 -18.5% 1.23x
RemoveWhereMoveInts 37 34 -8.1% 1.09x
Array2D 7520 6928 -7.9% 1.09x
RemoveWhereSwapInts 65 60 -7.7% 1.08x
MapReduce 397 368 -7.3% 1.08x
MapReduceAnyCollection 398 369 -7.3% 1.08x (?)

Code size: -O

Improvement OLD NEW DELTA RATIO
RandomValues.o 3006 2894 -3.7% 1.04x

Performance: -Osize

Regression OLD NEW DELTA RATIO
Breadcrumbs.MutatedIdxToUTF16.Mixed 259 280 +8.1% 0.93x (?)
 
Improvement OLD NEW DELTA RATIO
RandomIntegersLCG 757 152 -79.9% 4.98x
RandomDoubleLCG 886 222 -74.9% 3.99x
RandomShuffleLCG2 2272 768 -66.2% 2.96x
RemoveWhereMoveInts 37 34 -8.1% 1.09x (?)
Array2D 7520 6912 -8.1% 1.09x
RemoveWhereSwapInts 67 62 -7.5% 1.08x (?)
FlattenListLoop 5280 4910 -7.0% 1.08x (?)

Code size: -Osize

Regression OLD NEW DELTA RATIO
RandomValues.o 2679 2711 +1.2% 0.99x

Performance: -Onone

Regression OLD NEW DELTA RATIO
RandomIntegersLCG 34992 47766 +36.5% 0.73x
RandomDoubleLCG 45246 56050 +23.9% 0.81x
RandomIntegersDef 44200 53700 +21.5% 0.82x
RandomDoubleDef 69300 82200 +18.6% 0.84x (?)
Breadcrumbs.MutatedIdxToUTF16.Mixed 284 308 +8.5% 0.92x (?)
ArrayOfGenericPOD2 1322 1433 +8.4% 0.92x (?)
ArrayOfPOD 1065 1146 +7.6% 0.93x (?)
 
Improvement OLD NEW DELTA RATIO
StringMatch 60100 53200 -11.5% 1.13x (?)
LessSubstringSubstringGenericComparable 48 44 -8.3% 1.09x (?)
DataCreateEmptyArray 174500 160600 -8.0% 1.09x (?)

Code size: -swiftlibs

How to read the data The tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.

If you see any unexpected regressions, you should consider fixing the
regressions before you merge the PR.

Noise: Sometimes the performance results (not code size!) contain false
alarms. Unexpected regressions which are marked with '(?)' are probably noise.
If you see regressions which you cannot explain you can try to run the
benchmarks again. If regressions still show up, please consult with the
performance team (@eeckstein).

Hardware Overview
  Model Name: Mac Pro
  Model Identifier: MacPro6,1
  Processor Name: 12-Core Intel Xeon E5
  Processor Speed: 2.7 GHz
  Number of Processors: 1
  Total Number of Cores: 12
  L2 Cache (per Core): 256 KB
  L3 Cache: 30 MB
  Memory: 64 GB
@palimondo

This comment has been minimized.

Copy link
Collaborator Author

commented Jun 7, 2019

@swift-ci please benchmark

@palimondo

This comment has been minimized.

Copy link
Collaborator Author

commented Jun 7, 2019

The optimized results seem quite nice. It even looks like the -Osize likes the new code a lot better than original. I dunno what weight to assign to the -Onone regressions…

I'm re-running benchmarks with @imneme's modulo optimization, to see how it fares in Swift.

@swift-ci

This comment has been minimized.

Copy link
Contributor

commented Jun 7, 2019

Performance: -O

Regression OLD NEW DELTA RATIO
FlattenListFlatMap 7287 10078 +38.3% 0.72x (?)
FlattenListLoop 4332 4830 +11.5% 0.90x (?)
Breadcrumbs.MutatedIdxToUTF16.Mixed 258 283 +9.7% 0.91x (?)
Breadcrumbs.MutatedUTF16ToIdx.Mixed 261 281 +7.7% 0.93x (?)
 
Improvement OLD NEW DELTA RATIO
RandomShuffleLCG2 2272 736 -67.6% 3.09x
RandomIntegersLCG 178 145 -18.5% 1.23x
RandomDoubleLCG 270 220 -18.5% 1.23x
RemoveWhereMoveInts 37 34 -8.1% 1.09x (?)
Array2D 7520 6928 -7.9% 1.09x
RemoveWhereSwapInts 65 60 -7.7% 1.08x (?)
MapReduce 397 368 -7.3% 1.08x (?)
MapReduceAnyCollection 398 369 -7.3% 1.08x (?)

Code size: -O

Improvement OLD NEW DELTA RATIO
RandomValues.o 3006 2894 -3.7% 1.04x

Performance: -Osize

Regression OLD NEW DELTA RATIO
FlattenListLoop 4426 4909 +10.9% 0.90x (?)
StringBuilderWithLongSubstring 1400 1550 +10.7% 0.90x (?)
Breadcrumbs.MutatedIdxToUTF16.Mixed 258 282 +9.3% 0.91x (?)
 
Improvement OLD NEW DELTA RATIO
RandomIntegersLCG 757 152 -79.9% 4.98x
RandomDoubleLCG 886 222 -74.9% 3.99x
RandomShuffleLCG2 2272 768 -66.2% 2.96x
ObjectiveCBridgeStubFromNSDateRef 4170 3770 -9.6% 1.11x (?)
RemoveWhereMoveInts 37 34 -8.1% 1.09x (?)
Array2D 7520 6912 -8.1% 1.09x
RemoveWhereSwapInts 67 62 -7.5% 1.08x (?)
MapReduce 433 404 -6.7% 1.07x (?)

Code size: -Osize

Regression OLD NEW DELTA RATIO
RandomShuffle.o 3757 3805 +1.3% 0.99x
RandomValues.o 2679 2711 +1.2% 0.99x

Performance: -Onone

Regression OLD NEW DELTA RATIO
RandomIntegersLCG 34896 46916 +34.4% 0.74x
RandomDoubleLCG 45108 55980 +24.1% 0.81x
RandomIntegersDef 44000 54000 +22.7% 0.81x
RandomDoubleDef 69200 80900 +16.9% 0.86x (?)
Breadcrumbs.MutatedIdxToUTF16.Mixed 280 307 +9.6% 0.91x (?)
ArrayOfGenericPOD2 1323 1434 +8.4% 0.92x (?)
 
Improvement OLD NEW DELTA RATIO
StringMatch 60000 53200 -11.3% 1.13x (?)
LessSubstringSubstringGenericComparable 48 44 -8.3% 1.09x

Code size: -swiftlibs

How to read the data The tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.

If you see any unexpected regressions, you should consider fixing the
regressions before you merge the PR.

Noise: Sometimes the performance results (not code size!) contain false
alarms. Unexpected regressions which are marked with '(?)' are probably noise.
If you see regressions which you cannot explain you can try to run the
benchmarks again. If regressions still show up, please consult with the
performance team (@eeckstein).

Hardware Overview
  Model Name: Mac Pro
  Model Identifier: MacPro6,1
  Processor Name: 12-Core Intel Xeon E5
  Processor Speed: 2.7 GHz
  Number of Processors: 1
  Total Number of Cores: 12
  L2 Cache (per Core): 256 KB
  L3 Cache: 30 MB
  Memory: 64 GB
@palimondo

This comment has been minimized.

Copy link
Collaborator Author

commented Jun 7, 2019

All results were identical with the exception of regressed -O RandomShuffleLCG2, but the difference comes to effectively just 2 μs (that test has legacyFactor of 16, meaning the real measured times were 44 μs and 46 μs), I'd say that was probably within the margin of error.

Looking at the generated assembly for the modulo optimization on Compiler Explorer, it looks like the Swift version is not as tight as the C version for some reason… Thoughts?

@lemire

This comment has been minimized.

Copy link

commented Jun 7, 2019

@palimondo One difference for sure is that Swift is careful regarding the case where the bound is zero (which leads to a division by zero) whereas the C version assumes the programmer checked.

Show resolved Hide resolved stdlib/public/core/Random.swift Outdated
[stdlib] Remove modulo optimization
Swift, compared to C, seems unable to generate tightly fused instructions here for some reason (probably the division by zero check?)… removing.
@palimondo

This comment has been minimized.

Copy link
Collaborator Author

commented Jun 9, 2019

I realize that I have forgotten to use the masking subtraction assignment operator (&-=), and had just t -= upperBound, in the modulo optimization, but correcting that did not fix the generated code. I have tried bracketing the whole optimization in an explicit upperBound !=0 if statement, but nothing I have tried made a difference…

So I'm removing the modulo optimization until someone more clever than me figures out how to make it work in Swift.

@palimondo

This comment has been minimized.

Copy link
Collaborator Author

commented Jun 9, 2019

@swift-ci benchmark

@palimondo

This comment has been minimized.

Copy link
Collaborator Author

commented Jun 9, 2019

@swift-ci test

@swift-ci

This comment was marked as resolved.

Copy link
Contributor

commented Jun 9, 2019

Build failed
Swift Test OS X Platform
Git Sha - 65098b2

@swift-ci

This comment was marked as resolved.

Copy link
Contributor

commented Jun 9, 2019

Build failed
Swift Test Linux Platform
Git Sha - 65098b2

@swift-ci

This comment has been minimized.

Copy link
Contributor

commented Jun 9, 2019

Performance: -O

Regression OLD NEW DELTA RATIO
ObjectiveCBridgeStubFromNSDateRef 3940 4340 +10.2% 0.91x (?)
Breadcrumbs.MutatedIdxToUTF16.Mixed 259 282 +8.9% 0.92x
ObjectiveCBridgeFromNSString 935 1010 +8.0% 0.93x (?)
 
Improvement OLD NEW DELTA RATIO
RandomShuffleLCG2 2272 704 -69.0% 3.23x
RandomIntegersLCG 178 145 -18.5% 1.23x
RandomDoubleLCG 270 220 -18.5% 1.23x
RemoveWhereMoveInts 37 34 -8.1% 1.09x (?)
Array2D 7520 6928 -7.9% 1.09x
RemoveWhereSwapInts 65 60 -7.7% 1.08x
MapReduce 397 368 -7.3% 1.08x
MapReduceAnyCollection 398 369 -7.3% 1.08x (?)
FlattenListFlatMap 10289 9566 -7.0% 1.08x (?)
FlattenListLoop 5184 4823 -7.0% 1.07x (?)

Code size: -O

Improvement OLD NEW DELTA RATIO
RandomValues.o 3006 2894 -3.7% 1.04x

Performance: -Osize

Regression OLD NEW DELTA RATIO
FlattenListLoop 4437 4911 +10.7% 0.90x (?)
ObjectiveCBridgeStubFromNSDateRef 3770 4170 +10.6% 0.90x (?)
Breadcrumbs.MutatedIdxToUTF16.Mixed 259 281 +8.5% 0.92x (?)
 
Improvement OLD NEW DELTA RATIO
RandomIntegersLCG 757 152 -79.9% 4.98x
RandomDoubleLCG 910 222 -75.6% 4.10x
RandomShuffleLCG2 2272 768 -66.2% 2.96x
RemoveWhereMoveInts 37 34 -8.1% 1.09x (?)
Array2D 7520 6912 -8.1% 1.09x (?)
RemoveWhereSwapInts 67 62 -7.5% 1.08x (?)
MapReduce 433 404 -6.7% 1.07x

Code size: -Osize

Regression OLD NEW DELTA RATIO
RandomValues.o 2679 2711 +1.2% 0.99x

Performance: -Onone

Regression OLD NEW DELTA RATIO
RandomIntegersLCG 34944 47620 +36.3% 0.73x
RandomDoubleLCG 45154 55904 +23.8% 0.81x
RandomIntegersDef 44100 53800 +22.0% 0.82x
RandomDoubleDef 69400 82100 +18.3% 0.85x (?)
StringUTF16SubstringBuilder 19380 21860 +12.8% 0.89x (?)
Breadcrumbs.MutatedIdxToUTF16.Mixed 283 308 +8.8% 0.92x (?)
ArrayOfGenericPOD2 1322 1433 +8.4% 0.92x (?)
 
Improvement OLD NEW DELTA RATIO
StringMatch 60100 53000 -11.8% 1.13x (?)
LessSubstringSubstringGenericComparable 48 44 -8.3% 1.09x (?)

Code size: -swiftlibs

How to read the data The tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.

If you see any unexpected regressions, you should consider fixing the
regressions before you merge the PR.

Noise: Sometimes the performance results (not code size!) contain false
alarms. Unexpected regressions which are marked with '(?)' are probably noise.
If you see regressions which you cannot explain you can try to run the
benchmarks again. If regressions still show up, please consult with the
performance team (@eeckstein).

Hardware Overview
  Model Name: Mac Pro
  Model Identifier: MacPro6,1
  Processor Name: 12-Core Intel Xeon E5
  Processor Speed: 2.7 GHz
  Number of Processors: 1
  Total Number of Cores: 12
  L2 Cache (per Core): 256 KB
  L3 Cache: 30 MB
  Memory: 64 GB
@stephentyrone

This comment has been minimized.

Copy link
Member

commented Jun 10, 2019

So I'm removing the modulo optimization until someone more clever than me figures out how to make it work in Swift.

Please create two swift bugs if you haven't already:

  • the missed optimization
  • re-enabling the modulo optimization once we can
@stephentyrone
Copy link
Member

left a comment

Two things:

a. This method works well for small (sub-word) integer types, but is bad for larger types--we want to switch to rejection sampling (what Melissa calls "Apple's method", though it's so blindingly obvious that Ian and I can't have invented it) for types bigger than a word (this takes us from O(M(n)) down to O(n) complexity). That improvement doesn't need to be part of this PR, but is worth a bug report.

b. multipliedFullWidth currently fatalErrors for Int64 on 32b systems (https://github.com/apple/swift/blob/master/stdlib/public/core/IntegerTypes.swift.gyb#L1521). That's a bug, but it either needs to be fixed before this can go in (I can do that sometime this week), or you need to work around it here for now, probably by conditionally using the old version on 32b systems.

@palimondo

This comment has been minimized.

Copy link
Collaborator Author

commented Jun 10, 2019

@stephentyrone I'm confused by what you mean in point a). Isn't the word size 64 bits on a 64bit system and Swift does not yet support any bigger types?

I'm also unsure about the bug for missing optimization. If the reason for why O'Neill's optimization didn't work in Swift is the division by zero check as Lemire hypothesized, how would this be expressible without compromising the safety? Introduction of a new operator?

Anyway… I've filed SR-10909 Missed optimization for modulo and will add the architecture fork in next commit to keep the old implementation around for 32-bit platforms.

@stephentyrone

This comment has been minimized.

Copy link
Member

commented Jun 10, 2019

@stephentyrone I'm confused by what you mean in point a). Isn't the word size 64 bits on a 64bit system and Swift does not yet support any bigger types?

In the standard library today, yes. In third party code or in the standard library in the future, there may be arbitrary types conforming to FixedWidthInteger. Now that we have binary and module stability, future third-party types may even be back-deployed against this version of the standard library, so it is good not to let clear inefficiencies be introduced. I'm fine with this one in the short-term, because it's no worse than the current situation, but we should resolve it when we can.

I'm also unsure about the bug for missing optimization. If the reason for why O'Neill's optimization didn't work in Swift is the division by zero check as Lemire hypothesized, how would this be expressible without compromising the safety? Introduction of a new operator?

Given _precondition(upperBound != 0, "upperBound cannot be zero."), the compiler should be able to elide the check; it would have already been detected. I think this may be as simple as changing _precondition to something else--I don't think _precondition gives the needed hint to the compiler, and I recall there being a reason for it--but I'll need to dig in my notes.

Even without the precondition, m.low < upperBound together with the UnsignedInteger constraint should tell the compiler all it needs to know to prove that upperBound != 0. So that's a missed optimization that we should be tracking.

Anyway… I've filed SR-10909 Missed optimization for modulo and will add the architecture fork in next commit to keep the old implementation around for 32-bit platforms.

Thanks!

@palimondo

This comment has been minimized.

Copy link
Collaborator Author

commented Jun 10, 2019

@stephentyrone I hope you don't mind that I've created a reminder to add the Support multipliedFullWidth on 32-bit systems (SR-10910) for you… 😇

@palimondo palimondo requested a review from stephentyrone Jun 10, 2019

[stdlib] TODO FIXME SR-10912
Remove the old OpenBSD generation method, once 32-bit systems support multipliedFullWidth on UInt64.

@palimondo palimondo requested a review from stephentyrone Jun 10, 2019

@stephentyrone

This comment has been minimized.

Copy link
Member

commented Jun 10, 2019

@swift-ci please test

@palimondo

This comment has been minimized.

Copy link
Collaborator Author

commented Jun 10, 2019

Do we have a way to benchmark this on 32-bit systems once SR-10910 gets resolved?
(Could it be necessary to keep the OpenBSD method for performance reasons on 32-bit platforms?)

@stephentyrone

This comment has been minimized.

Copy link
Member

commented Jun 11, 2019

I'm going to review this once more tomorrow morning for numerical correctness specifically and then merge it assuming it looks good. @airspeedswift and I are discussing cherry-picking for 5.1 as well.

@stephentyrone

This comment has been minimized.

Copy link
Member

commented Jun 11, 2019

Do we have a way to benchmark this on 32-bit systems once SR-10910 gets resolved?
(Could it be necessary to keep the OpenBSD method for performance reasons on 32-bit platforms?)

No, because we'll want to have a rejection sampling path in the long run anyway for bigger-than-word types, which would also cover this case, and the existing modulus-based system is already pretty painful for 64b int on 32b.

@stephentyrone

This comment has been minimized.

Copy link
Member

commented Jun 11, 2019

Ended up having some time tonight. This is fine to merge.

@stephentyrone stephentyrone merged commit 2bc648c into apple:master Jun 11, 2019

4 checks passed

Swift Test Linux Platform No test results found.
Details
Swift Test Linux Platform (smoke test)
Details
Swift Test OS X Platform No test results found.
Details
Swift Test OS X Platform (smoke test)
Details
@stephentyrone

This comment has been minimized.

Copy link
Member

commented Jun 11, 2019

n.b. @lemire @imneme thinking about this, I believe that we can actually do somewhat better still by using straight-forward rejection sampling for large ranges and this algorithm for small ranges. This would be an extra branch, but it will be perfectly predicted in most uses (and could be hoisted or specialized for algorithms that produce values in bulk). I'll write up a demo sometime soon.

@palimondo

This comment has been minimized.

Copy link
Collaborator Author

commented Jun 11, 2019

@stephentyrone Didn't @imneme try that here already? That version was not in the charts of the corresponding article for some reason...

@stephentyrone

This comment has been minimized.

Copy link
Member

commented Jun 11, 2019

@stephentyrone Didn't @imneme try that here already? That version was not in the charts of the corresponding article for some reason...

I would expect the perf tradeoff to be somewhat workload- (and hence benchmark-) dependent. It may be that it wasn't a substantial win for the use case that Melissa was testing, or there may be some issue we still need to understand, but for the top half of upperBound values, the actual selection looks the same with both approaches, and a comparison is certainly cheaper than a full-width multiply followed by a comparison. It likely doesn't matter for int64 due to the fact that the rejection branch is necessarily "unpredictable" with a good RNG, but for custom types where multiply can be much more costly than a branch mispredict (as this algorithm supports in Swift), it will matter.

@palimondo

This comment has been minimized.

Copy link
Collaborator Author

commented Jun 11, 2019

I meant it more like… "Is this the algorithm you had in mind? Maybe I could transcribe also that one…" 😜

… but I have overlooked my clue:

I'll write up a demo sometime soon.

@palimondo

This comment has been minimized.

Copy link
Collaborator Author

commented Jun 11, 2019

FWIW, I suspect I have misinterpreted the situation with modulo optimization. The Swift vs. C assembly comparison wasn't fair as I took a too narrow code snippet I suspected to be problematic and took out code that included important hints for the Swift compiler to perform effective optimization. Furthermore, the original code was C++!

I re-did the Godbolt Compiler Explorer comparison again:

I believe the generated instructions are largely equivalent.

Looking again at the original article by @imneme, the modulo optimization provides visible improvement on Large-Shuffle Benchmark, but on Small-Shuffle Benchmark the version with modulo optimization, Debiased Int Mult (t-opt, m-opt) in the chart, performs the same as Lemire's original version Debiased Int Mult (t-opt).

Large-Shuffle Small-Shuffle
Large-Shuffle Results Small-Shuffle Results

Our benchmarks in SBS most closely correspond to small shuffle, which would explain why I saw no improvement from modulo optimization on our benchmarks.

@stephentyrone I guess I should come up with a synthetic benchmark to mimic O'Neill's Large-Shuffle and we could re-introduce the modulo optimization together with your rejection sampling (when you get around to it).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.