Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[stdlib] Lemire’s Nearly Divisionless Random Integer Generation #25286

Merged
merged 6 commits into from Jun 11, 2019

Conversation

@palimondo
Copy link
Collaborator

@palimondo palimondo commented Jun 6, 2019

Implementation of Daniel Lemire’s “Fast Random Integer Generation in Interval”
See https://arxiv.org/pdf/1805.10941.pdf

Implementation of Daniel Lemire’s “Fast Random Integer Generation in Interval”
See https://arxiv.org/pdf/1805.10941.pdf
@palimondo
Copy link
Collaborator Author

@palimondo palimondo commented Jun 6, 2019

I have attempted a Swift port of @lemire's algorithm:

uint64_t nearlydivisionless ( uint64_t s, uint64_t (* random64 ) ( void ) ) {
  uint64_t x = random64 () ;
  __uint128_t m = ( __uint128_t ) x * ( __uint128_t ) s;
  uint64_t l = ( uint64_t ) m;
  if (l < s) {
    uint64_t t = -s % s;
    while (l < t) {
      x = random64 () ;
      m = ( __uint128_t ) x * ( __uint128_t ) s;
      l = ( uint64_t ) m;
    }
  }
  return m >> 64;
}

I'm not sure if I have correctly interpreted the negation of unsigned integer, I had to rely on https://stackoverflow.com/questions/8026694/c-unary-minus-operator-behavior-with-unsigned-operands
@stephentyrone can you please verify (since I've seen you answer that SO question...)?

@inlinable
public mutating func next<T: FixedWidthInteger & UnsignedInteger>(
  upperBound: T
) -> T {
  _precondition(upperBound != 0, "upperBound cannot be zero.")
  var random: T = next()
  var m = random.multipliedFullWidth(by: upperBound)
  if (m.low < upperBound) {
    let t = (T.max - upperBound + 1) % upperBound
    while (m.low < t) {
      random = next()
      m = random.multipliedFullWidth(by: upperBound)
    }
  }
  return m.high
}

Is there better way to express this in Swift than (T.max - upperBound + 1)

In my local test with WyRand on the worst case(?) bound UInt64.max / 2 + 1 I'm seeing ~4x speed improvement in random number generation.

Open Questions

  • Is it even possible to change the implementation of next(upperBound:) this way given ABI stability constraints?
  • How do we evaluate performance of algorithms like this across architectures? (@swift-ci does not perform ARM benchmarking AFAIK, does Apple do this internally?)

@palimondo
Copy link
Collaborator Author

@palimondo palimondo commented Jun 6, 2019

@swift-ci please benchmark

I'm expecting to see improvement in RandomShuffleLCG2 benchmark. I don't think RandomShuffleDef2 will pop up as the slow generator should hide the improvement. We'll see...

@Azoy
Copy link
Member

@Azoy Azoy commented Jun 6, 2019

We are able to change the implementation of next(upperBound:) within ABI constraints. There's currently a notice on the documentation that states:

The algorithm used to create random values may change in a future version of Swift. If you’re passing a generator that results in the same sequence of integer values each time you run your program, that sequence may change when your program is compiled using a different version of Swift.

cc: @stephentyrone

@lemire
Copy link

@lemire lemire commented Jun 6, 2019

Out of instinct, I would replace

 (T.max -upperBound + 1) % upperBound

by

((~upperBound) &+ 1) % upperBound

or

(0 &- upperBound) % upperBound

depending on how clever you feel.

It might be useful to point to the reference as otherwise this function might look like mysterious magic to some: Fast Random Integer Generation in an Interval, ACM Transactions on Modeling and Computer Simulation 29 (1), 2019

@Azoy
Copy link
Member

@Azoy Azoy commented Jun 6, 2019

I expect to see some performance wins with the floating point random API as well.

@palimondo
Copy link
Collaborator Author

@palimondo palimondo commented Jun 6, 2019

Something's messed up. I need to investigate locally why has the benchmark crashed… but it's late over here, so I'll come back to this tomorrow.

@palimondo
Copy link
Collaborator Author

@palimondo palimondo commented Jun 6, 2019

@lemire I have, of course, linked to your paper in the PR's description right at the top 😉

I liked the ((~upperBound) &+ 1) % upperBound as it resembles your original C code, but looking at how's the ~ implemented it's just 2 extra and unnecessary operations, so I'll go with (0 &- upperBound) % upperBound for a better unoptimized performance (I understand they all boil down to same instructions in -O).

@lemire
Copy link

@lemire lemire commented Jun 6, 2019

@palimondo Sorry, I missed the first comment.

@palimondo
Copy link
Collaborator Author

@palimondo palimondo commented Jun 6, 2019

Hmm.. reading through Melissa O'Neil's post, I see there is still a "Bitmask with Rejection (Unbiased) — Apple's Method" that performed even slightly better…
@stephentyrone is there a way to express mask >>= __builtin_clz(range|1); in Swift?

@Azoy
Copy link
Member

@Azoy Azoy commented Jun 6, 2019

@palimondo mask >>= (range | 1).leadingZeroBitCount?

@lemire
Copy link

@lemire lemire commented Jun 6, 2019

@palimondo

I thought that Melissa's conclusion was that "The fastest (unbiased) method is Lemire's with an extra tweak from me [Melissa]."

Let us ping her : @imneme

@palimondo
Copy link
Collaborator Author

@palimondo palimondo commented Jun 6, 2019

I'm sure you're right, I did not finish reading before posting here… 😇

All this tweaking and performance fine-tuning still hinges on the important question of:

How do we evaluate performance of algorithms like this across architectures? (@swift-ci does not perform ARM benchmarking AFAIK, does Apple do this internally?)

…and I'm guessing that when considering the ARM performance, Swift would decide based on specific performance of these algorithms on A family of processors, not the reference implementations.

@imneme
Copy link

@imneme imneme commented Jun 6, 2019

@lemire, That is what I said. Not sure I can add much beyond that…

Benchmarks are always something of a rabbit hole with more data the deeper you go but typically additional data reveals complexity rather than simple clarity. I really like @lemire's method in general. Frankly, I'd like it if there was a specific mod instruction itself was smart enough to realize when it has no work to do and short-circuit itself rather than needing any tricks. As it is (at least on x86), mod is really just a side effect of div and so it needs to go to the full amount of work, so my optimization is helpful.

(Possibly there is a general compiler optimization for mod, perhaps triggered by profiling!)

@lemire
Copy link

@lemire lemire commented Jun 6, 2019

New blog post where I include ARM benchmarks: Nearly Divisionless Random Integer Generation On Various Systems. I include benchmarks on an ARM box. (I had the numbers inverted in an early version of the blog post: short story is that avoiding division is always good in my tests, using floats is slower.)

@palimondo
Copy link
Collaborator Author

@palimondo palimondo commented Jun 7, 2019

@swift-ci please benchmark

@swift-ci
Copy link
Contributor

@swift-ci swift-ci commented Jun 7, 2019

Performance: -O

Regression OLD NEW DELTA RATIO
Breadcrumbs.MutatedIdxToUTF16.Mixed 258 281 +8.9% 0.92x (?)
ObjectiveCBridgeFromNSString 935 1010 +8.0% 0.93x (?)
 
Improvement OLD NEW DELTA RATIO
RandomShuffleLCG2 2272 704 -69.0% 3.23x
RandomIntegersLCG 178 145 -18.5% 1.23x
RandomDoubleLCG 270 220 -18.5% 1.23x
RemoveWhereMoveInts 37 34 -8.1% 1.09x
Array2D 7520 6928 -7.9% 1.09x
RemoveWhereSwapInts 65 60 -7.7% 1.08x
MapReduce 397 368 -7.3% 1.08x
MapReduceAnyCollection 398 369 -7.3% 1.08x (?)

Code size: -O

Improvement OLD NEW DELTA RATIO
RandomValues.o 3006 2894 -3.7% 1.04x

Performance: -Osize

Regression OLD NEW DELTA RATIO
Breadcrumbs.MutatedIdxToUTF16.Mixed 259 280 +8.1% 0.93x (?)
 
Improvement OLD NEW DELTA RATIO
RandomIntegersLCG 757 152 -79.9% 4.98x
RandomDoubleLCG 886 222 -74.9% 3.99x
RandomShuffleLCG2 2272 768 -66.2% 2.96x
RemoveWhereMoveInts 37 34 -8.1% 1.09x (?)
Array2D 7520 6912 -8.1% 1.09x
RemoveWhereSwapInts 67 62 -7.5% 1.08x (?)
FlattenListLoop 5280 4910 -7.0% 1.08x (?)

Code size: -Osize

Regression OLD NEW DELTA RATIO
RandomValues.o 2679 2711 +1.2% 0.99x

Performance: -Onone

Regression OLD NEW DELTA RATIO
RandomIntegersLCG 34992 47766 +36.5% 0.73x
RandomDoubleLCG 45246 56050 +23.9% 0.81x
RandomIntegersDef 44200 53700 +21.5% 0.82x
RandomDoubleDef 69300 82200 +18.6% 0.84x (?)
Breadcrumbs.MutatedIdxToUTF16.Mixed 284 308 +8.5% 0.92x (?)
ArrayOfGenericPOD2 1322 1433 +8.4% 0.92x (?)
ArrayOfPOD 1065 1146 +7.6% 0.93x (?)
 
Improvement OLD NEW DELTA RATIO
StringMatch 60100 53200 -11.5% 1.13x (?)
LessSubstringSubstringGenericComparable 48 44 -8.3% 1.09x (?)
DataCreateEmptyArray 174500 160600 -8.0% 1.09x (?)

Code size: -swiftlibs

How to read the data The tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.

If you see any unexpected regressions, you should consider fixing the
regressions before you merge the PR.

Noise: Sometimes the performance results (not code size!) contain false
alarms. Unexpected regressions which are marked with '(?)' are probably noise.
If you see regressions which you cannot explain you can try to run the
benchmarks again. If regressions still show up, please consult with the
performance team (@eeckstein).

Hardware Overview
  Model Name: Mac Pro
  Model Identifier: MacPro6,1
  Processor Name: 12-Core Intel Xeon E5
  Processor Speed: 2.7 GHz
  Number of Processors: 1
  Total Number of Cores: 12
  L2 Cache (per Core): 256 KB
  L3 Cache: 30 MB
  Memory: 64 GB

@palimondo
Copy link
Collaborator Author

@palimondo palimondo commented Jun 7, 2019

@swift-ci please benchmark

@palimondo
Copy link
Collaborator Author

@palimondo palimondo commented Jun 7, 2019

The optimized results seem quite nice. It even looks like the -Osize likes the new code a lot better than original. I dunno what weight to assign to the -Onone regressions…

I'm re-running benchmarks with @imneme's modulo optimization, to see how it fares in Swift.

@swift-ci
Copy link
Contributor

@swift-ci swift-ci commented Jun 7, 2019

Performance: -O

Regression OLD NEW DELTA RATIO
FlattenListFlatMap 7287 10078 +38.3% 0.72x (?)
FlattenListLoop 4332 4830 +11.5% 0.90x (?)
Breadcrumbs.MutatedIdxToUTF16.Mixed 258 283 +9.7% 0.91x (?)
Breadcrumbs.MutatedUTF16ToIdx.Mixed 261 281 +7.7% 0.93x (?)
 
Improvement OLD NEW DELTA RATIO
RandomShuffleLCG2 2272 736 -67.6% 3.09x
RandomIntegersLCG 178 145 -18.5% 1.23x
RandomDoubleLCG 270 220 -18.5% 1.23x
RemoveWhereMoveInts 37 34 -8.1% 1.09x (?)
Array2D 7520 6928 -7.9% 1.09x
RemoveWhereSwapInts 65 60 -7.7% 1.08x (?)
MapReduce 397 368 -7.3% 1.08x (?)
MapReduceAnyCollection 398 369 -7.3% 1.08x (?)

Code size: -O

Improvement OLD NEW DELTA RATIO
RandomValues.o 3006 2894 -3.7% 1.04x

Performance: -Osize

Regression OLD NEW DELTA RATIO
FlattenListLoop 4426 4909 +10.9% 0.90x (?)
StringBuilderWithLongSubstring 1400 1550 +10.7% 0.90x (?)
Breadcrumbs.MutatedIdxToUTF16.Mixed 258 282 +9.3% 0.91x (?)
 
Improvement OLD NEW DELTA RATIO
RandomIntegersLCG 757 152 -79.9% 4.98x
RandomDoubleLCG 886 222 -74.9% 3.99x
RandomShuffleLCG2 2272 768 -66.2% 2.96x
ObjectiveCBridgeStubFromNSDateRef 4170 3770 -9.6% 1.11x (?)
RemoveWhereMoveInts 37 34 -8.1% 1.09x (?)
Array2D 7520 6912 -8.1% 1.09x
RemoveWhereSwapInts 67 62 -7.5% 1.08x (?)
MapReduce 433 404 -6.7% 1.07x (?)

Code size: -Osize

Regression OLD NEW DELTA RATIO
RandomShuffle.o 3757 3805 +1.3% 0.99x
RandomValues.o 2679 2711 +1.2% 0.99x

Performance: -Onone

Regression OLD NEW DELTA RATIO
RandomIntegersLCG 34896 46916 +34.4% 0.74x
RandomDoubleLCG 45108 55980 +24.1% 0.81x
RandomIntegersDef 44000 54000 +22.7% 0.81x
RandomDoubleDef 69200 80900 +16.9% 0.86x (?)
Breadcrumbs.MutatedIdxToUTF16.Mixed 280 307 +9.6% 0.91x (?)
ArrayOfGenericPOD2 1323 1434 +8.4% 0.92x (?)
 
Improvement OLD NEW DELTA RATIO
StringMatch 60000 53200 -11.3% 1.13x (?)
LessSubstringSubstringGenericComparable 48 44 -8.3% 1.09x

Code size: -swiftlibs

How to read the data The tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.

If you see any unexpected regressions, you should consider fixing the
regressions before you merge the PR.

Noise: Sometimes the performance results (not code size!) contain false
alarms. Unexpected regressions which are marked with '(?)' are probably noise.
If you see regressions which you cannot explain you can try to run the
benchmarks again. If regressions still show up, please consult with the
performance team (@eeckstein).

Hardware Overview
  Model Name: Mac Pro
  Model Identifier: MacPro6,1
  Processor Name: 12-Core Intel Xeon E5
  Processor Speed: 2.7 GHz
  Number of Processors: 1
  Total Number of Cores: 12
  L2 Cache (per Core): 256 KB
  L3 Cache: 30 MB
  Memory: 64 GB

@palimondo
Copy link
Collaborator Author

@palimondo palimondo commented Jun 7, 2019

All results were identical with the exception of regressed -O RandomShuffleLCG2, but the difference comes to effectively just 2 μs (that test has legacyFactor of 16, meaning the real measured times were 44 μs and 46 μs), I'd say that was probably within the margin of error.

Looking at the generated assembly for the modulo optimization on Compiler Explorer, it looks like the Swift version is not as tight as the C version for some reason… Thoughts?

@lemire
Copy link

@lemire lemire commented Jun 7, 2019

@palimondo One difference for sure is that Swift is careful regarding the case where the bound is zero (which leads to a division by zero) whereas the C version assumes the programmer checked.

stdlib/public/core/Random.swift Outdated Show resolved Hide resolved
Swift, compared to C, seems unable to generate tightly fused instructions here for some reason (probably the division by zero check?)… removing.
@palimondo
Copy link
Collaborator Author

@palimondo palimondo commented Jun 9, 2019

I realize that I have forgotten to use the masking subtraction assignment operator (&-=), and had just t -= upperBound, in the modulo optimization, but correcting that did not fix the generated code. I have tried bracketing the whole optimization in an explicit upperBound !=0 if statement, but nothing I have tried made a difference…

So I'm removing the modulo optimization until someone more clever than me figures out how to make it work in Swift.

@palimondo
Copy link
Collaborator Author

@palimondo palimondo commented Jun 9, 2019

@swift-ci benchmark

@palimondo
Copy link
Collaborator Author

@palimondo palimondo commented Jun 9, 2019

@swift-ci test

@swift-ci
Copy link
Contributor

@swift-ci swift-ci commented Jun 9, 2019

Performance: -O

Regression OLD NEW DELTA RATIO
ObjectiveCBridgeStubFromNSDateRef 3940 4340 +10.2% 0.91x (?)
Breadcrumbs.MutatedIdxToUTF16.Mixed 259 282 +8.9% 0.92x
ObjectiveCBridgeFromNSString 935 1010 +8.0% 0.93x (?)
 
Improvement OLD NEW DELTA RATIO
RandomShuffleLCG2 2272 704 -69.0% 3.23x
RandomIntegersLCG 178 145 -18.5% 1.23x
RandomDoubleLCG 270 220 -18.5% 1.23x
RemoveWhereMoveInts 37 34 -8.1% 1.09x (?)
Array2D 7520 6928 -7.9% 1.09x
RemoveWhereSwapInts 65 60 -7.7% 1.08x
MapReduce 397 368 -7.3% 1.08x
MapReduceAnyCollection 398 369 -7.3% 1.08x (?)
FlattenListFlatMap 10289 9566 -7.0% 1.08x (?)
FlattenListLoop 5184 4823 -7.0% 1.07x (?)

Code size: -O

Improvement OLD NEW DELTA RATIO
RandomValues.o 3006 2894 -3.7% 1.04x

Performance: -Osize

Regression OLD NEW DELTA RATIO
FlattenListLoop 4437 4911 +10.7% 0.90x (?)
ObjectiveCBridgeStubFromNSDateRef 3770 4170 +10.6% 0.90x (?)
Breadcrumbs.MutatedIdxToUTF16.Mixed 259 281 +8.5% 0.92x (?)
 
Improvement OLD NEW DELTA RATIO
RandomIntegersLCG 757 152 -79.9% 4.98x
RandomDoubleLCG 910 222 -75.6% 4.10x
RandomShuffleLCG2 2272 768 -66.2% 2.96x
RemoveWhereMoveInts 37 34 -8.1% 1.09x (?)
Array2D 7520 6912 -8.1% 1.09x (?)
RemoveWhereSwapInts 67 62 -7.5% 1.08x (?)
MapReduce 433 404 -6.7% 1.07x

Code size: -Osize

Regression OLD NEW DELTA RATIO
RandomValues.o 2679 2711 +1.2% 0.99x

Performance: -Onone

Regression OLD NEW DELTA RATIO
RandomIntegersLCG 34944 47620 +36.3% 0.73x
RandomDoubleLCG 45154 55904 +23.8% 0.81x
RandomIntegersDef 44100 53800 +22.0% 0.82x
RandomDoubleDef 69400 82100 +18.3% 0.85x (?)
StringUTF16SubstringBuilder 19380 21860 +12.8% 0.89x (?)
Breadcrumbs.MutatedIdxToUTF16.Mixed 283 308 +8.8% 0.92x (?)
ArrayOfGenericPOD2 1322 1433 +8.4% 0.92x (?)
 
Improvement OLD NEW DELTA RATIO
StringMatch 60100 53000 -11.8% 1.13x (?)
LessSubstringSubstringGenericComparable 48 44 -8.3% 1.09x (?)

Code size: -swiftlibs

How to read the data The tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.

If you see any unexpected regressions, you should consider fixing the
regressions before you merge the PR.

Noise: Sometimes the performance results (not code size!) contain false
alarms. Unexpected regressions which are marked with '(?)' are probably noise.
If you see regressions which you cannot explain you can try to run the
benchmarks again. If regressions still show up, please consult with the
performance team (@eeckstein).

Hardware Overview
  Model Name: Mac Pro
  Model Identifier: MacPro6,1
  Processor Name: 12-Core Intel Xeon E5
  Processor Speed: 2.7 GHz
  Number of Processors: 1
  Total Number of Cores: 12
  L2 Cache (per Core): 256 KB
  L3 Cache: 30 MB
  Memory: 64 GB

@stephentyrone
Copy link
Member

@stephentyrone stephentyrone commented Jun 10, 2019

So I'm removing the modulo optimization until someone more clever than me figures out how to make it work in Swift.

Please create two swift bugs if you haven't already:

  • the missed optimization
  • re-enabling the modulo optimization once we can

Copy link
Member

@stephentyrone stephentyrone left a comment

Two things:

a. This method works well for small (sub-word) integer types, but is bad for larger types--we want to switch to rejection sampling (what Melissa calls "Apple's method", though it's so blindingly obvious that Ian and I can't have invented it) for types bigger than a word (this takes us from O(M(n)) down to O(n) complexity). That improvement doesn't need to be part of this PR, but is worth a bug report.

b. multipliedFullWidth currently fatalErrors for Int64 on 32b systems (https://github.com/apple/swift/blob/master/stdlib/public/core/IntegerTypes.swift.gyb#L1521). That's a bug, but it either needs to be fixed before this can go in (I can do that sometime this week), or you need to work around it here for now, probably by conditionally using the old version on 32b systems.

@palimondo
Copy link
Collaborator Author

@palimondo palimondo commented Jun 10, 2019

@stephentyrone I'm confused by what you mean in point a). Isn't the word size 64 bits on a 64bit system and Swift does not yet support any bigger types?

I'm also unsure about the bug for missing optimization. If the reason for why O'Neill's optimization didn't work in Swift is the division by zero check as Lemire hypothesized, how would this be expressible without compromising the safety? Introduction of a new operator?

Anyway… I've filed SR-10909 Missed optimization for modulo and will add the architecture fork in next commit to keep the old implementation around for 32-bit platforms.

@stephentyrone
Copy link
Member

@stephentyrone stephentyrone commented Jun 10, 2019

@stephentyrone I'm confused by what you mean in point a). Isn't the word size 64 bits on a 64bit system and Swift does not yet support any bigger types?

In the standard library today, yes. In third party code or in the standard library in the future, there may be arbitrary types conforming to FixedWidthInteger. Now that we have binary and module stability, future third-party types may even be back-deployed against this version of the standard library, so it is good not to let clear inefficiencies be introduced. I'm fine with this one in the short-term, because it's no worse than the current situation, but we should resolve it when we can.

I'm also unsure about the bug for missing optimization. If the reason for why O'Neill's optimization didn't work in Swift is the division by zero check as Lemire hypothesized, how would this be expressible without compromising the safety? Introduction of a new operator?

Given _precondition(upperBound != 0, "upperBound cannot be zero."), the compiler should be able to elide the check; it would have already been detected. I think this may be as simple as changing _precondition to something else--I don't think _precondition gives the needed hint to the compiler, and I recall there being a reason for it--but I'll need to dig in my notes.

Even without the precondition, m.low < upperBound together with the UnsignedInteger constraint should tell the compiler all it needs to know to prove that upperBound != 0. So that's a missed optimization that we should be tracking.

Anyway… I've filed SR-10909 Missed optimization for modulo and will add the architecture fork in next commit to keep the old implementation around for 32-bit platforms.

Thanks!

@palimondo
Copy link
Collaborator Author

@palimondo palimondo commented Jun 10, 2019

@stephentyrone I hope you don't mind that I've created a reminder to add the Support multipliedFullWidth on 32-bit systems (SR-10910) for you… 😇

@palimondo palimondo requested a review from stephentyrone Jun 10, 2019
Remove the old OpenBSD generation method, once 32-bit systems support multipliedFullWidth on UInt64.
@palimondo palimondo requested a review from stephentyrone Jun 10, 2019
@stephentyrone
Copy link
Member

@stephentyrone stephentyrone commented Jun 10, 2019

@swift-ci please test

@palimondo
Copy link
Collaborator Author

@palimondo palimondo commented Jun 10, 2019

Do we have a way to benchmark this on 32-bit systems once SR-10910 gets resolved?
(Could it be necessary to keep the OpenBSD method for performance reasons on 32-bit platforms?)

@stephentyrone
Copy link
Member

@stephentyrone stephentyrone commented Jun 11, 2019

I'm going to review this once more tomorrow morning for numerical correctness specifically and then merge it assuming it looks good. @airspeedswift and I are discussing cherry-picking for 5.1 as well.

@stephentyrone
Copy link
Member

@stephentyrone stephentyrone commented Jun 11, 2019

Do we have a way to benchmark this on 32-bit systems once SR-10910 gets resolved?
(Could it be necessary to keep the OpenBSD method for performance reasons on 32-bit platforms?)

No, because we'll want to have a rejection sampling path in the long run anyway for bigger-than-word types, which would also cover this case, and the existing modulus-based system is already pretty painful for 64b int on 32b.

@stephentyrone
Copy link
Member

@stephentyrone stephentyrone commented Jun 11, 2019

Ended up having some time tonight. This is fine to merge.

@stephentyrone stephentyrone merged commit 2bc648c into apple:master Jun 11, 2019
4 checks passed
@stephentyrone
Copy link
Member

@stephentyrone stephentyrone commented Jun 11, 2019

n.b. @lemire @imneme thinking about this, I believe that we can actually do somewhat better still by using straight-forward rejection sampling for large ranges and this algorithm for small ranges. This would be an extra branch, but it will be perfectly predicted in most uses (and could be hoisted or specialized for algorithms that produce values in bulk). I'll write up a demo sometime soon.

@palimondo
Copy link
Collaborator Author

@palimondo palimondo commented Jun 11, 2019

@stephentyrone Didn't @imneme try that here already? That version was not in the charts of the corresponding article for some reason...

@stephentyrone
Copy link
Member

@stephentyrone stephentyrone commented Jun 11, 2019

@stephentyrone Didn't @imneme try that here already? That version was not in the charts of the corresponding article for some reason...

I would expect the perf tradeoff to be somewhat workload- (and hence benchmark-) dependent. It may be that it wasn't a substantial win for the use case that Melissa was testing, or there may be some issue we still need to understand, but for the top half of upperBound values, the actual selection looks the same with both approaches, and a comparison is certainly cheaper than a full-width multiply followed by a comparison. It likely doesn't matter for int64 due to the fact that the rejection branch is necessarily "unpredictable" with a good RNG, but for custom types where multiply can be much more costly than a branch mispredict (as this algorithm supports in Swift), it will matter.

@palimondo
Copy link
Collaborator Author

@palimondo palimondo commented Jun 11, 2019

I meant it more like… "Is this the algorithm you had in mind? Maybe I could transcribe also that one…" 😜

… but I have overlooked my clue:

I'll write up a demo sometime soon.

@palimondo
Copy link
Collaborator Author

@palimondo palimondo commented Jun 11, 2019

FWIW, I suspect I have misinterpreted the situation with modulo optimization. The Swift vs. C assembly comparison wasn't fair as I took a too narrow code snippet I suspected to be problematic and took out code that included important hints for the Swift compiler to perform effective optimization. Furthermore, the original code was C++!

I re-did the Godbolt Compiler Explorer comparison again:

I believe the generated instructions are largely equivalent.

Looking again at the original article by @imneme, the modulo optimization provides visible improvement on Large-Shuffle Benchmark, but on Small-Shuffle Benchmark the version with modulo optimization, Debiased Int Mult (t-opt, m-opt) in the chart, performs the same as Lemire's original version Debiased Int Mult (t-opt).

Large-Shuffle Small-Shuffle
Large-Shuffle Results Small-Shuffle Results

Our benchmarks in SBS most closely correspond to small shuffle, which would explain why I saw no improvement from modulo optimization on our benchmarks.

@stephentyrone I guess I should come up with a synthetic benchmark to mimic O'Neill's Large-Shuffle and we could re-introduce the modulo optimization together with your rejection sampling (when you get around to it).

@tempelmann
Copy link

@tempelmann tempelmann commented May 17, 2020

Here's a little critique, and a suggestion at the end. I'm not actively involved in Swift development, not even using it, so this is the best place I could find to add comments.

I am worried that a mistake has been made when designing this class. Sadly it's too late now to correct it.

The mistake is that the doc missed out on explaining how to create a custom next() function, especially if one wants to create a PRNG with a reproducible output (using a seed value). The problem is that the original impl of upperBounds: was apparently using the lower bits of the UInt64, whereas the new one uses the higher bits. That leads to the ugly effect that when someone implemented a PRNG that only delivered a smaller range, e.g. in the lower 32 bits, this used to work before this change when he only pulled small ranges from it. But with this change, the PRNG now fails, always returning zero for randomized subranges.

The issue is shown here in a code example with sample output: https://gist.github.com/nicklockwood/cf1fde3eb24602883c17f94e63e490bc

This potential problem should have been known about - it's clearly documented in any theory of PRNGs. For instance, this WP article conveniently lists examples and shows that many of them do not return good randomized values in their lower bits.

As one comment above says, the API docs only warned that PRNGs might return different values after an implementation change, but they did not consider to break it entirely - yet, this change did. Agreed?

To prevent this, there could have been an extra function that returns the range of significantly randomized values. This could be done by returning a divisor and max value, which in the gist example above would have been: divisor=1 and max= 233280, whereas for the system RNG it's possibly 2^32 and 2^32 (assuming only the upper 32 bit contain good random values). With that, any subrange function, like the upperBounds: one, would use this information to first reduce the value by this range before applying the upperBounds reduction.

In fact, adding such a function might still be valuable as it would (a) make future implementations of custom PRNGs via a custom next() function easier, and (b) would also help other code that uses the original next() result and create valid values from it.

@xwu
Copy link
Collaborator

@xwu xwu commented May 17, 2020

That leads to the ugly effect that when someone implemented a PRNG that only delivered a smaller range, e.g. in the lower 32 bits

To prevent this, there could have been an extra function that returns the range of significantly randomized values.

The method next() -> UInt64 is expected to produce 64 bits of usable randomness. If your source of randomness only gives you 32 usable bits at a time, then you need to get data from that source more than once before returning from next(). We can certainly document that requirement more clearly.

@tempelmann
Copy link

@tempelmann tempelmann commented May 17, 2020

Yes, that would be better than the current affair. Still, people might struggle with coming up with 64 bit random gens, and the need for reproducible RNGs is not uncommon. So please provide some assistance with that if you can.

Also, has this been proven (and postulated) that every bit of the next()'s result properly appearing to be random and not repetitive? So, even the lowest bit has a wide randomness and is not providing a repeating pattern in a short cycle? I'm thinking of random gens that keep giving something like 0, 1, 0, 1 etc. (cycle length 2) or 0, 0, 1, 1 etc. (cycle length 4) on the lowest bit often. Because, the fact that the next upperBounds has been rewritten to prefer the upper bits suggest that the lower bits are, in fact, not as well randomized. Otherwise, I wonder if that code can't be made faster by staying with the lower bits. Might save a few cycles because you can save the down-shift, right?

@lemire
Copy link

@lemire lemire commented May 17, 2020

I agree with @xwu's answer and I think it is the correct one though I would point out that the documentation already states: "Returns a value from a uniform, independent distribution of binary data." If you have non-trivial cycles or you are offering a limited amount of randomness, it seems that you are clearly violating the "uniform + independent" ideal.

This potential problem should have been known about - it's clearly documented in any theory of PRNGs.

It is true that, historically, people (and standard libraries) have relied on random number generators that would produce, for example, 32-bit words with only 31 bits of randomness (or less). In fact, the history of pseudo-random number generators is filled with really bad generators.

However, today, we have many really fast pseudo-random generators that produce high quality random words: it is quite easy to have a really fast generator that will produce 64-bit words that will pass hard statistical tests (e.g., consider the PCG family). If you really need it, you can rather easily go for cryptographically strong generator. Sky is the limit! Twenty or thirty years ago, there were difficult to find and difficult to write, but it is now much easier. There is no good reason to default to whatever bad generator your standard C library offers if you care about quality.

I'm thinking of random gens that keep giving something like 0, 1, 0, 1 etc. (cycle length 2) or 0, 0, 1, 1 etc. (cycle length 4) on the lowest bit often.

I think that trying to accomodate obviously flawed pseudo-random number generators, like the ones you describe, is unnecessary because we now have (in 2020) a large collection of high-quality generators.

To prevent this, there could have been an extra function that returns the range of significantly randomized values.

It seems that you are asking for support for legacy generators.

Still, people might struggle with coming up with 64 bit random gens, and the need for reproducible RNGs is not uncommon.

That's a fair and probably correct point.

Some of us have offered some implementations...
https://github.com/lemire/SwiftWyhash

I think that the standard library could be extended with a few options. I, for one, would be willing to provide a couple if the core team called for it.

@tempelmann
Copy link

@tempelmann tempelmann commented May 17, 2020

Thanks for your clarifications. I must admit I'm a bit behind on this topic, having working on and with PRNGs in the 80s, mostly. Also, thanks for taking this seriously. I only came into this because I had been interested why someone's attempt at subclassing Swift's RNG did work in older Xcode versions but not in recent ones. I tried to figure this out by reading the docs and found that they didn't provide answers to this. So, I think adding some clarifying docs is the right way to resolve this for others in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
8 participants