Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

An optimal algorithm for bounded random integers #39143

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

stephentyrone
Copy link
Contributor

Everyone knows that generating an unbiased random integer in a range 0 ..< upperBound, where upperBound is not a power of two, requires rejection sampling. What if I told you that Big Random Number has lied to us for decades, and we have been played for absolute fools?

Previously Swift used Lemire's "nearly divisionless" method (https://arxiv.org/abs/1805.10941) for this operation. This PR introduces a novel algorithm that:

  • never divides
  • avoids rejection sampling entirely
  • achieves a theoretically optimal bound on the amount of randomness consumed to generate a sample
  • delivers actual performance improvements for most real cases

Lemire interprets each word from the random source as a fixed-point number in [0, 1), multiplies by upperBound, and takes the floor. Up to this point, this is the algorithm suggested by Knuth in TAoCP vol 2, and as observed by Knuth, it is slightly biased. Lemire cleverly corrects this bias via rejection sampling, which requires one division in the general case (hence, "nearly divisionless").

Our new algorithm takes a different approach. Rather than using rejection sampling, we observe that the bias decreases exponentially in the number of bits used for the computation. In the limit we are interpreting the bitstream from the random source as a uniform real number r in [0, 1) and ⌊r * upperBound⌋ provides an unbiased sample in 0 ..< upperBound. The only challenge, then, is to know when we have computed enough bits of the product to know what the result is.

Observe that we can split the random stream at any bit position i, yielding r = r₀ + r₁ with r₀ a fixed-point number in [0,1) and 0 ≤ r₁ < 2⁻ⁱ. Further observe that:

result = ⌊r * upperBound⌋
       = ⌊r₀ * upperBound⌋ + ⌊frac(r₀*upperBound) + r₁*upperBound⌋

The first term of this expression is just Knuth's biased sample, which is computed with just a full-width multiply.

If i > log₂(upperBound), both summands in the second term are smaller than 1, so the second term is either 0 or 1. Applying the bound on r₁, we see that if frac(r₀ * upperBound) <= 1 - upperBound * 2⁻ⁱ, the second term is necessarily zero, and the first term is the unbiased result. Happily, this is also a trivial computation on the low-order part of the full-width multiply.

If the test fails, we do not reject the sample, throwing away the bits we have already consumed from the random source; instead we increase i by a convenient amount, computing more bits of the product. This is the criticial improvement; while Lemire has a probability of 1/2 to reject for each word consumed in the worst case, we have a probability of terminating of 1/2 for each bit consumed. This reduces the worst-case expected number of random bits required from O(log₂(upperBound)) to log₂(upperBound) + O(1), which is optimal[1].

Of more practical interest, this new algorithm opens an intriguing possibility: we can compute just 64 extra bits, and have a probability of 1 - 2⁻⁶⁴ of terminating. This is so close to certainty that we can simply stop unconditionally without introducing any measurable bias (detecting any difference would require about 2¹²⁸ samples, which is prohibitive). This is a significant performance improvement for slow random generators, since it asymptotically reduces the number of bits required by a factor of two for bignums, while matching the expected number of bits required for smaller numbers. This is the algorithm implemented by this PR (the formally-uniform method is not much more complex to implement and is only a little bit slower, but there's no reason to do so).

More intriguing still, this algorithm can be made unconditional by removing the early out, so that every value computed requires word size + 64 bits from the stream, which breaks the loop-carried dependency for fast generators, unlocking vectorization and parallelization where it was previously impossible. This is an especially powerful advantage when paired with bitstream generators that allow skip-ahead such as counter-based generators or PCG.

Note that it is possible to employ Lemire's tighter early-out check that involves a division with this algorithm as well; this is beneficial in some cases when upperBound is a constant and the generator is slow, but we do not think it necessary with the new algorithm and other planned improvements.

[1] We can actually achieve log₂(upperBound) + ε for any ε > 0 by generating multiple random samples at once, but that is only of theoretical interest--it is still interesting, however, since I don't think anyone has described how to attain it previously.

Everyone knows that generating an unbiased random integer in a range 0 ..< upperBound, where upperBound is not a power of two, requires rejection sampling. What if I told you that Big Random Number has lied to us for decades, and we have been played for absolute fools?

Previously Swift used Lemire's "nearly divisionless" method (https://arxiv.org/abs/1805.10941) for this operation. This PR introduces a novel algorithm that:

- never divides
- avoids rejection sampling entirely
- achieves a theoretically optimal bound on the amount of randomness consumed to generate a sample
- delivers actual performance improvements for most real cases

Lemire interprets each word from the random source as a fixed-point number in [0, 1), multiplies by upperBound, and takes the floor. Up to this point, this is the algorithm suggested by Knuth in TAoCP vol 2, and as observed by Knuth, it is slightly biased. Lemire cleverly corrects this bias via rejection sampling, which requires one division in the general case (hence, "nearly divisionless").

Our new algorithm takes a different approach. Rather than using rejection sampling, we observe that the bias decreases exponentially in the number of bits used for the computation. In the limit we are interpreting the bitstream from the random source as a uniform real number r in [0, 1) and ⌊r * upperBound⌋ provides an unbiased sample in 0 ..< upperBound. The only challenge, then, is to know when we have computed enough bits of the product to know what the result is.

Observe that we can split the random stream at any bit position i, yielding r = r₀ + r₁ with r₀ a fixed-point number in [0,1) and 0 ≤ r₁ < 2⁻ⁱ. Further observe that:

    result = ⌊r * upperBound⌋
           = ⌊r₀ * upperBound⌋ + ⌊frac(r₀*upperBound) + r₁*upperBound⌋

The first term of this expression is just Knuth's biased sample, which is computed with just a full-width multiply.

If i > log₂(upperBound), both summands in the second term are smaller than 1, so the second term is either 0 or 1. Applying the bound on r₁, we see that if frac(r₀ * upperBound) <= 1 - upperBound * 2⁻ⁱ, the second term is necessarily zero, and the first term is the unbiased result. Happily, this is _also_ a trivial computation on the low-order part of the full-width multiply.

If the test fails, we do not reject the sample, throwing away the bits we have already consumed from the random source; instead we increase i by a convenient amount, computing more bits of the product. This is the criticial improvement; while Lemire has a probability of 1/2 to reject for each word consumed in the worst case, we have a probability of terminating of 1/2 for each _bit_ consumed. This reduces the worst-case expected number of random bits required from O(log₂(upperBound)) to log₂(upperBound) + O(1), which is optimal[1].

Of more practical interest, this new algorithm opens an intriguing possibility: we can compute just 64 extra bits, and have a probability of 1 - 2⁻⁶⁴ of terminating. This is so close to certainty that we can simply stop unconditionally without introducing any measurable bias (detecting any difference would require about 2¹²⁸ samples, which is prohibitive). This is a significant performance improvement for slow random generators, since it asymptotically reduces the number of bits required by a factor of two for bignums, while matching the expected number of bits required for smaller numbers. This is the algorithm implemented by this PR (the formally-uniform method is not much more complex to implement and is only a little bit slower, but there's no reason to do so).

More intriguing still, this algorithm can be made unconditional by removing the early out, so that every value computed requires word size + 64 bits from the stream, which breaks the loop-carried dependency for fast generators, unlocking vectorization and parallelization where it was previously impossible. This is an especially powerful advantage when paired with bitstream generators that allow skip-ahead such as counter-based generators or PCG.

Note that it is _possible_ to employ Lemire's tighter early-out check that involves a division with this algorithm as well; this is beneficial in some cases when upperBound is a constant and the generator is slow, but we do not think it necessary with the new algorithm and other planned improvements.

[1] We can actually achieve log₂(upperBound) + ε for any ε > 0 by generating multiple random samples at once, but that is only of theoretical interest--it is still interesting, however, since I don't think anyone has described how to attain it previously.
@lemire
Copy link

lemire commented Sep 2, 2021

Related idea at: https://github.com/KWillets/range_generator

The approach is an application of the concept of short product. I recently made use of it for a fast number parser... see paper at https://arxiv.org/pdf/2101.11408.pdf (section 7)

///
/// Requires T.bitWidth >= 64.
@_transparent @usableFromInline
internal func multiplyHigh<T:FixedWidthInteger & UnsignedInteger>(
Copy link
Collaborator

@xwu xwu Sep 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could delete the ABI footprint of this by making it @_alwaysEmitIntoClient, I think. (Would also underscore the name.)

@stephentyrone
Copy link
Contributor Author

stephentyrone commented Sep 3, 2021

Since this is "doing numbers" it's worth pointing out here that the versions of this algorithm with early-outs can leak timing information, even if implemented with constant-time primitives (because the distribution of values that require an additional word be computed is not uniform in the set of outputs). For Swift's standard library that's not a major issue, but it would be problematic for some other places such algorithms can be deployed. (This does not apply to the unconditional "just use 64 extra bits no matter what" version.)

@JamesTheAwesomeDude
Copy link

JamesTheAwesomeDude commented Feb 26, 2022

In the limit we are interpreting the bitstream from the random source as a uniform real number r in [0, 1) and ⌊r * upperBound⌋ provides an unbiased sample in 0 ..< upperBound.

nb: isn't this the "Fast Dice Roller" presented in Lumbroso (2013) "Optimal discrete uniform generation from coin flips, and applications"?

In any case, congrats on beating CPython, PHP, C stdlib, OpenSSL, OpenJDK, Node, Rust, and Go to the punch. This thread helped me find the existence of this algorithm far faster than I would have otherwise.

@paulidale paulidale mentioned this pull request Oct 25, 2023
1 task
paulidale added a commit to paulidale/openssl that referenced this pull request Oct 25, 2023
Refer: swiftlang/swift#39143 for a description
of the algorithm.

It is optimal in the sense of having:

* no divisions
* minimal number of blocks of random bits from the generator
paulidale added a commit to paulidale/openssl that referenced this pull request Oct 25, 2023
Refer: swiftlang/swift#39143 for a description
of the algorithm.

It is optimal in the sense of having:

* no divisions
* minimal number of blocks of random bits from the generator
paulidale added a commit to paulidale/openssl that referenced this pull request Oct 26, 2023
Refer: swiftlang/swift#39143 for a description
of the algorithm.

It is optimal in the sense of having:

* no divisions
* minimal number of blocks of random bits from the generator
paulidale added a commit to paulidale/openssl that referenced this pull request Oct 31, 2023
Refer: swiftlang/swift#39143 for a description
of the algorithm.

It is optimal in the sense of having:

* no divisions
* minimal number of blocks of random bits from the generator
openssl-machine pushed a commit to openssl/openssl that referenced this pull request Nov 1, 2023
Refer: swiftlang/swift#39143 for a description
of the algorithm.

It is optimal in the sense of having:

* no divisions
* minimal number of blocks of random bits from the generator

Reviewed-by: Tom Cosgrove <tom.cosgrove@arm.com>
Reviewed-by: Matthias St. Pierre <Matthias.St.Pierre@ncp-e.com>
Reviewed-by: Tomas Mraz <tomas@openssl.org>
(Merged from #22499)
openssl-machine pushed a commit to openssl/openssl that referenced this pull request Nov 1, 2023
Refer: swiftlang/swift#39143 for a description
of the algorithm.

It is optimal in the sense of having:

* no divisions
* minimal number of blocks of random bits from the generator

Reviewed-by: Tom Cosgrove <tom.cosgrove@arm.com>
Reviewed-by: Matthias St. Pierre <Matthias.St.Pierre@ncp-e.com>
Reviewed-by: Tomas Mraz <tomas@openssl.org>
(Merged from #22499)

(cherry picked from commit 55755fb)
wanghao75 pushed a commit to openeuler-mirror/openssl that referenced this pull request Nov 4, 2023
Refer: swiftlang/swift#39143 for a description
of the algorithm.

It is optimal in the sense of having:

* no divisions
* minimal number of blocks of random bits from the generator

Reviewed-by: Tom Cosgrove <tom.cosgrove@arm.com>
Reviewed-by: Matthias St. Pierre <Matthias.St.Pierre@ncp-e.com>
Reviewed-by: Tomas Mraz <tomas@openssl.org>
(Merged from openssl/openssl#22499)

Signed-off-by: fly2x <fly2x@hitls.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants