{{ message }}

# shuffle() modulo bias and extra iteration #57

Open
opened this issue Aug 13, 2018 · 3 comments
Open

# shuffle() modulo bias and extra iteration#57

opened this issue Aug 13, 2018 · 3 comments

### paulhauner commented Aug 13, 2018 • edited

TLDR; I think there are two bugs in the shuffling function. I suggest how to fix them. I also built a "sandbox" for testing shuffling functions.

I recently went on a deep-dive into shuffling functions to figure out why the beacon_chain one gets stuck in a loop. I have some findings which I will report here.

I believe there are two errors in the present implementation:

1. Modulo-bias in when pulling a "random" list index from the seed.
2. There is one unnecessary extra iteration in the swap.

I'll give a quick intro to the the Fisher-Yates shuffle and modulo bias for some background. I found it interesting, so you might too.

## Background

### Durstenfeld-Knuth-Fisher-Yates Shuffle

This is a method of shuffling a list "introduced" by Richard Durstenfeld in 1964 and "popularized" by Donald E. Knuth in The Art of Computer Programming. Apparently it's similar to work done by Ronald Fisher and Frank Yates in 1938. More @ wikipedia.

History aside, it looks like this:

``````-- To shuffle an array a of n elements (indices 0..n-1):
for i from 0 to n−2 do
j ← random integer such that i ≤ j < n
exchange a[i] and a[j]
``````

It seems that this is the primitive for the shuffling algorithm used in the spec, so lets assume this is what we're going to use. Seems optimal.

### Modulo-Bias

For background, modulo bias is introduced when using modulo to "fit" an random integer of a large range into a smaller range when the smaller range is not a clean multiple of the larger range. In other words, `large_range % small_range > 0`.

As an example, consider a function `rng()` which generates an integer between `0..3` (inclusive) and an application which requires a integer of `0..2` (inclusive). A developer might simply call `rng() % 3`, however this introduces a bias because both `0 % 3 = 0` and `3 % 3 = 0`. This skews the post-modulo result in favor of `0`. This is a Bad Thing™.

A method to fix this is to simply filter out all "biased" results, that is any results which are greater than `RAND_MAX - (RAND_MAX % n)`, where `RAND_MAX` defines the upper bound of an rng (`0..RAND_MAX`) and `n` is the upper bound of the desired range (`0..n`) and `RAND_MAX > n`. Lets apply this knowledge to the previous example and define the following function:

```def filtered_rng(n):
while True:
x = rng()
if x < rand_max - rand_max % n:
break
return x```

The developer can now call `filtered_rng(3)`. This will give them results without a modulo bias. The downside is that it's now possible for the function to run indeterminately, because it's relying on `rng` not to generate certain numbers. We can just assume that this won't actually run forever, instead it'll just slow us down a every now and then.

## Proposed changes

I propose the following code to fix these issues (the unused `config` var stays for simplicity):

```def shuffle(lst,
seed,
config=DEFAULT_CONFIG):
lst_count = len(lst)
assert lst_count <= 16777216
o = [x for x in lst]
source = seed
i = 0
while i < (len(lst) - 1):
source = blake(source)
for pos in range(0, 30, 3):
remaining = lst_count - i
if remaining == 1:
break
m = int.from_bytes(source[pos:pos+3], 'big')
rand_max = 16777216 - 16777216 % remaining
if m < rand_max:
replacement_pos = (m % remaining) + i
o[i], o[replacement_pos] = o[replacement_pos], o[i]
i += 1
return o```

You can find this code here in a "shuffling sandbox" I made to test the conclusions I have come to.

First, to address error 1 (modulo-bias), the `rand_max` calculation has been moved to where it can dynamically filter the "rng" results on each call to the "rng". If you look at the shuffling psuedo-code (above) you can see that the rng is required to be within dynamic bounds `i ≤ j < n`, therefore the "modulo-biased" rng results need to be filtered specific to these bounds, not as per the overall list result (as in the current implementation).

Secondly, to address error 2 (unnecessary iteration), the `while` loop has been changed to do one less iteration. The reason this last iteration is unnecessary because we're attempting to swap the last element in the list, `n-1`, with any other integer greater or equal to `n - 1` but less than `n`. The only possible outcome here is for the last element to be swapped with itself which is meaningless.

During the process of this, I made another shuffling implementation here. It is slower than the other implementations (due to abstraction overhead in Python), however it separates the logic into "shuffler" and "rng" components. This is useful in gaining an understanding of the fundamental structures involved.

## Optimizations

There are a couple of potential speed-ups I thought of whilst doing this, will test them at some point.

1. Only use the minimum required amount of bits from the seed. E.g., for each swap consume `ceil(log2(lst_count - i))` bits. This should reduce the amount of hashes at the cost of doing a little extra math. <-- After some brief testing I no longer think this is a sensible optimization.
2. Don't copy the list before performing the shuffle, use a shuffle method that naturally creates a new list instead of swapping an existing one.

Point (2) is not very important, it would not change the outcome of the shuffle. Worth noting for client devs. However, point (1) would change the results of the shuffling function and would therefore need to be included in the spec. I'll come back to this once I have some benchmarks on speed increases.

As always, happy for feedback. Sorry about the essay.

We are unable to convert the task to an issue at this time. Please try again.

### paulhauner commented Aug 13, 2018

 This is relevant to this: #18

### paulhauner commented Aug 14, 2018

 FYI, I played around with optimization 1 (only use the minimum amount of bits) and it doesn't seem to be feasible/sensible. The "optimization" ran orders of magnitude slower than the original version in Python. The `blake2` hashes are impressively quick, so their overhead isn't significant.

### paulhauner commented Aug 16, 2018

 FYI, this was discussed in the last Eth 2.0 Implementers Call (approx. 2 mins ago) and the spec has been changed. The spec still does an extra shuffle, however that doesn't affect the outcome of the shuffle, so it's not particularly important.