New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

shuffle() modulo bias and extra iteration #57

Open
paulhauner opened this Issue Aug 13, 2018 · 3 comments

Comments

Projects
None yet
1 participant
@paulhauner
Copy link

paulhauner commented Aug 13, 2018

TLDR; I think there are two bugs in the shuffling function. I suggest how to fix them. I also built a "sandbox" for testing shuffling functions.


I recently went on a deep-dive into shuffling functions to figure out why the beacon_chain one gets stuck in a loop. I have some findings which I will report here.

I believe there are two errors in the present implementation:

  1. Modulo-bias in when pulling a "random" list index from the seed.
  2. There is one unnecessary extra iteration in the swap.

I'll give a quick intro to the the Fisher-Yates shuffle and modulo bias for some background. I found it interesting, so you might too.

Background

Durstenfeld-Knuth-Fisher-Yates Shuffle

This is a method of shuffling a list "introduced" by Richard Durstenfeld in 1964 and "popularized" by Donald E. Knuth in The Art of Computer Programming. Apparently it's similar to work done by Ronald Fisher and Frank Yates in 1938. More @ wikipedia.

History aside, it looks like this:

-- To shuffle an array a of n elements (indices 0..n-1):
for i from 0 to n−2 do
     j ← random integer such that i ≤ j < n
     exchange a[i] and a[j]

It seems that this is the primitive for the shuffling algorithm used in the spec, so lets assume this is what we're going to use. Seems optimal.

Modulo-Bias

For background, modulo bias is introduced when using modulo to "fit" an random integer of a large range into a smaller range when the smaller range is not a clean multiple of the larger range. In other words, large_range % small_range > 0.

As an example, consider a function rng() which generates an integer between 0..3 (inclusive) and an application which requires a integer of 0..2 (inclusive). A developer might simply call rng() % 3, however this introduces a bias because both 0 % 3 = 0 and 3 % 3 = 0. This skews the post-modulo result in favor of 0. This is a Bad Thing™.

A method to fix this is to simply filter out all "biased" results, that is any results which are greater than RAND_MAX - (RAND_MAX % n), where RAND_MAX defines the upper bound of an rng (0..RAND_MAX) and n is the upper bound of the desired range (0..n) and RAND_MAX > n. Lets apply this knowledge to the previous example and define the following function:

def filtered_rng(n):
    while True:
            x = rng()
            if x < rand_max - rand_max % n:
                break
    return x

The developer can now call filtered_rng(3). This will give them results without a modulo bias. The downside is that it's now possible for the function to run indeterminately, because it's relying on rng not to generate certain numbers. We can just assume that this won't actually run forever, instead it'll just slow us down a every now and then.

Proposed changes

I propose the following code to fix these issues (the unused config var stays for simplicity):

def shuffle(lst,
            seed,
            config=DEFAULT_CONFIG):
    lst_count = len(lst)
    assert lst_count <= 16777216
    o = [x for x in lst]
    source = seed
    i = 0
    while i < (len(lst) - 1):
        source = blake(source)
        for pos in range(0, 30, 3):
            remaining = lst_count - i
            if remaining == 1:
                break
            m = int.from_bytes(source[pos:pos+3], 'big')
            rand_max = 16777216 - 16777216 % remaining
            if m < rand_max:
                replacement_pos = (m % remaining) + i
                o[i], o[replacement_pos] = o[replacement_pos], o[i]
                i += 1
    return o

You can find this code here in a "shuffling sandbox" I made to test the conclusions I have come to.

First, to address error 1 (modulo-bias), the rand_max calculation has been moved to where it can dynamically filter the "rng" results on each call to the "rng". If you look at the shuffling psuedo-code (above) you can see that the rng is required to be within dynamic bounds i ≤ j < n, therefore the "modulo-biased" rng results need to be filtered specific to these bounds, not as per the overall list result (as in the current implementation).

Secondly, to address error 2 (unnecessary iteration), the while loop has been changed to do one less iteration. The reason this last iteration is unnecessary because we're attempting to swap the last element in the list, n-1, with any other integer greater or equal to n - 1 but less than n. The only possible outcome here is for the last element to be swapped with itself which is meaningless.

Extra Reading

During the process of this, I made another shuffling implementation here. It is slower than the other implementations (due to abstraction overhead in Python), however it separates the logic into "shuffler" and "rng" components. This is useful in gaining an understanding of the fundamental structures involved.

Optimizations

There are a couple of potential speed-ups I thought of whilst doing this, will test them at some point.

  1. Only use the minimum required amount of bits from the seed. E.g., for each swap consume ceil(log2(lst_count - i)) bits. This should reduce the amount of hashes at the cost of doing a little extra math. <-- After some brief testing I no longer think this is a sensible optimization.
  2. Don't copy the list before performing the shuffle, use a shuffle method that naturally creates a new list instead of swapping an existing one.

Point (2) is not very important, it would not change the outcome of the shuffle. Worth noting for client devs. However, point (1) would change the results of the shuffling function and would therefore need to be included in the spec. I'll come back to this once I have some benchmarks on speed increases.

As always, happy for feedback. Sorry about the essay.

@paulhauner

This comment has been minimized.

Copy link
Author

paulhauner commented Aug 13, 2018

This is relevant to this: #18

@paulhauner

This comment has been minimized.

Copy link
Author

paulhauner commented Aug 14, 2018

FYI, I played around with optimization 1 (only use the minimum amount of bits) and it doesn't seem to be feasible/sensible. The "optimization" ran orders of magnitude slower than the original version in Python.

The blake2 hashes are impressively quick, so their overhead isn't significant.

@paulhauner

This comment has been minimized.

Copy link
Author

paulhauner commented Aug 16, 2018

FYI, this was discussed in the last Eth 2.0 Implementers Call (approx. 2 mins ago) and the spec has been changed. The spec still does an extra shuffle, however that doesn't affect the outcome of the shuffle, so it's not particularly important.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment