Choose randomly n in k elements, in linear time and constant memory complexity
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
results
.gitignore
Makefile
README.mkd
linear_choosens.py
plot.py
test.py

README.mkd

n choose k problem, and related implementations

Comparison of four Python implementations of the n choose k problem: Choose randomly n elements in a list of k elements. The distribution should be uniform: equiprobability is a necessary condition.

This works takes its roots from this blog post,that explain both interests and history of the problem and its solutions.

Notes on the linear implementation

The complexity in time is linear, and constant in memory (if you do not save the choosen items).

The linear implementation provides not only a performance boost compared to stdlib when the number of item is greater than 20 in the tests, but also a greater flexibility: you don't have to provides the full items in order to walk them, just their number.

You can therefore choose N elements into a generator of K elements, without having to store the full generator or even the N elements if you treat them immediatly.

The current implementation of linear_choosen implements these uses, and is a standalone ; you can therefore copy-paste it whereever you need it.

Repository

-plot.py: plotting with matplotlib
-test.py: launch benchmarks
-linear_choosen.py: methods implementation
-results/: images outputs

Implemented methods

The four following methods are implemented. The third one is the main interest of this repository. Further formal analysis of the method needs to be done.

stdlib

The random.sample method do exactly the job. It automatically choose between two internal implementations, function to n. Link to the doc.

Source.

dumb

The very obvious mix it, then take the n firsts. Very costly.

Source.

linear

This algorithm is detailed in Vitter paper An Efficient Algorithm for Sequential Random Sampling, published in 1987, with the mathematical proof. You can also find it in Knutt's Seminumerical Algorithms. (source: Jon Bentley's Programming Pearls)

This is an implementation proposal, that could probably be improved. Its less efficient than stdlib when n is small, but notably quicker when n comes near k.

The first element have a n/k likelihood to be in the output subset. If this element is choosen, then find the next choosens is like n-1 choose k-1. This treatment is recursively applied on the k items.

This method allows an O(1) complexity in memory, and a O(k) complexity in time (worst case), because elements are walked only once, and deciding whether an element is choosen or not is a comparison of a random number against a likelihood. The subset is consequently constructed during the walk of all elements. All the k elements don't need to be walked, when the search is 0 choose k-i.

Source.

(linear) recursive

Purely recursive implementation of the linear algorithm. Slower, don't work on big dataset without modification of python stack size.

Source.

Runtime comparison

runtimes The stdlib implementation is better when n is small, but the linear implementation is quicker in other cases.

linear method distribution

Following graphics don't show any obvious distribution bias between firsts and lasts elements.

runtimes

runtimes

These results are comparable to stdlib:

runtimes

runtimes

And to the dumb method:

runtimes

runtimes

Further analysis of the data could highlight a bias induced by the linear method. If so, it is probably a bias of implementation (bad source code) or a random bias (bad random generator). The method itself is proven.