# Home

WebDrake edited this page Apr 12, 2012 · 3 revisions
##### Clone this wiki locally

SampleD is a library for the D programming language that implements some clever algorithms for random sampling.

The generic problem is this: suppose we have a collection of N records and want to take a random sample of size n < N, such that any individual record has an equal chance of being picked.

A "simple" way of doing this is just to randomly select records, adding them to the sample if they have not already been selected. This may be acceptable if, for example, we want a tiny sample while having a large number of records, so duplicates are not likely to arise. If the sample is larger, however, this becomes very inefficient as we have to compare each new selection to every one of the already-selected records. Moreover, we have to store the list of already-selected records, which may also be prohibitive for large sample size.

A more tractable method is provided in Knuth (1998), pp. 142--143. Algorithm S sequentially considers each of the N records, randomly deciding for each whether or not it should be added to the sample. However, this algorithm clearly scales with N, which makes for a very inefficient case when we wish to pick a small sample from a very large number of records.

An alternative, more efficient approach is not to consider each individual record but to "skip" across the sequence of records -- in other words, at each step of the algorithm we randomly generate a number of records to ignore and jump across that number of records, selecting the final record we land on. To do this effectively we need an algorithm that will (i) correctly calculate the skip size, given the number of records remaining, the number of samples still to take, etc.; and (ii) make a minimal number of calls to the random number generator in order to do so.

The challenge was left as an open problem by Knuth and was solved in a pair of papers by Jeffrey Scott Vitter (1984, 1987), whose Algorithms A and D are implemented in the present library. Algorithm A offers an o(N) algorithm that requires only n random variates to be generated, while Algorithm D (which defaults to A in some special cases) reduces the order of the algorithm to n.

### References

• Knuth, D.E. (1998) The Art of Computer Programming [3rd ed.], vol. 2: Seminumerical Algorithms.

• Vitter, J.S. (1984) Faster methods for random sampling. Commun. ACM 27(7): 703-718.

• Vitter, J.S. (1987) An efficient algorithm for sequential random sampling. ACM Trans. Math. Sci. 13(1): 58--67.