Sampling with replacement; sampling with weights; and reservoir-sampling with vectors #11023

soerenwolfers · 2024-03-06T18:42:31Z

soerenwolfers
Mar 6, 2024

EDIT: the "weight" aspect of this has previously been asked for at #7506

Sampling is invaluable when working with big data and it's great that duckdb offers USING SAMPLE for this.

Below, I am writing down some thoughts about how this could be made even more powerful.

The set of possible sampling choices encompasses at least:

{Vectorized, NonVectorized} 
x {Bernoulli, Reservoir} 
x {SpecifyRows, SpecifyFraction} 
x {WithReplacement, WithoutReplacement} 
x {Weighted, Uniform} 
x {Repeatable, NotRepeatable}

Currently, duckdb supports

(
    {Vectorized, NonVectorized} x {Bernoulli} x {SpecifyFraction}
    \cup {NonVectorized} x {Reservoir) x {SpecifyRows, SpecifyFraction}
)
x {WithoutReplacement} x {Uniform} x {Repeatable, NotRepeatable}

Some possible extensions (using n for the input sample size and either k for the requested sample size or fraction for the requested sample size as a fraction of n):

WithReplacement: This would be useful, for example, for statistical bootstrapping and for generating large test data from a small initial data collection. Implementation notes:

For Bernoulli, this just means drawing from Binomial[1/n, fraction * n] instead of Bernoulli[fraction]. The result of that is a sample where "every single input row has the same distribution of occurrences in the results set as it would have if fraction*n independent random samples with replacement had been drawn from the input set; but the occurrences of the different input rows are independent." Note that this is perfectly analogous to the existing Bernoulli without replacement, where "every single input row [...] random samples without replacement [...]" (The only thing I'm unsure about here is whether it's safe to assume that n is known at the outset. The fact that reservoir sampling currently does work when I specify a fraction, indicates it is.)
For Reservoir, this can be done in O(max(k, n)log(min(k, n))) as described at https://epubs.siam.org/doi/pdf/10.1137/1.9781611972740.53 (the sampling from the Binomial distribution described there is a bit odd; that can be done in O(1) using, e.g., acceptance rejection with a suitable Poisson or Normal or Bernoulli candidate depending on the indices)

Weighted: this only makes sense with replacement (without replacement, sufficiently large samples must contain all input rows, so cannot be weighted).

This can be useful to ensure statistically meaningful samples in small samples (e.g. give more weight to big customers ). Weighted sampling followed by uniform aggregates is statistically equivalent to but cheaper than weighted aggregates.

For Reservoir, this just requires additional running sum of the weight. It would prevent the "skipping" logic that saves random number generator invocations (you can't do better than O(n) random number calls here as becomes clear from the sequence w_n = 2^n) but I feel that's an acceptable cost.
For Bernoulli, I don't think you can do anything without a prior run through to compute the sum of all weights.

Vectorized x Reservoir: Not sure if this is needed, but does have advantage of much reduced variability of final sample size compared to vectorized Bernoulli.

could just do Reservoir(RandomlyRounded(k/VectorSize) on the set of input vectors.

Finally, I'd find things more complete if SpecifyRows was supported on Bernoulli as well, but I understand there may be reasoning that specifying a number of rows implies exactness whereas specifying a percentage does not necessarily imply exactness.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sampling with replacement; sampling with weights; and reservoir-sampling with vectors #11023

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Sampling with replacement; sampling with weights; and reservoir-sampling with vectors #11023

soerenwolfers Mar 6, 2024

Replies: 0 comments

soerenwolfers
Mar 6, 2024