Sampling with replacement; sampling with weights; and reservoir-sampling with vectors #11023
soerenwolfers
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
EDIT: the "weight" aspect of this has previously been asked for at #7506
Sampling is invaluable when working with big data and it's great that duckdb offers
USING SAMPLE
for this.Below, I am writing down some thoughts about how this could be made even more powerful.
The set of possible sampling choices encompasses at least:
Currently, duckdb supports
Some possible extensions (using
n
for the input sample size and eitherk
for the requested sample size orfraction
for the requested sample size as a fraction ofn
):WithReplacement
: This would be useful, for example, for statistical bootstrapping and for generating large test data from a small initial data collection. Implementation notes:For Bernoulli, this just means drawing from
Binomial[1/n, fraction * n]
instead ofBernoulli[fraction]
. The result of that is a sample where "every single input row has the same distribution of occurrences in the results set as it would have if fraction*n independent random samples with replacement had been drawn from the input set; but the occurrences of the different input rows are independent." Note that this is perfectly analogous to the existing Bernoulli without replacement, where "every single input row [...] random samples without replacement [...]" (The only thing I'm unsure about here is whether it's safe to assume thatn
is known at the outset. The fact that reservoir sampling currently does work when I specify a fraction, indicates it is.)For Reservoir, this can be done in
O(max(k, n)log(min(k, n)))
as described at https://epubs.siam.org/doi/pdf/10.1137/1.9781611972740.53 (the sampling from the Binomial distribution described there is a bit odd; that can be done inO(1)
using, e.g., acceptance rejection with a suitable Poisson or Normal or Bernoulli candidate depending on the indices)Weighted
: this only makes sense with replacement (without replacement, sufficiently large samples must contain all input rows, so cannot be weighted).This can be useful to ensure statistically meaningful samples in small samples (e.g. give more weight to big customers ). Weighted sampling followed by uniform aggregates is statistically equivalent to but cheaper than weighted aggregates.
Vectorized x Reservoir
: Not sure if this is needed, but does have advantage of much reduced variability of final sample size compared to vectorized Bernoulli.Finally, I'd find things more complete if
SpecifyRows
was supported on Bernoulli as well, but I understand there may be reasoning that specifying a number of rows implies exactness whereas specifying a percentage does not necessarily imply exactness.Beta Was this translation helpful? Give feedback.
All reactions