New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial sampling implementation #1195
Conversation
Related, the SQL standard We might want to implement this syntax as well (it is already in our parser actually). |
So why don't we just support |
The TABLESAMPLE is more limited because it can only be used on base table scans. I believe the Postgres' implementation directly goes to the base table to filter out which pages it reads from disk (i.e. the sample is directly pushed down into the scan). The sample syntax I proposed here can be used for arbitrary queries which is much more flexible. I'm also not a huge fan of the tablesample syntax: SELECT avg(salary)
FROM emp TABLESAMPLE SYSTEM (50) 50 is a percentage, which does not seem very intuitive to me (I would expect either 0.5 to be a percentage, or 50 to be a discrete number of tuples). I also don't like the fact that the user needs to specify the sample method with no possibility for a default. In Postgres the method is also only usable for base tables, not for table-producing functions or subqueries, but this part we can fix by allowing this syntax after those as well. Looking around a bit, SQL server is a bit more sane. There you specify either ROWS or PERCENTAGE, and the sampling method is optional: -- 50%
SELECT avg(salary)
FROM emp TABLESAMPLE (50 PERCENT)
-- 50 rows
SELECT avg(salary)
FROM emp TABLESAMPLE (50 ROWS) I still think it is unnecessarily restrictive to only support this after a single entry in the FROM clause, because it forces the user to manually push down the sample into the from clause, which is problematic when joins are involved. |
Having played around with this some more, I propose the following syntax (which does not require reserving any new keywords): -- sample 100 elements of the table
SELECT * FROM tbl USING SAMPLE (100);
-- sample 10% of the table
SELECT * FROM tbl USING SAMPLE (10%);
-- sample 10% using bernoulli sampling
SELECT * FROM tbl USING SAMPLE (10%, bernoulli);
-- sample 10% using bernoulli sampling, fixed seed 200
SELECT * FROM tbl USING SAMPLE (10%, bernoulli, seed=200);
-- sample 10% using reservoir sampling, fixed seed 200
SELECT * FROM tbl USING SAMPLE (10%, method=reservoir, seed=200); I propose we support the following three sampling methods:
Reservoir sampling is the only sampling method that allows a fixed sample size (i.e. not a percentage). Bernoulli and system sampling do not have a fixed sample size. Bernoulli sampling simply takes the given percentage (e.g. 10%) and then gives every tuple a 10% chance of passing. System sampling is similar to bernoulli sampling, but gives every chunk a 10% chance of passing. This is more efficient because we don't need to pass partial chunks through the pipeline. It is relatively similar to Postgres' system sampling which does the same but on the page level. Bernoulli and system sampling are streaming sampling methods in the sense that they are non-blocking operators (i.e. not sinks). Reservoir sampling is a blocking operator (a sink). My suggestion for default is: if an exact sample size is specified (i.e. not a percentage) we default to reservoir sampling. If a percentage is specified, we default to system sampling, i.e.: -- system sampling, return ~approx 10% of the data
SELECT * FROM tbl USING SAMPLE (10%);
-- reservoir sampling, return 30 tuples
SELECT * FROM tbl USING SAMPLE (30); |
This PR implements the sampling operator as described in #1188. The sample operator can be used as follows:
Read the docs for more information.
In a query,
SAMPLE
should occur before the ORDER BY/LIMIT clauses, but after everything else. The sample operator occurs right after the FROM clause in the scan. For example, here is Q01 with a sample operator:The sample operator is implemented using reservoir sampling without replacement with exponential jumps in a streaming manner (following the algorithm from the paper
Weighted random sampling with a reservoir
byPavlos S. Efraimidis et al.
). Currently only uniform random sampling is supported, but weighted random sampling should not be a difficult extension.Some limitations:
Currently the sample size must be specified exactly, there is no option of specifying the sample size as a percentage of the total data set. More work would need to be done to properly support that, I'm not familiar with a streaming algorithm that supports unbiased sampling where the sample size(this has been fixed)m
depends on the stream size. Materializing the table would be the easiest option, but undesirable since sampling is primarily intended to make queries run faster, not slower.SAMPLE
is defined as a reserved keyword right now. Otherwise shift/reduce conflicts are introduced, because the parser cannot know in this case:If what is meant isWHERE (x ++) SAMPLE 10
orWHERE (x ++ sample) 10
(i.e. is sample a columnar input to a binary operator, or is++
a unary operator and sample a keyword).An alternative would be to use theUSING SAMPLE
syntax instead, i.e.:This avoids needing to add a reserved keyword, but looks a bit uglier imo. What do you think @hannesmuehleisen ?(This has been fixed, we have switched to the USING SAMPLE syntax)