Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial sampling implementation #1195

Merged
merged 14 commits into from Dec 7, 2020
Merged

Initial sampling implementation #1195

merged 14 commits into from Dec 7, 2020

Conversation

Mytherin
Copy link
Collaborator

@Mytherin Mytherin commented Dec 5, 2020

This PR implements the sampling operator as described in #1188. The sample operator can be used as follows:

-- select a sample of 5 rows from "tbl" using reservoir sampling
SELECT * FROM tbl USING SAMPLE 5;
-- select a sample of 10% of the table using system sampling (cluster sampling)
SELECT * FROM tbl USING SAMPLE 10%;
-- select a sample of 10% of the table using bernoulli sampling
SELECT * FROM tbl USING SAMPLE 10 PERCENT (bernoulli);
-- select a sample of 50 rows of the table using reservoir sampling with a fixed seed (100)
SELECT * FROM tbl USING SAMPLE reservoir(50 ROWS) REPEATABLE (100);
-- select a sample of 20% of the table using system sampling with a fixed seed (377)
SELECT * FROM tbl USING SAMPLE 10% (system, 377);
-- select a sample of 10% of "tbl" BEFORE the join with tbl2
SELECT * FROM tbl TABLESAMPLE RESERVOIR(20%), tbl2 WHERE tbl.i=tbl2.i;
-- select a sample of 10% of "tbl" AFTER the join with tbl2
SELECT * FROM tbl, tbl2 WHERE tbl.i=tbl2.i USING SAMPLE RESERVOIR(20%);

Read the docs for more information.

In a query, SAMPLE should occur before the ORDER BY/LIMIT clauses, but after everything else. The sample operator occurs right after the FROM clause in the scan. For example, here is Q01 with a sample operator:

SELECT
    l_returnflag,
    l_linestatus,
    sum(l_quantity) AS sum_qty,
    sum(l_extendedprice) AS sum_base_price,
    sum(l_extendedprice * (1 - l_discount)) AS sum_disc_price,
    sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) AS sum_charge,
    avg(l_quantity) AS avg_qty,
    avg(l_extendedprice) AS avg_price,
    avg(l_discount) AS avg_disc,
    count(*) AS count_order
FROM
    lineitem
WHERE
    l_shipdate <= CAST('1998-09-02' AS date)
GROUP BY
    l_returnflag,
    l_linestatus
USING SAMPLE 100
ORDER BY
    l_returnflag,
    l_linestatus;


┌───────────────────────────┐
│          ORDER_BY         │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│           #0 ASC          │#1 ASC          │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│       HASH_GROUP_BY       │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             #0            │#1            │sum(#2)          │sum(#3)          │sum(#4)          │sum(#5)          │avg(#6)          │avg(#7)          │avg(#8)          │
│        count_star()       │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│         PROJECTION        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│        l_returnflag       │
│        l_linestatus       │
│         l_quantity        │
│      l_extendedprice      │
│             #4            │*(#4, +(1.00, l_tax))   │
│         l_quantity        │
│      l_extendedprice      │
│         l_discount        │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│         PROJECTION        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│        l_returnflag       │
│        l_linestatus       │
│         l_quantity        │
│      l_extendedprice      │
│ *(l_extendedprice, -(1.00,│
│        l_discount))       │
│           l_tax           │
│         l_discount        │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│         PROJECTION        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             #0            │#1            │#2            │#3            │#4            │#5            │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│           FILTER          │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│   l_shipdate<=1998-09-02  │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│      PHYSICAL_SAMPLE      │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│            100            │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│          SEQ_SCAN         │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│          lineitem         │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│        l_returnflag       │
│        l_linestatus       │
│         l_quantity        │
│      l_extendedprice      │
│         l_discount        │
│           l_tax           │
│         l_shipdate        │
└───────────────────────────┘ 

The sample operator is implemented using reservoir sampling without replacement with exponential jumps in a streaming manner (following the algorithm from the paper Weighted random sampling with a reservoir by Pavlos S. Efraimidis et al.). Currently only uniform random sampling is supported, but weighted random sampling should not be a difficult extension.

Some limitations:

  • We don't support lazy vectors yet, so the entire scan is always performed (i.e. no performance is saved from scanning less data), potential efficiency gains are only from computing aggregates over fewer rows. Related issue Abstract and Lazy Vector representation #396.
  • Currently the sample size must be specified exactly, there is no option of specifying the sample size as a percentage of the total data set. More work would need to be done to properly support that, I'm not familiar with a streaming algorithm that supports unbiased sampling where the sample size m depends on the stream size. Materializing the table would be the easiest option, but undesirable since sampling is primarily intended to make queries run faster, not slower. (this has been fixed)
  • SAMPLE is defined as a reserved keyword right now. Otherwise shift/reduce conflicts are introduced, because the parser cannot know in this case:
SELECT * FROM tbl WHERE x ++ sample 10;

If what is meant is WHERE (x ++) SAMPLE 10 or WHERE (x ++ sample) 10 (i.e. is sample a columnar input to a binary operator, or is ++ a unary operator and sample a keyword).

An alternative would be to use the USING SAMPLE syntax instead, i.e.:

-- select a sample of size 10 
SELECT * FROM test USING SAMPLE 10

This avoids needing to add a reserved keyword, but looks a bit uglier imo. What do you think @hannesmuehleisen ? (This has been fixed, we have switched to the USING SAMPLE syntax)

  • Currently samples are tested for simple functionality, but it would be nice to have statistical tests on samples as well (e.g. testing that samples are actually correctly uniform) using statistical testing techniques.

@Mytherin
Copy link
Collaborator Author

Mytherin commented Dec 5, 2020

Related, the SQL standard TABLESAMPLE method: https://wiki.postgresql.org/wiki/TABLESAMPLE_Implementation

We might want to implement this syntax as well (it is already in our parser actually).

@hannes
Copy link
Member

hannes commented Dec 6, 2020

So why don't we just support TABLESAMPLE like in Postgres and call it a day?

@Mytherin
Copy link
Collaborator Author

Mytherin commented Dec 6, 2020

The TABLESAMPLE is more limited because it can only be used on base table scans. I believe the Postgres' implementation directly goes to the base table to filter out which pages it reads from disk (i.e. the sample is directly pushed down into the scan). The sample syntax I proposed here can be used for arbitrary queries which is much more flexible. I'm also not a huge fan of the tablesample syntax:

SELECT avg(salary)
FROM emp TABLESAMPLE SYSTEM (50)

50 is a percentage, which does not seem very intuitive to me (I would expect either 0.5 to be a percentage, or 50 to be a discrete number of tuples). I also don't like the fact that the user needs to specify the sample method with no possibility for a default. In Postgres the method is also only usable for base tables, not for table-producing functions or subqueries, but this part we can fix by allowing this syntax after those as well.

Looking around a bit, SQL server is a bit more sane. There you specify either ROWS or PERCENTAGE, and the sampling method is optional:

-- 50%
SELECT avg(salary)
FROM emp TABLESAMPLE (50 PERCENT)
-- 50 rows
SELECT avg(salary)
FROM emp TABLESAMPLE (50 ROWS)

I still think it is unnecessarily restrictive to only support this after a single entry in the FROM clause, because it forces the user to manually push down the sample into the from clause, which is problematic when joins are involved.

@Mytherin
Copy link
Collaborator Author

Mytherin commented Dec 6, 2020

Having played around with this some more, I propose the following syntax (which does not require reserving any new keywords):

-- sample 100 elements of the table
SELECT * FROM tbl USING SAMPLE (100);
-- sample 10% of the table
SELECT * FROM tbl USING SAMPLE (10%);
-- sample 10% using bernoulli sampling
SELECT * FROM tbl USING SAMPLE (10%, bernoulli);
-- sample 10% using bernoulli sampling, fixed seed 200
SELECT * FROM tbl USING SAMPLE (10%, bernoulli, seed=200);
-- sample 10% using reservoir sampling, fixed seed 200
SELECT * FROM tbl USING SAMPLE (10%, method=reservoir, seed=200);

I propose we support the following three sampling methods:

  • System
  • Bernoulli
  • Reservoir

Reservoir sampling is the only sampling method that allows a fixed sample size (i.e. not a percentage). Bernoulli and system sampling do not have a fixed sample size. Bernoulli sampling simply takes the given percentage (e.g. 10%) and then gives every tuple a 10% chance of passing. System sampling is similar to bernoulli sampling, but gives every chunk a 10% chance of passing. This is more efficient because we don't need to pass partial chunks through the pipeline. It is relatively similar to Postgres' system sampling which does the same but on the page level.

Bernoulli and system sampling are streaming sampling methods in the sense that they are non-blocking operators (i.e. not sinks). Reservoir sampling is a blocking operator (a sink).

My suggestion for default is: if an exact sample size is specified (i.e. not a percentage) we default to reservoir sampling. If a percentage is specified, we default to system sampling, i.e.:

-- system sampling, return ~approx 10% of the data
SELECT * FROM tbl USING SAMPLE (10%);
-- reservoir sampling, return 30 tuples
SELECT * FROM tbl USING SAMPLE (30);

@Mytherin Mytherin merged commit 1cb03a8 into duckdb:master Dec 7, 2020
@Mytherin Mytherin deleted the sampling branch April 20, 2021 13:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants