Initial sampling implementation #1195

Mytherin · 2020-12-05T17:00:54Z

This PR implements the sampling operator as described in #1188. The sample operator can be used as follows:

-- select a sample of 5 rows from "tbl" using reservoir sampling
SELECT * FROM tbl USING SAMPLE 5;
-- select a sample of 10% of the table using system sampling (cluster sampling)
SELECT * FROM tbl USING SAMPLE 10%;
-- select a sample of 10% of the table using bernoulli sampling
SELECT * FROM tbl USING SAMPLE 10 PERCENT (bernoulli);
-- select a sample of 50 rows of the table using reservoir sampling with a fixed seed (100)
SELECT * FROM tbl USING SAMPLE reservoir(50 ROWS) REPEATABLE (100);
-- select a sample of 20% of the table using system sampling with a fixed seed (377)
SELECT * FROM tbl USING SAMPLE 10% (system, 377);
-- select a sample of 10% of "tbl" BEFORE the join with tbl2
SELECT * FROM tbl TABLESAMPLE RESERVOIR(20%), tbl2 WHERE tbl.i=tbl2.i;
-- select a sample of 10% of "tbl" AFTER the join with tbl2
SELECT * FROM tbl, tbl2 WHERE tbl.i=tbl2.i USING SAMPLE RESERVOIR(20%);

Read the docs for more information.

In a query, SAMPLE should occur before the ORDER BY/LIMIT clauses, but after everything else. The sample operator occurs right after the FROM clause in the scan. For example, here is Q01 with a sample operator:

SELECT
    l_returnflag,
    l_linestatus,
    sum(l_quantity) AS sum_qty,
    sum(l_extendedprice) AS sum_base_price,
    sum(l_extendedprice * (1 - l_discount)) AS sum_disc_price,
    sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) AS sum_charge,
    avg(l_quantity) AS avg_qty,
    avg(l_extendedprice) AS avg_price,
    avg(l_discount) AS avg_disc,
    count(*) AS count_order
FROM
    lineitem
WHERE
    l_shipdate <= CAST('1998-09-02' AS date)
GROUP BY
    l_returnflag,
    l_linestatus
USING SAMPLE 100
ORDER BY
    l_returnflag,
    l_linestatus;


┌───────────────────────────┐
│          ORDER_BY         │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│           #0 ASC          │
│           #1 ASC          │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│       HASH_GROUP_BY       │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             #0            │
│             #1            │
│          sum(#2)          │
│          sum(#3)          │
│          sum(#4)          │
│          sum(#5)          │
│          avg(#6)          │
│          avg(#7)          │
│          avg(#8)          │
│        count_star()       │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│         PROJECTION        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│        l_returnflag       │
│        l_linestatus       │
│         l_quantity        │
│      l_extendedprice      │
│             #4            │
│   *(#4, +(1.00, l_tax))   │
│         l_quantity        │
│      l_extendedprice      │
│         l_discount        │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│         PROJECTION        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│        l_returnflag       │
│        l_linestatus       │
│         l_quantity        │
│      l_extendedprice      │
│ *(l_extendedprice, -(1.00,│
│        l_discount))       │
│           l_tax           │
│         l_discount        │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│         PROJECTION        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             #0            │
│             #1            │
│             #2            │
│             #3            │
│             #4            │
│             #5            │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│           FILTER          │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│   l_shipdate<=1998-09-02  │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│      PHYSICAL_SAMPLE      │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│            100            │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│          SEQ_SCAN         │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│          lineitem         │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│        l_returnflag       │
│        l_linestatus       │
│         l_quantity        │
│      l_extendedprice      │
│         l_discount        │
│           l_tax           │
│         l_shipdate        │
└───────────────────────────┘

The sample operator is implemented using reservoir sampling without replacement with exponential jumps in a streaming manner (following the algorithm from the paper Weighted random sampling with a reservoir by Pavlos S. Efraimidis et al.). Currently only uniform random sampling is supported, but weighted random sampling should not be a difficult extension.

Some limitations:

We don't support lazy vectors yet, so the entire scan is always performed (i.e. no performance is saved from scanning less data), potential efficiency gains are only from computing aggregates over fewer rows. Related issue Abstract and Lazy Vector representation #396.
Currently the sample size must be specified exactly, there is no option of specifying the sample size as a percentage of the total data set. More work would need to be done to properly support that, I'm not familiar with a streaming algorithm that supports unbiased sampling where the sample size m depends on the stream size. Materializing the table would be the easiest option, but undesirable since sampling is primarily intended to make queries run faster, not slower. (this has been fixed)
~~SAMPLE is defined as a reserved keyword right now. Otherwise shift/reduce conflicts are introduced, because the parser cannot know in this case:~~

SELECT * FROM tbl WHERE x ++ sample 10;

~~If what is meant is WHERE (x ++) SAMPLE 10 or WHERE (x ++ sample) 10 (i.e. is sample a columnar input to a binary operator, or is ++ a unary operator and sample a keyword).~~

~~An alternative would be to use the USING SAMPLE syntax instead, i.e.:~~

-- select a sample of size 10 
SELECT * FROM test USING SAMPLE 10

~~This avoids needing to add a reserved keyword, but looks a bit uglier imo. What do you think @hannesmuehleisen ?~~ (This has been fixed, we have switched to the USING SAMPLE syntax)

Currently samples are tested for simple functionality, but it would be nice to have statistical tests on samples as well (e.g. testing that samples are actually correctly uniform) using statistical testing techniques.

Mytherin · 2020-12-05T17:20:59Z

Related, the SQL standard TABLESAMPLE method: https://wiki.postgresql.org/wiki/TABLESAMPLE_Implementation

We might want to implement this syntax as well (it is already in our parser actually).

hannes · 2020-12-06T10:16:27Z

So why don't we just support TABLESAMPLE like in Postgres and call it a day?

Mytherin · 2020-12-06T10:38:12Z

The TABLESAMPLE is more limited because it can only be used on base table scans. I believe the Postgres' implementation directly goes to the base table to filter out which pages it reads from disk (i.e. the sample is directly pushed down into the scan). The sample syntax I proposed here can be used for arbitrary queries which is much more flexible. I'm also not a huge fan of the tablesample syntax:

SELECT avg(salary)
FROM emp TABLESAMPLE SYSTEM (50)

50 is a percentage, which does not seem very intuitive to me (I would expect either 0.5 to be a percentage, or 50 to be a discrete number of tuples). I also don't like the fact that the user needs to specify the sample method with no possibility for a default. In Postgres the method is also only usable for base tables, not for table-producing functions or subqueries, but this part we can fix by allowing this syntax after those as well.

Looking around a bit, SQL server is a bit more sane. There you specify either ROWS or PERCENTAGE, and the sampling method is optional:

-- 50%
SELECT avg(salary)
FROM emp TABLESAMPLE (50 PERCENT)
-- 50 rows
SELECT avg(salary)
FROM emp TABLESAMPLE (50 ROWS)

I still think it is unnecessarily restrictive to only support this after a single entry in the FROM clause, because it forces the user to manually push down the sample into the from clause, which is problematic when joins are involved.

Mytherin · 2020-12-06T12:03:18Z

Having played around with this some more, I propose the following syntax (which does not require reserving any new keywords):

-- sample 100 elements of the table
SELECT * FROM tbl USING SAMPLE (100);
-- sample 10% of the table
SELECT * FROM tbl USING SAMPLE (10%);
-- sample 10% using bernoulli sampling
SELECT * FROM tbl USING SAMPLE (10%, bernoulli);
-- sample 10% using bernoulli sampling, fixed seed 200
SELECT * FROM tbl USING SAMPLE (10%, bernoulli, seed=200);
-- sample 10% using reservoir sampling, fixed seed 200
SELECT * FROM tbl USING SAMPLE (10%, method=reservoir, seed=200);

I propose we support the following three sampling methods:

System
Bernoulli
Reservoir

Reservoir sampling is the only sampling method that allows a fixed sample size (i.e. not a percentage). Bernoulli and system sampling do not have a fixed sample size. Bernoulli sampling simply takes the given percentage (e.g. 10%) and then gives every tuple a 10% chance of passing. System sampling is similar to bernoulli sampling, but gives every chunk a 10% chance of passing. This is more efficient because we don't need to pass partial chunks through the pipeline. It is relatively similar to Postgres' system sampling which does the same but on the page level.

Bernoulli and system sampling are streaming sampling methods in the sense that they are non-blocking operators (i.e. not sinks). Reservoir sampling is a blocking operator (a sink).

My suggestion for default is: if an exact sample size is specified (i.e. not a percentage) we default to reservoir sampling. If a percentage is specified, we default to system sampling, i.e.:

-- system sampling, return ~approx 10% of the data
SELECT * FROM tbl USING SAMPLE (10%);
-- reservoir sampling, return 30 tuples
SELECT * FROM tbl USING SAMPLE (30);

… percentage

Mytherin added 2 commits December 5, 2020 17:42

Initial sampling implementation

2aae1e2

Enable verification on this test

b5dab87

Mytherin added 2 commits December 6, 2020 10:31

Code cleanup: extract code of reservoir sample to separate class

40185ff

next_index needs to be at least 1

17c038f

Mytherin added 10 commits December 7, 2020 11:49

Modify sampling syntax, and add support for reservoir sampling with a…

808a264

… percentage

Add bernoulli and system sampling options

5bb86e5

Add support for TABLESAMPLE keyword

0f6d693

Unify USING SAMPLE and TABLESAMPLE syntax: use same syntax rules in both

e3b1ec5

Format

60ac3f6

Increment storage version

243a349

Parallel sampling test

506d29a

Add missing include

dbd829b

Avoid hardcoding this answer

171c944

Some more non-happy path tests

86eeb26

Mytherin merged commit 1cb03a8 into duckdb:master Dec 7, 2020

Mytherin mentioned this pull request Dec 7, 2020

Implement row sampling with a SAMPLE statement #1188

Closed

Mytherin deleted the sampling branch April 20, 2021 13:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial sampling implementation #1195

Initial sampling implementation #1195

Mytherin commented Dec 5, 2020 •

edited

Mytherin commented Dec 5, 2020 •

edited

hannes commented Dec 6, 2020

Mytherin commented Dec 6, 2020 •

edited

Mytherin commented Dec 6, 2020 •

edited

Initial sampling implementation #1195

Initial sampling implementation #1195

Conversation

Mytherin commented Dec 5, 2020 • edited

Mytherin commented Dec 5, 2020 • edited

hannes commented Dec 6, 2020

Mytherin commented Dec 6, 2020 • edited

Mytherin commented Dec 6, 2020 • edited

Mytherin commented Dec 5, 2020 •

edited

Mytherin commented Dec 5, 2020 •

edited

Mytherin commented Dec 6, 2020 •

edited

Mytherin commented Dec 6, 2020 •

edited