New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dnm] sql: use vectorized scan #31354

Open
wants to merge 4 commits into
base: master
from

Conversation

Projects
None yet
2 participants
@jordanlewis
Member

jordanlewis commented Oct 15, 2018

This commit clones the RowFetcher and associated encoding machinery so
that flows can be planned directly on top of columnarized output from
MVCCScan.

This is [dnm] because that cloning is pretty gross. It would be better to
teach the RowFetcher about both methods, potentially, to avoid having
to clone so much code. Some is probably unavoidable, though - the
decode methods I think will need to be cloned because they'll need to
understand about exec.ColBatch and friends.

A new session setting is added, experimental_vectorize, which
automatically replaces TableReaders with a colBatchScan operator
followed by a materialize operator. This automatic replacement is quite
hacky and another reason why this is [dnm].

This actually demonstrates something like 20% gains on COUNT and
things that use data from the scan, like SELECT SUM, without even
involving any other columnar operators. You can even see in
cockroach demo if you turn experimental_vectorize on and use only
the currently supported datatypes (int, float, string, bytes) and no nulls.

I thought that this was probably because we're batching the allocation
of the underlying memory for the data, but I don't think that's true,
because the materializer still has to create the same number of Datums
to do its job. So perhaps the speedup is due to the columnar
representation instead - fewer type switches when deserializing, since
it performs all of the decodes for a single column (which is a single
type) at once.

First 2 commits are from #31353.

Release note: None

exec: add physical types enum
An exec physical type corresponds to a particular type's bytes
representation, from the perspective of exec.

Release note: None

@jordanlewis jordanlewis requested review from cockroachdb/distsql-prs as code owners Oct 15, 2018

@cockroach-teamcity

This comment has been minimized.

Show comment
Hide comment
@cockroach-teamcity

cockroach-teamcity Oct 15, 2018

Member

This change is Reviewable

Member

cockroach-teamcity commented Oct 15, 2018

This change is Reviewable

jordanlewis added some commits Oct 15, 2018

exec: add columnarizer and materializer
These are the components that map data between the EncDatumRow
representation and the exec.ColBatch representation.

Release note: None
sql: add vectorize settings
Release note: None
[dnm] sql: use vectorized scan
This commit clones the RowFetcher and associated encoding machinery so
that flows can be planned directly on top of columnarized output from
MVCCScan.

A new session setting is added, experimental_vectorize, which
automatically replaces TableReaders with a colBatchScan operator
followed by a materialize operator.

This actually demonstrates something like 20% gains on COUNT and
things that use data from the scan, like SELECT SUM, without even
involving any other columnar operators.

I thought that this was probably because we're batching the allocation
of the underlying memory for the data, but I don't think that's true,
because the materializer still has to create the same number of Datums
to do its job. So perhaps the speedup is due to the columnar
representation instead - fewer type switches when deserializing, since
it performs all of the decodes for a single column (which is a single
type) at once.

Release note: None
@jordanlewis

This comment has been minimized.

Show comment
Hide comment
@jordanlewis

jordanlewis Oct 17, 2018

Member

To put my money where my mouth is:

root@:26257/defaultdb> show create table a;
+------------+-----------------------------------------------+
| table_name |               create_statement                |
+------------+-----------------------------------------------+
| a          | CREATE TABLE a (                              |
|            |                                               |
|            |     a INT NOT NULL,                           |
|            |                                               |
|            |     b INT NULL,                               |
|            |                                               |
|            |     CONSTRAINT "primary" PRIMARY KEY (a ASC), |
|            |                                               |
|            |     FAMILY "primary" (a, b)                   |
|            |                                               |
|            | )                                             |
+------------+-----------------------------------------------+
(1 row)

Time: 7.036ms


root@:26257/defaultdb> insert into a select g,g+1 from generate_series(1,100000) g(g);

26257/defaultdb> select sum(a), sum(b) from a;                                                                                                                +------------+------------+
|    sum     |    sum     |
+------------+------------+
| 5000050000 | 5000150000 |
+------------+------------+
(1 row)

Time: 112.224ms

root@:26257/defaultdb> select sum(a), sum(b) from a;
+------------+------------+
|    sum     |    sum     |
+------------+------------+
| 5000050000 | 5000150000 |
+------------+------------+
(1 row)

Time: 104.331ms

root@:26257/defaultdb> select sum(a), sum(b) from a;
+------------+------------+
|    sum     |    sum     |
+------------+------------+
| 5000050000 | 5000150000 |
+------------+------------+
(1 row)

Time: 107.582ms

root@:26257/defaultdb> set experimental_vectorize=true;
SET

Time: 1.941ms

root@:26257/defaultdb> select sum(a), sum(b) from a;
+------------+------------+
|    sum     |    sum     |
+------------+------------+
| 5000148976 | 5000150000 |
+------------+------------+
(1 row)

Time: 83.114ms

root@:26257/defaultdb> select sum(a), sum(b) from a;
+------------+------------+
|    sum     |    sum     |
+------------+------------+
| 5000148976 | 5000150000 |
+------------+------------+
(1 row)

Time: 81.388ms

root@:26257/defaultdb> select sum(a), sum(b) from a;
+------------+------------+
|    sum     |    sum     |
+------------+------------+
| 5000148976 | 5000150000 |
+------------+------------+
(1 row)

Time: 80.507ms
Member

jordanlewis commented Oct 17, 2018

To put my money where my mouth is:

root@:26257/defaultdb> show create table a;
+------------+-----------------------------------------------+
| table_name |               create_statement                |
+------------+-----------------------------------------------+
| a          | CREATE TABLE a (                              |
|            |                                               |
|            |     a INT NOT NULL,                           |
|            |                                               |
|            |     b INT NULL,                               |
|            |                                               |
|            |     CONSTRAINT "primary" PRIMARY KEY (a ASC), |
|            |                                               |
|            |     FAMILY "primary" (a, b)                   |
|            |                                               |
|            | )                                             |
+------------+-----------------------------------------------+
(1 row)

Time: 7.036ms


root@:26257/defaultdb> insert into a select g,g+1 from generate_series(1,100000) g(g);

26257/defaultdb> select sum(a), sum(b) from a;                                                                                                                +------------+------------+
|    sum     |    sum     |
+------------+------------+
| 5000050000 | 5000150000 |
+------------+------------+
(1 row)

Time: 112.224ms

root@:26257/defaultdb> select sum(a), sum(b) from a;
+------------+------------+
|    sum     |    sum     |
+------------+------------+
| 5000050000 | 5000150000 |
+------------+------------+
(1 row)

Time: 104.331ms

root@:26257/defaultdb> select sum(a), sum(b) from a;
+------------+------------+
|    sum     |    sum     |
+------------+------------+
| 5000050000 | 5000150000 |
+------------+------------+
(1 row)

Time: 107.582ms

root@:26257/defaultdb> set experimental_vectorize=true;
SET

Time: 1.941ms

root@:26257/defaultdb> select sum(a), sum(b) from a;
+------------+------------+
|    sum     |    sum     |
+------------+------------+
| 5000148976 | 5000150000 |
+------------+------------+
(1 row)

Time: 83.114ms

root@:26257/defaultdb> select sum(a), sum(b) from a;
+------------+------------+
|    sum     |    sum     |
+------------+------------+
| 5000148976 | 5000150000 |
+------------+------------+
(1 row)

Time: 81.388ms

root@:26257/defaultdb> select sum(a), sum(b) from a;
+------------+------------+
|    sum     |    sum     |
+------------+------------+
| 5000148976 | 5000150000 |
+------------+------------+
(1 row)

Time: 80.507ms
@jordanlewis

This comment has been minimized.

Show comment
Hide comment
@jordanlewis

jordanlewis Oct 17, 2018

Member

Hmm, just noticed that the results are wrong by "a little bit" - clearly there's a bug somewhere :) but I'm still pretty sure that the 20ms gain there isn't caused by whatever off-by-one error is in this idea.

Member

jordanlewis commented Oct 17, 2018

Hmm, just noticed that the results are wrong by "a little bit" - clearly there's a bug somewhere :) but I'm still pretty sure that the 20ms gain there isn't caused by whatever off-by-one error is in this idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment