Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign up[dnm] sql: use vectorized scan #31354
+2,319
−15
Conversation
jordanlewis
requested review from
solongordon,
asubiotto and
changangela
Oct 15, 2018
jordanlewis
requested review from
cockroachdb/distsql-prs
as
code owners
Oct 15, 2018
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Show comment
Hide comment
jordanlewis
added some commits
Oct 15, 2018
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Show comment
Hide comment
jordanlewis
Oct 17, 2018
Member
To put my money where my mouth is:
root@:26257/defaultdb> show create table a;
+------------+-----------------------------------------------+
| table_name | create_statement |
+------------+-----------------------------------------------+
| a | CREATE TABLE a ( |
| | |
| | a INT NOT NULL, |
| | |
| | b INT NULL, |
| | |
| | CONSTRAINT "primary" PRIMARY KEY (a ASC), |
| | |
| | FAMILY "primary" (a, b) |
| | |
| | ) |
+------------+-----------------------------------------------+
(1 row)
Time: 7.036ms
root@:26257/defaultdb> insert into a select g,g+1 from generate_series(1,100000) g(g);
26257/defaultdb> select sum(a), sum(b) from a; +------------+------------+
| sum | sum |
+------------+------------+
| 5000050000 | 5000150000 |
+------------+------------+
(1 row)
Time: 112.224ms
root@:26257/defaultdb> select sum(a), sum(b) from a;
+------------+------------+
| sum | sum |
+------------+------------+
| 5000050000 | 5000150000 |
+------------+------------+
(1 row)
Time: 104.331ms
root@:26257/defaultdb> select sum(a), sum(b) from a;
+------------+------------+
| sum | sum |
+------------+------------+
| 5000050000 | 5000150000 |
+------------+------------+
(1 row)
Time: 107.582ms
root@:26257/defaultdb> set experimental_vectorize=true;
SET
Time: 1.941ms
root@:26257/defaultdb> select sum(a), sum(b) from a;
+------------+------------+
| sum | sum |
+------------+------------+
| 5000148976 | 5000150000 |
+------------+------------+
(1 row)
Time: 83.114ms
root@:26257/defaultdb> select sum(a), sum(b) from a;
+------------+------------+
| sum | sum |
+------------+------------+
| 5000148976 | 5000150000 |
+------------+------------+
(1 row)
Time: 81.388ms
root@:26257/defaultdb> select sum(a), sum(b) from a;
+------------+------------+
| sum | sum |
+------------+------------+
| 5000148976 | 5000150000 |
+------------+------------+
(1 row)
Time: 80.507ms
|
To put my money where my mouth is:
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Show comment
Hide comment
jordanlewis
Oct 17, 2018
Member
Hmm, just noticed that the results are wrong by "a little bit" - clearly there's a bug somewhere :) but I'm still pretty sure that the 20ms gain there isn't caused by whatever off-by-one error is in this idea.
|
Hmm, just noticed that the results are wrong by "a little bit" - clearly there's a bug somewhere :) but I'm still pretty sure that the 20ms gain there isn't caused by whatever off-by-one error is in this idea. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
jordanlewis commentedOct 15, 2018
•
edited
This commit clones the RowFetcher and associated encoding machinery so
that flows can be planned directly on top of columnarized output from
MVCCScan.
This is [dnm] because that cloning is pretty gross. It would be better to
teach the RowFetcher about both methods, potentially, to avoid having
to clone so much code. Some is probably unavoidable, though - the
decode methods I think will need to be cloned because they'll need to
understand about
exec.ColBatchand friends.A new session setting is added, experimental_vectorize, which
automatically replaces TableReaders with a colBatchScan operator
followed by a materialize operator. This automatic replacement is quite
hacky and another reason why this is [dnm].
This actually demonstrates something like 20% gains on COUNT and
things that use data from the scan, like SELECT SUM, without even
involving any other columnar operators. You can even see in
cockroach demoif you turnexperimental_vectorizeon and use onlythe currently supported datatypes (int, float, string, bytes) and no nulls.
I thought that this was probably because we're batching the allocation
of the underlying memory for the data, but I don't think that's true,
because the materializer still has to create the same number of Datums
to do its job. So perhaps the speedup is due to the columnar
representation instead - fewer type switches when deserializing, since
it performs all of the decodes for a single column (which is a single
type) at once.
First 2 commits are from #31353.
Release note: None