-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Add support for projection pushdown into struct fields #14750
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…_ids, which can refer to sub-columns
…artially use this in the row group scan
Mytherin
added a commit
that referenced
this pull request
Nov 15, 2024
Follow-up from #14750 - this PR extends the Parquet reader and the MultiFileReader to work with `ColumnIndex` - allowing pushdown into struct fields. Internally the way this works in the reader is that the `StructColumnReader` can now leave certain child readers on `NULL`. Child readers that are left on `NULL` are not scanned - they instead emit a constant `NULL` value. ### Benchmarks Below is a benchmark running TPC-H Q01 over SF10. For the struct case, we store the entire rows in a struct field, and then extract it using a view, e.g.: ```sql CALL dbgen(sf=10, suffix='_normalized'); COPY (SELECT lineitem_normalized AS row_val FROM lineitem_normalized) TO 'lineitem_struct.parquet'; CREATE VIEW lineitem AS SELECT UNNEST(row_val) FROM 'lineitem_struct.parquet'; ``` We can see even more significant speed-ups than when using our native storage - primarily because reading the (unnecessary) string columns in Parquet format is significantly slower than when using our native format. | Query | Struct (v1.1) | Struct (new) | Regular | |-------|---------------|--------------|---------| | Q01 | 1.23s | 0.38s | 0.35s | ### Explain This PR also adds support to the `EXPLAIN` for showing which sub-columns are selected in the projection list of a scan, for example: ```sql explain select struct_val.l_orderkey, struct_val.l_partkey from lineitem_struct.parquet; ┌───────────────────────────┐ │ PROJECTION │ │ ──────────────────── │ │ struct_extract(struct_val,│ │ 'l_orderkey') │ │ struct_extract(struct_val,│ │ 'l_partkey') │ │ │ │ ~60175 Rows │ └─────────────┬─────────────┘ ┌─────────────┴─────────────┐ │ PARQUET_SCAN │ │ ──────────────────── │ │ Function: │ │ PARQUET_SCAN │ │ │ │ Projections: │ │ struct_val.l_orderkey │ │ struct_val.l_partkey │ │ │ │ ~60175 Rows │ └───────────────────────────┘ ```
Flogex
added a commit
to Flogex/duckdb
that referenced
this pull request
Nov 25, 2024
Flogex
added a commit
to motherduckdb/duckdb-delta
that referenced
this pull request
Nov 25, 2024
Incorporates changes from duckdb/duckdb/pull/14750. There is no official patch for this in the DuckDB repo yet.
Mytherin
added a commit
that referenced
this pull request
Dec 4, 2024
After @Mytherin did most of the heavy lifting in #14750, I was able to implement struct projection pushdown for JSON reads. Although we still need to parse the entire JSON, we do not need to convert it to DuckDB Vectors, which saves a lot of time. Here are the results (not very scientific - some noise from my laptop doing other stuff) when reading TPC-H SF1 directly from JSON files as a struct, without struct projection pushdown (Before) and with (This PR). As we can see, all queries are now faster. The difference is more pronounced for some queries, e.g., Q3, than it is for others. | Query | Before | This PR | |:-|-:|-:| | 1 | 0.665 | 0.478 | | 2 | 0.234 | 0.183 | | 3 | 0.813 | 0.209 | | 4 | 0.774 | 0.601 | | 5 | 0.890 | 0.582 | | 6 | 0.617 | 0.425 | | 7 | 0.807 | 0.433 | | 8 | 0.898 | 0.588 | | 9 | 0.970 | 0.616 | | 10 | 0.808 | 0.551 | | 11 | 0.179 | 0.150 | | 12 | 0.785 | 0.603 | | 13 | 0.258 | 0.208 | | 14 | 0.702 | 0.478 | | 15 | 0.655 | 0.449 | | 16 | 0.200 | 0.170 | | 17 | 1.372 | 0.848 | | 18 | 1.457 | 0.903 | | 19 | 0.703 | 0.482 | | 20 | 0.767 | 0.525 | | 21 | 2.143 | 1.495 | | 22 | 0.319 | 0.240 |
github-actions bot
pushed a commit
to duckdb/duckdb-r
that referenced
this pull request
Dec 21, 2024
Add support for projection pushdown into struct fields (duckdb/duckdb#14750)
github-actions bot
added a commit
to duckdb/duckdb-r
that referenced
this pull request
Dec 21, 2024
Add support for projection pushdown into struct fields (duckdb/duckdb#14750) Co-authored-by: krlmlr <krlmlr@users.noreply.github.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR adds support for projection pushdown into struct fields, allowing for scans to only partially scan structs - the parts that are required for the query. The way this is supported is that there is a new class -
ColumnIndex- that supports recursively qualifying which columns are needed for the query:For example, if we are querying a struct
STRUCT(a INT, b INT), and we only need to access the second struct element (b) - the child index would haveindex = 1.The child
ColumnIndexesare set by theRemoveUnusedColumnsoptimizer - which looks for calls tostruct_extractorarray_extractand figures out which child indexes are required to complete the query.Backwards Compatibility
The vectors that are returned by the scans are still in the same form - i.e. the scan still returns a
STRUCT(a INT, b INT). The only difference is that we don't scan the sub-columns we don't need - i.e. the columna INTis filled with a constantNULLvalue. In that way, scanning the sub-columns we don't need is actually optional. This allows for the struct projection pushdown to be fully backwards compatible.The table function receives the column indexes it needs as input in the
TableFunctionInitInput. The oldcolumn_idsare still present as well, which can still be used by existing table functions.Benchmarks
Below are some benchmarks, running various TPC-H queries over SF10. For the
structcase, we store the entire rows in a struct field, and then extract it using a view, e.g.:As we can see, with the projection pushdown the performance is mostly the same between a regular table scan and scanning unnested struct columns. There are still some differences in the generated query plans (different join order in Q09) - most likely caused by either some missing statistics or statistics not being propagated correctly, to be perhaps investigated in the future.
Future Work
This PR only implements struct projection pushdown for DuckDB's native table scans - it still needs to be implemented for the parquet reader.