Skip to content

Add support for projection pushdown into struct fields #14750

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 23 commits into from
Nov 8, 2024

Conversation

Mytherin
Copy link
Collaborator

@Mytherin Mytherin commented Nov 7, 2024

This PR adds support for projection pushdown into struct fields, allowing for scans to only partially scan structs - the parts that are required for the query. The way this is supported is that there is a new class - ColumnIndex - that supports recursively qualifying which columns are needed for the query:

struct ColumnIndex {
	idx_t index;
	vector<ColumnIndex> child_indexes;
};

For example, if we are querying a struct STRUCT(a INT, b INT), and we only need to access the second struct element (b) - the child index would have index = 1.

The child ColumnIndexes are set by the RemoveUnusedColumns optimizer - which looks for calls to struct_extract or array_extract and figures out which child indexes are required to complete the query.

Backwards Compatibility

The vectors that are returned by the scans are still in the same form - i.e. the scan still returns a STRUCT(a INT, b INT). The only difference is that we don't scan the sub-columns we don't need - i.e. the column a INT is filled with a constant NULL value. In that way, scanning the sub-columns we don't need is actually optional. This allows for the struct projection pushdown to be fully backwards compatible.

The table function receives the column indexes it needs as input in the TableFunctionInitInput. The old column_ids are still present as well, which can still be used by existing table functions.

struct TableFunctionInitInput {
	vector<column_t> column_ids;
	vector<ColumnIndex> column_indexes;
};

Benchmarks

Below are some benchmarks, running various TPC-H queries over SF10. For the struct case, we store the entire rows in a struct field, and then extract it using a view, e.g.:

CALL dbgen(sf=10, suffix='_normalized');
CREATE TABLE lineitem_struct AS SELECT lineitem_normalized AS row_val FROM lineitem_normalized;
CREATE VIEW lineitem AS SELECT UNNEST(row_val) FROM lineitem_struct;
Query Struct (v1.1) Struct (new) Regular
Q01 0.33s 0.15s 0.14s
Q06 0.26s 0.07s 0.04s
Q09 1.17s 0.62s 0.2s

As we can see, with the projection pushdown the performance is mostly the same between a regular table scan and scanning unnested struct columns. There are still some differences in the generated query plans (different join order in Q09) - most likely caused by either some missing statistics or statistics not being propagated correctly, to be perhaps investigated in the future.

Future Work

This PR only implements struct projection pushdown for DuckDB's native table scans - it still needs to be implemented for the parquet reader.

@duckdb-draftbot duckdb-draftbot marked this pull request as draft November 8, 2024 09:19
@Mytherin Mytherin marked this pull request as ready for review November 8, 2024 09:19
@duckdb-draftbot duckdb-draftbot marked this pull request as draft November 8, 2024 11:51
@Mytherin Mytherin marked this pull request as ready for review November 8, 2024 11:51
@duckdb-draftbot duckdb-draftbot marked this pull request as draft November 8, 2024 14:14
@Mytherin Mytherin marked this pull request as ready for review November 8, 2024 14:14
@Mytherin Mytherin merged commit da7bc76 into duckdb:main Nov 8, 2024
42 checks passed
Mytherin added a commit that referenced this pull request Nov 15, 2024
Follow-up from #14750 - this PR
extends the Parquet reader and the MultiFileReader to work with
`ColumnIndex` - allowing pushdown into struct fields. Internally the way
this works in the reader is that the `StructColumnReader` can now leave
certain child readers on `NULL`. Child readers that are left on `NULL`
are not scanned - they instead emit a constant `NULL` value.


### Benchmarks

Below is a benchmark running TPC-H Q01 over SF10. For the struct case,
we store the entire rows in a struct field, and then extract it using a
view, e.g.:

```sql
CALL dbgen(sf=10, suffix='_normalized');
COPY (SELECT lineitem_normalized AS row_val FROM lineitem_normalized) TO 'lineitem_struct.parquet';
CREATE VIEW lineitem AS SELECT UNNEST(row_val) FROM 'lineitem_struct.parquet';
```

We can see even more significant speed-ups than when using our native
storage - primarily because reading the (unnecessary) string columns in
Parquet format is significantly slower than when using our native
format.

| Query | Struct (v1.1) | Struct (new) | Regular |
|-------|---------------|--------------|---------|
| Q01   | 1.23s         | 0.38s        | 0.35s   |


### Explain

This PR also adds support to the `EXPLAIN` for showing which sub-columns
are selected in the projection list of a scan, for example:

```sql
explain select struct_val.l_orderkey, struct_val.l_partkey from lineitem_struct.parquet;

┌───────────────────────────┐
│         PROJECTION        │
│    ────────────────────   │
│ struct_extract(struct_val,│
│        'l_orderkey')      │
│ struct_extract(struct_val,│
│        'l_partkey')       │
│                           │
│        ~60175 Rows        │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│       PARQUET_SCAN        │
│    ────────────────────   │
│         Function:         │
│        PARQUET_SCAN       │
│                           │
│        Projections:       │
│   struct_val.l_orderkey   │
│    struct_val.l_partkey   │
│                           │
│        ~60175 Rows        │
└───────────────────────────┘

```
Flogex added a commit to Flogex/duckdb that referenced this pull request Nov 25, 2024
Flogex added a commit to motherduckdb/duckdb-delta that referenced this pull request Nov 25, 2024
Incorporates changes from duckdb/duckdb/pull/14750. There is no official patch for this in the DuckDB repo yet.
Mytherin added a commit that referenced this pull request Dec 4, 2024
After @Mytherin did most of the heavy lifting in
#14750, I was able to implement
struct projection pushdown for JSON reads.

Although we still need to parse the entire JSON, we do not need to
convert it to DuckDB Vectors, which saves a lot of time. Here are the
results (not very scientific - some noise from my laptop doing other
stuff) when reading TPC-H SF1 directly from JSON files as a struct,
without struct projection pushdown (Before) and with (This PR).

As we can see, all queries are now faster. The difference is more
pronounced for some queries, e.g., Q3, than it is for others.

| Query | Before | This PR |
|:-|-:|-:|
| 1 | 0.665 | 0.478 |
| 2 | 0.234 | 0.183 |
| 3 | 0.813 | 0.209 |
| 4 | 0.774 | 0.601 |
| 5 | 0.890 | 0.582 |
| 6 | 0.617 | 0.425 |
| 7 | 0.807 | 0.433 |
| 8 | 0.898 | 0.588 |
| 9 | 0.970 | 0.616 |
| 10 | 0.808 | 0.551 |
| 11 | 0.179 | 0.150 |
| 12 | 0.785 | 0.603 |
| 13 | 0.258 | 0.208 |
| 14 | 0.702 | 0.478 |
| 15 | 0.655 | 0.449 |
| 16 | 0.200 | 0.170 |
| 17 | 1.372 | 0.848 |
| 18 | 1.457 | 0.903 |
| 19 | 0.703 | 0.482 |
| 20 | 0.767 | 0.525 |
| 21 | 2.143 | 1.495 |
| 22 | 0.319 | 0.240 |
@Mytherin Mytherin deleted the structpushdown branch December 8, 2024 06:51
github-actions bot pushed a commit to duckdb/duckdb-r that referenced this pull request Dec 21, 2024
Add support for projection pushdown into struct fields (duckdb/duckdb#14750)
github-actions bot added a commit to duckdb/duckdb-r that referenced this pull request Dec 21, 2024
Add support for projection pushdown into struct fields (duckdb/duckdb#14750)

Co-authored-by: krlmlr <krlmlr@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant