Currently ClickBench Q23 is a major outliner due to the fact that datafusion doesn't do late materialization of projected columns.
SELECT * FROM hits WHERE "URL" LIKE '%google%' ORDER BY "EventTime" LIMIT 10;
A query like this needs to scan all the columns only after all the redundant rows are prunned from output.
Some numbers that I was able to get locally showing this regression in play.
| Engine |
Query time |
DataFusion 54 (SELECT *) |
~4.0s avg |
DataFusion 54 (SELECT "EventTime", "URL") |
~0.9s |
SortPreservingMergeExec (limit 10)
└─ SortExec TopK (EventTime), DynamicFilter [ empty ]
└─ ProjectionExec (105 columns)
└─ FilterExec (URL LIKE '%google%')
└─ DataSourceExec (projection=[105 cols], 111 files)
predicate: URL LIKE %google% AND DynamicFilter [ empty ]
EXPLAIN ANALYZE metrics (SELECT *)
DataSourceExec
| Metric |
Value |
projection |
105 columns |
output_rows |
97.17 M |
output_bytes |
137.8 GB |
row_groups_pruned_statistics |
325 → 325 matched |
row_groups_pruned_bloom_filter |
325 → 325 matched |
page_index_pages_pruned |
0 |
What duckdb does in this scenario is that it materialises the projection rows that are not needed for any computation or maybe needed in later steps as late as possilble.
But this late materialization comes at the cost of requiring a self join later on to get the other columns back, and got self join's we need row tagging.
In datafusion we can tag a row uniquely with file_row_index(), input_file_name().
But then if rows are not dynamically filtered on the basis of above two columns it will again lead to a full table scan for probe side of the join.
Currently ClickBench Q23 is a major outliner due to the fact that datafusion doesn't do late materialization of projected columns.
A query like this needs to scan all the columns only after all the redundant rows are prunned from output.
Some numbers that I was able to get locally showing this regression in play.
SELECT *)SELECT "EventTime", "URL")EXPLAIN ANALYZEmetrics (SELECT *)DataSourceExec
projectionoutput_rowsoutput_bytesrow_groups_pruned_statisticsrow_groups_pruned_bloom_filterpage_index_pages_prunedWhat duckdb does in this scenario is that it materialises the projection rows that are not needed for any computation or maybe needed in later steps as late as possilble.
But this late materialization comes at the cost of requiring a self join later on to get the other columns back, and got self join's we need row tagging.
In datafusion we can tag a row uniquely with
file_row_index(),input_file_name().But then if rows are not dynamically filtered on the basis of above two columns it will again lead to a full table scan for probe side of the join.