-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Late Materialization Optimizer #15692
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…sformed plan is maintained
…n running the optimizer multiple times over the same path
…for complex expressions in the Top-N)
|
@Mytherin I tried your PR against the following and am not seeing a speed up. I then run this query: and it takes 1.477 s but when I run the transformed version manually: I see it takes 155ms. I think I have your PR compiled correctly as You can see the |
|
That's because of the limit size. The late materialization optimizer only triggers when the limit is |
|
Minor, also to be done on as subsequent PR, there are 3 warnings turned errors in the amalgamation run (OSX Debug): |
…e the threshold at which late materialization is triggered (#15697) Follow-up from #15692 This PR adds the `late_materialization_max_rows` setting that allows you to configure the threshold at which late materialization is triggered. The default value is `50`. Example usage: ```sql SET late_materialization_max_rows=1000; explain SELECT * FROM lineitem ORDER BY l_orderkey DESC LIMIT 1000; ``` The exact best setting is hard to determine - essentially the row-id pushdown has two components to it (1) the OR filter push-down, which is done for up to `dynamic_or_filter_threshold` rows (defaults to 50) and the min-max filter pushdown. The row-id rewrite generally always provides performance improvements for up to `dynamic_or_filter_threshold` which is why we select that as a default. Beyond that, it depends on the locality of the rows. If the min-max filter on row-ids is selective (i.e. the rows we select are close together physically in the table) the row-id rewrite is effective. If the rows are spread out, the rewrite can worsen performance. CC @abramk
Late Materialization Optimizer (duckdb/duckdb#15692) Clean up temporary test directory in `run_tests_one_by_one.py` even if test segfaults (duckdb/duckdb#15688) Reduce test size so CI is less likely to fail (duckdb/duckdb#15689)
|
hey! wondering if this optimization applies for non-duckdb table, e.g., a delta lake table. Thanks! |
…(*)` directly in the multi file reader (#17325) This PR generalizes the late materialization optimizer introduced in #15692 - allowing it to be used for the Parquet reader. In particular, the `TableFunction` is extended with an extra callback that allows specifying the relevant row-id columns: ```cpp typedef vector<column_t> (*table_function_get_row_id_columns)(ClientContext &context, optional_ptr<FunctionData> bind_data); ``` This is then used by the Parquet reader to specify the two row-id columns: `file_index` (#17144) and `file_row_number` (#16979). Top-N , sample and limit/offset queries are then transformed into a join on the relevant row-id columns. For example: ```sql SELECT * FROM lineitem.parquet ORDER BY l_extendedprice DESC LIMIT 5; -- becomes SELECT * FROM lineitem.parquet WHERE (file_index, file_row_number) IN ( SELECT file_index, file_row_number FROM lineitem.parquet ORDER BY l_extendedprice DESC LIMIT 5) ORDER BY l_extendedprice DESC; ``` ### Performance ```sql SELECT * FROM lineitem.parquet ORDER BY l_extendedprice DESC LIMIT 5; ``` | v1.2.1 | main | new | |--------|--------|--------| | 0.19s | 0.14s | 0.06s | ```sql SELECT * FROM lineitem.parquet ORDER BY l_orderkey DESC LIMIT 5; ``` | v1.2.1 | main | new | |--------|-------|-------| | 0.73s | 0.53s | 0.06s | ```sql SELECT * FROM lineitem.parquet LIMIT 1000000 OFFSET 10000000; ``` | v1.2.1 | main | new | |--------|-------|-------| | 1.6s | 1.2s | 0.14s | ### Refactor I've also moved the `ParquetMultiFileInfo` to a separate file as part of this PR - which is most of the changes here.
This PR adds the Late Materialization optimizer that enables late materialization for certain queries - in particular top-n (
ORDER BY .. LIMIT ..), limit + offset, and sample queries. The optimization piggy-backs off of the row-id filter pushdown introduced in #15020 - and does the row-id rewrites mentioned there automatically.Rewrites
Here are some examples of rewrites:
Top-N
Limit + Offset
Performance