Skip to content

Automatically materialize CTEs #12290

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 34 commits into from
Jul 11, 2024
Merged

Conversation

lnkuiper
Copy link
Contributor

This PR adds functionality to inspect query plans before binding, to find out whether CTEs should be materialized. Currently, DuckDB never materializes CTEs. If CTEs are queried multiple times, they are recomputed, which can be very expensive for large complex CTEs.

The heuristic introduced in this PR is pretty simple: if the CTE performs a (grouped) aggregation and is queried more than once, it should be materialized. This heuristic can be improved in a later PR to also materialize when, e.g., the CTE contains a window function.

Pipeline sharing is something that is on our roadmap for the future, which serves a similar purpose while being more generally useful, but this is much more difficult to implement, so that will have to wait.

To make sure query plans do not regress, I've improved CTE statistics in the join order optimizer.

Some performance improvement highlights below.

TPC-H

benchmark/tpch/sf1/q15.benchmark
Old timing: 0.046487
New timing: 0.024154

TPC-DS

benchmark/tpcds/sf1/q23.benchmark
Old timing: 0.655913
New timing: 0.288761

benchmark/tpcds/sf1/q24.benchmark
Old timing: 0.053736
New timing: 0.031949

benchmark/tpcds/sf1/q36.benchmark
Old timing: 0.141676
New timing: 0.061156

benchmark/tpcds/sf1/q57.benchmark
Old timing: 0.173906
New timing: 0.089459

benchmark/tpcds/sf1/q59.benchmark
Old timing: 0.187678
New timing: 0.097933

benchmark/tpcds/sf1/q64.benchmark
Old timing: 0.240837
New timing: 0.158492

@lnkuiper lnkuiper requested a review from Tmonster May 28, 2024 13:52
@kryonix
Copy link
Contributor

kryonix commented May 28, 2024

Hi @lnkuiper cool idea! I've briefly discussed a similar (yet non existing) PR with @Mytherin a while back.

I just glimpsed over the code, so not sure if this is actually an issue currently. But I think this optimization should not be applied when a CTE explicitly requests to be NOT MATERIALIZED, aka. when CTEMaterialize::CTE_MATERIALIZE_NEVER is set.

@duckdb-draftbot duckdb-draftbot marked this pull request as draft May 28, 2024 14:42
@lnkuiper lnkuiper marked this pull request as ready for review May 28, 2024 14:43
Copy link
Contributor

@Tmonster Tmonster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM w.r.t. the join order optimizer. Had a couple of comments about some of the other logic

Also don't quite understand the reference counting stuff, but I think that's just to get a count of the CTEs in a query plan right?

@duckdb-draftbot duckdb-draftbot marked this pull request as draft May 29, 2024 10:39
@lnkuiper lnkuiper marked this pull request as ready for review May 29, 2024 13:47
@duckdb-draftbot duckdb-draftbot marked this pull request as draft May 29, 2024 14:39
@lnkuiper lnkuiper marked this pull request as ready for review May 29, 2024 14:39
@duckdb-draftbot duckdb-draftbot marked this pull request as draft May 30, 2024 09:04
@lnkuiper lnkuiper marked this pull request as ready for review May 30, 2024 09:14
@lnkuiper
Copy link
Contributor Author

lnkuiper commented May 30, 2024

I've also added statistics to LOGICAL_DELIM_GET in the join order optimizer, so we can now reorder join plans with delim gets in them. This slightly improves TPC-H Q2.

EDIT: I've also implemented the feedback by Tom/Denis - thanks a lot!

@lnkuiper
Copy link
Contributor Author

lnkuiper commented Jun 7, 2024

I've added benchmark/tpch/cte/auto_cte_materialization.benchmark, which is 1.5-2x faster compared to feature.

@duckdb-draftbot duckdb-draftbot marked this pull request as draft June 7, 2024 12:20
@lnkuiper lnkuiper marked this pull request as ready for review June 7, 2024 12:33
@duckdb-draftbot duckdb-draftbot marked this pull request as draft June 10, 2024 12:21
@lnkuiper lnkuiper marked this pull request as ready for review June 10, 2024 12:21
@duckdb-draftbot duckdb-draftbot marked this pull request as draft June 11, 2024 13:25
@lnkuiper lnkuiper marked this pull request as ready for review June 11, 2024 13:26
@Mytherin Mytherin changed the base branch from feature to main June 21, 2024 12:37
@duckdb-draftbot duckdb-draftbot marked this pull request as draft July 5, 2024 06:31
@lnkuiper lnkuiper marked this pull request as ready for review July 5, 2024 06:31
@Mytherin
Copy link
Collaborator

Thanks! LGTM - could you just solve the merge conflict?

@duckdb-draftbot duckdb-draftbot marked this pull request as draft July 10, 2024 11:15
@lnkuiper lnkuiper marked this pull request as ready for review July 10, 2024 11:27
@Mytherin Mytherin merged commit 6e0fc96 into duckdb:main Jul 11, 2024
42 checks passed
@Mytherin
Copy link
Collaborator

Thanks!

krlmlr added a commit to duckdb/duckdb-r that referenced this pull request Jul 17, 2024
Merge pull request duckdb/duckdb#12290 from lnkuiper/auto_cte_materialize
@lnkuiper lnkuiper deleted the auto_cte_materialize branch August 14, 2024 07:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants