Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

incremental models on postgres are only fast if all CTEs are ephemeral #335

Closed
1 of 5 tasks
TjrGithub opened this issue Aug 7, 2020 · 9 comments
Closed
1 of 5 tasks
Labels
wontfix This will not be worked on

Comments

@TjrGithub
Copy link

Describe the bug

When making a postgres-backed model incremental, you would expect the incremental loads of raw data to be faster. Instead, they get slower. However, once you set all the contribution CTEs (separate DBT files) to ephemeral, and make sure the CTEs are consumed in a row with one descendant only (i.e.: with a from b, b from c, c from d, d from source -- so that it can be inlined), you do get the performance benefits.
That means: Views are performance barriers that prevent the optimizer from pushing down the {{is_incremental}} stuff to the underlying queries from which the views are selected.

Maybe you could mention that in the documentation?

Steps To Reproduce

In as much detail as possible, please provide steps to reproduce the issue. Sample data that triggers the issue, example model code, etc is all very helpful here.

Expected behavior

A clear and concise description of what you expected to happen.

Screenshots and log output

If applicable, add screenshots or log output to help explain your problem.

System information

Which database are you using dbt with?

  • postgres
  • redshift
  • bigquery
  • snowflake
  • other (specify: ____________)

The output of dbt --version:

installed version: 0.17.1
   latest version: 0.17.2

Your version of dbt is out of date! You can find instructions for upgrading here:
https://docs.getdbt.com/docs/installation

Plugins:
  - postgres: 0.17.1
  - redshift: 0.17.1
  - snowflake: 0.17.1
  - bigquery: 0.17.1

The operating system you're using:
$ uname -a
Linux 21d117632cc9 5.3.0-55-generic dbt-labs/dbt-core#49-Ubuntu SMP Thu May 21 12:47:19 UTC 2020 x86_64 GNU/Linux
i.e. the official docker container for python:latest

The output of python --version:
Python 3.8.4

Additional context

Add any other context about the problem here.

@jtcohen6
Copy link
Collaborator

jtcohen6 commented Aug 7, 2020

Hey @TjrGithub, you've got a good point. Postgres is, structurally, a very different databases from the analytics-forward data warehouses with which we most often work. I'm not surprised to hear that Postgres' optimizer fails to push down filters into the underlying views; older versions of Postgres (<12) treated CTEs as optimization fences, too. This was one of the biggest innovations of Redshift's optimizer, in the early days of analytic databases.

Maybe you could mention that in the documentation?

I think this is a great idea! I'm going to transfer this issue to our docs.getdbt.com repo, since I don't foresee a code change in this one. Would you be willing to contribute the additional context? I think a comment in this section may be appropriate.

@jtcohen6 jtcohen6 transferred this issue from dbt-labs/dbt-core Aug 7, 2020
@TjrGithub
Copy link
Author

I agree, that's where I was expecting this to be documented.

As an aside, what's also missing from that section is how to handle views. Do I put the is_incremental stuff in there, too? At least, that might be a way around the optimization fence problem. But it just doesn't feel right, from a functional programming point of view.

What kind of missing additional context to you have in mind?

@jtcohen6
Copy link
Collaborator

jtcohen6 commented Aug 9, 2020

The tricky thing about is_incremental() is that it checks four conditions, one of which is that the current model is materialized as incremental. For that reason, if you include is_incremental() in a view or ephemeral model, the condition will never resolve to true.

I think the larger context here is: Incremental modeling is an attempt to optimize database performance. As such it depends significantly on the database you're using, and it relies heavily on your understanding of what that database's optimizer will and won't be able to do.

@TjrGithub
Copy link
Author

TjrGithub commented Aug 21, 2020

Does this boil down to "if Postgres and incremental, then all models should be tables"?
Making all views ephemeral does not work in my testing - it completely unhinges the query optimizer, and what previously took 30 seconds now takes 5 hours.

@jtcohen6
Copy link
Collaborator

jtcohen6 commented Aug 21, 2020

I think that's right—if you're creating an incremental model on postgres, it should be selecting from (and filtering against) a table or a view, not an ephemeral model.

Out of curiosity, which version of Postgres are you running?

@TjrGithub
Copy link
Author

No, a view does not work. That is the point of this bug report.

Postgres 12.4 from https://hub.docker.com/_/postgres

@jtcohen6
Copy link
Collaborator

My mistake in writing that above, I think this is what I meant:

"If you're creating an incremental model on postgres, it should be selecting from (and filtering against) a table, not a view or an ephemeral model."

@runleonarun runleonarun added icebox For issues we're closing but will revisit at a future date if possible! dbt Core The changes proposed in this issue relate to dbt Core wontfix This will not be worked on and removed dbt Core The changes proposed in this issue relate to dbt Core icebox For issues we're closing but will revisit at a future date if possible! labels Mar 15, 2023
@runleonarun
Copy link
Collaborator

Closing this issue as won't fix. If you believe I have closed this in error, then let me know and I willput it on the icebox instead of closing completely.

@kyletl
Copy link

kyletl commented Mar 21, 2023

@runleonarun I ran into this issue a few weeks ago and had no guidance from the dbt docs on why using incremental models wasn't improving my model runs in postgres at all. The docs should be edited with this information, it feels critical to new dbt users who are working in postgres

nghi-ly pushed a commit that referenced this issue Feb 13, 2024
REPO SYNC - Public to Private
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

4 participants