Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
Additional materialisation: Subquery #1248
Add "Subquery" materialisation
Bit of a spanner here for me, using Greenplum 4.x. There are two query planners in Greenplum, one called the legacy planner which is very clunky, and the super (not so super) “optimizer”. The optimizer reverts to the legacy planner when things get to complicated. Unfortunately neither of these can manage to rewrite CTEs sensibly. Take a query with some base models in CTEs and run them through the optimizer and it will “revert” to the legacy planner and the result is very expensive and doesn’t run. If you copy and paste the CTEs into subqueries, the optimizer doesnt revert and can do some amazing tricks!
So how difficult would it be to add another materialisation strategy “subquery”, which is like ephemeral except writes the dependent models into subqueries instead of CTEs?
(All of the above may or may not be an issue in GP5.x, but I won’t be able to test that until later in the year.)
Who will this benefit?
Users of databases which have query planners that have mixed results using CTEs, eg Greenplum
Hey @jamesrguy - thanks for the feature request! As I understand it, Greenplum is based on Postgres, so this limitation makes sense to me. In Postgres, CTEs are "optimization fences" in which the optimizer can't push down filters past the CTE boundary. Sounds like something similar is happening in Greenplum here.
Ephemeral models are pretty tricky to handle in dbt. They're a special case of a materialization, and they're defined and implemented deep inside of the dbt compiler. I don't think it would be a good idea to add a second
If we did this, there would be two places where code would need to change:
One of the big challenges with CTEs is that they need to be applied recursively. I think dbt's compiler will just handle this, but I'm not totally certain. It's definitely worth looking into.
Let me know if you buy this general approach. If you folks (or anyone else visiting this issue!) has the resources to try making a PR, I'd be super happy to help out!
One other consideration here upon thinking about this some more. When dbt injects ephemeral models, it does so once at the top of the file, then refers to that CTE by name throughout the query. If we're using subqueries here, we could run into name collisions if the same ephemeral model is referenced twice in a given select statement.
Since subqueries need to be named on postgres (and i assume greenplum) this would render out to
Ideally we'd number these subqueries, but that might be a little tricky depending on where the subquery is constructed.
In the last instance there, we would always have to name the the subqueries for them to be useful in the top part of the select, as they would need to be explicitly referenced...
I added some code to runtime.py and will test it out in the next few days:
and added get_ephemeral_type to utils. It seems to work!
Happy for more input if anyone has style tips, etc, in keeping with how things are done in DBT, and will report back some experience/testing.