New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additional materialisation: Subquery #1248

Open
jamesrguy opened this Issue Jan 17, 2019 · 5 comments

Comments

Projects
None yet
3 participants
@jamesrguy
Copy link

jamesrguy commented Jan 17, 2019

Feature

Add "Subquery" materialisation

Feature description

Bit of a spanner here for me, using Greenplum 4.x. There are two query planners in Greenplum, one called the legacy planner which is very clunky, and the super (not so super) “optimizer”. The optimizer reverts to the legacy planner when things get to complicated. Unfortunately neither of these can manage to rewrite CTEs sensibly. Take a query with some base models in CTEs and run them through the optimizer and it will “revert” to the legacy planner and the result is very expensive and doesn’t run. If you copy and paste the CTEs into subqueries, the optimizer doesnt revert and can do some amazing tricks!

So how difficult would it be to add another materialisation strategy “subquery”, which is like ephemeral except writes the dependent models into subqueries instead of CTEs?

(All of the above may or may not be an issue in GP5.x, but I won’t be able to test that until later in the year.)

Who will this benefit?

Users of databases which have query planners that have mixed results using CTEs, eg Greenplum

@drewbanin

This comment has been minimized.

Copy link
Contributor

drewbanin commented Jan 17, 2019

Hey @jamesrguy - thanks for the feature request! As I understand it, Greenplum is based on Postgres, so this limitation makes sense to me. In Postgres, CTEs are "optimization fences" in which the optimizer can't push down filters past the CTE boundary. Sounds like something similar is happening in Greenplum here.

Ephemeral models are pretty tricky to handle in dbt. They're a special case of a materialization, and they're defined and implemented deep inside of the dbt compiler. I don't think it would be a good idea to add a second subquery materialization, but I do think it could be reasonable to make the implementation details of ephemeral models configurable. I'm not exactly sure how this config should be defined, but imagine something like ephemeral_type="subquery".

If we did this, there would be two places where code would need to change:

  1. When a ref is made to an ephemeral model, dbt should return the contents of the model in a subquery (rather than the name of a CTE)
    https://github.com/fishtown-analytics/dbt/blob/dev/stephen-girard/core/dbt/context/runtime.py#L51-L56

  2. When the ephemeral type is set to subquery, CTEs should not be injected into models
    https://github.com/fishtown-analytics/dbt/blob/dev/stephen-girard/core/dbt/contracts/graph/compiled.py#L178

One of the big challenges with CTEs is that they need to be applied recursively. I think dbt's compiler will just handle this, but I'm not totally certain. It's definitely worth looking into.

Let me know if you buy this general approach. If you folks (or anyone else visiting this issue!) has the resources to try making a PR, I'd be super happy to help out!

@jamesrguy

This comment has been minimized.

Copy link
Author

jamesrguy commented Jan 17, 2019

Thanks drew. That approach seems reasonable, but I'll take a look into the code to see what it might take to make it happen!

@drewbanin

This comment has been minimized.

Copy link
Contributor

drewbanin commented Jan 18, 2019

One other consideration here upon thinking about this some more. When dbt injects ephemeral models, it does so once at the top of the file, then refers to that CTE by name throughout the query. If we're using subqueries here, we could run into name collisions if the same ephemeral model is referenced twice in a given select statement.

eg:

select *
from {{ ref('ephemeral_model') }}
left join {{ ref('ephemeral_model') }} using (id)

Since subqueries need to be named on postgres (and i assume greenplum) this would render out to

select *
from (select * from ...) as sbq_ephemeral_model
left join (select * from ...) as sbq_ephemeral_model using (id)

Ideally we'd number these subqueries, but that might be a little tricky depending on where the subquery is constructed.

@cmcarthur cmcarthur added this to the Wilt Chamberlain milestone Jan 23, 2019

@jamesrguy

This comment has been minimized.

Copy link
Author

jamesrguy commented Feb 6, 2019

In the last instance there, we would always have to name the the subqueries for them to be useful in the top part of the select, as they would need to be explicitly referenced...

I added some code to runtime.py and will test it out in the next few days:

        is_ephemeral = (get_materialization(target_model) == 'ephemeral')
        if is_ephemeral:
            is_subquery = (get_ephemeral_type(target_model) == 'subquery')
            if is_subquery:
                return '(\n{}\n)'.format(target_model.compiled_sql)
            else:
                model.set_cte(target_model_id, None)
                return adapter.Relation.create(
                   type=adapter.Relation.CTE,
                   identifier=add_ephemeral_model_prefix(
                        target_model_name)).quote(identifier=False)

and added get_ephemeral_type to utils. It seems to work!

Happy for more input if anyone has style tips, etc, in keeping with how things are done in DBT, and will report back some experience/testing.

@drewbanin

This comment has been minimized.

Copy link
Contributor

drewbanin commented Feb 6, 2019

cool @jamesrguy! Sounds like the next step is to make a Pull Request for your change :)

We can talk about the specifics of the code there!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment