Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-2068] [Feature] Let's add Materialized View as a materialization, finally #6911

Closed
Fleid opened this issue Feb 8, 2023 · 17 comments · Fixed by #7211, dbt-labs/dbt-redshift#386, #7239, dbt-labs/dbt-redshift#387 or #7244
Assignees
Labels
enhancement New feature or request materialized_views Team:Adapters Issues designated for the adapter area of the code
Milestone

Comments

@Fleid
Copy link
Contributor

Fleid commented Feb 8, 2023

materialized: materialized_view

Context : The time has come, let's do this.

I'll elaborate on why we're doing this now and not before in a public post in the near future. The short version is that a new generation of Materialized Views (MVs) is emerging, powered by streaming technologies, and we want to support them. This also allows dbt to retroactively support some of the current implementations in a cleaner way. Win, win, win.

This specific issue is about building the foundation. It is about re-purposing the work already done in the experimental package for a default materialization, Postgres, Redshift and BigQuery implementations, and make it official.

Please note that Snowflake is not in the list above. For dbt-snowflake, we'll skip their current MVs and support directly Dynamic Tables when they're ready (we already have an issue for it!).

A new materialization

So yes, we need a new materialization with a default implementation in dbt core, and 3 dispatch implementations for Postgres, Redshift and BigQuery.

I like the ergonomics designed in the experimental package...

  • [edit 1]I see no reason for us to change them for now (well... see the alternative option below).
  • [edit 2]I re-used them but added on_definition_change (now split into on_configuration_change and on_schema_change for easier onboarding from incremental models) to offer a similar behavior to what can be found with on_schema_change for incremental models. Here it's not only the schema that we may need to change, but most importantly the configuration options like auto_refresh mode, or the refresh_interval_minutes. These options can be modified via an ALTER statement that doesn't require the full state of the MV to be recomputed (not that it's limited to that, but it allows those too).
  • ... I re-used them but added on_configuration_change to mimic the behavior of on_schema_change from incremental models, focused on configs. A normal run of dbt on MVs should only be about "moving data", and not deploying code changes. So either refresh, or do nothing. We can still try to pass along configuration changes (indexes, labels, warehouses...) because there's a chance we can do that without triggering a full refresh of the view (which is what we are trying to avoid).

We also want to provide that capability for the logic of the view (SQL statement). But at this point, we can't do it cleanly. We can't compare the logic that is deployed to the one we have in the model (I dread comparing SQL statements as strings), and no platform in scope supports CREATE OR REPLACE. So we can't be clever, and we can't be brutal expecting the platform to be clever.

We also can't limit the scope to on_schema_change because in MVs there is only room for 1 logic to be applied at a point in time, contrary to incremental models where the previous logic can live with the new one. We will follow up on that later.


Default implementation

Note how the current expectation is that base MVs require a refresh.

  • Configurations : none
  • Lifecycle (dbt run) :
    • If necessary backup (exists with different materialization OR exists and full-refresh)
    • If necessary create (doesn't exist) Else
      • If the options of the object in the database are different from the current model, depending on on_configuration_change
        • ignore : we can't use ignore here, as on_schema_change=ignore for incremental models has set weird expectations on the word. Let's use skip instead for now.
        • skip : Do nothing, aka stage changes until the next full refresh. (Documentation follow-up: this can create uncertainty where there is a delta between the model definition and what's in the database - same as on_schema_change=ignore for incremental models)
        • fail : Fail the run
        • apply : [default] Apply the changes (implementation detail: let's try to favor commands that do not trigger full recomputes (ALTER, CREATE OR REPLACE))
      • Refresh when necessary (the object was not re-created in a step above)
  • Lifecycle (dbt run --full-refresh) :
    • If necessary backup (exists with different materialization OR exists and full-refresh)
    • If necessary drop
    • Create

Described more simply, for a MV already created:

  • dbt run --full-refresh : we drop and create
    • Here, in order we should try to:
      1. CREATE OR REPLACE - if CREATE OR REPLACE is supported
      2. CREATE, DROP, ALTER NAME - if rename is supported (Postgres, Redshift)
      3. DROP, CREATE - if none of the above (BQ)
  • dbt run : we refresh (or do nothing if auto_refresh)
    • Except if you set on_configuration_change to something other that skip, which is the default, then we check the config in database to the one in the model, and if there are diffs, we apply or fail
    • For apply, in order we should try to - this choice need to be made for each configuration (some will support ALTER, some not)
      1. ALTER in place
      2. else apply the full refresh order

Postgres

Postgres : Materialized Views

⚠️ Schema management requires some attention


Redshift

Redshift Materialized Views

⚠️ Schema management requires some attention

  • Configurations:
    • auto_refresh : YES/NO, optional (database level default no) (see lifecycle for effect)
    • From models : sort / sort_type / dist
  • Lifecycle : default except Refresh is bypassed if auto_refresh
    • on_configuration_change.apply : auto_refresh via ALTER, rest via DROP/CREATE

BigQuery

BigQuery Materialized Views

  • Configurations:
    • enable_refresh : BOOLEAN, optional (database level default true) (see lifecycle for effect)
    • refresh_interval_minutes : integer, optional (database level default at 30), only valid for enable_refresh true or default
    • max_staleness : time before bypassing the MV and running the statement instead (in preview but I like it!)
    • From models: Partition / Cluster, TTL, Labels
  • Lifecycle : default except Refresh is bypassed if auto_refresh
    • on_configuration_change.apply : enable_refresh/refresh_interval_minutes/max_staleness via ALTER, rest via DROP/CREATE

Backward compatibility

People using materialized: materialized_view from the experimental package in dbt core 1.4 and before, should not be broken by the addition of a default and adapter level materialized_view materializations. The chain of precedence in macros will preserve existing behavior. The package will need to be uninstalled to benefit from the new implementation.




Additional details, not required for implementation

Risks

  1. Most users interested by this topic are already using the experimental package. We need to make sure that this implementation won't break them when we roll it out. I'm guessing the chain of inheritance should work in our favor here, but we need to be sure.

  2. Same thing for community/vendor supported adapters already implementing a version of MVs (Materialize for example).

  3. Information collection will be tricky across the board. MVs are usually not accessible via the default information schemas / catalogues

  4. Cascading drops in full-refresh are tricky. If an upstream table used in a MV is dropped/altered, the MV may end up in a funky state. See comments in the experimental packages. We may need to revisit some of the behaviors or re-calibrate expectations.

Alternative/Options

default auto_refresh = true

I'm seriously wondering if the default ergonomics should not be more aggressively geared towards auto refreshing:
[Edit following comments below] Not a good idea

  • Default
    • Configurations
      • auto_refresh : optional, default true
    • Lifecycle :
      • DDL :
        • If necessary backup (exists with different materialization OR exists and full-refresh)
        • If necessary create (doesn't exist)
  • Postgres
    • Configurations:
      • auto_refresh : mandatory no
    • Lifecycle : default + else refresh
  • Redshift
    • Configurations:
      • auto_refresh : ⚠️ default option is reversed (still true, contrary to the engine default)
  • BigQuery
    • Configurations:
      • auto_refresh/enable_refresh : default
      • refresh_interval_minutes
      • max_staleness

Who will this benefit?

A lot of people:

Anything else?

Historical perspective on the topic : https://www.youtube.com/watch?v=awj-5aYXZnc

@Fleid Fleid added enhancement New feature or request triage labels Feb 8, 2023
@Fleid Fleid added this to the v1.5 milestone Feb 8, 2023
@Fleid Fleid self-assigned this Feb 8, 2023
@github-actions github-actions bot changed the title [Feature] Let's add Materialized View as a materialization, finally [CT-2068] [Feature] Let's add Materialized View as a materialization, finally Feb 8, 2023
@Fleid Fleid added Team:Adapters Issues designated for the adapter area of the code and removed triage labels Feb 8, 2023
@Fleid
Copy link
Contributor Author

Fleid commented Feb 8, 2023

@jtcohen6, @Kayrnt and everyone else, I welcome your feedback on the alternative ergonomics ;)

@amychen1776
Copy link

amychen1776 commented Feb 9, 2023

I agree with this approach of not making auto-refresh the default. It will help users be more mindful of managing to compute costs (i.e making sure their development environments are running with intent). While I don't find using MVs for analytics compelling without auto-refresh in prod, a user interacting with an MV in dbt will be primarily doing it during development where having clear levers will be more helpful.

@morsapaes
Copy link

Materialized views in Materialize work under different assumptions to other databases (i.e. native incremental updates, no need for scheduled refreshes), so at first glance it doesn't look like we could inherit from dbt-postgres once the new materialization lands. As a maintainer, I'd like to make sure that we're consistent with the materialization name you land on, though! As is, we're using materializedview in our custom materialization, so deviating from that would mean a breaking change for existing users.

On our end, the materialization looks like:

Configurations

  • (optional) cluster: the cluster that maintains the materialized view (i.e. compute)
  • (optional) indexes: any indexes defined on top of the materialized view

Lifecycle
The default behaviour in dbt-materialize is to drop and recreate the materialized view, which is less problematic than it might be in other adapters since you're not expected to hit dbt run on a cadence (but rather once to bootstrap your setup, and then to add new models or evolve existing ones (using partial runs when possible)), and the engine is purpose-built to, well, be very efficient at computing and incrementally maintaining materialized views! At the risk of going too deep into Materialize specifics, one thing we're eyeing next is allowing an operation that waits until a new view with the new SQL definition is caught up before dropping an existing one, and then swaps them.

@Fleid
Copy link
Contributor Author

Fleid commented Feb 13, 2023

We should talk ;)

At the risk of going too deep into Materialize specifics, one thing we're eyeing next is allowing an operation that waits until a new view with the new SQL definition is caught up before dropping an existing one, and then swaps them.

I like that.

@Fleid
Copy link
Contributor Author

Fleid commented Mar 9, 2023

Updated with a new on_definition_change parameter for dbt run.

@mikealfare mikealfare self-assigned this Mar 14, 2023
@Fleid
Copy link
Contributor Author

Fleid commented Mar 15, 2023

Updated following internal review:

  • on_definition_change is split between on_schema_change (tracking column list) and on_configuration_change (tracking model configurations), to ease onboarding from incremental models
  • reduced the option for these configs to ignore/fail/apply - we try a little bit of cleverness, favoring ALTER / CREATE OR REPLACE statements where applicable, but mostly rely on the databases to optimize and avoid full recomputes

@Fleid
Copy link
Contributor Author

Fleid commented Mar 23, 2023

Updated to remove on_schema_change for now - see reasons inline

@Nipsuli
Copy link

Nipsuli commented Jul 17, 2023

Wondering how this can be closed as completed when the dbt-bigquery part is still in draft pr? dbt-labs/dbt-bigquery#639

@mikealfare
Copy link
Contributor

Incorrectly closed due to automation.

@mikealfare
Copy link
Contributor

Yay for automation!

@mikealfare mikealfare reopened this Sep 13, 2023
@benc-db
Copy link

benc-db commented Oct 10, 2023

on auto_refresh and backup: is backup taken care of in core or do we need any new adapter work? Is auto_refresh a signal from core that the adapters should implement as appropriate?

@mikealfare
Copy link
Contributor

on auto_refresh and backup: is backup taken care of in core or do we need any new adapter work? Is auto_refresh a signal from core that the adapters should implement as appropriate?

auto_refresh and backup are specific to Redshift, hence were implemented in dbt-redshift. I do think we'll need to have some form of generic auto_refresh at the dbt-core level so that the materialization can inspect whether the MV is set to auto refresh. But we have not done that yet. There's a very small chance I knock that out tomorrow before the 1.7.0rc1 release.

I imagine Databricks has its own platform specific settings, and those would need to be added into dbt-databricks as additional AdapterConfig arguments. These should then get routed through into the NodeConfig/ModelConfig/etc. so that you can see them in the global jinja context (e.g. config, config.model).

I'm working with @dataders to put something together for you that should be much more helpful than this response, but hopefully that answers your immediate question.

@mikealfare
Copy link
Contributor

The only outstanding work for this feature addresses the nuances around automatic refreshes on materialized views that are already scheduled to refresh. This functionality already has associated issues in each adapter's respective repository. With that context, I'm closing this as complete.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment