[CT-2068] [Feature] Let's add Materialized View as a materialization, finally #6911

Fleid · 2023-02-08T23:09:10Z

`materialized: materialized_view`

Context : The time has come, let's do this.

I'll elaborate on why we're doing this now and not before in a public post in the near future. The short version is that a new generation of Materialized Views (MVs) is emerging, powered by streaming technologies, and we want to support them. This also allows dbt to retroactively support some of the current implementations in a cleaner way. Win, win, win.

This specific issue is about building the foundation. It is about re-purposing the work already done in the experimental package for a default materialization, Postgres, Redshift and BigQuery implementations, and make it official.

Please note that Snowflake is not in the list above. For dbt-snowflake, we'll skip their current MVs and support directly Dynamic Tables when they're ready (we already have an issue for it!).

A new materialization

So yes, we need a new materialization with a default implementation in dbt core, and 3 dispatch implementations for Postgres, Redshift and BigQuery.

I like the ergonomics designed in the experimental package...

[edit 1]~~I see no reason for us to change them for now (well... see the alternative option below).~~
[edit 2]I re-used them but added on_definition_change (now split into on_configuration_change and on_schema_change for easier onboarding from incremental models) to offer a similar behavior to what can be found with on_schema_change for incremental models. Here it's not only the schema that we may need to change, but most importantly the configuration options like auto_refresh mode, or the refresh_interval_minutes. These options can be modified via an ALTER statement that doesn't require the full state of the MV to be recomputed (not that it's limited to that, but it allows those too).
... I re-used them but added on_configuration_change to mimic the behavior of on_schema_change from incremental models, focused on configs. A normal run of dbt on MVs should only be about "moving data", and not deploying code changes. So either refresh, or do nothing. We can still try to pass along configuration changes (indexes, labels, warehouses...) because there's a chance we can do that without triggering a full refresh of the view (which is what we are trying to avoid).

We also want to provide that capability for the logic of the view (SQL statement). But at this point, we can't do it cleanly. We can't compare the logic that is deployed to the one we have in the model (I dread comparing SQL statements as strings), and no platform in scope supports CREATE OR REPLACE. So we can't be clever, and we can't be brutal expecting the platform to be clever.

We also can't limit the scope to on_schema_change because in MVs there is only room for 1 logic to be applied at a point in time, contrary to incremental models where the previous logic can live with the new one. We will follow up on that later.

Default implementation

Note how the current expectation is that base MVs require a refresh.

Configurations : none
Lifecycle (dbt run) :
- If necessary backup (exists with different materialization ~~OR exists and full-refresh~~)
- If necessary create (doesn't exist) Else
  - If the options of the object in the database are different from the current model, depending on on_configuration_change
    - ~~ignore~~ : we can't use ignore here, as on_schema_change=ignore for incremental models has set weird expectations on the word. Let's use skip instead for now.
    - skip : Do nothing, aka stage changes until the next full refresh. (Documentation follow-up: this can create uncertainty where there is a delta between the model definition and what's in the database - same as on_schema_change=ignore for incremental models)
    - fail : Fail the run
    - apply : [default] Apply the changes (implementation detail: let's try to favor commands that do not trigger full recomputes (ALTER, CREATE OR REPLACE))
  - Refresh when necessary (the object was not re-created in a step above)
Lifecycle (dbt run --full-refresh) :
- If necessary backup (exists with different materialization ~~OR exists and full-refresh~~)
- If necessary drop
- Create

Described more simply, for a MV already created:

dbt run --full-refresh : we drop and create
- Here, in order we should try to:
  1. CREATE OR REPLACE - if CREATE OR REPLACE is supported
  2. CREATE, DROP, ALTER NAME - if rename is supported (Postgres, Redshift)
  3. DROP, CREATE - if none of the above (BQ)
dbt run : we refresh (or do nothing if auto_refresh)
- Except if you set on_configuration_change to something other that skip, which is the default, then we check the config in database to the one in the model, and if there are diffs, we apply or fail
- For apply, in order we should try to - this choice need to be made for each configuration (some will support ALTER, some not)
  1. ALTER in place
  2. else apply the full refresh order

Postgres

Postgres : Materialized Views

⚠️ Schema management requires some attention

Configurations:
- From models : indexes
Lifecycle : default
- on_configuration_change.apply : indexes via DROP/CREATE INDEXES

Redshift

Redshift Materialized Views

⚠️ Schema management requires some attention

Configurations:
- auto_refresh : YES/NO, optional (database level default no) (see lifecycle for effect)
- From models : sort / sort_type / dist
Lifecycle : default except Refresh is bypassed if auto_refresh
- on_configuration_change.apply : auto_refresh via ALTER, rest via DROP/CREATE

BigQuery

BigQuery Materialized Views

Configurations:
- enable_refresh : BOOLEAN, optional (database level default true) (see lifecycle for effect)
- refresh_interval_minutes : integer, optional (database level default at 30), only valid for enable_refresh true or default
- max_staleness : time before bypassing the MV and running the statement instead (in preview but I like it!)
- From models: Partition / Cluster, TTL, Labels
Lifecycle : default except Refresh is bypassed if auto_refresh
- on_configuration_change.apply : enable_refresh/refresh_interval_minutes/max_staleness via ALTER, rest via DROP/CREATE

Backward compatibility

People using materialized: materialized_view from the experimental package in dbt core 1.4 and before, should not be broken by the addition of a default and adapter level materialized_view materializations. The chain of precedence in macros will preserve existing behavior. The package will need to be uninstalled to benefit from the new implementation.

Additional details, not required for implementation

Risks

Most users interested by this topic are already using the experimental package. We need to make sure that this implementation won't break them when we roll it out. I'm guessing the chain of inheritance should work in our favor here, but we need to be sure.
Same thing for community/vendor supported adapters already implementing a version of MVs (Materialize for example).
Information collection will be tricky across the board. MVs are usually not accessible via the default information schemas / catalogues
Cascading drops in full-refresh are tricky. If an upstream table used in a MV is dropped/altered, the MV may end up in a funky state. See comments in the experimental packages. We may need to revisit some of the behaviors or re-calibrate expectations.

Alternative/Options

default auto_refresh = true

~~I'm seriously wondering if the default ergonomics should not be more aggressively geared towards auto refreshing:~~
[Edit following comments below] Not a good idea

Default
- Configurations
  - auto_refresh : optional, default true
- Lifecycle :
  - DDL :
    - If necessary backup (exists with different materialization OR exists and full-refresh)
    - If necessary create (doesn't exist)
Postgres
- Configurations:
  - auto_refresh : mandatory no
- Lifecycle : default + else refresh
Redshift
- Configurations:
  - auto_refresh : ⚠️ default option is reversed (still true, contrary to the engine default)
BigQuery
- Configurations:
  - auto_refresh/enable_refresh : default
  - refresh_interval_minutes
  - max_staleness

Who will this benefit?

A lot of people:

Core / Postgres : Support Materialized Views #1162
BigQuery : Support BigQuery Materialized Views dbt-bigquery#18
Snowflake : Ft Add materialization for Snowflake materialized view object type dbt-snowflake#181
Databricks : Add "materialized_view" materialization. databricks/dbt-databricks#242
Issues in the experimental package

Anything else?

Historical perspective on the topic : https://www.youtube.com/watch?v=awj-5aYXZnc

The text was updated successfully, but these errors were encountered:

Fleid · 2023-02-08T23:16:17Z

@jtcohen6, @Kayrnt and everyone else, I welcome your feedback on the alternative ergonomics ;)

amychen1776 · 2023-02-09T21:02:59Z

I agree with this approach of not making auto-refresh the default. It will help users be more mindful of managing to compute costs (i.e making sure their development environments are running with intent). While I don't find using MVs for analytics compelling without auto-refresh in prod, a user interacting with an MV in dbt will be primarily doing it during development where having clear levers will be more helpful.

morsapaes · 2023-02-13T10:14:52Z

Materialized views in Materialize work under different assumptions to other databases (i.e. native incremental updates, no need for scheduled refreshes), so at first glance it doesn't look like we could inherit from dbt-postgres once the new materialization lands. As a maintainer, I'd like to make sure that we're consistent with the materialization name you land on, though! As is, we're using materializedview in our custom materialization, so deviating from that would mean a breaking change for existing users.

On our end, the materialization looks like:

Configurations

(optional) cluster: the cluster that maintains the materialized view (i.e. compute)
(optional) indexes: any indexes defined on top of the materialized view

Lifecycle
The default behaviour in dbt-materialize is to drop and recreate the materialized view, which is less problematic than it might be in other adapters since you're not expected to hit dbt run on a cadence (but rather once to bootstrap your setup, and then to add new models or evolve existing ones (using partial runs when possible)), and the engine is purpose-built to, well, be very efficient at computing and incrementally maintaining materialized views! At the risk of going too deep into Materialize specifics, one thing we're eyeing next is allowing an operation that waits until a new view with the new SQL definition is caught up before dropping an existing one, and then swaps them.

Fleid · 2023-02-13T21:26:45Z

We should talk ;)

At the risk of going too deep into Materialize specifics, one thing we're eyeing next is allowing an operation that waits until a new view with the new SQL definition is caught up before dropping an existing one, and then swaps them.

I like that.

Fleid · 2023-03-09T05:35:26Z

Updated with a new on_definition_change parameter for dbt run.

Fleid · 2023-03-15T16:57:20Z

Updated following internal review:

on_definition_change is split between on_schema_change (tracking column list) and on_configuration_change (tracking model configurations), to ease onboarding from incremental models
reduced the option for these configs to ignore/fail/apply - we try a little bit of cleverness, favoring ALTER / CREATE OR REPLACE statements where applicable, but mostly rely on the databases to optimize and avoid full recomputes

Fleid · 2023-03-23T19:09:18Z

Updated to remove on_schema_change for now - see reasons inline

Nipsuli · 2023-07-17T18:53:30Z

Wondering how this can be closed as completed when the dbt-bigquery part is still in draft pr? dbt-labs/dbt-bigquery#639

mikealfare · 2023-07-17T19:00:29Z

Incorrectly closed due to automation.

mikealfare · 2023-09-13T17:55:52Z

Yay for automation!

benc-db · 2023-10-10T20:10:24Z

on auto_refresh and backup: is backup taken care of in core or do we need any new adapter work? Is auto_refresh a signal from core that the adapters should implement as appropriate?

mikealfare · 2023-10-11T05:37:15Z

on auto_refresh and backup: is backup taken care of in core or do we need any new adapter work? Is auto_refresh a signal from core that the adapters should implement as appropriate?

auto_refresh and backup are specific to Redshift, hence were implemented in dbt-redshift. I do think we'll need to have some form of generic auto_refresh at the dbt-core level so that the materialization can inspect whether the MV is set to auto refresh. But we have not done that yet. There's a very small chance I knock that out tomorrow before the 1.7.0rc1 release.

I imagine Databricks has its own platform specific settings, and those would need to be added into dbt-databricks as additional AdapterConfig arguments. These should then get routed through into the NodeConfig/ModelConfig/etc. so that you can see them in the global jinja context (e.g. config, config.model).

I'm working with @dataders to put something together for you that should be much more helpful than this response, but hopefully that answers your immediate question.

mikealfare · 2023-12-02T18:57:30Z

The only outstanding work for this feature addresses the nuances around automatic refreshes on materialized views that are already scheduled to refresh. This functionality already has associated issues in each adapter's respective repository. With that context, I'm closing this as complete.

Fleid added enhancement New feature or request triage labels Feb 8, 2023

Fleid added this to the v1.5 milestone Feb 8, 2023

Fleid self-assigned this Feb 8, 2023

github-actions bot changed the title ~~[Feature] Let's add Materialized View as a materialization, finally~~ [CT-2068] [Feature] Let's add Materialized View as a materialization, finally Feb 8, 2023

Fleid added Team:Adapters Issues designated for the adapter area of the code and removed triage labels Feb 8, 2023

Fleid mentioned this issue Feb 9, 2023

[CT-2070] [Feature] Materialized tests #6914

Closed

Fleid mentioned this issue Feb 11, 2023

[CT-1576] [CT-1572] [Bug] Snowflake cluster_by config does not update Snowflake table for incremental models dbt-labs/dbt-snowflake#335

Closed

2 tasks

mikealfare self-assigned this Mar 14, 2023

This was referenced Mar 22, 2023

ADAP-38: Add materialized views as a new materialization #7210

Closed

ADAP-387: Stub materialized view as a materialization #7211

Merged

mikealfare linked a pull request Mar 22, 2023 that will close this issue

ADAP-387: Stub materialized view as a materialization #7211

Merged

6 tasks

mikealfare assigned McKnight-42 Mar 22, 2023

This was referenced Mar 28, 2023

ADAP-2: Materialized Views dbt-labs/dbt-redshift#386

Merged

ADAP-2: Materialized Views dbt-labs/dbt-bigquery#635

Closed

ADAP-2: Materialized Views #7239

Merged

ADAP-389: Stub materialized view as a materialization dbt-labs/dbt-redshift#387

Merged

mikealfare linked a pull request Mar 29, 2023 that will close this issue

ADAP-389: Stub materialized view as a materialization dbt-labs/dbt-redshift#387

Merged

6 tasks

This was referenced Mar 30, 2023

ADAP-400: Support materialized views with create/drop functionality dbt-labs/dbt-bigquery#639

Merged

ADAP-388: Stub materialized view as a materialization - postgres #7244

Merged

mikealfare linked a pull request Mar 30, 2023 that will close this issue

ADAP-388: Stub materialized view as a materialization - postgres #7244

Merged

6 tasks

McKnight-42 mentioned this issue Apr 4, 2023

ADAP-391: Add configuration change option #7272

Merged

6 tasks

Fleid mentioned this issue May 10, 2023

[ADAP-544] [Feature] Support Dynamic Tables as a materialization (following base Materialized View ergonomics) dbt-labs/dbt-snowflake#603

Closed

mikealfare mentioned this issue May 19, 2023

Materialized Views dbt-labs/docs.getdbt.com#3400

Closed

3 tasks

mikealfare closed this as completed in #7239 Jun 7, 2023

dbeatty10 mentioned this issue Jun 14, 2023

[CT-2705] dbt deps behaves different between windows and linux when adding a package #7874

Closed

2 tasks

This was referenced Jul 11, 2023

Feature/materialized views/adap 608 #7985

Closed

ADAP-608: Finalize Materialized Views dbt-labs/dbt-redshift#527

Closed

Apply new integration tests to existing framework to identify supported features #8099

Merged

mikealfare reopened this Jul 17, 2023

dataders modified the milestones: v1.6, v1.7 Sep 8, 2023

McKnight-42 closed this as completed in dbt-labs/dbt-bigquery#639 Sep 13, 2023

mikealfare reopened this Sep 13, 2023

dbeatty10 added the materialized_views label Oct 16, 2023

mikealfare mentioned this issue Oct 19, 2023

ADAP-835: Optimize manual refresh on auto-refreshed materialized views dbt-labs/dbt-redshift#637

Closed

4 tasks

Fleid mentioned this issue Oct 25, 2023

[CT-3229] Materialized views are not recreated after configuration change #8862

Open

2 tasks

mikealfare mentioned this issue Oct 25, 2023

ADAP-835: Optimize manual refresh on auto-refreshed materialized views dbt-labs/dbt-bigquery#981

Closed

4 tasks

mikealfare closed this as completed Dec 2, 2023

chenzl25 mentioned this issue Dec 22, 2023

Change dbt run to not drop models by default unless full refresh option is provided risingwavelabs/dbt-risingwave#20

Closed

nszoni mentioned this issue Feb 20, 2024

v1.6.0rc1 microsoft/dbt-synapse#202

Merged

This was referenced Apr 2, 2024

[Feature] Support on_config_change for incremental #9842

Closed

Support Materialized Views #1162

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CT-2068] [Feature] Let's add Materialized View as a materialization, finally #6911

[CT-2068] [Feature] Let's add Materialized View as a materialization, finally #6911

Fleid commented Feb 8, 2023 •

edited

Loading

Fleid commented Feb 8, 2023 •

edited

Loading

amychen1776 commented Feb 9, 2023 •

edited

Loading

morsapaes commented Feb 13, 2023

Fleid commented Feb 13, 2023

Fleid commented Mar 9, 2023

Fleid commented Mar 15, 2023

Fleid commented Mar 23, 2023

Nipsuli commented Jul 17, 2023

mikealfare commented Jul 17, 2023

mikealfare commented Sep 13, 2023

benc-db commented Oct 10, 2023

mikealfare commented Oct 11, 2023

mikealfare commented Dec 2, 2023

[CT-2068] [Feature] Let's add Materialized View as a materialization, finally #6911

[CT-2068] [Feature] Let's add Materialized View as a materialization, finally #6911

Comments

Fleid commented Feb 8, 2023 • edited Loading

materialized: materialized_view

Context : The time has come, let's do this.

A new materialization

Default implementation

Postgres

Redshift

BigQuery

Backward compatibility

Additional details, not required for implementation

Risks

Alternative/Options

default auto_refresh = true

Who will this benefit?

Anything else?

Fleid commented Feb 8, 2023 • edited Loading

amychen1776 commented Feb 9, 2023 • edited Loading

morsapaes commented Feb 13, 2023

Fleid commented Feb 13, 2023

Fleid commented Mar 9, 2023

Fleid commented Mar 15, 2023

Fleid commented Mar 23, 2023

Nipsuli commented Jul 17, 2023

mikealfare commented Jul 17, 2023

mikealfare commented Sep 13, 2023

benc-db commented Oct 10, 2023

mikealfare commented Oct 11, 2023

mikealfare commented Dec 2, 2023

Fleid commented Feb 8, 2023 •

edited

Loading

`materialized: materialized_view`

Fleid commented Feb 8, 2023 •

edited

Loading

amychen1776 commented Feb 9, 2023 •

edited

Loading