-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Materialized View Spec #11041
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Materialized View Spec #11041
Changes from all commits
8c5d276
afc4a0d
b2d0b68
27783c7
e85ab16
cff9596
a8b52b2
521477f
ed85e95
3bc583c
49d5da8
d18c9da
eb7d71b
8ffff63
6b065f5
0e17881
7e9dc11
efed628
0673113
476aced
a02ff98
9a377e0
3fec943
413ceb4
295035e
25adf5b
ae0b005
95740e0
9b492c9
878b66b
fe2dec7
a2ca4b2
6dffbee
ec2ac6a
48c1553
d154f8d
25d70dd
e07b7ef
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -42,12 +42,28 @@ An atomic swap of one view metadata file for another provides the basis for maki | |
|
|
||
| Writers create view metadata files optimistically, assuming that the current metadata location will not be changed before the writer's commit. Once a writer has created an update, it commits by swapping the view's metadata file pointer from the base location to the new location. | ||
|
|
||
| ### Materialized Views | ||
|
|
||
| Materialized views are a type of view with precomputed results from the view query stored as a table. | ||
| When queried, engines may return the precomputed data for the materialized views, shifting the cost of query execution to the precomputation step. | ||
|
|
||
| Iceberg materialized views are implemented as a combination of an Iceberg view and an underlying Iceberg table, the "storage-table", which stores the precomputed data. | ||
| Materialized View metadata is a superset of View metadata with an additional pointer to the storage table. The storage table is an Iceberg table with additional materialized view refresh state metadata. | ||
| Refresh metadata contains information about the "source tables" and/or "source views", which are the tables/views referenced in the query definition of the materialized view. | ||
| During read time, a materialized view (storage table) can be interpreted as "fresh", "stale" or "invalid", depending on the following situations: | ||
| * **fresh** -- The `snapshot_id`s of the last refresh operation match the current `snapshot_id`s of all the source tables. | ||
| * **stale** -- The `snapshot_id`s do not match for at-least one source table, indicating that a refresh operation needs to be performed to capture the latest source table changes. | ||
| * **invalid** -- The current `version_id` of the materialized view does not match the `view-version-id` of the refresh state. | ||
|
|
||
| ## Specification | ||
|
|
||
| ### Terms | ||
|
|
||
| * **Schema** -- Names and types of fields in a view. | ||
| * **Version** -- The state of a view at some point in time. | ||
| * **Storage table** -- Iceberg table that stores the precomputed data of the materialized view. | ||
| * **Source table** -- A table reference that occurs in the query definition of the materialized view. The materialized view depends on the data from the source tables. | ||
JanKaul marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| * **Source view** -- A view reference that occurs in the query definition of the materialized view. The materialized view depends on the definitions from the source views. | ||
stevenzwu marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| ### View Metadata | ||
|
|
||
|
|
@@ -63,11 +79,13 @@ The view version metadata file has the following fields: | |
| | _required_ | `versions` | A list of known [versions](#versions) of the view [1] | | ||
| | _required_ | `version-log` | A list of [version log](#version-log) entries with the timestamp and `version-id` for every change to `current-version-id` | | ||
| | _optional_ | `properties` | A string to string map of view properties [2] | | ||
| | _optional_ | `max-staleness-ms` | The maximum time interval in milliseconds after a refresh operation during which the materialized view's data is considered fresh [3] | | ||
|
|
||
| Notes: | ||
|
|
||
| 1. The number of versions to retain is controlled by the view property: `version.history.num-entries`. | ||
| 2. Properties are used for metadata such as `comment` and for settings that affect view maintenance. This is not intended to be used for arbitrary metadata. | ||
| 3. The `max-staleness-ms` field only applies to materialized views and must be set to `null` for common views. If `max-staleness-ms` is not `null` and the time elapsed since the last refresh operation is less than `max-staleness-ms`, the query engine may return data directly from the `storage-table` without evaluating freshness based on the source tables and views. If `max-staleness-ms` is `null` for a materialized view, the data in the `storage-table` is always considered fresh. | ||
|
|
||
| #### Versions | ||
|
|
||
|
|
@@ -82,9 +100,12 @@ Each version in `versions` is a struct with the following fields: | |
| | _required_ | `representations` | A list of [representations](#representations) for the view definition | | ||
| | _optional_ | `default-catalog` | Catalog name to use when a reference in the SELECT does not contain a catalog | | ||
| | _required_ | `default-namespace` | Namespace to use when a reference in the SELECT is a single identifier | | ||
| | _optional_ | `storage-table` | A [storage table identifier](#storage-table-identifier) of the storage table | | ||
|
|
||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I initially thought of Anyway, I like your description and I can add it either as a property or a metadata field, if there is consensus. |
||
| When `default-catalog` is `null` or not set, the catalog in which the view is stored must be used as the default catalog. | ||
|
|
||
| When 'storage-table' is `null` or not set, the entity is a common view, otherwise it is a materialized view. | ||
|
|
||
JanKaul marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| #### Summary | ||
|
|
||
| Summary is a string to string map of metadata about a view version. Common metadata keys are documented here. | ||
|
|
@@ -160,6 +181,57 @@ Each entry in `version-log` is a struct with the following fields: | |
| | _required_ | `timestamp-ms` | Timestamp when the view's `current-version-id` was updated (ms from epoch) | | ||
| | _required_ | `version-id` | ID that `current-version-id` was set to | | ||
|
|
||
| #### Storage Table Identifier | ||
|
|
||
| The table identifier for the storage table that stores the precomputed results. | ||
|
|
||
| | Requirement | Field name | Description | | ||
JanKaul marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| |-------------|----------------|-------------| | ||
| | _required_ | `namespace` | A list of strings for namespace levels | | ||
| | _required_ | `name` | A string specifying the name of the table/view | | ||
stevenzwu marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| ### Storage table metadata | ||
stevenzwu marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| This section describes additional metadata for the storage table that supplements the regular table metadata and is required for materialized views. | ||
| The property "refresh-state" is set on the [snapshot summary](https://iceberg.apache.org/spec/#snapshots) property of every storage table snapshot to determine the freshness of the precomputed data of the storage table. | ||
|
|
||
| | Requirement | Field name | Description | | ||
| |-------------|-----------------|-------------| | ||
| | _required_ | `refresh-state` | A [refresh state](#refresh-state) record stored as a JSON-encoded string | | ||
|
|
||
| #### Refresh state | ||
|
|
||
| The refresh state record captures the state of all source tables, views, and materialized views in the materialized view's fully expanded query tree at refresh time. Source table states are stored in `source-table-states` and source view states in `source-view-states`. For source views, `source-view-states` includes indirect references — tables or views nested within other views (exluding MVs) but not directly referenced in the query. | ||
| For source materialized views, both the source view and its storage table are included in the refresh state. Indirect references are excluded for materialized view sources; during read time, query engines may recursively expand the query tree to determine freshness. The refresh state has the following fields: | ||
|
|
||
| | Requirement | Field name | Description | | ||
| |-------------|----------------|-------------| | ||
| | _required_ | `view-version-id` | The `version-id` of the materialized view when the refresh operation was performed | | ||
| | _required_ | `source-table-states` | A list of [source table](#source-table) records for all tables that are directly or indirectly referenced in the materialized view query | | ||
| | _required_ | `source-view-states` | A list of [source view](#source-view) records for all views that are directly or indirectly referenced in the materialized view query | | ||
stevenzwu marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| | _required_ | `refresh-start-timestamp-ms` | A timestamp of when the refresh operation was started | | ||
|
|
||
| #### Source table | ||
|
|
||
| A source table record captures the state of a source table at the time of the last refresh operation. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the state of a source table (including source MV's storage table)? |
||
|
|
||
| | Requirement | Field name | Description | | ||
stevenzwu marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| |-------------|----------------|-------------| | ||
| | _required_ | `uuid` | The uuid of the source table | | ||
| | _required_ | `snapshot-id` | Snapshot-id of when the last refresh operation was performed | | ||
| | _optional_ | `ref` | Branch name of the source table being referenced in the view query | | ||
|
|
||
| When `ref` is `null` or not set, it defaults to "main". | ||
|
|
||
| #### Source view | ||
|
|
||
| A source view record captures the state of a source view at the time of the last refresh operation. | ||
|
|
||
| | Requirement | Field name | Description | | ||
| |-------------|----------------|-------------| | ||
| | _required_ | `uuid` | The uuid of the source view | | ||
| | _required_ | `version-id` | Version-id of when the last refresh operation was performed | | ||
|
|
||
| ## Appendix A: An Example | ||
|
|
||
| The JSON metadata file format is described using an example below. | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since there are many use cases for allowing an engine to use a stale materialization, can we add a "warm" situation with the description:
warm - The
snapshot_ids do not match for at-least one source table and the snapshot was committed after the current time minus materialization.max-stalessness-ms