[SIP-68] A better model for Datasets #14909

betodealmeida · 2021-05-28T22:14:46Z

[SIP] A better model for Datasets

This document proposes the creation of new set of SQLAlchemy models to better represent datasets in Superset and build the foundation for a powerful semantic layer:

A Table model, representing physical tables or views in a given database.
A Dataset model, representing the current concept of physical and virtual datasets;
A Column model, for table/view columns, dataset metrics, and dataset derived columns.

These would replace the following existing 9 models:

BaseDatasource, DruidDatasource, and SqlaTable. These models currently are used to represent Druid datasources (to be deprecated), as well as physical and virtual datasets built on SQLAlchemy, with common functionality in a base class.
BaseColumn, DruidColumn, TableColumn, representing columns in Druid or SQLAlchemy.
BaseMetric, DruidMetric, SqlMetric, representing metrics in Druid or SQLAlchemy.

(Note that Superset currently doesn't have a distinction between a table in a database and a dataset enriched with metadata. Because of this, it's hard to infer relationships between objects such as a virtual dataset and the tables it references, even though Superset is already able to extract that relationship from the SQL.)

There's an additional 10th model that is out-of-scope for this SIP: AnnotationDatasource. Eventually it would be nice to allow any dataset to be used as an annotation, removing the need for a custom model.

Motivation

The current semantic model in Superset is very simple. Originally, Superset was built as a UI to explore Druid datasources (tables). At the time Druid didn't support SQL, and queries had to be done using a native JSON-based interface. Because of the lack of SQL support, when adding a datasource to Superset the column types had to be inferred from their names, and users could manually override them. Also, because of the nature of Druid's storage, users also had to indicate a temporal column, which columns were filterable, and which ones were groupable, in order to prevent expensive queries from running.

Users could also add new metrics and derived columns to a datasource, which was an important feature because they couldn't simply be defined on-the-fly through SQL. The metrics, columns, and metadata describing columns were (and still are) stored in a model called DruidDatasource. This description of the the underlying table and the additional metadata became Superset's semantic layer.

Once support for SQL was added, a new model was introduced, called SqlaTable. The new model represents a table in a database supported by SQLAlchemy, and contains additional metadata similar to the Druid counterpart: column types, labels, properties (groupable/filterable/temporal), as well as additional metrics and columns defined using SQL. In this document we'll focus on the SQLAlchemy tables and datasets, since Druid is proposed to be deprecated in Superset 2.0 (SIP-11).

One of the main problems with the current implementation of datasets is that they are not distinct enough from tables, with datasets being just a very thin layer around tables. Datasets can't be named, and instead inherit the name of the underlying table. Because of that, it's not possible to have two datasets pointing to the same table. Because they are so tightly coupled users don't understand the difference between a table and a dataset, and a common source of confusion is why a user can query a table in SQL Lab but can't visualize it in the Explore view — unless they create a dataset first.

Some of these shortcomings were solved by the introduction of virtual datasets, datasets that instead of pointing to a table are defined by a SQL expression. Virtual datasets can be named, and multiple virtual datasets can point to the same table, allowing different teams to curate datasets that tailor their needs. While clearly a step in the right direction, virtual datasets are also not a separate entity in the code base (they still use the SqlaTable model!), and there's still a lot of confusion between them and physical datasets, and them and database VIEWs.

Additionally, the process of creating a virtual dataset is unclear: users need to go to SQL Lab, write a query, execute it, and click "Explore" to create a dataset. During that process, they have very little control over the dataset that is being created, being allowed only to choose a name, though they can edit other attributes later. Other than this flow, clicking "+ Dataset" only allows the creation of physical datasets, but not virtual.

To improve the semantic layer of Superset we want to make datasets an entity of their own, distinct from tables. Datasets are the foundation of the semantic layer, and should be treated as a first class citizen in Superset.

Proposed Change

This diagram depicts the current SQLAlchemy dataset model (SqlaTable), as well as related entities. Models are represented by bold boxes:

A few things of note:

Physical datasets have an 1:1 relationship with views and tables in a given database (in blue).
Virtual datasets, like queries, have implicit n:n relationships with views and tables (represented as dotted lines). Because there's no Table model the relationship exists only in the SQL query present in these models. Superset is already able to parse the SQL and extract the tables referenced for security reasons, so it should be straightforward to represent this relationship explicitly.
Columns and metrics, while not depicted here, are very similar to datasets. A column object can point directly to a column in a given table, similar to how physical datasets work; and a derived column has an implicit n:n relationship to columns via an expression.
Most charts have an n:1 relationship with datasets. The "Multiple Line Charts" and "deck.gl Multiple Layers" have an n:n relationship to charts, but they are a special case so the relationship is not represented in the diagram.

These are the proposed new models:

The new implementation adds explicit models for Dataset, Table and Column:

A Table has a 1:1 relationship with a physical table or view, and is a direct representation of a database construct without any extra metadata.
A Dataset has an n:n relationship to Table, representing both physical datasets (n:1) and virtual datasets (n:n).
A Column (not shown) is defined essentially by an expression, and can point to a table column or a SQL snippet defining a metric or derived column. Columns also have extra metadata that support data exploration.

New or Changed Public Interfaces

This solution involves implementing new models, specially Table and Dataset. The former will represent tables (or views) in databases, while the latter will represent datasets. In this proposal datasets are virtual-first, and physical datasets are a particular case of virtual datasets. The data currently stored in the SqlaTable model will be migrated and split between the two new models, in addition to a Column model for metrics and columns.

The Table model represents the physical table stored in the database, and is pretty straightforward:

class Table(Model):
    id: int
    uuid: UUID

    database_id: int
    catalog: Optional[str]  # this will allow better support for Presto et al.
    schema: Optional[str]
    name: str

    columns: List[Column]  # 1:n relationship to Column model

While the dataset stores additional metadata:

class Dataset(Model):
    id: int
    uuid: UUID

    name: str
    owners: List[User]

    expression: str
    tables: List[Table]  # n:n relationship to Table model
    columns: List[Column]  # inherited columns, metrics, and computed dimensions

Here, expression represents a SQL source. For physical datasets this corresponds to a table name, while for a virtual dataset this maps to a more complex SQL query.

As an example, a physical dataset could look like this:

expression = "dim_some_table"

While a virtual dataset would have:

expression = "SELECT * FROM dim_some_table"

Both can be used directly in a SQL query:

SELECT COUNT(*) AS metric, dim1, dim2
FROM ($expression) source
GROUP BY 2, 3

Datasets (and tables) have columns, with rich metadata that helps exploring them:

class Column(Model):
    id: int
    uuid: UUID

    name: str
    type: str  # ideally not-DB specific

    # this points to a physical column (`created at`), or is a SQL expression (`revenue - expenses`)
    expression: str

    description: str
    warning_text: str

    # for timeseries
    is_temporal: bool
    # for visualizing data on a map
    is_spatial: bool
    # for scheduling queries and more
    is_partition: bool
    # indicates if this is a metric
    is_aggregation: bool
    # indicates which aggregations can be performed
    is_additive: bool

    # this can be computed offline
    cardinality: int

    # for chart labels
    description: str
    units: str

    # for AB testing, alerts, and more
    increase_good: bool

    # for auto-joins we'll propose the creation of a Dimension model in another SIP
    dimension: Dimension
    dimension_relationship: Union[Relationship.ManyToOne, ...]

Note that the columns are defined by a single expression, which can point to:

A physical column, eg, country;
An aggregation, eg, COUNT(*), in which case the column represents a metric;
A non-aggregating expression, in which case it represents a derived column.

This simplifies the current column model where TableColumn has both a column_name as well as an expression.

The extra metadata can be used to orient the user when exploring the dataset. For example, if a given column is non-additive (COUNT(DISTINCT user_id)) we know that it can't be used in a metric using a SUM(). If a given column has the is_spatial attribute we know it can be used in Deck.gl visualizations.

The dimension and dimension relationship can potentially be leveraged in the future, allowing additional dimension attributes to be automatically included when slicing and dicing the dataset. For example, if a user annotates a column user_id as referencing the User dimension, they should now be able to filter or group by country when exploring the dataset, and Superset will automatically perform the joins needed for the query.

To accentuate the difference between tables and datasets in this option we will leverage a new dataset editor, as well as allowing tables to be explored without having to create a dataset beforehand. These are described in detail in the last section ("Future work").

Examples

Here's a couple examples in pseudo-code. First, a physical dataset with a simple metric:

fact_sales = Table(
    id=1,
    database_id=1,
    catalog=None,
    schema=None,
    name="fact_sales",
    columns=[
        Column(name="id", type="int", expression="id"),
        Column(name="ts", type="datetime", expression="ts"),
        Column(name="item_id", type="int", expression="item_id"),
    ],
)

physical_dataset = Dataset(
    expression="fact_sales",
    tables=[fact_sales],
    columns=[
        # inherited
        Column(name="id", type="int", expression="id"),
        Column(name="ts", type="datetime", expression="ts"),
        Column(name="item_id", type="int", expression="item_id"),
        # metrics
        Column(name="items_sold", type="decimal", expression="COUNT(*)", is_aggregation=True),
    ],
)

And a virtual dataset:

dim_items = Table(
    id=2,
    database_id=1,
    name="dim_items",
    columns=[
        Column(name="id", type="int", expression="id"),
        Column(name="price", type="decimal", expression="price"),
        Column(name="description", type="str", expression="description"),
    ],
)

virtual_dataset = Dataset(
    expression="SELECT ts, price FROM fact_sales JOIN dim_items ON fact_sales.item_id = dim_items.id",
    tables=[fact_sales, dim_items],
    columns=[
        # inherited
        Column(name="ts", type="datetime", expression="ts"),
        Column(name="price", type="decimal", expression="price"),
        # metrics
        Column(name="revenue", type="decimal", expression="SUM(price)", is_aggregation=True),
    ],
)

New dependencies

No new dependencies are anticipated.

Migration Plan and Compatibility

The implementation is planned to be done in 4 steps:

Implement the new models but leave them empty.
Create migration, splitting SqlaTable into Dataset and Table, and moving SqlMetric and TableColumn to Column. Implement double-write, so that the new models are kept in sync.
Update code to read from the new models.
Remove Druid models.

Rejected Alternatives

Single model

We discussed implementing the proposal in a single model called Dataset containing all the metadata. Columns and metrics would be stored as JSON. This would allow for faster iteration on the semantic layer, as well as simplify the eventual implementation of versioning in Superset. Schemas would be enforced by marshmallow, and it would minimize the need for DB migrations.

After a lot of consideration we decided that implementing separate models would be better. While this proposal potentially requires more DB migrations the migrations would be simpler, while the single model approach would require migrations that read/deserialize/modify/serialize/write the metadata. Also, enforcing the schema at the DB level helps simplify the application logic.

`Query` reference in `Dataset`

We discussed having source_type and source_id in the Dataset model, to identify the source of a dataset, pointing either to a Table or a Query. This would greatly simplify the work for tools that index data, like Amundsen or Datahub, but make the database relationships more complicated, as pointed out by @john-bodley.

Instead, we opted for defining datasets by their expression, and store explicit relationships to Table objects to help with indexing.

Future work

Dimensions

After the work on datasets is complete we'd like to propose a new semantic construct, the dimension. This would allow users to define table relationships in Superset itself, outside of the database, recommending joins when exploring data.

As an example, a user could point the user_id column in their dataset to the User dimension. Then, when exploring data, they would be able to filter or group by columns in the user dimensions (eg, user.country). Superset would join the two tables automatically for the user.

Automatically performing joins can be very expensive, so this still requires considerable planning and discussion.

Editing datasets

To encourage users to create well curated datasets we want to offer a dataset editor that is easy to use, while still targeting power users. This should be available in the main call-to-action in the Superset UI for content creation — the "+" button that currently allow users to create queries, charts and dashboards. In addition to those items, we should also allow users to create datasets.

The new dataset creation flow should allow users to create a new dataset from an existing table, like the current existing flow. But the new flow should also encourage users to select only the columns that they're interested in, annotating columns with additional metadata and creating metrics, reinforcing the fact that the dataset is more than just a table.

For power users we want to leverage the fact that the new export capabilities of Superset produce readable YAML, and provide a YAML editor so that users can create and update datasets quickly, with the possibility of using source control for versioning and storage. Instead of implementing the editor from scratch it's possible to embed the Monaco editor in Superset, which provides support for editing YAML.

For both cases, ideally the editor would allow us to integrate tightly with SQL Lab, providing:

syntax validation;
rich autocomplete when writing SQL definitions for metrics, columns and sources;
the ability to preview sample rows from the dataset.

Exploring tables

One common source of confusion in Superset is that many users (including developers!) don't understand the need to add a table as a dataset before being able to explore it. One way to solve this problem is creating a dataset on-the-fly if none exists, so that the user can visualize a table they know to exist. The dataset can then be discarded if the chart is not saved, or created when the chart is saved.

The text was updated successfully, but these errors were encountered:

john-bodley · 2021-06-04T01:49:31Z

@betodealmeida thanks for the excellent worded SIP. I'm in agreement with the direction. A few questions:

Tables

When/how are these added?
Is there any plan for these to be regularly synced to ensure they accurately represent the underlying data warehouse (see SIP-7 for specifics)?
If (2) How are differences resolved between columns in the Table and Dataset models?

Datasets

The source_type/source_id polymorphic association is somewhat akin to the current datasource_type/datasource_id association in the Datasource model, used primarily to differentiate between a Druid NoSQL datasource and SQL table where the datasource_type determines which table to join with. Having this necessary explicit (weak) relationship rather than an implicit (strong) relationship (courtesy of a foreign-key) has never seemed desirable from a data integrity perspective, and with the deprecation of Druid NoSQL I was hoping we could remove said pattern.

I realize there is no was to define a polymorphic associations using SQL constraints given foreign key constraint always references one target table, but I was wondering if one of the alternative options outlined here would be more desirable, i.e., a super-table or using two columns (preferred for simplicity reasons).

cccs-rc · 2021-07-14T20:17:37Z

I'm eager to see this progress! Just wanted to mention an open PR my team has in progress (#15154) regarding business types for columns. Hoping we can incorporate business types into these new models as well.

betodealmeida · 2021-10-27T22:46:02Z

When/how are these added?

I'm going to create the model (step 1), and later have a migration using SqlaTable to populate Table and Dataset (step 2). I'm planning to start working on it as soon as we have consensus.

Is there any plan for these to be regularly synced to ensure they accurately represent the underlying data warehouse (see SIP-7 for specifics)?

No plan, but we definitely should do it.

If (2) How are differences resolved between columns in the Table and Dataset models?

That's a hard question, I need to think about it. But I think it's out of scope for now.

betodealmeida · 2021-10-27T22:59:31Z

@john-bodley can you take another look? I updated the SIP, and addressed the relationship issue you brought up.

@eschutho do you want to take another look as well? I removed the relationship between Dataset and Query to simplify the implementation.

@mistercrunch can you take a look as well? Thanks!

john-bodley · 2021-10-29T18:56:56Z

@betodealmeida overall this LGTM. I have one question regarding the Dataset.expression attribute, wouldn't it be simpler to make it optional and thus would be NULL for datasets which represent physical database entities? Granted it makes your SQL substitution logic more complex but it could provide a cleaner mechanism of differentiating between physical and virtual datasets.

betodealmeida · 2021-11-02T22:57:05Z

@betodealmeida overall this LGTM. I have one question regarding the Dataset.expression attribute, wouldn't it be simpler to make it optional and thus would be NULL for datasets which represent physical database entities? Granted it makes your SQL substitution logic more complex but it could provide a cleaner mechanism of differentiating between physical and virtual datasets.

@john-bodley I really like the symmetry of having just an expression for both virtual and physical datasets, since it simplifies building the SQL for a given dataset — though I agree it makes it harder for differentiating between them.

What are your thoughts on having a boolean attribute, is_physical?

mayurnewase · 2021-11-02T23:44:43Z

To create a physical dataset we first need to add table?
or we choose table directly from database and schemas.

betodealmeida · 2021-11-03T00:01:28Z

To create a physicle dataset we first need to add table right? or we choose table directly from database and schemas.

In the UI just choose, and we'll create the Table instance behind the scenes. Users will never create Table objects directly.

rumbin · 2022-01-22T19:38:22Z

I wonder to what extent it makes sense to lend some of the modeling abstraction of related tools like dbt or of competitors like Looker (LookML).
I suppose that many users would love to see some convergence with dbt: The model names, the newly introduced metrics layer, the ref() syntax. Much of that could be taken advantage of.
(BTW, do we have any statistics of the number of Superset users who als use dbt?).
In this context #18098 could be used as a starting point for synchronization of the models.

stale · 2022-04-17T19:56:09Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. For admin, please label this issue .pinned to prevent stale bot from closing the issue.

rusackas · 2023-09-06T16:56:08Z

@betodealmeida what's the situation on this? We're closing this Issue since it's been approved, and the Issue is no longer needed/actionable, but leaving it the issue in the SIP board as "Approved" rather than "Implemented". I assume there's still intent to implement anything here that hasn't been implemented already?

junlincc added the sip Superset Improvement Proposal label Jun 1, 2021

betodealmeida changed the title ~~[WIP] A better model for Datasets~~ [SIP-68] A better model for Datasets Jun 2, 2021

EBoisseauSierra mentioned this issue Jun 9, 2021

Impossible to use “Calculated columns” in “Custom metrics”, as they are not expanded. #15060

Closed

3 tasks

rusackas added this to DISCUSS in SIPs Jun 24, 2021

betodealmeida mentioned this issue Nov 11, 2021

feat: new models for SIP-68 #17395

Closed

9 tasks

junlincc added viz:charts:deck.gl Related to deck.gl charts and removed viz:charts:deck.gl Related to deck.gl charts labels Nov 17, 2021

junlincc mentioned this issue Nov 18, 2021

[Community Announcement]Nov 17 Community growth and successful first-time contribution! #17473

Closed

This was referenced Nov 24, 2021

update tables unique key #15833

Closed

feat: new dataset/table/column models #17543

Merged

betodealmeida moved this from DISCUSS to APPROVED in SIPs Dec 8, 2021

This was referenced Dec 22, 2021

[SIP-78] Proposal for Advanced Types #17852

Closed

Multiple tables join queries to produce a single chart #12203

Closed

betodealmeida mentioned this issue Jan 22, 2022

Move to monaco-editor #13838

Closed

This was referenced Mar 29, 2022

perf(alembic): paginize db migration for new dataset models #19406

Closed

perf: refactor SIP-68 db migrations with INSERT SELECT FROM #19421

Merged

stale bot added the inactive Inactive for >= 30 days label Apr 17, 2022

rusackas assigned betodealmeida Dec 8, 2022

stale bot removed the inactive Inactive for >= 30 days label Dec 8, 2022

michael-s-molina closed this as completed Sep 6, 2023

betodealmeida mentioned this issue May 30, 2024

chore: remove sl_ tables #28704

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SIP-68] A better model for Datasets #14909

[SIP-68] A better model for Datasets #14909

betodealmeida commented May 28, 2021 •

edited

Loading

john-bodley commented Jun 4, 2021 •

edited

Loading

cccs-rc commented Jul 14, 2021

betodealmeida commented Oct 27, 2021

betodealmeida commented Oct 27, 2021

john-bodley commented Oct 29, 2021

betodealmeida commented Nov 2, 2021

mayurnewase commented Nov 2, 2021 •

edited

Loading

betodealmeida commented Nov 3, 2021

rumbin commented Jan 22, 2022

stale bot commented Apr 17, 2022

rusackas commented Sep 6, 2023

[SIP-68] A better model for Datasets #14909

[SIP-68] A better model for Datasets #14909

Comments

betodealmeida commented May 28, 2021 • edited Loading

[SIP] A better model for Datasets

Motivation

Proposed Change

New or Changed Public Interfaces

Examples

New dependencies

Migration Plan and Compatibility

Rejected Alternatives

Single model

Query reference in Dataset

Future work

Dimensions

Editing datasets

Exploring tables

john-bodley commented Jun 4, 2021 • edited Loading

cccs-rc commented Jul 14, 2021

betodealmeida commented Oct 27, 2021

betodealmeida commented Oct 27, 2021

john-bodley commented Oct 29, 2021

betodealmeida commented Nov 2, 2021

mayurnewase commented Nov 2, 2021 • edited Loading

betodealmeida commented Nov 3, 2021

rumbin commented Jan 22, 2022

stale bot commented Apr 17, 2022

rusackas commented Sep 6, 2023

betodealmeida commented May 28, 2021 •

edited

Loading

`Query` reference in `Dataset`

john-bodley commented Jun 4, 2021 •

edited

Loading

mayurnewase commented Nov 2, 2021 •

edited

Loading