Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: a native SQLAlchemy dialect for Superset #14225

Merged
merged 10 commits into from
Aug 18, 2023

Conversation

betodealmeida
Copy link
Member

@betodealmeida betodealmeida commented Apr 18, 2021

SUMMARY

This PR introduces a new SQLAlchemy dialect, superset://, together with a corresponding DB engine spec. With this, users can create a new database using the superset:// SQLAlchemy URI, and use it to write queries like this:

SELECT * FROM "examples.birth_names";

Queries can also join data from multiple databases, or even move data from one to another:

INSERT INTO "db1.table"
SELECT * FROM "db2.table";

The database can even query itself (not that that's useful):

-- these two are identical
SELECT * FROM "examples.bart_lines";
SELECT * FROM "Superset meta database.examples%2Ebart_lines";

The dialect can only be used if the ENABLE_SUPERSET_META_DB feature flag is enabled, otherwise it will be blocked and won't even show up in the list of available databases.

While the dialect can be use to join across databases, users should be careful with big joins and expensive queries. Filtering, sorting, limiting and offsetting are pushed to the corresponding databases, but aggregations and joins happen in memory. For this reason it's recommended to enable asynchronous queries in the superset:// database, so that computations are executed in Celery workers instead of the web workers.

In addition, it's also possible to limit how much data is read from each database via the SUPERSET_META_DB_LIMIT configuration value, set initially to 1000.

The dialect uses Superset's security manager to prevent users from accessing unauthorized databases. DML is supported as long as DML is enabled in the superset:// database and all the related databases. It's also possible to limit the databases that the dialect has access to, by setting allowed_dbs in the engine parameters. Eg:

{"allowed_dbs":["Google Sheets","examples"]}

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

Screenshot 2023-08-10 at 11-38-17 Superset

TEST PLAN

Added unit tests.

ADDITIONAL INFORMATION

  • Has associated issue:
  • Changes UI
  • Includes DB Migration (follow approval process in SIP-59)
    • Migration is atomic, supports rollback & is backwards-compatible
    • Confirm DB migration upgrade and downgrade tested
    • Runtime estimates and downtime expectations provided
  • Introduces new feature or API
  • Removes existing feature or API

@rumbin
Copy link
Contributor

rumbin commented Apr 19, 2021

Sounds interesting! Could you provide some possible Use Cases? Is this a step towards paving the path to a LookML-like modeling layer?

@betodealmeida
Copy link
Member Author

Sounds interesting! Could you provide some possible Use Cases? Is this a step towards paving the path to a LookML-like modeling layer?

Yeah, that's my vision. Improving the semantic layer in Superset so we can do things like:

  1. Define dimension tables, declare the relationship ("foreign keys") in datasets, and have auto-joins in the Explore view, regardless of where the tables live. There's some performance concerns here, so we'd probably want to limit the cardinality of the dimensions.
  2. Move data between DBs. This dialect supports DML, so you can SELECT from a table in one DB and INSERT into a table in another. This could also help materializing datasets/dimensions in different databases to improve JOINs.
  3. Build connectors to APIs. This dialect is built on top of shillelagh, which helps building SQL connectors to APIs. We can selectively enable in Superset some of the adapters supported by shillelagh. Imagine using a Google Calendar to annotate events on a time series, for example.

@rumbin
Copy link
Contributor

rumbin commented Apr 20, 2021

Thanks for the explanation.
I wonder, if all tables of a query built by explore are accessible by the same DB connection -- e.g. different DBs, but the same Snowflake account -- will this method be smart enough to push down the whole query? Our vision is to integrate all relevant sources into Snowflake, in order to have them available for arbitrary client softwares. So, having a modeling layer would be highly welcome, but we would not want to query and merge from different connections.

BTW, is this the right place for such a discussion, or would Slack or the mailing list be more appropriate?

@betodealmeida
Copy link
Member Author

I wonder, if all tables of a query built by explore are accessible by the same DB connection -- e.g. different DBs, but the same Snowflake account -- will this method be smart enough to push down the whole query?

It would not, but that's an interesting use case. Presto has a similar use case, because we don't support multiple catalogs in a single DB, you need to create a DB per catalog.

@betodealmeida
Copy link
Member Author

OK, I think this is ready for it's annual review! 😆

@betodealmeida
Copy link
Member Author

@danilomo I think this shillelagh could be the basis of creating connectors to non-relational databases. This is what I need at the moment. I wanted to plot data from a non-relational source, but Superset only understand SQLAlchemy dialects.

This might interest you: https://preset.io/blog/accessing-apis-with-superset/

@betodealmeida betodealmeida removed the request for review from mistercrunch August 11, 2023 15:14
@mdeshmu
Copy link
Contributor

mdeshmu commented Aug 14, 2023

@michael-s-molina do we plan to include this in 3.0?

@michael-s-molina
Copy link
Member

@michael-s-molina do we plan to include this in 3.0?

Given the experimental nature of this feature, I think it's a good idea to give it some time to mature. It will likely be included in a minor release.

@betodealmeida betodealmeida requested review from john-bodley and removed request for dpgaspar August 18, 2023 18:46
Copy link
Member

@john-bodley john-bodley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@betodealmeida I left one small comment, but otherwise this LGTM.

@@ -67,6 +67,10 @@ def get_git_sha() -> str:
"sqlalchemy.dialects": [
"postgres.psycopg2 = sqlalchemy.dialects.postgresql:dialect",
"postgres = sqlalchemy.dialects.postgresql:dialect",
"superset = superset.extensions.metadb:SupersetAPSWDialect",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would dialects be a more appropriate name that metadb?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, good point. I'm not sure if we're ever going to have multiple SQLAlchemy dialects defined in Superset, which is why I named it this way.

@betodealmeida betodealmeida merged commit 6b660c8 into apache:master Aug 18, 2023
33 checks passed
@betodealmeida
Copy link
Member Author

@mdeshmu mdeshmu mentioned this pull request Aug 19, 2023
9 tasks
@danilomo
Copy link

danilomo commented Aug 20, 2023

@danilomo I think this shillelagh could be the basis of creating connectors to non-relational databases. This is what I need at the moment. I wanted to plot data from a non-relational source, but Superset only understand SQLAlchemy dialects.

This might interest you: https://preset.io/blog/accessing-apis-with-superset/

Thanks a lot, man!

Edit

"merged"

Wait, what? This is fantastic, lol

Congratulations!

@ozbillwang
Copy link

It appears that the latest release, v3.0.1 (which was released two weeks ago), still doesn't include this feature.

However, this feature has been mentioned in https://superset.apache.org/docs/databases/meta-database/

image

When will we have access to this feature?

If I can't wait, how can I build the application with the latest code? Could you please provide instructions?

@mistercrunch mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 3.1.0 labels Feb 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels .pinned Draws attention size/XL 🚢 3.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.