feat(SIP-85): OAuth2 for databases #27631

betodealmeida · 2024-03-23T22:08:35Z

SUMMARY

This PR introduces a new table called database_user_oauth2_tokens. The table is used for storing personal user tokens associated with a given database:

# \d database_user_oauth2_tokens
                                               Table "public.database_user_oauth2_tokens"
         Column          |            Type             | Collation | Nullable |                         Default
-------------------------+-----------------------------+-----------+----------+---------------------------------------------------------
 created_on              | timestamp without time zone |           |          |
 changed_on              | timestamp without time zone |           |          |
 id                      | integer                     |           | not null | nextval('database_user_oauth2_tokens_id_seq'::regclass)
 user_id                 | integer                     |           | not null |
 database_id             | integer                     |           | not null |
 access_token            | bytea                       |           |          |
 access_token_expiration | timestamp without time zone |           |          |
 refresh_token           | bytea                       |           |          |
 created_by_fk           | integer                     |           |          |
 changed_by_fk           | integer                     |           |          |
Indexes:
    "database_user_oauth2_tokens_pkey" PRIMARY KEY, btree (id)
Foreign-key constraints:
    "database_user_oauth2_tokens_changed_by_fk_fkey" FOREIGN KEY (changed_by_fk) REFERENCES ab_user(id)
    "database_user_oauth2_tokens_created_by_fk_fkey" FOREIGN KEY (created_by_fk) REFERENCES ab_user(id)
    "database_user_oauth2_tokens_database_id_fkey" FOREIGN KEY (database_id) REFERENCES dbs(id)
    "database_user_oauth2_tokens_user_id_fkey" FOREIGN KEY (user_id) REFERENCES ab_user(id)

Whenever a SQLAlchemy engine is instantiated, the personal user token (or None) will be passed to the get_url_for_impersonation method in the DB engine spec, so that a custom URL can be built for the user. For example, for GSheets:

def get_url_for_impersonation(
    cls,
    url: URL,
    impersonate_user: bool,
    username: str | None,
    access_token: str | None,  # <== here
) -> URL:
    if not impersonate_user:
        return url

    if username is not None:
        user = security_manager.find_user(username=username)
        if user and user.email:
            url = url.update_query_dict({"subject": user.email})

    if access_token:
        url = url.update_query_dict({"access_token": access_token})

    return url

The change allows users to login to databases like BigQuery, Snowflake, Dremio, Databricks, Google Sheets, etc. using their own credentials. This makes it easier to set up databases, since service accounts are no longer required, and provides better isolation of data between users. Only support for Google Sheets is implemented in this PR, and it's considered the reference implementation. Note that a newer version of Shillelagh is required, since a change in the Google Auth API introduced a regression.

In order to populate the table with personal access tokens, the DB engine spec checks for a specific exception that signals that OAuth2 should start:

class BaseEngineSpec:
    def execute(...) -> None:
        try:
            cursor.execute(query)
        except cls.oauth2_exception as ex:  # <== here
            if cls.is_oauth2_enabled() and g.user:
                cls.start_oauth2_dance(database_id)
            raise cls.get_dbapi_mapped_exception(ex) from ex
        except Exception as ex:
            raise cls.get_dbapi_mapped_exception(ex) from ex

When called, the start_oauth2_dance method will return the error OAUTH2_REDIRECT to the frontend. The error is captured by the ErrorMessageWithStackTrace component, which provides a link to the user so they can start the OAuth2 authentication. Since this is implemented at the DB engine spec level, any query will trigger it — in SQL Lab, Explore, or dashboards — see the screenshots below for the UX.

Note that while the current implementation triggers OAuth2 when a query needs authorization, we could also implement affordances in the database UI to manually trigger OAuth2 to store the personal access tokens. This could be done in the future.

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

SQL Lab. Note that the query runs automatically once OAuth is completed:

SIP-85.Sql.Lab.mov

Explore. Note that the chart is automatically updated after OAuth:

SIP-85.Explore.mov

Same thing for dashboards:

SIP-85.Dashboard.mov

TESTING INSTRUCTIONS

Create a Google Sheets database.
Create a Google OAuth2 application at https://console.cloud.google.com/apis/credentials/oauthclient/ of type "Web application"
Edit superset_config.py and add the client ID and secret:

DATABASE_OAUTH2_CREDENTIALS = {
    "Google Sheets": {
       "CLIENT_ID": "XXX.apps.googleusercontent.com",
       "CLIENT_SECRET": "GOCSPX-YYY",
    },
}

In SQL Lab, try to query a sheet that is not shared publicly. It should trigger OAuth2.
Add the sheet as a dataset and create a chart.
Delete the tokens from the database:

DELETE FROM database_user_oauth2_tokens;

Reload the chart. It should trigger OAuth2.
Add the chart to a dashboard, delete the tokens, and reload the dashboard. It should trigger OAuth2.

ADDITIONAL INFORMATION

Has associated issue: [SIP-85] OAuth2 for databases #20300
Required feature flags:
Changes UI
Includes DB Migration (follow approval process in SIP-59)
- Migration is atomic, supports rollback & is backwards-compatible
- Confirm DB migration upgrade and downgrade tested
- Runtime estimates and downtime expectations provided
Introduces new feature or API
Removes existing feature or API

codecov · 2024-03-23T23:27:16Z

Codecov Report

Attention: Patch coverage is 89.96416% with 28 lines in your changes are missing coverage. Please review.

Project coverage is 69.96%. Comparing base (883e455) to head (59d4daf).
Report is 28 commits behind head on master.

Files	Patch %	Lines
superset/db_engine_specs/base.py	68.29%	13 Missing ⚠️
superset/utils/lock.py	86.48%	5 Missing ⚠️
.../components/ErrorMessage/OAuth2RedirectMessage.tsx	92.10%	0 Missing and 3 partials ⚠️
superset/utils/oauth2.py	96.42%	2 Missing ⚠️
superset-frontend/src/setup/setupErrorMessages.ts	0.00%	1 Missing ⚠️
superset/databases/api.py	95.45%	1 Missing ⚠️
superset/db_engine_specs/gsheets.py	97.14%	1 Missing ⚠️
superset/models/core.py	93.75%	1 Missing ⚠️
superset/sql_lab.py	75.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #27631      +/-   ##
==========================================
+ Coverage   69.89%   69.96%   +0.07%     
==========================================
  Files        1911     1916       +5     
  Lines       75024    75377     +353     
  Branches     8355     8403      +48     
==========================================
+ Hits        52435    52741     +306     
- Misses      20539    20571      +32     
- Partials     2050     2065      +15

Flag	Coverage Δ
hive	`48.98% <57.08%> (+0.04%)`	⬆️
javascript	`57.54% <89.74%> (+0.06%)`	⬆️
mysql	`77.76% <60.00%> (-0.15%)`	⬇️
postgres	`77.90% <59.58%> (-0.13%)`	⬇️
presto	`53.66% <57.91%> (+0.01%)`	⬆️
python	`83.20% <90.00%> (+0.05%)`	⬆️
sqlite	`77.34% <59.58%> (-0.12%)`	⬇️
unit	`57.18% <87.50%> (+0.42%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

mistercrunch

Overall seems solid. My most important point is probably around adding a database index on the new model, the rest are comments/notes.

superset/connectors/sqla/models.py

superset/migrations/versions/2024-03-20_16-02_678eefb4ab44_add_access_token_table.py

superset/exceptions.py

superset/utils/oauth2.py

superset/db_engine_specs/hive.py

john-bodley · 2024-03-25T18:36:29Z

@betodealmeida do we perceive there could/would be other authorization frameworks other than OAuth 2.0? If so I was wondering if there was merit in renaming database_user_oauth2_tokens to be something more generic and adding additional column(s) which define said frameworks.

craig-rueda

Looks good so far! I think there's quite a few cases that could be tested. (Token delete/insert/ etc., the updated engine spec definitions, etc...)

betodealmeida · 2024-03-25T19:21:39Z

@betodealmeida do we perceive there could/would be other authorization frameworks other than OAuth 2.0? If so I was wondering if there was merit in renaming database_user_oauth2_tokens to be something more generic and adding additional column(s) which define said frameworks.

@john-bodley I'm not sure, to be honest. The beauty of OAuth2 is that the same flow is shared across multiple providers, all you need is an access token and a refresh token, so the same foundation works for BigQuery/GSheets/Snowflake/Dremio/Databricks.

The one case I can think of is if at some point we'd want users to be able to input their own username/password, but I think that trying to address potential future uses would increase the complexity without clear benefits.

betodealmeida · 2024-03-25T19:41:49Z

Looks good so far! I think there's quite a few cases that could be tested. (Token delete/insert/ etc., the updated engine spec definitions, etc...)

@craig-rueda I did test inserting/deleting the token in the API test. I'll add tests for the new DB engine spec methods.

superset/utils/oauth2.py

superset/db_engine_specs/gsheets.py

harksin · 2024-03-29T14:29:52Z

Hey there,
thanks for this so usefull PR !
I can confirm that when we use superset and Trino we are interested in a support for Oauth in superset with Trino to get rid of the user impersonation.

betodealmeida · 2024-03-29T14:38:56Z

I'm simply trying to bring up the fact that this solution wouldn't work in a use case that I would need it to work in if I were to use this feature.

The solution I'm proposing is a hybrid superset_config.py (this PR) + new model (future PR), which would work for both of of our use cases. We'd have to introduce the new model and UI elements, but most of the logic of this PR would remain unchanged. I'm not against your proposal, on the contrary, I think it's a great idea and an elegant solution.

But what I'm hearing from you is that we should go with only the new model, where the client ID/secret live in the metadata database. That solution is very suboptimal for my use case, since it would require users to create and manage their own applications.

You mentioned:

This is based on the assumption that the best UX for adding cloud db clients would still be through the UI

But that is definitely not true in my use case. Google Sheets is one of our most popular databases used at Preset, and it would be great if users could connect to private sheets without having to figure out how to create an OAuth2 application first. BigQuery is probably the most popular, and the faster our users can start exploring their data, the better.

You also mentioned:

then the config based approach will anyway become redundant, and will cause both maintenance burden, deprecation and removal at some point.

I don't see why the config approach would have to removed, since it complements the approach you're proposing. The default configuration for the feature is just an empty dictionary. We just need to add logic that checks the for the client information in two different places, as we already do for many things in Superset — this is IMHO the main feature of Superset, how it's adaptable and configurable to so many different use cases.

And in fairness, I do think that if we have the hybrid approach most people will prefer to use the UI to configure OAuth2, because as you said, it's easier to update the client in a single place and changes don't require a redeploy. But that doesn't mean that it's a solution that works for everyone.

(As for Snowflake, my understanding is that we can derive the OAuth2 URL from the account name, which is part of the SQLAlchemy URL.)

villebro · 2024-03-29T16:06:15Z

The solution I'm proposing is a hybrid superset_config.py (this PR) + new model (future PR), which would work for both of of our use cases. We'd have to introduce the new model and UI elements, but most of the logic of this PR would remain unchanged. I'm not against your proposal, on the contrary, I think it's a great idea and an elegant solution.

But what I'm hearing from you is that we should go with only the new model, where the client ID/secret live in the metadata database. That solution is very suboptimal for my use case, since it would require users to create and manage their own applications.

I'm totally ok with the hybrid approach if we make sure we're not painting ourselves into a corner with it. The reason I feel uneasy with implementing a "one client for all connections of type x" approach as the initial implementation is that I feel it's inherently atypical for this type of flow: While it may be optimal for this specific use case, I don't think it works generally for most other use cases.

But that is definitely not true in my use case. Google Sheets is one of our most popular databases used at Preset, and it would be great if users could connect to private sheets without having to figure out how to create an OAuth2 application first. BigQuery is probably the most popular, and the faster our users can start exploring their data, the better.

I understand this. Again, my reservation stems from the fact that this is not a typical OAuth2 flow, i.e. you likely can't use this for the majority of OAuth2 connectivity use cases, but the alternative I'm proposing works for this one, too. But as I said, I'm not against being able to provide static creds in the config where applicable.

I don't see why the config approach would have to removed, since it complements the approach you're proposing. The default configuration for the feature is just an empty dictionary. We just need to add logic that checks the for the client information in two different places, as we already do for many things in Superset — this is IMHO the main feature of Superset, how it's adaptable and configurable to so many different use cases.

(As for Snowflake, my understanding is that we can derive the OAuth2 URL from the account name, which is part of the SQLAlchemy URL.)

But even if you dynamically generate the URI, you would still need to store the client creds somewhere, right? In other words, the user would still need to pass the client id and secret somehow for the backend to be able to use them. Quoting from https://docs.snowflake.com/en/user-guide/oauth-custom#request-header:

betodealmeida · 2024-03-29T17:19:47Z

@villebro for Snowflake the admin would add to config.py:

DATABASE_OAUTH2_CREDENTIALS = {
    "Snowflake": {
        "CLIENT_ID": "XXX",
        "CLIENT_SECRET": "YYY",
    },
}

Then once a Snowflake database is added the DB engine spec can determine the OAuth2 URL from the SQLAlchemy URI, and use the information from the config to authorize the user. We could also optionally have the URL in the config, for deployments where that is possible.

I think we have two very different use cases, which is why we need these two workflows. For Preset, most users are using cloud DBs, for the same reason they're using cloud Superset — they don't want to run their own infrastructure. And we want to have them connected to their data as quickly and easily as possible. We want to allow them to use our own application, but the Admin users shouldn't have access to the application details.

In your case, you're running your own very dynamic infrastructure, you have dozens (hundreds?) of databases, and multiple custom OAuth2 providers. Your admins are tech-savvy and have no trouble connecting Superset to complex database deployments, creating OAuth2 applications, and so on.

You mentioned "reducing throwaway work". I don't think there's a lot of work that will be thrown away if later we implement your suggestion. Most of the logic will remain, the only thing that would change would be how we pass the configuration to the DB engine specs — currently they (and by "they" I mean GSheets) read it from the config, but we could pass it explicitly so it come either from the config or from the assigned client.

I'm happy to add that abstraction in the near future, before someone starts working on a n:n between databases and OAuth2 client.

villebro · 2024-03-29T19:21:26Z

for Snowflake the admin would add to config.py:
DATABASE_OAUTH2_CREDENTIALS = {
    "Snowflake": {
        "CLIENT_ID": "XXX",
        "CLIENT_SECRET": "YYY",
    },
}
Then once a Snowflake database is added the DB engine spec can determine the OAuth2 URL from the SQLAlchemy URI, and use the information from the config to authorize the user. We could also optionally have the URL in the config, for deployments where that is possible.

This inherently means that you'll only be able to connect to a single Snowflake account using OAuth2. I'm sure that may be fine for many deployments, but as a general pattern that's unnecessarily restrictive.

Again I want to reiterate that my purpose here isn't to block this work. Rather, I'm trying to bring up issues with this approach that I'm expecting to surface after this PR is merged when orgs start rolling this out to their deployments. For that reason, it would be great to have an understanding of when these follow-up tasks are expected to be done to call OAuth2 support complete. Maybe we could discuss this over a meeting to see which orgs can commit to this work, and when?

betodealmeida · 2024-03-29T19:31:05Z

This inherently means that you'll only be able to connect to a single Snowflake account using OAuth2. I'm sure that may be fine for many deployments, but as a general pattern that's unnecessarily restrictive.

No, you could add as many different Snowflake databases as you wanted. Each one would connect to a different OAuth2 URL because each one would have a different SQLAlchemy URI. The client ID and the client are not tied to a single account, unless I'm missing something.

I'm planning to add support for Snowflake next, so if we need any kind of refactoring to support it, or if we need to implement the full model that you're proposing, I would be the one doing that work.

villebro · 2024-03-29T19:48:41Z

No, you could add as many different Snowflake databases as you wanted. Each one would connect to a different OAuth2 URL because each one would have a different SQLAlchemy URI. The client ID and the client are not tied to a single account, unless I'm missing something.

I would be surprised if one client works across multiple accounts. I haven't tried this personally, but I understand the process as follows:

Login to your Snowflake account
Issue a CREATE SECURITY INTEGRATION TYPE = OAUTH as per here
Use the creds from SYSTEM$SHOW_OAUTH_CLIENT_SECRETS when executing queries on your account as per here

And if you'd want to integrate with another account, you'd redo those steps on that account, and then use those creds when executing queries against that one.

betodealmeida · 2024-03-29T20:05:15Z

I would be surprised if one client works across multiple accounts. I haven't tried this personally, but I understand the process as follows [...]

I haven't tried it personally yet either, and if it doesn't work, it's OK — I'll then implement your proposal next.

mistercrunch

Minor comments but overall LGTM. Stamping my approval but we may want another stamp from @villebro since he got deep in here already and seemed likely to push this further in the future.

mistercrunch · 2024-04-02T15:58:33Z

superset/db_engine_specs/README.md

@@ -542,6 +543,70 @@ The method `get_url_for_impersonation` updates the SQLAlchemy URI before every q

 Alternatively, it's also possible to impersonate users by implementing the `update_impersonation_config`. This is a class method which modifies `connect_args` in place. You can use either method, and ideally they [should be consolidated in a single one](https://github.com/apache/superset/issues/24910).

+### OAuth2


NIT: feels like this belongs on the documentation website, probably as a new section "Connecting Users to Databases using OAuth2" under the "Installation and Configuration" section. This README is buried, maybe the guideline for using this readme would be for things that speak to developers working on the db_engine_specs package as opposed to admins looking to install/configure Superset. But the content is great! :)

NIT: personally I like to break lines at 80-100 in docs/md, but there's not standard enforced there at the moment, so fine either way.

Right, this README documents all the functionality of the DB engine specs and is targeting developers, explaining the methods that are needed. I'm happy to write additional docs about OAuth2 in the main website, I'll do that.

villebro

LGTM with a request for making it possible to define clients per database in the future, preferably via a dedicated client model.

pull-request-size bot added the size/XXL label Mar 23, 2024

github-actions bot added risk:db-migration PRs that require a DB migration api Related to the REST API labels Mar 23, 2024

betodealmeida force-pushed the sip-85 branch 2 times, most recently from 0cd62a5 to 8c7c027 Compare March 23, 2024 23:22

betodealmeida force-pushed the sip-85 branch 4 times, most recently from e055f69 to be464ba Compare March 23, 2024 23:51

github-actions bot added the preset-io label Mar 23, 2024

betodealmeida force-pushed the sip-85 branch 3 times, most recently from 047477d to 7e1c4e9 Compare March 24, 2024 00:26

betodealmeida marked this pull request as ready for review March 24, 2024 00:50

betodealmeida requested a review from a team as a code owner March 24, 2024 00:50

john-bodley self-requested a review March 25, 2024 16:53

mistercrunch reviewed Mar 25, 2024

View reviewed changes

craig-rueda reviewed Mar 25, 2024

View reviewed changes

villebro reviewed Mar 25, 2024

View reviewed changes

superset/utils/oauth2.py Outdated Show resolved Hide resolved

betodealmeida force-pushed the sip-85 branch from fdadf84 to 9ed7d79 Compare March 26, 2024 15:10

betodealmeida requested review from villebro and craig-rueda March 26, 2024 17:46

betodealmeida force-pushed the sip-85 branch from 06dd066 to 0d27d8c Compare March 26, 2024 18:23

villebro reviewed Mar 26, 2024

View reviewed changes

superset/utils/oauth2.py Outdated Show resolved Hide resolved

eschutho reviewed Mar 26, 2024

View reviewed changes

superset/db_engine_specs/gsheets.py Outdated Show resolved Hide resolved

mistercrunch approved these changes Apr 2, 2024

View reviewed changes

villebro approved these changes Apr 2, 2024

View reviewed changes

betodealmeida added 9 commits April 2, 2024 21:15

feat(SIP-85): OAuth2 for databases

adf177c

Add more tests

38e2bf8

Add KV lock for refreshing tokens

d0a3fb3

Make lock generic

1c3f841

Refactor JWT encode/decode

9a4aa20

Use DAO, Marshmallow schema, and cascade deletes

66ec9cb

Bump shillelagh

ee41e59

Fix typo

d61018e

Improve docstring

3f484bb

betodealmeida force-pushed the sip-85 branch from 59d4daf to 3f484bb Compare April 3, 2024 01:18

Add dep for test

421d5f7

betodealmeida merged commit 9022f5c into master Apr 3, 2024
43 checks passed

betodealmeida mentioned this pull request Apr 3, 2024

chore(OAuth2): refactor for custom OAuth2 clients #27880

Merged

9 tasks

jzhao62 pushed a commit to jzhao62/superset that referenced this pull request Apr 4, 2024

feat(SIP-85): OAuth2 for databases (apache#27631)

c8303ab

EandrewJones pushed a commit to UMD-ARLIS/superset that referenced this pull request Apr 5, 2024

feat(SIP-85): OAuth2 for databases (apache#27631)

f85891a

EnxDev pushed a commit to EnxDev/superset that referenced this pull request Apr 12, 2024

feat(SIP-85): OAuth2 for databases (apache#27631)

78024f7

rusackas deleted the sip-85 branch April 16, 2024 16:52

mistercrunch mentioned this pull request Apr 27, 2024

Dashboards should be loaded with current user access rights, not with dashboard owner's access rights #28214

Closed

qleroy pushed a commit to qleroy/superset that referenced this pull request Apr 28, 2024

feat(SIP-85): OAuth2 for databases (apache#27631)

10812d2

betodealmeida mentioned this pull request Jun 26, 2024

[SIP-41] Proposal for Alignment of Backend Error Handling #9298

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(SIP-85): OAuth2 for databases #27631

feat(SIP-85): OAuth2 for databases #27631

betodealmeida commented Mar 23, 2024 •

edited

Loading

codecov bot commented Mar 23, 2024 •

edited

Loading

mistercrunch left a comment

john-bodley commented Mar 25, 2024

craig-rueda left a comment

betodealmeida commented Mar 25, 2024

betodealmeida commented Mar 25, 2024

harksin commented Mar 29, 2024

betodealmeida commented Mar 29, 2024

villebro commented Mar 29, 2024

betodealmeida commented Mar 29, 2024

villebro commented Mar 29, 2024

betodealmeida commented Mar 29, 2024

villebro commented Mar 29, 2024

betodealmeida commented Mar 29, 2024 •

edited

Loading

mistercrunch left a comment

mistercrunch Apr 2, 2024

mistercrunch Apr 2, 2024

betodealmeida Apr 2, 2024

villebro left a comment

		@@ -542,6 +543,70 @@ The method `get_url_for_impersonation` updates the SQLAlchemy URI before every q

		Alternatively, it's also possible to impersonate users by implementing the `update_impersonation_config`. This is a class method which modifies `connect_args` in place. You can use either method, and ideally they [should be consolidated in a single one](https://github.com/apache/superset/issues/24910).

		### OAuth2

feat(SIP-85): OAuth2 for databases #27631

feat(SIP-85): OAuth2 for databases #27631

Conversation

betodealmeida commented Mar 23, 2024 • edited Loading

SUMMARY

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

TESTING INSTRUCTIONS

ADDITIONAL INFORMATION

codecov bot commented Mar 23, 2024 • edited Loading

Codecov Report

mistercrunch left a comment

Choose a reason for hiding this comment

john-bodley commented Mar 25, 2024

craig-rueda left a comment

Choose a reason for hiding this comment

betodealmeida commented Mar 25, 2024

betodealmeida commented Mar 25, 2024

harksin commented Mar 29, 2024

betodealmeida commented Mar 29, 2024

villebro commented Mar 29, 2024

betodealmeida commented Mar 29, 2024

villebro commented Mar 29, 2024

betodealmeida commented Mar 29, 2024

villebro commented Mar 29, 2024

betodealmeida commented Mar 29, 2024 • edited Loading

mistercrunch left a comment

Choose a reason for hiding this comment

mistercrunch Apr 2, 2024

Choose a reason for hiding this comment

mistercrunch Apr 2, 2024

Choose a reason for hiding this comment

betodealmeida Apr 2, 2024

Choose a reason for hiding this comment

villebro left a comment

Choose a reason for hiding this comment

betodealmeida commented Mar 23, 2024 •

edited

Loading

codecov bot commented Mar 23, 2024 •

edited

Loading

betodealmeida commented Mar 29, 2024 •

edited

Loading