[SIP-85] OAuth2 for databases #20300

betodealmeida · 2022-06-07T22:25:21Z

[SIP-85] OAuth2 for databases

Motivation

Modern cloud databases (eg, Sowflake, BigQuery, and Google Sheets) support OAuth2 for user authentication and authorization. Connecting Superset to these databases today, specially Google Sheets, is a complicated multi-step process, requiring uses to create a service account that can potentially give access to all the sheets inside an organization (if "impersonate user" is not enabled).

Superset could offer a much better experience to users by supporting OAuth2 directly. Users would simply have to grant the necessary permissions to Superset, and Superset would store and use the access tokens when running queries. This would remove the complexity of setting up these databases, and solve the problem of risking giving access to confidential tables and/or sheets.

Proposed Change

We propose supporting OAuth2 directly in Superset. When accessing a resource (table or Google sheet) that requires OAuth2 the user would be redirected to the authorization server, where they can grant the necessary permissions. After OAuth2 the access and refresh tokens would be stored in table with grain of (database, user), and reused in subsequent connections.

The following video shows the process in action:

oauth2.mov

New or Changed Public Interfaces

This proposal adds a new table and a new endpoint. It also modifies modify_url_for_impersonation to pass a new parameter, the OAuth2 access_token. The SIP will be first implemented for Google Sheets, but the implementation is database agnostic, with all database specific logic living in the engine spec.

To store the user tokens a new table database_user_oauth2_tokens would be created, with a corresponding SQLAlchemy model:

user_id	database_id	access_token	access_token_expiration	refresh_token
1	1	XXX	2022-06-06 09:00:00Z	YYY

(The access token and refresh token are stored in encrypted columns, similar to other database credentials.)

Here the user with ID 1 has an access token XXX on the database with ID 1. When they run a query in that database Superset will pass the access token to the database engine spec via modify_url_for_impersonation. For Google Sheets to use this token it's as simple as:

class GSheetsEngineSpec(SqliteEngineSpec):
    ...
    @classmethod
    def modify_url_for_impersonation(
        cls,
        url: URL,
        impersonate_user: bool,
        username: Optional[str],
        access_token: Optional[str] = None,  # added in this SIP
    ) -> None:
        if impersonate_user and access_token:
            url.query["access_token"] = access_token

Different DB engine specs might have to handle the token in a different way.

When passing the token to the DB engine spec Superset will check if it hasn't expired. If it has expired and a refresh token is available Superset will refresh the token and store the new one by calling a DB engine spec method. For Google Sheets:

class GSheetsEngineSpec(SqliteEngineSpec):
    ...
    @staticmethod
    def get_oauth2_fresh_token(refresh_token: str) -> OAuth2TokenResponse:
        response = http.request(  # type: ignore
            "POST",
            "https://oauth2.googleapis.com/token",
            fields={
                "client_id": current_app.config["GSHEETS_OAUTH2_CLIENT_ID"],
                "client_secret": current_app.config["GSHEETS_OAUTH2_CLIENT_SECRET"],
                "refresh_token": refresh_token,
                "grant_type": "refresh_token",
            },
        )
        return json.loads(response.data.decode("utf-8"))

The refresh token logic is part of the OAuth2 spec, but the details of how to do it are left to each DB engine spec. For Google Sheets we need to post a payload to https://oauth2.googleapis.com/token, as the example above shows.

How are these tokens initially obtained and stored? Each DB engine spec should declare if it supports OAuth2 via a is_oauth2_enabled method, for example:

from shillelagh.exception import UnauthenticatedError

class GSheetsEngineSpec(SqliteEngineSpec):
    ...
    @staticmethod
    def is_oauth2_enabled() -> bool:
        return (
            "GSHEETS_OAUTH2_CLIENT_ID" in current_app.config
            and "GSHEETS_OAUTH2_CLIENT_SECRET" in current_app.config
        )

    oauth2_exception = UnauthenticatedError

(The default method in the base class returns False.)

Each DB engine spec also defines an exception oauth2_exception that is raised when OAuth2 is needed (UnauthenticatedError in the example above). Superset will capture this exception and start the OAuth2 flow by raising a custom error that the frontend understands:

class BaseEngineSpec:
    ...
    def execute(
        cls,
        cursor: Any,
        query: str,
        database_id: int,
        **kwargs: Any,
    ) -> None:
        try:
            cursor.execute(query)
        except cls.oauth2_exception as ex:
            if cls.is_oauth2_enabled():
                oauth_url = cls.get_oauth2_authorization_uri(database_id)
                raise OAuth2RedirectError(oauth_url) from ex

Here the base engine spec is calling the method get_oauth2_authorization_uri that is database specific — this returns the URL of the authorization server.

When authenticating with OAuth2 it's common to pass a state parameter to the authorization server to, as the name suggests, maintain state. That parameter is returned unmodified from the authorization server, and we can use it to verify that the request is valid and came from Superset. In this implementation, our state is a JWT signed with current_app.config["SECRET_KEY"], with the following payload:

{"user_id" 1, "database_id" 1}

The user should be redirected back from the authorization service to a new endpoint, api/v1/databases/oauth2/. That endpoint will receive the state JWT and a code query argument. The backend validates the JWT and extracts the user and database IDs. It then exchanges the code for the access and refresh tokens. Since this is database specific we call a method on the DB engine spec; for Google Sheets:

class GSheetsEngineSpec(SqliteEngineSpec):
    ...
    @staticmethod
    def get_oauth2_token(code: str) -> OAuth2TokenResponse:
        redirect_uri = current_app.config.get(
            "GSHEETS_OAUTH2_REDIRECT_URI",
            url_for("DatabaseRestApi.oauth2", _external=True),
        )

        response = http.request(  # type: ignore
            "POST",
            "https://oauth2.googleapis.com/token",
            fields={
                "code": code,
                "client_id": current_app.config["GSHEETS_OAUTH2_CLIENT_ID"],
                "client_secret": current_app.config["GSHEETS_OAUTH2_CLIENT_SECRET"],
                "redirect_uri": redirect_uri,
                "grant_type": "authorization_code",
            },
        )
        return json.loads(response.data.decode("utf-8"))

Note here that we have an optional GSHEETS_OAUTH2_REDIRECT_URI configuration key. This should not be used in most situations (but is needed at Preset, where we manage multiple Superset instances).

After exchanging the code for the access and refresh tokens they get saved to the table database_user_oauth2_tokens.

New dependencies

No new dependencies, though the reference implementation uses a few dependencies-of-dependencies not listed explicitly in Superset's setup.py (urllib3 and yarl).

Migration Plan and Compatibility

A new migration is needed to create the database_user_oauth2_tokens table.

Rejected Alternatives

There are not many alternative ways to solve the problem of authenticating to a database using OAuth2. The table is needed, since we store personal tokens at the (database, user) grain. We need a new endpoint, to receive the code from the authorization server. Passing the access_token to modify_url_for_impersonation seems like the cleanest way of supporting OAuth2 without having to refactor all the engine specs.

The text was updated successfully, but these errors were encountered:

padbk · 2022-06-15T16:38:15Z

Yes please, this would be wonderful ❤️

Our use case for this is Redshift and Aurora Postgres with Microsoft AAD.

betodealmeida · 2022-06-15T19:52:26Z

@phildum do you have any docs on how to use OAuth2 in those cases?

Samira-El · 2022-06-16T12:03:31Z

We, at Wise, have implemented this for Snowflake in the superset custom config file, it's a great chunk of the file, would love to have this native support for it in Superset.

Currently, like shown in this SIP, we're saving the tokens to the metadata db, in hindsight, I suggest adding the option to save them to the cache instead, since they expire anyway and to avoid multiple trips to the DB to fetch them throughout the period the user is querying the data source.

On another note, as was pointed out on Slack, a limitation we hit with this implementation is that we cannot enable alerts/reports because Superset uses a machine user which cannot authenticate to the data source to render a dashboard.

There is also data caching that has to be done per user to avoid data leaking to visitors of a dashboard, I have a PR open for this #20114

padbk · 2022-06-27T08:35:32Z

@phildum do you have any docs on how to use OAuth2 in those cases?

Redshift: https://docs.aws.amazon.com/redshift/latest/mgmt/redshift-iam-access-control-native-idp.html

Postgres RDS is via Kerberos to AWS managed AD which can then be synced with Azure AD via Azure AD Connect: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/postgresql-kerberos.html
https://aws.amazon.com/blogs/security/enable-office-365-with-aws-managed-microsoft-ad-without-user-password-synchronization/
As Postgres RDS is not OAuth2, but Kerberos, you may want to split this out, but the premise of database access via SSO could remain the same, with a different backend implementation of token fetching?

michael-s-molina · 2022-07-12T11:49:23Z

Currently, like shown in this SIP, we're saving the tokens to the metadata db, in hindsight, I suggest adding the option to save them to the cache instead, since they expire anyway and to avoid multiple trips to the DB to fetch them throughout the period the user is querying the data source.

@betodealmeida If we choose to use the cache to store this information, our custom flask caching backend can store data in the key_value table in case a client does not support any of the available backends 😉

wearpants · 2022-08-09T20:01:53Z

This would fill a major missing hole in BI offerings generally, I have several clients in need of this

yzliao-zip · 2022-08-22T06:53:30Z

this is a great feature! We spent a lot of time to find a good way to enable snowflake oauth and none of them is perfect. Really looking forward to having this feature available.

dungdm93 · 2022-09-14T08:16:27Z

Trino like this 👍

musicmuthu · 2022-10-12T15:23:38Z

this will support trino oauth2 ?

betodealmeida · 2022-10-12T17:18:41Z

this will support trino oauth2 ?

We'll create a base implementation, and then people can add support on per-database basis. My idea is to start with GSheets and Snowflake, but it should be easy to add support for more databases.

anilkumar000 · 2022-10-13T12:22:24Z

Could please point out to the branch where the code is in progress to incorporate this feature. @betodealmeida

dobrych · 2022-10-18T20:26:58Z

@betodealmeida I've tried to make a docker build from your branch https://github.com/betodealmeida/incubator-superset/tree/gsheets_oauth but it failed on the npm stage:

executor failed running [/bin/sh -c cd /app/superset-frontend && npm run ${BUILD_CMD} && rm -rf node_modules]: exit code: 1

is there a better recipe to test your changes?
I'd be happy to contribute as a tester (or maybe even code)

Samira-El · 2022-12-05T16:12:57Z

Hi,

We'll create a base implementation, and then people can add support on per-database basis. My idea is to start with GSheets and Snowflake, but it should be easy to add support for more databases.

Wondering if the implementation in progress is taking into consideration how having oauth for the datasources impacts scheduling alerts/reports and if anything is being done around this?

Samira-El · 2023-02-15T12:38:19Z

Any updates on this? There is no recent activity in this branch https://github.com/betodealmeida/incubator-superset/tree/gsheets_oauth

Happy to contribute as a dev

villebro · 2023-02-26T16:20:27Z

Hi,

We'll create a base implementation, and then people can add support on per-database basis. My idea is to start with GSheets and Snowflake, but it should be easy to add support for more databases.

Wondering if the implementation in progress is taking into consideration how having oauth for the datasources impacts scheduling alerts/reports and if anything is being done around this?

@Samira-El alerts & Reports can now be executed without a service account, see #21931 for details. So in terms of this PR, as long as the report executor has a valid access and/or refresh token stored in the db, the report should execute as if the user were logged in.

yzliao-zip · 2023-09-06T21:56:11Z

@michael-s-molina could you give a reference on the PR or code path that implements the OAuth2 for DB? Thanks!

musicmuthu · 2023-09-07T09:52:18Z

@michael-s-molina Can you please share the PR branch name for the oauth DB issue

michael-s-molina · 2023-09-07T11:35:53Z

@yzliao-zip @musicmuthu A closed SIP ticket only means that the SIP was approved/rejected/discarded. It does not mean it was implemented yet. You can check the implementation status of SIPs here.

musicmuthu · 2023-09-08T05:33:15Z

@michael-s-molina thank you for the explanation

betodealmeida · 2024-03-18T18:18:18Z

I finally found some time to work on this.

mistercrunch · 2024-03-25T17:34:46Z

Quick comment here for legacy and capturing a Slack conversation. I wondered if we had considered using the user session as a mean to store the token. The idea being that the token would be more ephemeral, not persisted by Superset and expiring with the Superset session, which all seemed good and healthy.

Now from a short conversation around this, a clear issue around the session approach is around 1) the async celery backend doesn't have a handle on the web request, forcing us to pass the token around in some capacity and encrypting it on the wire and 2) to support alerts & reports, unless we'd disable that combination of feature. In any case, knowing that Superset needs to have a hold on the token, it does make sense to use the database that provides the right guarantees around encryption.

villebro · 2024-03-25T18:24:47Z

Hi @betodealmeida @mistercrunch , I can chime in here, as I've implemented e2e OAuth database auth using the tokens from the initial Superset OAuth authentication flow. This is a more custom case, as we're able to use the initial token as the basis for all db OAuths, and don't need to do separate web based login dance that's required for these auths. However, the storage requirements are likely very similar.

The relevant auth steps look something like this:

When the user logs in, the JWT tokens are persisted in the key-value table. We do fairly heavy encryption of the tokens to minimize risk of tokens leaking.
When connecting to internal databases that support OAuth, we do token exchange to exchange the initial token for the downstream one.
When the downstream token expires (they usually have a fairly short TTL), they are re-exchanged for a new one from the initial token.
We persist the refresh tokens also, which means that if the initial token expires, we can refresh it on the fly and then re-exchange when needed.

One important thing that required careful consideration is how to handle this in highly horizontally scaled deployments. Example: let's say you open a big dashboard with many charts referencing the same database. The browser will trigger multiple chart data requests that will likely be picked up by different web workers that don't share state. If the token needs exchanging or refreshing, we only want one single process to handle that. For that reason, we use a distributed lock, that again leverages the key-value table. It goes something like this (this happens from the db connection mutator):

fetch the latest downstream token from the key-value store (each downstream token is persisted in a separate entry based on the OAuth app id)
if it's valid, add the JWT token to the database connection and return the mutated db connection.
if it's expired, try to retrieve a lock for executing token exchange.
If the lock is already taken, do backoff, and goto 1.
Initiate the token exchange process.
If the initial token is also expired, a similar locking/refreshing flow is initiated for that as explained in 1-4
once the new token is ready, persist that in the key-value store, and return the lock

This ensures that only a single web worker thread is actually carrying out the token exchange/refresh, and the others are simply waiting for it to complete, after which they proceed normally with the chart data request. It also works great for alerts & reports, as this all happens under the hood by the web workers. This flow has worked very well for us, and I'm happy to contribute the related code upstream if this aligns with this feature.

rusackas · 2024-06-18T15:37:17Z

Hi @betodealmeida - just checking in to see if this is still WIP of if it's effectively "done". No pressure, just wondering what column to put it in ;)

betodealmeida added the sip Superset Improvement Proposal label Jun 7, 2022

superset-github-bot bot added the preset-io label Jun 7, 2022

rusackas added this to ACTIVE DISCUSSION in SIPs Jun 13, 2022

betodealmeida changed the title ~~[SIP-83] OAuth2 for databases~~ [SIP-85] OAuth2 for databases Jun 13, 2022

betodealmeida moved this from ACTIVE DISCUSSION to APPROVED (in mailing list) in SIPs Aug 9, 2022

mdesmet mentioned this issue Oct 12, 2022

Custom cache for oauth2 tokens trinodb/trino-python-client#223

Open

betodealmeida mentioned this issue Dec 10, 2022

feat: use personal access token in OAuth2 databases #20280

Closed

9 tasks

michael-s-molina closed this as completed Sep 6, 2023

betodealmeida mentioned this issue Mar 23, 2024

feat(SIP-85): OAuth2 for databases #27631

Merged

9 tasks

dosubot bot mentioned this issue May 23, 2024

Superset throws access is denied error when viewing dashboard even after signing in. #28646

Open

3 tasks

betodealmeida mentioned this issue May 30, 2024

feat: impersonate with email prefix #28770

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SIP-85] OAuth2 for databases #20300

[SIP-85] OAuth2 for databases #20300

betodealmeida commented Jun 7, 2022 •

edited

padbk commented Jun 15, 2022

betodealmeida commented Jun 15, 2022

Samira-El commented Jun 16, 2022

padbk commented Jun 27, 2022

michael-s-molina commented Jul 12, 2022

wearpants commented Aug 9, 2022

yzliao-zip commented Aug 22, 2022

dungdm93 commented Sep 14, 2022

musicmuthu commented Oct 12, 2022

betodealmeida commented Oct 12, 2022

anilkumar000 commented Oct 13, 2022

dobrych commented Oct 18, 2022

Samira-El commented Dec 5, 2022 •

edited

Samira-El commented Feb 15, 2023

villebro commented Feb 26, 2023 •

edited

yzliao-zip commented Sep 6, 2023

musicmuthu commented Sep 7, 2023

michael-s-molina commented Sep 7, 2023 •

edited

musicmuthu commented Sep 8, 2023

betodealmeida commented Mar 18, 2024

mistercrunch commented Mar 25, 2024

villebro commented Mar 25, 2024

rusackas commented Jun 18, 2024

[SIP-85] OAuth2 for databases #20300

[SIP-85] OAuth2 for databases #20300

Comments

betodealmeida commented Jun 7, 2022 • edited