Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(SIP-85): OAuth2 for databases #27631

Merged
merged 10 commits into from
Apr 3, 2024
Merged

feat(SIP-85): OAuth2 for databases #27631

merged 10 commits into from
Apr 3, 2024

Conversation

betodealmeida
Copy link
Member

@betodealmeida betodealmeida commented Mar 23, 2024

SUMMARY

This PR introduces a new table called database_user_oauth2_tokens. The table is used for storing personal user tokens associated with a given database:

# \d database_user_oauth2_tokens
                                               Table "public.database_user_oauth2_tokens"
         Column          |            Type             | Collation | Nullable |                         Default
-------------------------+-----------------------------+-----------+----------+---------------------------------------------------------
 created_on              | timestamp without time zone |           |          |
 changed_on              | timestamp without time zone |           |          |
 id                      | integer                     |           | not null | nextval('database_user_oauth2_tokens_id_seq'::regclass)
 user_id                 | integer                     |           | not null |
 database_id             | integer                     |           | not null |
 access_token            | bytea                       |           |          |
 access_token_expiration | timestamp without time zone |           |          |
 refresh_token           | bytea                       |           |          |
 created_by_fk           | integer                     |           |          |
 changed_by_fk           | integer                     |           |          |
Indexes:
    "database_user_oauth2_tokens_pkey" PRIMARY KEY, btree (id)
Foreign-key constraints:
    "database_user_oauth2_tokens_changed_by_fk_fkey" FOREIGN KEY (changed_by_fk) REFERENCES ab_user(id)
    "database_user_oauth2_tokens_created_by_fk_fkey" FOREIGN KEY (created_by_fk) REFERENCES ab_user(id)
    "database_user_oauth2_tokens_database_id_fkey" FOREIGN KEY (database_id) REFERENCES dbs(id)
    "database_user_oauth2_tokens_user_id_fkey" FOREIGN KEY (user_id) REFERENCES ab_user(id)

Whenever a SQLAlchemy engine is instantiated, the personal user token (or None) will be passed to the get_url_for_impersonation method in the DB engine spec, so that a custom URL can be built for the user. For example, for GSheets:

def get_url_for_impersonation(
    cls,
    url: URL,
    impersonate_user: bool,
    username: str | None,
    access_token: str | None,  # <== here
) -> URL:
    if not impersonate_user:
        return url

    if username is not None:
        user = security_manager.find_user(username=username)
        if user and user.email:
            url = url.update_query_dict({"subject": user.email})

    if access_token:
        url = url.update_query_dict({"access_token": access_token})

    return url

The change allows users to login to databases like BigQuery, Snowflake, Dremio, Databricks, Google Sheets, etc. using their own credentials. This makes it easier to set up databases, since service accounts are no longer required, and provides better isolation of data between users. Only support for Google Sheets is implemented in this PR, and it's considered the reference implementation. Note that a newer version of Shillelagh is required, since a change in the Google Auth API introduced a regression.

In order to populate the table with personal access tokens, the DB engine spec checks for a specific exception that signals that OAuth2 should start:

class BaseEngineSpec:
    def execute(...) -> None:
        try:
            cursor.execute(query)
        except cls.oauth2_exception as ex:  # <== here
            if cls.is_oauth2_enabled() and g.user:
                cls.start_oauth2_dance(database_id)
            raise cls.get_dbapi_mapped_exception(ex) from ex
        except Exception as ex:
            raise cls.get_dbapi_mapped_exception(ex) from ex

When called, the start_oauth2_dance method will return the error OAUTH2_REDIRECT to the frontend. The error is captured by the ErrorMessageWithStackTrace component, which provides a link to the user so they can start the OAuth2 authentication. Since this is implemented at the DB engine spec level, any query will trigger it — in SQL Lab, Explore, or dashboards — see the screenshots below for the UX.

Note that while the current implementation triggers OAuth2 when a query needs authorization, we could also implement affordances in the database UI to manually trigger OAuth2 to store the personal access tokens. This could be done in the future.

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

SQL Lab. Note that the query runs automatically once OAuth is completed:

SIP-85.Sql.Lab.mov

Explore. Note that the chart is automatically updated after OAuth:

SIP-85.Explore.mov

Same thing for dashboards:

SIP-85.Dashboard.mov

TESTING INSTRUCTIONS

  1. Create a Google Sheets database.
  2. Create a Google OAuth2 application at https://console.cloud.google.com/apis/credentials/oauthclient/ of type "Web application"
  3. Edit superset_config.py and add the client ID and secret:
DATABASE_OAUTH2_CREDENTIALS = {
    "Google Sheets": {
       "CLIENT_ID": "XXX.apps.googleusercontent.com",
       "CLIENT_SECRET": "GOCSPX-YYY",
    },
}
  1. In SQL Lab, try to query a sheet that is not shared publicly. It should trigger OAuth2.
  2. Add the sheet as a dataset and create a chart.
  3. Delete the tokens from the database:
DELETE FROM database_user_oauth2_tokens;
  1. Reload the chart. It should trigger OAuth2.
  2. Add the chart to a dashboard, delete the tokens, and reload the dashboard. It should trigger OAuth2.

ADDITIONAL INFORMATION

  • Has associated issue: [SIP-85] OAuth2 for databases #20300
  • Required feature flags:
  • Changes UI
  • Includes DB Migration (follow approval process in SIP-59)
    • Migration is atomic, supports rollback & is backwards-compatible
    • Confirm DB migration upgrade and downgrade tested
    • Runtime estimates and downtime expectations provided
  • Introduces new feature or API
  • Removes existing feature or API

@github-actions github-actions bot added risk:db-migration PRs that require a DB migration api Related to the REST API labels Mar 23, 2024
Copy link

codecov bot commented Mar 23, 2024

Codecov Report

Attention: Patch coverage is 89.96416% with 28 lines in your changes are missing coverage. Please review.

Project coverage is 69.96%. Comparing base (883e455) to head (59d4daf).
Report is 28 commits behind head on master.

Files Patch % Lines
superset/db_engine_specs/base.py 68.29% 13 Missing ⚠️
superset/utils/lock.py 86.48% 5 Missing ⚠️
.../components/ErrorMessage/OAuth2RedirectMessage.tsx 92.10% 0 Missing and 3 partials ⚠️
superset/utils/oauth2.py 96.42% 2 Missing ⚠️
superset-frontend/src/setup/setupErrorMessages.ts 0.00% 1 Missing ⚠️
superset/databases/api.py 95.45% 1 Missing ⚠️
superset/db_engine_specs/gsheets.py 97.14% 1 Missing ⚠️
superset/models/core.py 93.75% 1 Missing ⚠️
superset/sql_lab.py 75.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #27631      +/-   ##
==========================================
+ Coverage   69.89%   69.96%   +0.07%     
==========================================
  Files        1911     1916       +5     
  Lines       75024    75377     +353     
  Branches     8355     8403      +48     
==========================================
+ Hits        52435    52741     +306     
- Misses      20539    20571      +32     
- Partials     2050     2065      +15     
Flag Coverage Δ
hive 48.98% <57.08%> (+0.04%) ⬆️
javascript 57.54% <89.74%> (+0.06%) ⬆️
mysql 77.76% <60.00%> (-0.15%) ⬇️
postgres 77.90% <59.58%> (-0.13%) ⬇️
presto 53.66% <57.91%> (+0.01%) ⬆️
python 83.20% <90.00%> (+0.05%) ⬆️
sqlite 77.34% <59.58%> (-0.12%) ⬇️
unit 57.18% <87.50%> (+0.42%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@betodealmeida betodealmeida marked this pull request as ready for review March 24, 2024 00:50
@betodealmeida betodealmeida requested a review from a team as a code owner March 24, 2024 00:50
@john-bodley john-bodley self-requested a review March 25, 2024 16:53
Copy link
Member

@mistercrunch mistercrunch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall seems solid. My most important point is probably around adding a database index on the new model, the rest are comments/notes.

superset/connectors/sqla/models.py Show resolved Hide resolved
superset/exceptions.py Show resolved Hide resolved
superset/utils/oauth2.py Show resolved Hide resolved
superset/db_engine_specs/hive.py Show resolved Hide resolved
@john-bodley
Copy link
Member

@betodealmeida do we perceive there could/would be other authorization frameworks other than OAuth 2.0? If so I was wondering if there was merit in renaming database_user_oauth2_tokens to be something more generic and adding additional column(s) which define said frameworks.

Copy link
Member

@craig-rueda craig-rueda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good so far! I think there's quite a few cases that could be tested. (Token delete/insert/ etc., the updated engine spec definitions, etc...)

@betodealmeida
Copy link
Member Author

@betodealmeida do we perceive there could/would be other authorization frameworks other than OAuth 2.0? If so I was wondering if there was merit in renaming database_user_oauth2_tokens to be something more generic and adding additional column(s) which define said frameworks.

@john-bodley I'm not sure, to be honest. The beauty of OAuth2 is that the same flow is shared across multiple providers, all you need is an access token and a refresh token, so the same foundation works for BigQuery/GSheets/Snowflake/Dremio/Databricks.

The one case I can think of is if at some point we'd want users to be able to input their own username/password, but I think that trying to address potential future uses would increase the complexity without clear benefits.

@betodealmeida
Copy link
Member Author

Looks good so far! I think there's quite a few cases that could be tested. (Token delete/insert/ etc., the updated engine spec definitions, etc...)

@craig-rueda I did test inserting/deleting the token in the API test. I'll add tests for the new DB engine spec methods.

superset/utils/oauth2.py Outdated Show resolved Hide resolved
superset/utils/oauth2.py Outdated Show resolved Hide resolved
@harksin
Copy link

harksin commented Mar 29, 2024

Hey there,
thanks for this so usefull PR !
I can confirm that when we use superset and Trino we are interested in a support for Oauth in superset with Trino to get rid of the user impersonation.

@betodealmeida
Copy link
Member Author

I'm simply trying to bring up the fact that this solution wouldn't work in a use case that I would need it to work in if I were to use this feature.

The solution I'm proposing is a hybrid superset_config.py (this PR) + new model (future PR), which would work for both of of our use cases. We'd have to introduce the new model and UI elements, but most of the logic of this PR would remain unchanged. I'm not against your proposal, on the contrary, I think it's a great idea and an elegant solution.

But what I'm hearing from you is that we should go with only the new model, where the client ID/secret live in the metadata database. That solution is very suboptimal for my use case, since it would require users to create and manage their own applications.

You mentioned:

This is based on the assumption that the best UX for adding cloud db clients would still be through the UI

But that is definitely not true in my use case. Google Sheets is one of our most popular databases used at Preset, and it would be great if users could connect to private sheets without having to figure out how to create an OAuth2 application first. BigQuery is probably the most popular, and the faster our users can start exploring their data, the better.

You also mentioned:

then the config based approach will anyway become redundant, and will cause both maintenance burden, deprecation and removal at some point.

I don't see why the config approach would have to removed, since it complements the approach you're proposing. The default configuration for the feature is just an empty dictionary. We just need to add logic that checks the for the client information in two different places, as we already do for many things in Superset — this is IMHO the main feature of Superset, how it's adaptable and configurable to so many different use cases.

And in fairness, I do think that if we have the hybrid approach most people will prefer to use the UI to configure OAuth2, because as you said, it's easier to update the client in a single place and changes don't require a redeploy. But that doesn't mean that it's a solution that works for everyone.

(As for Snowflake, my understanding is that we can derive the OAuth2 URL from the account name, which is part of the SQLAlchemy URL.)

@villebro
Copy link
Member

The solution I'm proposing is a hybrid superset_config.py (this PR) + new model (future PR), which would work for both of of our use cases. We'd have to introduce the new model and UI elements, but most of the logic of this PR would remain unchanged. I'm not against your proposal, on the contrary, I think it's a great idea and an elegant solution.

But what I'm hearing from you is that we should go with only the new model, where the client ID/secret live in the metadata database. That solution is very suboptimal for my use case, since it would require users to create and manage their own applications.

I'm totally ok with the hybrid approach if we make sure we're not painting ourselves into a corner with it. The reason I feel uneasy with implementing a "one client for all connections of type x" approach as the initial implementation is that I feel it's inherently atypical for this type of flow: While it may be optimal for this specific use case, I don't think it works generally for most other use cases.

But that is definitely not true in my use case. Google Sheets is one of our most popular databases used at Preset, and it would be great if users could connect to private sheets without having to figure out how to create an OAuth2 application first. BigQuery is probably the most popular, and the faster our users can start exploring their data, the better.

I understand this. Again, my reservation stems from the fact that this is not a typical OAuth2 flow, i.e. you likely can't use this for the majority of OAuth2 connectivity use cases, but the alternative I'm proposing works for this one, too. But as I said, I'm not against being able to provide static creds in the config where applicable.

I don't see why the config approach would have to removed, since it complements the approach you're proposing. The default configuration for the feature is just an empty dictionary. We just need to add logic that checks the for the client information in two different places, as we already do for many things in Superset — this is IMHO the main feature of Superset, how it's adaptable and configurable to so many different use cases.

(As for Snowflake, my understanding is that we can derive the OAuth2 URL from the account name, which is part of the SQLAlchemy URL.)

But even if you dynamically generate the URI, you would still need to store the client creds somewhere, right? In other words, the user would still need to pass the client id and secret somehow for the backend to be able to use them. Quoting from https://docs.snowflake.com/en/user-guide/oauth-custom#request-header:

image

@betodealmeida
Copy link
Member Author

@villebro for Snowflake the admin would add to config.py:

DATABASE_OAUTH2_CREDENTIALS = {
    "Snowflake": {
        "CLIENT_ID": "XXX",
        "CLIENT_SECRET": "YYY",
    },
}

Then once a Snowflake database is added the DB engine spec can determine the OAuth2 URL from the SQLAlchemy URI, and use the information from the config to authorize the user. We could also optionally have the URL in the config, for deployments where that is possible.

I think we have two very different use cases, which is why we need these two workflows. For Preset, most users are using cloud DBs, for the same reason they're using cloud Superset — they don't want to run their own infrastructure. And we want to have them connected to their data as quickly and easily as possible. We want to allow them to use our own application, but the Admin users shouldn't have access to the application details.

In your case, you're running your own very dynamic infrastructure, you have dozens (hundreds?) of databases, and multiple custom OAuth2 providers. Your admins are tech-savvy and have no trouble connecting Superset to complex database deployments, creating OAuth2 applications, and so on.

You mentioned "reducing throwaway work". I don't think there's a lot of work that will be thrown away if later we implement your suggestion. Most of the logic will remain, the only thing that would change would be how we pass the configuration to the DB engine specs — currently they (and by "they" I mean GSheets) read it from the config, but we could pass it explicitly so it come either from the config or from the assigned client.

I'm happy to add that abstraction in the near future, before someone starts working on a n:n between databases and OAuth2 client.

@villebro
Copy link
Member

for Snowflake the admin would add to config.py:

DATABASE_OAUTH2_CREDENTIALS = {
    "Snowflake": {
        "CLIENT_ID": "XXX",
        "CLIENT_SECRET": "YYY",
    },
}

Then once a Snowflake database is added the DB engine spec can determine the OAuth2 URL from the SQLAlchemy URI, and use the information from the config to authorize the user. We could also optionally have the URL in the config, for deployments where that is possible.

This inherently means that you'll only be able to connect to a single Snowflake account using OAuth2. I'm sure that may be fine for many deployments, but as a general pattern that's unnecessarily restrictive.

Again I want to reiterate that my purpose here isn't to block this work. Rather, I'm trying to bring up issues with this approach that I'm expecting to surface after this PR is merged when orgs start rolling this out to their deployments. For that reason, it would be great to have an understanding of when these follow-up tasks are expected to be done to call OAuth2 support complete. Maybe we could discuss this over a meeting to see which orgs can commit to this work, and when?

@betodealmeida
Copy link
Member Author

This inherently means that you'll only be able to connect to a single Snowflake account using OAuth2. I'm sure that may be fine for many deployments, but as a general pattern that's unnecessarily restrictive.

No, you could add as many different Snowflake databases as you wanted. Each one would connect to a different OAuth2 URL because each one would have a different SQLAlchemy URI. The client ID and the client are not tied to a single account, unless I'm missing something.

I'm planning to add support for Snowflake next, so if we need any kind of refactoring to support it, or if we need to implement the full model that you're proposing, I would be the one doing that work.

@villebro
Copy link
Member

No, you could add as many different Snowflake databases as you wanted. Each one would connect to a different OAuth2 URL because each one would have a different SQLAlchemy URI. The client ID and the client are not tied to a single account, unless I'm missing something.

I would be surprised if one client works across multiple accounts. I haven't tried this personally, but I understand the process as follows:

  • Login to your Snowflake account
  • Issue a CREATE SECURITY INTEGRATION TYPE = OAUTH as per here
  • Use the creds from SYSTEM$SHOW_OAUTH_CLIENT_SECRETS when executing queries on your account as per here

And if you'd want to integrate with another account, you'd redo those steps on that account, and then use those creds when executing queries against that one.

@betodealmeida
Copy link
Member Author

betodealmeida commented Mar 29, 2024

I would be surprised if one client works across multiple accounts. I haven't tried this personally, but I understand the process as follows [...]

I haven't tried it personally yet either, and if it doesn't work, it's OK — I'll then implement your proposal next.

Copy link
Member

@mistercrunch mistercrunch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments but overall LGTM. Stamping my approval but we may want another stamp from @villebro since he got deep in here already and seemed likely to push this further in the future.

@@ -542,6 +543,70 @@ The method `get_url_for_impersonation` updates the SQLAlchemy URI before every q

Alternatively, it's also possible to impersonate users by implementing the `update_impersonation_config`. This is a class method which modifies `connect_args` in place. You can use either method, and ideally they [should be consolidated in a single one](https://github.com/apache/superset/issues/24910).

### OAuth2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: feels like this belongs on the documentation website, probably as a new section "Connecting Users to Databases using OAuth2" under the "Installation and Configuration" section. This README is buried, maybe the guideline for using this readme would be for things that speak to developers working on the db_engine_specs package as opposed to admins looking to install/configure Superset. But the content is great! :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: personally I like to break lines at 80-100 in docs/md, but there's not standard enforced there at the moment, so fine either way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, this README documents all the functionality of the DB engine specs and is targeting developers, explaining the methods that are needed. I'm happy to write additional docs about OAuth2 in the main website, I'll do that.

Copy link
Member

@villebro villebro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a request for making it possible to define clients per database in the future, preferably via a dedicated client model.

@betodealmeida betodealmeida merged commit 9022f5c into master Apr 3, 2024
43 checks passed
jzhao62 pushed a commit to jzhao62/superset that referenced this pull request Apr 4, 2024
EandrewJones pushed a commit to UMD-ARLIS/superset that referenced this pull request Apr 5, 2024
EnxDev pushed a commit to EnxDev/superset that referenced this pull request Apr 12, 2024
@rusackas rusackas deleted the sip-85 branch April 16, 2024 16:52
qleroy pushed a commit to qleroy/superset that referenced this pull request Apr 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api Related to the REST API preset-io risk:db-migration PRs that require a DB migration size/XXL
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants