-
Notifications
You must be signed in to change notification settings - Fork 13.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(SIP-85): OAuth2 for databases #27631
Conversation
0cd62a5
to
8c7c027
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #27631 +/- ##
==========================================
+ Coverage 69.89% 69.96% +0.07%
==========================================
Files 1911 1916 +5
Lines 75024 75377 +353
Branches 8355 8403 +48
==========================================
+ Hits 52435 52741 +306
- Misses 20539 20571 +32
- Partials 2050 2065 +15
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
e055f69
to
be464ba
Compare
047477d
to
7e1c4e9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall seems solid. My most important point is probably around adding a database index on the new model, the rest are comments/notes.
superset/migrations/versions/2024-03-20_16-02_678eefb4ab44_add_access_token_table.py
Show resolved
Hide resolved
@betodealmeida do we perceive there could/would be other authorization frameworks other than OAuth 2.0? If so I was wondering if there was merit in renaming |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good so far! I think there's quite a few cases that could be tested. (Token delete/insert/ etc., the updated engine spec definitions, etc...)
@john-bodley I'm not sure, to be honest. The beauty of OAuth2 is that the same flow is shared across multiple providers, all you need is an access token and a refresh token, so the same foundation works for BigQuery/GSheets/Snowflake/Dremio/Databricks. The one case I can think of is if at some point we'd want users to be able to input their own username/password, but I think that trying to address potential future uses would increase the complexity without clear benefits. |
@craig-rueda I did test inserting/deleting the token in the API test. I'll add tests for the new DB engine spec methods. |
Hey there, |
The solution I'm proposing is a hybrid But what I'm hearing from you is that we should go with only the new model, where the client ID/secret live in the metadata database. That solution is very suboptimal for my use case, since it would require users to create and manage their own applications. You mentioned:
But that is definitely not true in my use case. Google Sheets is one of our most popular databases used at Preset, and it would be great if users could connect to private sheets without having to figure out how to create an OAuth2 application first. BigQuery is probably the most popular, and the faster our users can start exploring their data, the better. You also mentioned:
I don't see why the config approach would have to removed, since it complements the approach you're proposing. The default configuration for the feature is just an empty dictionary. We just need to add logic that checks the for the client information in two different places, as we already do for many things in Superset — this is IMHO the main feature of Superset, how it's adaptable and configurable to so many different use cases. And in fairness, I do think that if we have the hybrid approach most people will prefer to use the UI to configure OAuth2, because as you said, it's easier to update the client in a single place and changes don't require a redeploy. But that doesn't mean that it's a solution that works for everyone. (As for Snowflake, my understanding is that we can derive the OAuth2 URL from the account name, which is part of the SQLAlchemy URL.) |
I'm totally ok with the hybrid approach if we make sure we're not painting ourselves into a corner with it. The reason I feel uneasy with implementing a "one client for all connections of type x" approach as the initial implementation is that I feel it's inherently atypical for this type of flow: While it may be optimal for this specific use case, I don't think it works generally for most other use cases.
I understand this. Again, my reservation stems from the fact that this is not a typical OAuth2 flow, i.e. you likely can't use this for the majority of OAuth2 connectivity use cases, but the alternative I'm proposing works for this one, too. But as I said, I'm not against being able to provide static creds in the config where applicable.
But even if you dynamically generate the URI, you would still need to store the client creds somewhere, right? In other words, the user would still need to pass the client id and secret somehow for the backend to be able to use them. Quoting from https://docs.snowflake.com/en/user-guide/oauth-custom#request-header: ![]() |
@villebro for Snowflake the admin would add to DATABASE_OAUTH2_CREDENTIALS = {
"Snowflake": {
"CLIENT_ID": "XXX",
"CLIENT_SECRET": "YYY",
},
} Then once a Snowflake database is added the DB engine spec can determine the OAuth2 URL from the SQLAlchemy URI, and use the information from the config to authorize the user. We could also optionally have the URL in the config, for deployments where that is possible. I think we have two very different use cases, which is why we need these two workflows. For Preset, most users are using cloud DBs, for the same reason they're using cloud Superset — they don't want to run their own infrastructure. And we want to have them connected to their data as quickly and easily as possible. We want to allow them to use our own application, but the In your case, you're running your own very dynamic infrastructure, you have dozens (hundreds?) of databases, and multiple custom OAuth2 providers. Your admins are tech-savvy and have no trouble connecting Superset to complex database deployments, creating OAuth2 applications, and so on. You mentioned "reducing throwaway work". I don't think there's a lot of work that will be thrown away if later we implement your suggestion. Most of the logic will remain, the only thing that would change would be how we pass the configuration to the DB engine specs — currently they (and by "they" I mean GSheets) read it from the config, but we could pass it explicitly so it come either from the config or from the assigned client. I'm happy to add that abstraction in the near future, before someone starts working on a |
This inherently means that you'll only be able to connect to a single Snowflake account using OAuth2. I'm sure that may be fine for many deployments, but as a general pattern that's unnecessarily restrictive. Again I want to reiterate that my purpose here isn't to block this work. Rather, I'm trying to bring up issues with this approach that I'm expecting to surface after this PR is merged when orgs start rolling this out to their deployments. For that reason, it would be great to have an understanding of when these follow-up tasks are expected to be done to call OAuth2 support complete. Maybe we could discuss this over a meeting to see which orgs can commit to this work, and when? |
No, you could add as many different Snowflake databases as you wanted. Each one would connect to a different OAuth2 URL because each one would have a different SQLAlchemy URI. The client ID and the client are not tied to a single account, unless I'm missing something. I'm planning to add support for Snowflake next, so if we need any kind of refactoring to support it, or if we need to implement the full model that you're proposing, I would be the one doing that work. |
I would be surprised if one client works across multiple accounts. I haven't tried this personally, but I understand the process as follows:
And if you'd want to integrate with another account, you'd redo those steps on that account, and then use those creds when executing queries against that one. |
I haven't tried it personally yet either, and if it doesn't work, it's OK — I'll then implement your proposal next. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments but overall LGTM. Stamping my approval but we may want another stamp from @villebro since he got deep in here already and seemed likely to push this further in the future.
@@ -542,6 +543,70 @@ The method `get_url_for_impersonation` updates the SQLAlchemy URI before every q | |||
|
|||
Alternatively, it's also possible to impersonate users by implementing the `update_impersonation_config`. This is a class method which modifies `connect_args` in place. You can use either method, and ideally they [should be consolidated in a single one](https://github.com/apache/superset/issues/24910). | |||
|
|||
### OAuth2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: feels like this belongs on the documentation website, probably as a new section "Connecting Users to Databases using OAuth2" under the "Installation and Configuration" section. This README is buried, maybe the guideline for using this readme would be for things that speak to developers working on the db_engine_specs
package as opposed to admins looking to install/configure Superset. But the content is great! :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: personally I like to break lines at 80-100 in docs/md, but there's not standard enforced there at the moment, so fine either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, this README documents all the functionality of the DB engine specs and is targeting developers, explaining the methods that are needed. I'm happy to write additional docs about OAuth2 in the main website, I'll do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with a request for making it possible to define clients per database in the future, preferably via a dedicated client model.
SUMMARY
This PR introduces a new table called
database_user_oauth2_tokens
. The table is used for storing personal user tokens associated with a given database:Whenever a SQLAlchemy engine is instantiated, the personal user token (or
None
) will be passed to theget_url_for_impersonation
method in the DB engine spec, so that a custom URL can be built for the user. For example, for GSheets:The change allows users to login to databases like BigQuery, Snowflake, Dremio, Databricks, Google Sheets, etc. using their own credentials. This makes it easier to set up databases, since service accounts are no longer required, and provides better isolation of data between users. Only support for Google Sheets is implemented in this PR, and it's considered the reference implementation. Note that a newer version of Shillelagh is required, since a change in the Google Auth API introduced a regression.
In order to populate the table with personal access tokens, the DB engine spec checks for a specific exception that signals that OAuth2 should start:
When called, the
start_oauth2_dance
method will return the errorOAUTH2_REDIRECT
to the frontend. The error is captured by theErrorMessageWithStackTrace
component, which provides a link to the user so they can start the OAuth2 authentication. Since this is implemented at the DB engine spec level, any query will trigger it — in SQL Lab, Explore, or dashboards — see the screenshots below for the UX.Note that while the current implementation triggers OAuth2 when a query needs authorization, we could also implement affordances in the database UI to manually trigger OAuth2 to store the personal access tokens. This could be done in the future.
BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF
SQL Lab. Note that the query runs automatically once OAuth is completed:
SIP-85.Sql.Lab.mov
Explore. Note that the chart is automatically updated after OAuth:
SIP-85.Explore.mov
Same thing for dashboards:
SIP-85.Dashboard.mov
TESTING INSTRUCTIONS
superset_config.py
and add the client ID and secret:ADDITIONAL INFORMATION