-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add more integrated support for Re:Dash users against the presto query_runner #3723
Conversation
…ing itself as `pyhive` and will now declare itself as `redash` as the source - Added logic for parsing the user email for the use executing the query to presto. The presto data source configured user will still take precedence and will ultimately default to redash when nothing else is available.
…ault to None which mimics no password being passed at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @Ralnoc! See my comment about the username functionality.
Btw, the password option was added in a previous pull request that was recently merged.
@@ -95,11 +98,14 @@ def get_schema(self, get_stats=False): | |||
return schema.values() | |||
|
|||
def run_query(self, query, user): | |||
user_name = self.configuration.get('username', user.email.split('@')[0] if user is not None else 'redash') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There were previous attempts as similar functionality, but it conflicts with how Redash caches data. Redash caches the last query execution result. This means it will cache the last execution of whatever user who executed it last, and hence might leak data to other users while giving the admin a false sense of protection.
I feel that we can't introduce this change as is without addressing the caching issue.
But I think you already use this in your environment, so maybe you can share how it works for you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are definitely leveraging this implementation in our environment.
Authentication behavior
The behavior of this code in practice allows for the username configuration for the Presto data source to override the individual user logins. (In the case of a situation where Presto is configured with single user access.)
Aside from that, it will attempt to leverage the email address as the basis for the username to pass to Presto. I would much rather prefer to update the SAML code so that we can pass the username as a unique attribute which updates the username attribute of the user object which is passed to the Query runner. That would allow for a consistent method of finding the username which it supposed to be leveraged against Presto. However that was a much more involved change to get in place. This allows a consistent behavior (In most environments, email matches username.) and gets this initial process in place.
I do acknowledge that using a SAML configured Re:Dash cluster against a authentication enforced Presto endpoints does not work right now. There is no way to obtain the specific user's password. However this implementation does help when auditing and investigating the Presto cluster itself, as you know the specific user running the queries in question.
Cached Results Scenario
The cached results is a known scenario and your concerns are understandable. However they are concerns with a slightly inaccurate assumption.
Presto, in general, does not offer a robust authorization solution internally. There is the ability to do static controls through files on disk, and there is one vendor who has added a much more robust support for Apache Ranger (Starburst).
However it is understood and accepted that until Re:Dash supports the ability to control access to queries based on group membership, that any data queried through Re:Dash is visible to all users within Re:Dash. If there is a need to segment what previously queried data people have access to, for example secure financial data, we would stand up a separate Re:Dash cluster environment which only grants access through SAML for the Finance users who need access to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you use this mostly for auditing/monitoring purposes?
However it is understood and accepted that until Re:Dash supports the ability to control access to queries based on group membership
You can control access to data sources (and the queries resulting from them) using groups. It doesn't cover all the use cases groups to queries model would cover, but it's still easier than setting up a separate Redash cluster :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely! Control to the sources itself is supported internally, but not the cached results of the Query. It is simply recognized that the only way to prevent anyone with access to Re:Dash from viewing the cached results is by that larger scale separation.
The primary purpose of having these users getting passed through to Presto is for the purpose of controlling the user's ability to leverage the server resources within Presto through Presto Resource Groups. Previously any Presto Resource Group tunings apply to anyone on Re:Dash which made it difficult to tune resource utilizations within Presto for different users leveraging Re:Dash.
For users leveraging Starburst Presto, by having usernames getting passed through, we gain the ability to supporting Ranger for authorization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, this sort of makes sense. I'm still worried that for people without the full context this will cause more harm than good. How about we put this behind an environment variable?
Similar to how we do with Athena:
redash/redash/query_runner/athena.py
Lines 9 to 11 in b09ae46
ANNOTATE_QUERY = parse_boolean(os.environ.get('ATHENA_ANNOTATE_QUERY', 'true')) | |
SHOW_EXTRA_SETTINGS = parse_boolean(os.environ.get('ATHENA_SHOW_EXTRA_SETTINGS', 'true')) | |
OPTIONAL_CREDENTIALS = parse_boolean(os.environ.get('ATHENA_OPTIONAL_CREDENTIALS', 'true')) |
This way you can easily use this without a fork, while it's only enabled for people who fully understand the implication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is totally reasonable. Let me do a refactor on the code to implement this.
Hi, (This is a template message, but I mean every word of it. Also you're welcome to reply) Thank you for making this contribution. While we couldn't bring it to completion and merge, it's still very much appreciated. 🙇 In the past year the Redash code base gone under massive updates: on the backend we moved to Python 3 & RQ instead of Celery and on the frontend we replaced Angular with React. It's very likely this makes this PR irrelevant without significant changes. :-( I'm closing this PR now. But if you're still interested in making it happen, let me know and I will reopen. Thanks. |
pyhive
and will now declare itself asredash
as the sourceWhat type of PR is this? (check all applicable)
Description
Currently the behavior of the presto query runner is that is passes a default username of
redash
unless a user is specified in the data source configuration. Additionally, it reports the source aspyhive
, making it difficult in PrestoDB to tell the difference between users making use of pyhive themselves and Re:Dash running queries in the infrastructure.This change alters both behaviors.
redash
. This makes it extremely easy to identify queries that come from the Re:Dash application.@
and leverage the the first element (the username) as what gets passed to Presto as the username.None
to the query runner, it will default to the username ofredash
.