Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more integrated support for Re:Dash users against the presto query_runner #3723

Closed

Conversation

Ralnoc
Copy link
Contributor

@Ralnoc Ralnoc commented Apr 18, 2019

  • Changed the source passed from the presto pyhive client from declaring itself as pyhive and will now declare itself as redash as the source
  • Added logic for parsing the user email for the use executing the query to presto. The presto data source configured user will still take precedence and will ultimately default to redash when nothing else is available.

What type of PR is this? (check all applicable)

  • Feature

Description

Currently the behavior of the presto query runner is that is passes a default username of redash unless a user is specified in the data source configuration. Additionally, it reports the source as pyhive, making it difficult in PrestoDB to tell the difference between users making use of pyhive themselves and Re:Dash running queries in the infrastructure.

This change alters both behaviors.

  1. Source: This will now declare itself as redash. This makes it extremely easy to identify queries that come from the Re:Dash application.
  2. Username: The new behavior is as follows:
    1. Data source configuration trumps all. If the username is configured, it will always declare the user as being what is configured in the data source.
    2. If a user is passed to the query runner with an email address, it will take that email address, split on the @ and leverage the the first element (the username) as what gets passed to Presto as the username.
      1. This is important for services like the Starburst version of Presto which supports Apache Ranger based authorization control.
    3. If user is passed as None to the query runner, it will default to the username of redash.

…ing itself as `pyhive` and will now declare itself as `redash` as the source

- Added logic for parsing the user email for the use executing the query to presto. The presto data source configured user will still take precedence and will ultimately default to redash when nothing else is available.
…ault to None which mimics no password being passed at all.
Copy link
Member

@arikfr arikfr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @Ralnoc! See my comment about the username functionality.

Btw, the password option was added in a previous pull request that was recently merged.

@@ -95,11 +98,14 @@ def get_schema(self, get_stats=False):
return schema.values()

def run_query(self, query, user):
user_name = self.configuration.get('username', user.email.split('@')[0] if user is not None else 'redash')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were previous attempts as similar functionality, but it conflicts with how Redash caches data. Redash caches the last query execution result. This means it will cache the last execution of whatever user who executed it last, and hence might leak data to other users while giving the admin a false sense of protection.

I feel that we can't introduce this change as is without addressing the caching issue.

But I think you already use this in your environment, so maybe you can share how it works for you?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are definitely leveraging this implementation in our environment.

Authentication behavior

The behavior of this code in practice allows for the username configuration for the Presto data source to override the individual user logins. (In the case of a situation where Presto is configured with single user access.)

Aside from that, it will attempt to leverage the email address as the basis for the username to pass to Presto. I would much rather prefer to update the SAML code so that we can pass the username as a unique attribute which updates the username attribute of the user object which is passed to the Query runner. That would allow for a consistent method of finding the username which it supposed to be leveraged against Presto. However that was a much more involved change to get in place. This allows a consistent behavior (In most environments, email matches username.) and gets this initial process in place.

I do acknowledge that using a SAML configured Re:Dash cluster against a authentication enforced Presto endpoints does not work right now. There is no way to obtain the specific user's password. However this implementation does help when auditing and investigating the Presto cluster itself, as you know the specific user running the queries in question.

Cached Results Scenario

The cached results is a known scenario and your concerns are understandable. However they are concerns with a slightly inaccurate assumption.

Presto, in general, does not offer a robust authorization solution internally. There is the ability to do static controls through files on disk, and there is one vendor who has added a much more robust support for Apache Ranger (Starburst).

However it is understood and accepted that until Re:Dash supports the ability to control access to queries based on group membership, that any data queried through Re:Dash is visible to all users within Re:Dash. If there is a need to segment what previously queried data people have access to, for example secure financial data, we would stand up a separate Re:Dash cluster environment which only grants access through SAML for the Finance users who need access to it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you use this mostly for auditing/monitoring purposes?

However it is understood and accepted that until Re:Dash supports the ability to control access to queries based on group membership

You can control access to data sources (and the queries resulting from them) using groups. It doesn't cover all the use cases groups to queries model would cover, but it's still easier than setting up a separate Redash cluster :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely! Control to the sources itself is supported internally, but not the cached results of the Query. It is simply recognized that the only way to prevent anyone with access to Re:Dash from viewing the cached results is by that larger scale separation.

The primary purpose of having these users getting passed through to Presto is for the purpose of controlling the user's ability to leverage the server resources within Presto through Presto Resource Groups. Previously any Presto Resource Group tunings apply to anyone on Re:Dash which made it difficult to tune resource utilizations within Presto for different users leveraging Re:Dash.

For users leveraging Starburst Presto, by having usernames getting passed through, we gain the ability to supporting Ranger for authorization.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this sort of makes sense. I'm still worried that for people without the full context this will cause more harm than good. How about we put this behind an environment variable?

Similar to how we do with Athena:

ANNOTATE_QUERY = parse_boolean(os.environ.get('ATHENA_ANNOTATE_QUERY', 'true'))
SHOW_EXTRA_SETTINGS = parse_boolean(os.environ.get('ATHENA_SHOW_EXTRA_SETTINGS', 'true'))
OPTIONAL_CREDENTIALS = parse_boolean(os.environ.get('ATHENA_OPTIONAL_CREDENTIALS', 'true'))

This way you can easily use this without a fork, while it's only enabled for people who fully understand the implication.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is totally reasonable. Let me do a refactor on the code to implement this.

@arikfr
Copy link
Member

arikfr commented Jan 21, 2020

Hi,

(This is a template message, but I mean every word of it. Also you're welcome to reply)

Thank you for making this contribution. While we couldn't bring it to completion and merge, it's still very much appreciated. 🙇

In the past year the Redash code base gone under massive updates: on the backend we moved to Python 3 & RQ instead of Celery and on the frontend we replaced Angular with React. It's very likely this makes this PR irrelevant without significant changes. :-(

I'm closing this PR now. But if you're still interested in making it happen, let me know and I will reopen.

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants