Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(embedded): Hide sensitive payload data from guest users #25878

Merged

Conversation

jfrag1
Copy link
Member

@jfrag1 jfrag1 commented Nov 6, 2023

SUMMARY

Currently, guest users can get access to more data than they need to when viewing an embedded dashboard. The api/v1/dashboard/{id} and api/v1/dashboard/{id}/datasets endpoints would return owners' first and last names. The latter would return information about the underlying database(s) the datasets belong to, which could be useful for an attacker. None of this data is necessary for the embedded dashboard to render.

I recognize this PR does break typical API patterns by stripping out certain data depending on the type of user requesting, which is pretty ugly. However, this approach avoids making breaking changes for regular API use, while making the embedded experience more secure.

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

TESTING INSTRUCTIONS

ADDITIONAL INFORMATION

@@ -342,6 +342,11 @@ def get(
$ref: '#/components/responses/404'
"""
result = self.dashboard_get_response_schema.dump(dash)
if security_manager.is_guest_user():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jfrag1 we should adhere to the "shift left" mentality and filter out inaccessible fields at the DAO layer as opposed to the API. This ensures we adhere to the DRY principle avoiding the need to repeat (and/or miss) said logic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@john-bodley Thanks for reviewing, but I'm not sure how we're supposed to properly filter out certain fields of a record in the DAO layer. As far as I'm aware the security logic in the DAO layer is only "row-level", protecting entire objects, there's no column-level protections. Are you suggesting in the appropriate DAO method do something like

dashboard = db.session.query(...
if is_guest:
    dashboard.owners = []
    dashboard.changed_by = None

I don't like this since it introduces the possibility of these changes accidentally being committed, but please let me know if you had something else in mind or if this pattern does exist currently but I'm just not aware of it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@betodealmeida @dpgaspar @michael-s-molina @villebro what are your thoughts on where sanitization of the data response should occur? Note this doesn't seem to be API specific and thus I'm not sure we should explicitly be defining said logic there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementing this at the API layer is problematic because there's no single point of verification and also because you will query data that you don't have access to and filter it out later, wasting resources (imagine a table with many columns and you only have access to one column). This will also be true for the suggested code below:

dashboard = db.session.query(...
if is_guest:
    dashboard.owners = []
    dashboard.changed_by = None

Considering our current architecture, where we offer a RLS (Row Level Security) solution instead of depending on the database layer for these types of controls, the correct way of implementing this is to augment our security model to also support Column Level Security (CLS) or even Object Level Security (OLS).

I don't like this since it introduces the possibility of these changes accidentally being committed

Ideally, you won't query for what you don't have access to, eliminating this concern because the values won't be in the session.

Pinging @villebro for awareness given his current work on an improved security model.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do appreciate the discussion here, but this really feels like premature optimization to me. The changes in this PR only affect two endpoints, and I don't predict that this kind of logic will needed in a widespread manner.

If I'm wrong and there is a larger need/desire for this kind of security measure, I definitely agree we should re-assess, but I personally don't see a need to come up with a more "perfect" solution at this point.

Copy link
Member

@michael-s-molina michael-s-molina Nov 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do appreciate the discussion here, but this really feels like premature optimization to me. The changes in this PR only affect two endpoints, and I don't predict that this kind of logic will needed in a widespread manner.

I thought that by referencing the concept of Object Level Security (OLS) it would be more clear why this is not a premature optimization and why we will have the same need in a widespread manner.

I'd argue the API layer is where that kind of security check belongs

If you research a little bit about CLS you'll see that many of these security checks are implemented at the database layer, even further from the API layer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it's risky to implement this fix in a single interface, as it gives no guarantees that this information won't be exposed via some other interface, either pre-existing, or one that's added later. Mind you, this is not the only security concern regarding guest tokens - the issue of exposing granular information about the underlying query has been raised multiple times in the community, and is also something I feel we need to tackle soon.

Could we setup a meeting to discuss viable long-term solutions to these issues? I propose meeting to agree on the the following:

  • A long term vision for fixing this and other similar security concerns. I agree with @michael-s-molina that baking this into the security work I've started sounds like a good idea. I can share what I have so far and my thoughts of how this can be taken forward.
  • Agreeing on a quick fix to unblock heavy users of guest tokens to bridge the gap between now and when we have a proper security mechanism in place for handling this and other similar security issues.

Copy link
Member Author

@jfrag1 jfrag1 Nov 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that by referencing the concept of Object Level Security (OLS) it would be more clear why this is not a premature optimization and why we will have the same need in a widespread manner.

This is referring to a security feature similar to row-level security where Superset could apply security filters to queries against connected databases, while the security issue at hand in this PR has to do with data from the Superset metadata DB.

If you research a little bit about CLS you'll see that many of these security checks are implemented at the database layer, even further from the API layer.

My understanding is that CLS is a security feature commonly implemented in data warehouses/BI tools, not a guide for how a server should implement security checks for data it owns

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a great idea to meet on long term solutions. If we were to move the logic for this endpoint into the schema (I left one more suggestion that would remove this from the api altogether) should we move forward with this for now as @villebro suggested?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a great idea to meet on long term solutions. If we were to move the logic for this endpoint into the schema (I left one more suggestion that would remove this from the api altogether) should we move forward with this for now as @villebro suggested?

@eschutho I also think moving this into the Marshmallow schema is probably the best solution for now.

@pull-request-size pull-request-size bot added size/M and removed size/S labels Nov 13, 2023
Copy link
Member

@eschutho eschutho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more suggestion.

schema = (
self.dashboard_get_response_schema_guest
if security_manager.is_guest_user()
else self.dashboard_get_response_schema
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jfrag1 I think we can even move the schema logic into Marshmallow dynamically with either a type function or a @marshmallow.post_dump decorator. That way the logic isn't in the api layer and you won't need two schemas.

@pull-request-size pull-request-size bot added size/S and removed size/M labels Nov 16, 2023
@jfrag1 jfrag1 closed this Nov 16, 2023
@jfrag1 jfrag1 reopened this Nov 16, 2023
Copy link
Member

@villebro villebro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jfrag1 I would prefer it if we remove the owners property all together if guest tokens aren't supposed to know who the owners are. Or is there some other reason we want to keep the property there? Also, I feel we really should add tests for both cases here (hence "Request changes").

@post_dump()
def post_dump(self, serialized: dict[str, Any], **kwargs: Any) -> dict[str, Any]:
if security_manager.is_guest_user():
serialized["owners"] = []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to fully remove the property owners, as replacing it with an empty list is IMO conveying incorrect info

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@villebro I agree. The only caveat would be if the owners field is required.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@john-bodley ideally I think we should relax that requirement if we want to make this data conditional on the permissions of the logged in user. Based on the proposed changes, I think it should be as follows:

  • regular user: all relevant owner info is conveyed (current behavior)
  • guest token: no owner property is present in the response

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@villebro I initially opted to keep the owners property in the payload because the dashboard frontend expects it to always be there, and fails to render if it's not.

I found a way to exclude it from the response payload and have the client tack the property on after the dashboard is requested, which allows the embedded dashboard to render successfully.

@post_dump()
def post_dump(self, serialized: dict[str, Any], **kwargs: Any) -> dict[str, Any]:
if security_manager.is_guest_user():
serialized["owners"] = []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

@jfrag1
Copy link
Member Author

jfrag1 commented Dec 4, 2023

@villebro Are you able to take another look at this?

Copy link
Member

@villebro villebro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for all the iterations @jfrag1 !

@jfrag1 jfrag1 merged commit 386d4e0 into apache:master Dec 4, 2023
29 checks passed
@jfrag1 jfrag1 deleted the jack/embedded-security-strip-out-unneeded-data branch December 4, 2023 22:53
@michael-s-molina michael-s-molina added v3.0 Label added by the release manager to track PRs to be included in the 3.0 branch v3.1 Label added by the release manager to track PRs to be included in the 3.1 branch labels Dec 6, 2023
michael-s-molina pushed a commit that referenced this pull request Dec 8, 2023
michael-s-molina pushed a commit that referenced this pull request Dec 8, 2023
jinghua-qa pushed a commit to preset-io/superset that referenced this pull request Dec 8, 2023
sadpandajoe pushed a commit to preset-io/superset that referenced this pull request Dec 11, 2023
josedev-union pushed a commit to Ortege-xyz/studio that referenced this pull request Jan 22, 2024
@mistercrunch mistercrunch added 🍒 3.0.3 🍒 3.0.4 🍒 3.1.0 🍒 3.1.1 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels labels Mar 8, 2024
sfirke pushed a commit to sfirke/superset that referenced this pull request Mar 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels preset:2023.49 size/M v3.0 Label added by the release manager to track PRs to be included in the 3.0 branch v3.1 Label added by the release manager to track PRs to be included in the 3.1 branch 🍒 3.0.3 🍒 3.0.4 🍒 3.1.0 🍒 3.1.1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants