Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-3056. Allow users to list volumes they have access to, and optionally allow all users to list all volumes #696

Merged
merged 20 commits into from Apr 21, 2020

Conversation

smengcl
Copy link
Contributor

@smengcl smengcl commented Mar 18, 2020

What changes were proposed in this pull request?

Optionally allow non-admin users to list all volumes.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-3056

How was this patch tested?

NOTE: The info below might be a bit out-dated. Need to double check.

Tested manually in ozonesecure docker-compose.
testuser is an admin user (listed in ozone.administrators), testuser2 is a non-admin user.

Before After (ACL disabled) After (ACL enabled)
ozone sh volume list Lists volumes owned by user Lists volumes owned by user Lists volumes that user has ACL LIST to.
ozone sh volume list --all n/a Lists all volumes if ozone.om.volume.listall.allowed is true; if false, only admins can list all volumes, non-admin will get OmException with PERMISSION_DENIED. Same as ACL disabled.

After the patch (non-admin can list all volumes)

bash-4.2$ kinit -kt /etc/security/keytabs/testuser.keytab testuser/scm@EXAMPLE.COM
bash-4.2$ klist
Ticket cache: FILE:/tmp/krb5cc_1000
Default principal: testuser/scm@EXAMPLE.COM

Valid starting     Expires            Service principal
03/18/20 21:20:25  03/19/20 21:20:25  krbtgt/EXAMPLE.COM@EXAMPLE.COM
	renew until 03/25/20 21:20:25
bash-4.2$ ozone sh volume create vol1
bash-4.2$ ozone sh volume list
{
  "metadata" : { },
  "name" : "vol1",
  "admin" : "root",
  "owner" : "testuser/scm@EXAMPLE.COM",
...
bash-4.2$ kdestroy
bash-4.2$ kinit -kt /etc/security/keytabs/testuser2.keytab testuser2/scm@EXAMPLE.COM
bash-4.2$ ozone sh volume list --all
{
  "metadata" : { },
  "name" : "vol1",
  "admin" : "root",
  "owner" : "testuser/scm@EXAMPLE.COM",
  "creationTime" : "2020-03-18T21:20:35.370Z",
...

For comparison, before the patch (non-admin can't list all volumes)

# kinit'ed as testuser2 (non-admin)
bash-4.2$ ozone sh volume list --all
PERMISSION_DENIED org.apache.hadoop.ozone.om.exceptions.OMException: Only admin users are authorized to create or list Ozone volumes.
bash-4.2$ klist
Ticket cache: FILE:/tmp/krb5cc_1000
Default principal: testuser2/scm@EXAMPLE.COM

Valid starting     Expires            Service principal
03/18/20 21:15:34  03/19/20 21:15:34  krbtgt/EXAMPLE.COM@EXAMPLE.COM
	renew until 03/25/20 21:15:34

@smengcl
Copy link
Contributor Author

smengcl commented Mar 18, 2020

@elek @xiaoyuyao Would you take a look? Thanks.

Essentially what this does is removing the admin check in listAllVolumes.

I believe this shouldn't impose security risks other than intentionally "leaking" all volume names (for now). What do you security experts think?

Copy link
Contributor

@dineshchitlangia dineshchitlangia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @smengcl for working on this.
The code change looks good.

I do not have details about the conversation with Arpit/Sanjay you mentioned thus I am not merging this PR yet. As far as security is concerned, listing all volumes is a functionality suited for system admins. If I am not supposed to have access to read the contents of some volumes what benefit will I achieve by being able to list those volumes? Thus I am not able to understand what value this change brings to ozone or the user experience.

@smengcl
Copy link
Contributor Author

smengcl commented Mar 19, 2020

Thanks @smengcl for working on this.
The code change looks good.

I do not have details about the conversation with Arpit/Sanjay you mentioned thus I am not merging this PR yet. As far as security is concerned, listing all volumes is a functionality suited for system admins. If I am not supposed to have access to read the contents of some volumes what benefit will I achieve by being able to list those volumes? Thus I am not able to understand what value this change brings to ozone or the user experience.

Thanks for the comment Dinesh. You concern is totally valid.

One motivation/background of this change is that in ofs://, all users should be able to list "root", which is essentially a listVolumesByUser() client call at this moment that returns only the list of volumes current user creates.

So there is also a tmp mount (HDDS-2929) implemented for ofs://, the current implementation of which is asking an admin to create a volume named tmp (with world ALL ACL) during cluster setup, so every other user who are accessing mount point ofs://om/tmp/ will be transparently directed to accessing bucket md5(current_username) in volume tmp.

Now there is a visual glitch when listing. Since the volume tmp is created by an admin, if another user tries ozone fs -ls ofs://om/, he won't see volume/mount tmp at all. -- I admit a better way to fix this might be to hack listing results and inject mount tmp. The point is this is one example of the motivation.

Yes there is this security implication behind this. I discussed with @xiaoyuyao a bit about this today.
We plan to discuss with @arp7 about this tomorrow and see if we actually want to loosen the admin check of listAllVolumes().

@smengcl
Copy link
Contributor Author

smengcl commented Mar 20, 2020

Discussed with @arp7 @xiaoyuyao @dineshchitlangia . Our conclusion is that we want to allow users who have access to some volumes to be able to list them. We will add a new config param in ozone-site.xml to allow OM admin to control whether they want to (1) shows all volumes to all users or (2) only allow users to list volumes they have access to. This config param should default to (1).

I'll repurposing this jira. TODOs:

  1. Modify the existing API(s) listVolumesByUser(?) to allow users who have access to some volumes to be able to list them;
  2. Add new config param as mentioned above;
  3. Add unit/integration tests.

@smengcl smengcl changed the title HDDS-3056. Allow all users to list all volumes HDDS-3056. Optionally allow all users to list all volumes Mar 23, 2020
@smengcl smengcl changed the title HDDS-3056. Optionally allow all users to list all volumes HDDS-3056. Allow users to list volumes they have access to Mar 23, 2020
@smengcl smengcl changed the title HDDS-3056. Allow users to list volumes they have access to HDDS-3056. Allow users to list volumes they have access to, and optionally allow all users to list all volumes Mar 23, 2020
@smengcl
Copy link
Contributor Author

smengcl commented Mar 24, 2020

4bdeafd should've accomplished the "optionally allow all users to list all volumes" part. I'll add test for that later.

As for the "Allow users to list volumes they have access to" part, first I was trying to write the code in OmMetadataManagerImpl#listAllVolumes, but then realized it wasn't a good spot to check for ACLs since that class shouldn't really import IAccessAuthorizer (as used in OzoneManager#checkAcls).

So eventually I decide to implement the logic in OzoneManager#listVolumeByUser. Since the deeper level calls into VolumeManager and OmMetadataManager doesn't have access to ACL (for the reason described in the previous paragraph), the implementation lists all volumes internally. And then checks ACL for each volume with current UGI when ACL is enabled.

-- This might lead to higher memory consumption when there are a lot of volumes existing in OM. But it requires a bit more change to mitigate this. I might try to put some of the listing iterator logic in OzoneManager to solve this, if this is a problem.

@smengcl
Copy link
Contributor Author

smengcl commented Mar 27, 2020

Rebased onto latest master to include the acceptance test failure fix HDDS-3284.

@smengcl smengcl requested a review from xiaoyuyao March 30, 2020 20:29
List<OmVolumeArgs> listOfAllVolumes = volumeManager.listVolumes(
null, prefix, prevKey, maxKeys);
// Filter all volumes by ACL LIST permission of UGI
return listOfAllVolumes.stream().filter(v -> v.getAclMap()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use accessAuthorizer.checkAccess

@smengcl
Copy link
Contributor Author

smengcl commented Apr 6, 2020

Will add the integration test soon.
Will resolve minor conflict with the Ozone Shell refactoring.

@xiaoyuyao
Copy link
Contributor

Looks like this need a rebase after HDDS-2691.

@smengcl
Copy link
Contributor Author

smengcl commented Apr 7, 2020

Just added integration tests. But it seems only one cluster can be launched in one test class.
Four tests can succeed separately. But when I run them at once in IntelliJ, the second would get stuck in a loop. Any help?

@smengcl
Copy link
Contributor Author

smengcl commented Apr 7, 2020

Just added integration tests. But it seems only one cluster can be launched in one test class.
Four tests can succeed separately. But when I run them at once in IntelliJ, the second would get stuck in a loop. Any help?

Rebased.

@smengcl
Copy link
Contributor Author

smengcl commented Apr 10, 2020

Rebased again because of HDDS-3340 (class ListVolumeHandler moved).

@smengcl
Copy link
Contributor Author

smengcl commented Apr 10, 2020

I'm able to dig a bit into the root cause of the timeout in the new integration test I added.
The symptom is that the tests succeed if I run each test case separately. But fails on the second test when I run all tests together.

Turns out, when a mini ozone cluster launches for a second time in the same test class. In setOwner() the OM would add the same volume to owner list for a second time (the list already has the volume entry) and succeed, which is very weird.
The result of this is a malformed list in UserVolumeInfo for the user, see prevVolList variable in the below screenshot:
ss1

This eventually causes testAclDisabledListAllDisallowed to get stuck in the it.hasNext() infinite loop and timeout because of how VolumeIterator and OmMetadataManagerImpl#listVolumes works.

I am able to confirm my discovery by setting a breakpoint inside addVolumeToOwnerList().

If I only run testAclDisabledListAllDisallowed this one test directly in IntelliJ, the test case would just pass. This makes the problem weirder. Because I do call the shutdown function in MiniOzoneClusterImpl to do the cleanup by the end of each test case. And it did delete the temp directory for the entire cluster. This in theory should have performed the clean up work.

My questions:

  1. Unless there are some other in-memory cache (TableCache) that is accidentally persisted across mini cluster (i.e. not fully cleaned up in MiniOzoneClusterImpl)? If this is the case we just need to somehow fix the test utility.

  2. Or could it be the case that the userTable is flushed by mistake? In this case this would be a major bug (outside the scope of this jira) that should be fixed.

Pinging for some help @bharatviswa504 @elek

@smengcl
Copy link
Contributor Author

smengcl commented Apr 10, 2020

I just posted #806, might be related to the issue.

@smengcl
Copy link
Contributor Author

smengcl commented Apr 14, 2020

#806 is merged. will rebase this PR.

@smengcl
Copy link
Contributor Author

smengcl commented Apr 14, 2020

just rebased.

somehow all the integration tests are timing out on my machine so I can't verify locally if the problem is fixed by #806. we'll see.

@smengcl
Copy link
Contributor Author

smengcl commented Apr 20, 2020

Even after reverting the lock change the acceptance test still fails.
Previously there is a passing run: https://github.com/apache/hadoop-ozone/runs/590172040
The details shows there are some s3 tests failures which are seemingly unrelated.
Rebased to latest master to hopefully resolve this test failure.

@smengcl
Copy link
Contributor Author

smengcl commented Apr 20, 2020

Acceptance failure is related to HDDS-3456.
This PR is blocked by the above jira then.

Or we can commit since the failures are unrelated. What do you think? @xiaoyuyao

@adoroszlai
Copy link
Contributor

Or we can commit since the failures are unrelated. What do you think? @xiaoyuyao

Please don't commit with unrelated failures. These may hide other, related failures, as we've seen in the past.

Please review #844 instead. ;)

@smengcl
Copy link
Contributor Author

smengcl commented Apr 20, 2020

Or we can commit since the failures are unrelated. What do you think? @xiaoyuyao

Please don't commit with unrelated failures. These may hide other, related failures, as we've seen in the past.

Please review #844 instead. ;)

Sure thing. Let's wait for #844 merge. Then I'll do a rebase. :)

@smengcl
Copy link
Contributor Author

smengcl commented Apr 20, 2020

Rebased as HDDS-3456 is committed.

@smengcl
Copy link
Contributor Author

smengcl commented Apr 21, 2020

Got a green run after the rebase. Will commit shortly.

Thanks @xiaoyuyao and @dineshchitlangia for the review.

@smengcl smengcl merged commit 091993b into apache:master Apr 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants