Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingest): datamodel to ingest organisation role metadata for a dataset #8267

Merged
merged 2 commits into from
Jul 12, 2023

Conversation

sheeru
Copy link
Contributor

@sheeru sheeru commented Jun 20, 2023

The aim of this feature is to enable users to view the required roles (Roles/Policies defined in Access Management System of an organisation) for accessing a dataset in that organisation. Also, should be able to request for the appropriate roles from the Datahub frontend to the organisation via urls.

Will attach details RFC document

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@github-actions github-actions bot added the product PR or Issue related to the DataHub UI/UX label Jun 20, 2023
@sheeru sheeru changed the title feat(datamodel): datamodel to ingest organisation role metadata for a dataset feat(ingest): datamodel to ingest organisation role metadata for a dataset Jun 20, 2023
* Provisioned users of a role
*/
@Aspect = {
"name": "externalRoleProvisionedUser"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will happen when the requirement to add groups comes along? How about other nested roles?

I'd rename this aspect to "actors", so that we can support users, groups, and nested roles

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have renamed to actors and now we have the ability to add groups under actors when we require

@jjoyce0510
Copy link
Collaborator

Hi @sheeru !

Thanks for the PR.

A few things I want us to consider, in addition to those comments I've already left:

  • Have we considered a "live lookup" approach? Wherein DataHub calls your service to determine

    1. whether a particular user has access to a particular dataset 
    2. request access for a particular user to a particular dataset (or role) 
    

If this does not work, why not? The reason I ask is that the changes required to support this in DataHub are a bit more constrained.

  • How can we limit the blast radius of these changes? A few things I'd request if we do go with this approach:

    1. Let's rename externalRoleMetadata to just "role". A role will be an external role, by default. dataHubRole is already an entity with a name that indicates it's usage within the DataHub system.
    2. Let's avoid defining an enum for PrivilegeType and simply use a string to denote the privileges associated with the role. This is because each system we want to integrate with in the future may have a different "native privilege type" concept. For example, in some SQL systems you do not have READ, WRITE, etc, but instead have "SELECT" or "INSERT" or "DELETE".
    3. Is there any chance a group or another role may be associated with a Role? If yes, I'd want to have some plan to support group as something a role can point to.
    4. Let's rename the externalRolesMetadata aspect of dataset entity to simple "access". This can grow to include additional access related information relevant for the dataset over time.

Let me know if you have questions!

Cheers
John

@sheeru
Copy link
Contributor Author

sheeru commented Jun 22, 2023

Hi @sheeru !

Thanks for the PR.

A few things I want us to consider, in addition to those comments I've already left:

  • Have we considered a "live lookup" approach? Wherein DataHub calls your service to determine
    1. whether a particular user has access to a particular dataset 
    2. request access for a particular user to a particular dataset (or role) 
    

If this does not work, why not? The reason I ask is that the changes required to support this in DataHub are a bit more constrained.

  • How can we limit the blast radius of these changes? A few things I'd request if we do go with this approach:

    1. Let's rename externalRoleMetadata to just "role". A role will be an external role, by default. dataHubRole is already an entity with a name that indicates it's usage within the DataHub system.
    2. Let's avoid defining an enum for PrivilegeType and simply use a string to denote the privileges associated with the role. This is because each system we want to integrate with in the future may have a different "native privilege type" concept. For example, in some SQL systems you do not have READ, WRITE, etc, but instead have "SELECT" or "INSERT" or "DELETE".
    3. Is there any chance a group or another role may be associated with a Role? If yes, I'd want to have some plan to support group as something a role can point to.
    4. Let's rename the externalRolesMetadata aspect of dataset entity to simple "access". This can grow to include additional access related information relevant for the dataset over time.

Let me know if you have questions!

Cheers John

Hi @jjoyce0510 / @shirshanka ,

Regarding the live lookup, currently our primary requirement is driven from UI. People can navigate to a dataset and view list of roles necessary to access that dataset. We can click on a dataset role and request access from that page.
Attaching the RFC for further details
RFC - Addition of External Access Managemt.pdf

Also, PrivilegeType is defined as enum because, UI should be able to group the access types and show the roles under that group. Hence enum is preferred to handle the list of access types.

Also, current requirement is to map users to datasets since user can login and request access to a dataset. In our organisation as well, we dont map roles to user groups for now.

I think the livelookup and mapping of groups to roles can be added as part of phase2 of RFC.

@anshbansal anshbansal added the community-contribution PR or Issue raised by member(s) of DataHub Community label Jun 23, 2023
@sheeru sheeru force-pushed the accessmgt branch 2 times, most recently from 92d16e5 to d1366ff Compare June 26, 2023 02:47
@sheeru
Copy link
Contributor Author

sheeru commented Jun 26, 2023

@jjoyce0510 Have addressed the comments and pls let me know if anything more is required.
Regarding the 2 points which u have highlighted,

1. whether a particular user has access to a particular dataset 
2. request access for a particular user to a particular dataset (or role) 

Since we are accessing the dataset from UI, we will know who has logged in. We will compare if the current logged in user is present as part of provisioned users and we can determine if the user has access to a role and dataset. Attached in RFC for more details

@jjoyce0510
Copy link
Collaborator

Thanks for attaching the RFC, I've taken a look at it.

I am very curious to understand why the listing of roles is so important here, as opposed to showing simpler things: does the user have access or not, and how can the user request access?

I'm imagining a flow where DataHub can simply redirect to an external URL where the role can be finally selected from a list of roles which have access.

We'd likely support a pluggable backend component (in the GraphQL server) that accepts 2 inputs:

  • user urn
  • dataset urn

and returns a simple boolean which represents whether the user urn has access to the dataset urn. We'd also allow for plugging in a "request access url" which would be a custom url that is redirected to when the user clicks to request access.

If we can have an approach like this, we have 2 major benefits:

  1. Simplicity, Separation of Concerns: We do not need to manage synchronizing the external Roles and their users into DataHub and keep this mapping up to date. Instead, that information stays in the external system and is simply queried in real time. This reduces the overhead your team will need to spend on keeping information up to date in DataHub and reduce the possibility of difficult synchronization bugs.

  2. Extensibility, Generalization: It is not clear to me that the proposed approach will generalize well across other organizations, who may not have the same requirements around requesting access for a specific role. By using a live-lookup approach, we can completely separate DataHub from the external system and provide an interface which can be easily extended to support new use cases as they arise.

So given these benefits, please help me to understand why the proposed approach will be better in the long term and applicable across multiple organizations who may use it...

Also, few other responses on the thread:

Also, PrivilegeType is defined as enum because, UI should be able to group the access types and show the roles under
that group. Hence enum is preferred to handle the list of access types.

We can easily group inside the UI even if we do not use enum. Please change to be a string for now. We can always add back an enum later once the Domain becomes more clear. My goal with this PR is to make it as easy as possible to extend beyond your use cases!

Cheers
John

}

type Actor {
provisionedUsers: [CorpUser!]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comments please!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

import java.net.URISyntaxException;


public final class RoleUrn extends Urn {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class is not necessary, feel free to remove!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}
record Access {

roles: array[RoleAssociation]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comments please!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}
record Access {

roles: array[RoleAssociation]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's also make this field optional

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

"name": "AssociatedWith",
"entityTypes": [ "role" ]
}
urn: RoleUrn
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can just be "Urn", we recommend against using strongly-typed urns now. (they are legacy)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

/**
* Provisioned users of a role
*/
@Aspect = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not need to be an aspect itself, just the parent record

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

"name": "Has",
"entityTypes": [ "corpuser" ]
}
provisionedUser: Urn
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nitpick: rename to "user"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

/**
* Link to access external access management
*/
requesturl: optional string
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Camel case: requestUrl

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

/**
* Can be READ, ADMIN, WRITE
*/
type: string
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making this a string

/**
* List of provisioned users of a role
*/
provisionedUsers: array[RoleProvisionedUser]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's just name this more simply: users

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@Aspect = {
"name": "roleProvisionedUser"
}
record RoleProvisionedUser {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's rename more simply: RoleUser

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Collaborator

@jjoyce0510 jjoyce0510 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the current approach, I left some naming comments. Let's consider the above comment about the requirements and pros / cons first ^

@@ -85,7 +85,8 @@ public class DatasetType implements SearchableEntityType<Dataset, String>, Brows
SIBLINGS_ASPECT_NAME,
EMBED_ASPECT_NAME,
DATA_PRODUCTS_ASPECT_NAME,
BROWSE_PATHS_V2_ASPECT_NAME
BROWSE_PATHS_V2_ASPECT_NAME,
ACCESS_DATASET_ASPECT_NAME
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indenting is inconsistent it seems!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

roles: [RoleAssociation!]
}

type Role implements Entity{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Entity {

add a space

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}

type Actor {
provisionedUsers: [CorpUser!]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's name this users

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}

type Actor {
provisionedUsers: [CorpUser!]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Why are we have a RoleProvisionedUser object in the GMS Backend layer but we have a list of resolved users here.

My expectation is to see:

users: [RoleUser!]! 


...

type RoleUser {
     // The user attached to the role
      user: CorpUser!
}

here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Done

@sheeru
Copy link
Contributor Author

sheeru commented Jun 29, 2023

Thanks for attaching the RFC, I've taken a look at it.

I am very curious to understand why the listing of roles is so important here, as opposed to showing simpler things: does the user have access or not, and how can the user request access?

I'm imagining a flow where DataHub can simply redirect to an external URL where the role can be finally selected from a list of roles which have access.

We'd likely support a pluggable backend component (in the GraphQL server) that accepts 2 inputs:

  • user urn
  • dataset urn

and returns a simple boolean which represents whether the user urn has access to the dataset urn. We'd also allow for plugging in a "request access url" which would be a custom url that is redirected to when the user clicks to request access.

If we can have an approach like this, we have 2 major benefits:

  1. Simplicity, Separation of Concerns: We do not need to manage synchronizing the external Roles and their users into DataHub and keep this mapping up to date. Instead, that information stays in the external system and is simply queried in real time. This reduces the overhead your team will need to spend on keeping information up to date in DataHub and reduce the possibility of difficult synchronization bugs.
  2. Extensibility, Generalization: It is not clear to me that the proposed approach will generalize well across other organizations, who may not have the same requirements around requesting access for a specific role. By using a live-lookup approach, we can completely separate DataHub from the external system and provide an interface which can be easily extended to support new use cases as they arise.

So given these benefits, please help me to understand why the proposed approach will be better in the long term and applicable across multiple organizations who may use it...

Also, few other responses on the thread:

Also, PrivilegeType is defined as enum because, UI should be able to group the access types and show the roles under
that group. Hence enum is preferred to handle the list of access types.

We can easily group inside the UI even if we do not use enum. Please change to be a string for now. We can always add back an enum later once the Domain becomes more clear. My goal with this PR is to make it as easy as possible to extend beyond your use cases!

Cheers John

Hi @jjoyce0510
Btw we are proposing to introduce AccessManagement tab in dataset page
Thanks for review. Here is my take on the live lookup approach

  1. From the users and the usability perspective, lets assume that a user has READ access for a dataset. If there is just the REQUEST button, the user might not be aware the already has the READ access, and every time, he might need to click to figure out, the access he has. Also, if he wants to upgrade it to WRITE or ADMIN access, there are two steps involved. First click of REQUEST button to get information about what role he has and also what are the other roles he can apply to.
    Second, he needs different URL (needs to be managed in external system again) to actually get redirected to actual IAM page where he can directly apply. Ideally it would be good if we display what roles he has already been provisioned and what roles he can apply to get other access as well.

In our proposed approach, user can be clear about what role he has and what elevated roles he can apply. By clicking the request, we can directly redirect to IAM page for requesting the required role.

  1. Also as you have said, there needs a sync overhead. But even in the external system, this mapping has to be maintained. It cant be a straight forward mapping. All the datasets are managed by the DBAs and every datasource (like oracle, mysql, bigquery, hadoop,etc) have their own DBAs and they have their own way of maintaning the roles for every datasource. Some of the datasources need not have roles associated as in case of oracle for us. For those datasets, we shouldn't display the REQUEST button. Since the roles information is not available in a single place even in external system, we need to build the roles mapping feature in the external system again. The complexity of building in the external system and maintaining in datahub are same. In datahub, it would be a one stop place where users can lookup along with other metadata information. Even in multiple organisation i assume we might still have the complexity of mainting the roles in their systems. Better to be in one place in Datahub

  2. Also since roles are not available for all datasets, we can disable the AccessManagement page for those datasets similar to Query and Validation tabs. We can extend this approach generically as well. If other organisation wants to maintain the lookup in their system, we can just maintain the REQUEST button in the AccessManagement tab. That Request can take to their external system to do any operation. Anyways the requestUrl we are still having as a property only. It is upto on us on how to show this in UI.

@@ -16,6 +16,7 @@ public class EntityTypeMapper {
static final Map<EntityType, String> ENTITY_TYPE_TO_NAME =
ImmutableMap.<EntityType, String>builder()
.put(EntityType.DATASET, "dataset")
.put(EntityType.ROLE, "externalRoleMetadata")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be "role"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup. Done

@sheeru
Copy link
Contributor Author

sheeru commented Jun 30, 2023

@jjoyce0510
Also, every table in our organisation has more than one roles available to access underlying table - Basic role, role that gives access to PII data, Role that gives access to PCI data, Admin Role, operations role. The user will have to request the one based on his/her requirement. Since there is no 1-1 mapping between role/table - the feature was designed that way.

The flow was specifically designed based on the user needs so they can raise for appropriate role from Datahub itself to save time. The search doesnt end at Datahub, it just begins and user go to IAM role or Notebooks depending on their access. Please note that this is one of the step as part of Data Discovery which we are trying to solve as part of Datahub.

@jjoyce0510
Copy link
Collaborator

Once this is green, we will merge. I will take over and fix some of the naming issues that now exist.

@jjoyce0510
Copy link
Collaborator

(Approved for run - hoping to merge this tomorrow morning PST)

Copy link
Collaborator

@jjoyce0510 jjoyce0510 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the hard work here.

Cheers
John

@jjoyce0510 jjoyce0510 merged commit e53d220 into datahub-project:master Jul 12, 2023
46 checks passed
@lix-mms
Copy link
Contributor

lix-mms commented Sep 4, 2023

Although this change is great and solves important problems, we see some limitations:

  • For organisations like ours, with auditing being very significant, we need to be able to view histories of changes to individual role assignment. It is not convenient for managing the life-cycle of individual role assignments based on this PR.
  • The assignment of more than one roles (in our case instead of roles we use platform principals like groups, service accounts, etc) might be batched into the same access request. Based this PR the modelling of such batches would be quite difficult.

Our organisation is working on a data model with more potential feature support, and fortunately it does not seem to conflict with the above change so far 🍀. If anyone wishes we are keen to align with more details 😊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-contribution PR or Issue raised by member(s) of DataHub Community product PR or Issue related to the DataHub UI/UX
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants