Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingestion) ldap: make ldap attrs keys configurable #4682

Merged

Conversation

atulsaurav
Copy link
Contributor

@atulsaurav atulsaurav commented Apr 17, 2022

  • Code changes to consume attrs_mapping from ldap recipe
  • Doc changes outlining how to use different configurations in recipe

closes #4599

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

…ject#4599)

* Code changes to consume attrs_mapping from ldap recipe

* Doc changes outlining how to use different configurations in recipe
@atulsaurav atulsaurav marked this pull request as draft April 17, 2022 15:33
@atulsaurav atulsaurav changed the title feat(ingestion) ldap: make ldap atttrs keys configurable (#4599) feat(ingestion) ldap: make ldap atttrs keys configurable Apr 17, 2022
@github-actions
Copy link

github-actions bot commented Apr 17, 2022

Unit Test Results (build & test)

385 tests  ±0   385 ✔️ +6   10m 7s ⏱️ -33s
  91 suites ±0       0 💤 ±0 
  91 files   ±0       0  - 6 

Results for commit ec943a4. ± Comparison against base commit 0949613.

♻️ This comment has been updated with latest results.

@github-actions
Copy link

github-actions bot commented Apr 17, 2022

Unit Test Results (metadata ingestion)

       8 files  ±0         8 suites  ±0   1h 17m 40s ⏱️ - 2m 24s
   565 tests ±0     562 ✔️ ±0    3 💤 ±0  0 ±0 
1 068 runs  ±0  1 025 ✔️ ±0  43 💤 ±0  0 ±0 

Results for commit ec943a4. ± Comparison against base commit 0949613.

♻️ This comment has been updated with latest results.

@atulsaurav atulsaurav marked this pull request as ready for review April 17, 2022 22:03
@jjoyce0510
Copy link
Collaborator

As discussed at office hours.. Let's move forward by duplicating the LDAP source and taking the caching approach you suggested:

  1. Fetch all users, extract the attribute that should be used in urn (from config)
  2. Create map from DN -> attribute for urn
  3. Fetch all groups, link to members, build GroupMembership aspect for each user
  4. Write back to DataHub
  • John

@@ -34,6 +34,24 @@ source:
# Options
base_dn: "dc=example,dc=org"

# Optional attribute mapping to allow ldap config differences across orgs
attrs_mapping:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@atulsaurav im wondering if this should really map "DataHub" concepts to specific attributes?

Like this:

urn_id:
email
firstName
lastName
displayName

Instead of mapping LDAP -> LDAP concepts

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good idea! Let me make changes and send you an update later in the evening. Thanks!


The `drop_missing_first_last_name` should be set to true if you've got many "headless" user LDAP accounts
for devices or services should be excluded when they do not contain a first and last name. This will only
impact the ingestion of LDAP users, while LDAP groups will be unaffected by this config option.

### Configurable LDAP

Every organization may implement LDAP slightly differently based on their needs. The makes a standard LDAP recipe ineffecvtive due to missing data during LDAP ingestion. For instance, LDAP recipe assumes department information for a CorpUser would be present in the `departmentNumber` attribute. If an organization chose not to implement that attribute or rather capture similar imformation in the `department` attribute, that information can be missed during LDAP ingestion (even though the information may be present in LDAP in a slightly different form). LDAP source provides flexibility to provide optional mapping for such variations to be reperesented under attrs_mapping. So if an organization represented `departmentNumber` as `department` and `mail` as `email`, the recipe can be adapted to customiza that mapping based on need. An example is show below. If `attrs_mapping` section is not provided, the default mapping will apply.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor typo:

ineffecvtive -> ineffective

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo:

customiza -> customize

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching these!

members = parse_from_attrs(attrs, "uniqueMember")
email = attrs["mail"][0].decode() if "mail" in attrs else full_name
owners = parse_from_attrs(attrs, self.config.attrs_mapping["owner"])
members = parse_from_attrs(attrs, self.config.attrs_mapping["uniqueMember"])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remember -> this members field is deprecated. instead, we should be populating the "GroupMembership" aspect of the user object.. Were you intending to do this as a followup?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, as a part of #3335

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonderful! Thanks for the update

@jjoyce0510
Copy link
Collaborator

Looking great so far. Once we do the "datahub concept" mapping piece we are good to go!

@atulsaurav atulsaurav marked this pull request as draft June 6, 2022 21:40
@atulsaurav atulsaurav marked this pull request as ready for review June 9, 2022 03:27
# default mapping for attrs
# general attrs
attrs_mapping: Dict[str, Any] = {}
attrs_mapping["urn"] = "sAMAccountName"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is the attribute used to construct the urn?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, thanks for clarifying that in the docs!

Copy link
Collaborator

@jjoyce0510 jjoyce0510 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good. As a followup we definitely should fix the LDAP group extraction to use the GroupMembership MCEs!

@atulsaurav atulsaurav changed the title feat(ingestion) ldap: make ldap atttrs keys configurable feat(ingestion) ldap: make ldap attrs keys configurable Jun 21, 2022
group_urn: cn
admins: owner
members: uniqueMember
displayName: name
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this overwrite the display name attribute defined above?

This makes me think we do need 2 distinct mappings:

userMappings
groupMappings

I should have called this out earlier - my apologies for that. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries, done!

attrs_mapping["group_urn"] = "cn"
attrs_mapping["admins"] = "owner"
attrs_mapping["members"] = "uniqueMember"
attrs_mapping["displayName"] = "name"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This overwrites like 36

Copy link
Collaborator

@jjoyce0510 jjoyce0510 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for improving this connector @atulsaurav !

@jjoyce0510 jjoyce0510 merged commit 648ebfc into datahub-project:master Jun 21, 2022
shirshanka pushed a commit to shirshanka/datahub that referenced this pull request Jun 22, 2022
Copy link
Contributor

@bda618 bda618 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int(attrs[self.config.attrs_mapping["departmentId"]][0].decode())

The line above is actually failing for me with the following error:

line 320, in build_corp_user_mce
int(attrs[self.config.user_attrs_map["departmentId"]][0].decode())

ValueError: invalid literal for int() with base 10: 'USE_NEW_GL_FIELDS'

It appears that "departmentId" has been mapped to "departmentNumber" which we don't have at our corp ldap. Thus, it is null. I think you can reproduce this error by rolling back changes at the test case to null.

I have temporali applied a fix as below:

attrs[self.config.attrs_mapping["departmentId"]][0]).decode()

It fixed a problem partly, but some users were not ingested and end up with error like " Failed to extract some records due to: source produced an invalid metadata work unit"

@atulsaurav
Copy link
Contributor Author

atulsaurav commented Jun 22, 2022

Yes this is an oddity with the way the LDAP connector was written. DH has 2 fields for department - departmentId and departmentName. DH makes a strong assumption that departmentId field should be numeric. I have opened up a discussion on slack in metadata-modeling channel on this topic. If we relax the metadata model to allow string types in departmentId field, this would be the way to go. For now instead of code change, you could define the mapping for departmentId field in your recipe to read from LDAP attr (e.g.) departmentId. Since that will not be found in your ldap, the departmentId field in DH will be left as null as it used to. Please let me know if this helps/makes sense.

@bda618
Copy link
Contributor

bda618 commented Jun 23, 2022

I have just done what you suggested and that fixed a problem, here is remapping from my receipt:

source:
    type: "ldap"
    config:
       # Map from a default mapping as departmentNumber to departmentId
       user_attrs_map: {"departmentId":"departmentId"}

The original source of the problem with our departmentNumber is that it is not a number, but a string.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make LDAP ingestion configurable for different attribute keys
3 participants