Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingest/unity): Ingest notebooks and their lineage #8940

Merged

Conversation

asikowitz
Copy link
Collaborator

We don't get notebook upstreams directly, instead we get tables' upstreams and downstreams which can include notebooks. Am open to suggestions on how to solve this in a cleaner way

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@@ -258,7 +258,7 @@ def get_long_description():

databricks = {
# 0.1.11 appears to have authentication issues with azure databricks
"databricks-sdk>=0.1.1, != 0.1.11",
"databricks-sdk>=0.9.0",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's some minor API changes that I don't want to be backwards-compatible to. They're pretty consistently updating this library and I'd like to keep up with it

Comment on lines +134 to +137
def as_urn(self) -> str:
return make_dataset_urn_with_platform_instance(
platform=self.platform, platform_instance=self.instance, name=self.guid()
)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kinda awkward but notebooks don't have a container structure (they have a folder path, but that can change)

@@ -153,7 +170,7 @@ def query_history(
"start_time_ms": start_time.timestamp() * 1000,
"end_time_ms": end_time.timestamp() * 1000,
},
"statuses": [QueryStatus.FINISHED.value],
"statuses": [QueryStatus.FINISHED],
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API change by upgrading to 0.9.0

"""List table lineage by table name."""
return self._workspace_client.api_client.do(
method="GET",
path="/api/2.0/lineage-tracking/table-lineage/get",
body={"table_name": table_name},
path="/api/2.0/lineage-tracking/table-lineage",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason, the /get endpoint returns the key "upstream_tables" while the base endpoint returns the key "upstreams". To stay consistent with the docs I swap to the base endpoint

Comment on lines 266 to 268
table_lineage = self.table_lineage(
table, include_notebooks=include_notebooks
)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually set upstreams in-place so we can pick up notebooks, which are not present in col lineage

Comment on lines +23 to +24
num_queries_missing_table: int = 0 # Can be due to pattern filter
num_queries_duplicate_table: int = 0
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't actually drop these queries

# Lineage endpoint doesn't exists on 2.1 version
try:
response: dict = self.list_lineages_by_table(
table_name=f"{table.schema.catalog.name}.{table.schema.name}.{table.name}"
table_name=table.ref.qualified_table_name,
include_entity_lineage=include_notebooks,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it correct to call include_notebooks and not something like include_entity_lineage?
Is this can only be a Notebook?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, this makes more sense as "include entity lineage"

yield from self.process_metastores()

if self.config.include_notebooks:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we move line 198 to here and then we don't need to separate the notebook processing logic.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I need to set the notebook's upstreams before I can generate the workunits for it. I suppose I can generate the lineage mcp afterwards... I'll do that

@github-actions github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Oct 4, 2023
Copy link
Contributor

@treff7es treff7es left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@asikowitz
Copy link
Collaborator Author

Merging through test failures as I think they failed due to slowness / node image not being available?

@asikowitz asikowitz merged commit d3346a0 into datahub-project:master Oct 4, 2023
53 of 58 checks passed
@asikowitz asikowitz deleted the databricks-notebook-ingestion branch October 4, 2023 14:22
@maggiehays maggiehays added the hacktoberfest-accepted Acceptance for hacktoberfest https://hacktoberfest.com/participation/ label Oct 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hacktoberfest-accepted Acceptance for hacktoberfest https://hacktoberfest.com/participation/ ingestion PR or Issue related to the ingestion of metadata merge-pending-ci A PR that has passed review and should be merged once CI is green.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants