-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(ingest/unity): Ingest notebooks and their lineage #8940
feat(ingest/unity): Ingest notebooks and their lineage #8940
Conversation
@@ -258,7 +258,7 @@ def get_long_description(): | |||
|
|||
databricks = { | |||
# 0.1.11 appears to have authentication issues with azure databricks | |||
"databricks-sdk>=0.1.1, != 0.1.11", | |||
"databricks-sdk>=0.9.0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's some minor API changes that I don't want to be backwards-compatible to. They're pretty consistently updating this library and I'd like to keep up with it
def as_urn(self) -> str: | ||
return make_dataset_urn_with_platform_instance( | ||
platform=self.platform, platform_instance=self.instance, name=self.guid() | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is kinda awkward but notebooks don't have a container structure (they have a folder path, but that can change)
@@ -153,7 +170,7 @@ def query_history( | |||
"start_time_ms": start_time.timestamp() * 1000, | |||
"end_time_ms": end_time.timestamp() * 1000, | |||
}, | |||
"statuses": [QueryStatus.FINISHED.value], | |||
"statuses": [QueryStatus.FINISHED], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The API change by upgrading to 0.9.0
"""List table lineage by table name.""" | ||
return self._workspace_client.api_client.do( | ||
method="GET", | ||
path="/api/2.0/lineage-tracking/table-lineage/get", | ||
body={"table_name": table_name}, | ||
path="/api/2.0/lineage-tracking/table-lineage", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some reason, the /get
endpoint returns the key "upstream_tables" while the base endpoint returns the key "upstreams". To stay consistent with the docs I swap to the base endpoint
table_lineage = self.table_lineage( | ||
table, include_notebooks=include_notebooks | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually set upstreams in-place so we can pick up notebooks, which are not present in col lineage
num_queries_missing_table: int = 0 # Can be due to pattern filter | ||
num_queries_duplicate_table: int = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't actually drop these queries
# Lineage endpoint doesn't exists on 2.1 version | ||
try: | ||
response: dict = self.list_lineages_by_table( | ||
table_name=f"{table.schema.catalog.name}.{table.schema.name}.{table.name}" | ||
table_name=table.ref.qualified_table_name, | ||
include_entity_lineage=include_notebooks, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it correct to call include_notebooks and not something like include_entity_lineage?
Is this can only be a Notebook?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, this makes more sense as "include entity lineage"
yield from self.process_metastores() | ||
|
||
if self.config.include_notebooks: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we move line 198 to here and then we don't need to separate the notebook processing logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I need to set the notebook's upstreams before I can generate the workunits for it. I suppose I can generate the lineage mcp afterwards... I'll do that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Merging through test failures as I think they failed due to slowness / node image not being available? |
We don't get notebook upstreams directly, instead we get tables' upstreams and downstreams which can include notebooks. Am open to suggestions on how to solve this in a cleaner way
Checklist