feat(ingest/unity): Ingest notebooks and their lineage #8940

asikowitz · 2023-10-04T06:48:40Z

We don't get notebook upstreams directly, instead we get tables' upstreams and downstreams which can include notebooks. Am open to suggestions on how to solve this in a cleaner way

Checklist

The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
Links to related issues (if applicable)
Tests for the changes have been added/updated (if applicable)
Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

asikowitz · 2023-10-04T06:49:18Z

metadata-ingestion/setup.py

@@ -258,7 +258,7 @@ def get_long_description():

 databricks = {
    # 0.1.11 appears to have authentication issues with azure databricks
-    "databricks-sdk>=0.1.1, != 0.1.11",
+    "databricks-sdk>=0.9.0",


There's some minor API changes that I don't want to be backwards-compatible to. They're pretty consistently updating this library and I'd like to keep up with it

asikowitz · 2023-10-04T06:49:58Z

metadata-ingestion/src/datahub/emitter/mcp_builder.py

+    def as_urn(self) -> str:
+        return make_dataset_urn_with_platform_instance(
+            platform=self.platform, platform_instance=self.instance, name=self.guid()
+        )


This is kinda awkward but notebooks don't have a container structure (they have a folder path, but that can change)

asikowitz · 2023-10-04T06:50:20Z

metadata-ingestion/src/datahub/ingestion/source/unity/proxy.py

@@ -153,7 +170,7 @@ def query_history(
                    "start_time_ms": start_time.timestamp() * 1000,
                    "end_time_ms": end_time.timestamp() * 1000,
                },
-                "statuses": [QueryStatus.FINISHED.value],
+                "statuses": [QueryStatus.FINISHED],


The API change by upgrading to 0.9.0

asikowitz · 2023-10-04T06:51:41Z

metadata-ingestion/src/datahub/ingestion/source/unity/proxy.py

        """List table lineage by table name."""
        return self._workspace_client.api_client.do(
            method="GET",
-            path="/api/2.0/lineage-tracking/table-lineage/get",
-            body={"table_name": table_name},
+            path="/api/2.0/lineage-tracking/table-lineage",


For some reason, the /get endpoint returns the key "upstream_tables" while the base endpoint returns the key "upstreams". To stay consistent with the docs I swap to the base endpoint

asikowitz · 2023-10-04T06:52:28Z

metadata-ingestion/src/datahub/ingestion/source/unity/proxy.py

+            table_lineage = self.table_lineage(
+                table, include_notebooks=include_notebooks
            )


Actually set upstreams in-place so we can pick up notebooks, which are not present in col lineage

asikowitz · 2023-10-04T06:52:53Z

metadata-ingestion/src/datahub/ingestion/source/unity/report.py

+    num_queries_missing_table: int = 0  # Can be due to pattern filter
+    num_queries_duplicate_table: int = 0


We don't actually drop these queries

treff7es · 2023-10-04T06:57:37Z

metadata-ingestion/src/datahub/ingestion/source/unity/proxy.py

        # Lineage endpoint doesn't exists on 2.1 version
        try:
            response: dict = self.list_lineages_by_table(
-                table_name=f"{table.schema.catalog.name}.{table.schema.name}.{table.name}"
+                table_name=table.ref.qualified_table_name,
+                include_entity_lineage=include_notebooks,


Is it correct to call include_notebooks and not something like include_entity_lineage?
Is this can only be a Notebook?

You're right, this makes more sense as "include entity lineage"

treff7es · 2023-10-04T07:06:35Z

metadata-ingestion/src/datahub/ingestion/source/unity/source.py

        yield from self.process_metastores()

+        if self.config.include_notebooks:


Can't we move line 198 to here and then we don't need to separate the notebook processing logic.

No, I need to set the notebook's upstreams before I can generate the workunits for it. I suppose I can generate the lineage mcp afterwards... I'll do that

treff7es

lgtm

asikowitz · 2023-10-04T14:22:38Z

Merging through test failures as I think they failed due to slowness / node image not being available?

feat(ingest/unity): Ingest notebooks and their lineage

64a3cba

asikowitz requested review from treff7es and hsheth2 October 4, 2023 06:48

asikowitz commented Oct 4, 2023

View reviewed changes

treff7es approved these changes Oct 4, 2023

View reviewed changes

github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Oct 4, 2023

vercel bot deployed to Preview October 4, 2023 07:37 View deployment

pr feedback; ingest upstream notebooks

de819f7

treff7es approved these changes Oct 4, 2023

View reviewed changes

asikowitz mentioned this pull request Oct 4, 2023

fix(ingest/unity): Remove metastore from ingestion and urns; standardize platform instance; add notebook filter #8943

Merged

5 tasks

asikowitz added the merge-pending-ci A PR that has passed review and should be merged once CI is green. label Oct 4, 2023

vercel bot deployed to Preview October 4, 2023 08:55 View deployment

asikowitz merged commit d3346a0 into datahub-project:master Oct 4, 2023

asikowitz deleted the databricks-notebook-ingestion branch October 4, 2023 14:22

maggiehays added the hacktoberfest-accepted Acceptance for hacktoberfest https://hacktoberfest.com/participation/ label Oct 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ingest/unity): Ingest notebooks and their lineage #8940

feat(ingest/unity): Ingest notebooks and their lineage #8940

asikowitz commented Oct 4, 2023

asikowitz Oct 4, 2023

asikowitz Oct 4, 2023

asikowitz Oct 4, 2023

asikowitz Oct 4, 2023

asikowitz Oct 4, 2023

asikowitz Oct 4, 2023

treff7es Oct 4, 2023

asikowitz Oct 4, 2023

treff7es Oct 4, 2023

asikowitz Oct 4, 2023

treff7es left a comment

asikowitz commented Oct 4, 2023

		num_queries_missing_table: int = 0 # Can be due to pattern filter
		num_queries_duplicate_table: int = 0

		yield from self.process_metastores()

		if self.config.include_notebooks:

feat(ingest/unity): Ingest notebooks and their lineage #8940

feat(ingest/unity): Ingest notebooks and their lineage #8940

Conversation

asikowitz commented Oct 4, 2023

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

treff7es left a comment

Choose a reason for hiding this comment

asikowitz commented Oct 4, 2023