fix(ingest): bigquery-beta - turning sql parsing off in lineage extraction #6163

treff7es · 2022-10-10T14:48:39Z

Turning SQL parsing in bigquery beta off by default as it slowed down significantly the lineage generation, and it was not precise enough.
Collecting report metrics for lineage per project and not globally
Fixing table pattern in lineage extraction

Checklist

The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
Links to related issues (if applicable)
Tests for the changes have been added/updated (if applicable)
Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

… significantly the lineage generation and it was not precise enough Collecting report metrics for lineage per project and not globally Fixing table pattern in lineage extraction

github-actions · 2022-10-10T15:34:05Z

Unit Test Results (build & test)

597 tests ±0 593 ✔️ ±0 12m 3s ⏱️ +24s
147 suites ±0     4 💤 ±0
147 files ±0     0 ❌ ±0

Results for commit d98216c. ± Comparison against base commit e9f6154.

♻️ This comment has been updated with latest results.

github-actions · 2022-10-10T15:34:41Z

Unit Test Results (metadata ingestion)

      8 files +    1       8 suites +1 58m 33s ⏱️ + 9m 54s
  731 tests -     1   729 ✔️ +  10 2 💤 ±0 0 ❌ - 8
1 464 runs +678 1 460 ✔️ +689 4 💤 ±0 0 ❌ - 8

Results for commit d98216c. ± Comparison against base commit e9f6154.

This pull request removes 1 test.

tests.integration.tableau.test_tableau_ingest ‑ test_tableau_usage_stat

♻️ This comment has been updated with latest results.

hsheth2 · 2022-10-11T00:33:28Z

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_config.py

    # The inheritance hierarchy is wonky here, but these options need modifications.
    project_id: Optional[str] = Field(
        default=None,
        description="[deprecated] Use project_id_pattern instead.",
    )
    storage_project_id: None = Field(default=None, exclude=True)

+    lineage_use_sql_parser: bool = Field(
+        default=False,
+        description="Experimental. Use sql parser to resolve view/table lineage. If there is a view being referenced then bigquery sends both the view as well as underlying tablein the references. There is no distinction between direct/base objects accessed. So doing sql parsing to ensure we only use direct objects accessed for lineage.",


Suggested change

description="Experimental. Use sql parser to resolve view/table lineage. If there is a view being referenced then bigquery sends both the view as well as underlying tablein the references. There is no distinction between direct/base objects accessed. So doing sql parsing to ensure we only use direct objects accessed for lineage.",

description="Experimental. Use sql parser to resolve view/table lineage. If there is a view being referenced then BigQuery records both the view and the underlying tables as lineage. Enabling SQL parsing allows us to only include the views/tables that were actually accessed by a query.",

hsheth2 · 2022-10-11T00:36:46Z

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/lineage.py

+                self.report.num_total_audit_entries[event.project_id] = (
+                    self.report.num_total_audit_entries.get(event.project_id, 0) + 1
+                )


Ideally the TopKDict would be a subclass of defaultdict. That'd let this code be simplified

Suggested change

self.report.num_total_audit_entries[event.project_id] = (

self.report.num_total_audit_entries.get(event.project_id, 0) + 1

)

self.report.num_total_audit_entries[event.project_id] += 1

hsheth2 · 2022-10-11T00:37:04Z

metadata-ingestion/src/datahub/utilities/stats_collections.py

+            dict_as_tuples = sorted_tuples[:10]
+            trimmed_dict = {k: v for k, v in dict_as_tuples}
+            trimmed_dict[f"... top(10) of total {len(big_dict)} entries"] = ""
+            print(f"Dropping entries {sorted_tuples[11:]}")


nit: use logger

hsheth2 · 2022-10-11T00:37:44Z

metadata-ingestion/src/datahub/utilities/stats_collections.py

+
+    def __init__(self, top_k: int = 10) -> None:
+        super().__init__()
+        self.top_k = 10


Suggested change

self.top_k = 10

self.top_k = top_k

hsheth2 · 2022-10-11T00:37:56Z

metadata-ingestion/src/datahub/utilities/stats_collections.py

@@ -0,0 +1,36 @@
+from typing import Any, Dict, TypeVar, Union
+
+T = TypeVar("T")


it doesn't look like this is used anywhere?

shirshanka

LGTM

treff7es added 2 commits October 10, 2022 16:45

Turning sql parsing in bigquery beta off by default as it slowed down…

91a1e80

… significantly the lineage generation and it was not precise enough Collecting report metrics for lineage per project and not globally Fixing table pattern in lineage extraction

Merge branch 'master' into slow_ingestion_fix

841cadb

github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Oct 10, 2022

treff7es and others added 5 commits October 10, 2022 18:13

Additional small fixes

02ead22

better reporting

927f3f0

Optimizing column query for tables with lots of tables

1a13a5e

Making mypy happy

371de0d

fix lint issues

d98216c

hsheth2 reviewed Oct 11, 2022

View reviewed changes

shirshanka approved these changes Oct 11, 2022

View reviewed changes

shirshanka merged commit 128e3a8 into datahub-project:master Oct 11, 2022

treff7es deleted the slow_ingestion_fix branch February 8, 2023 11:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ingest): bigquery-beta - turning sql parsing off in lineage extraction #6163

fix(ingest): bigquery-beta - turning sql parsing off in lineage extraction #6163

treff7es commented Oct 10, 2022

github-actions bot commented Oct 10, 2022 •

edited

github-actions bot commented Oct 10, 2022 •

edited

hsheth2 Oct 11, 2022

hsheth2 Oct 11, 2022

hsheth2 Oct 11, 2022

hsheth2 Oct 11, 2022

hsheth2 Oct 11, 2022

shirshanka left a comment

	description="Experimental. Use sql parser to resolve view/table lineage. If there is a view being referenced then bigquery sends both the view as well as underlying tablein the references. There is no distinction between direct/base objects accessed. So doing sql parsing to ensure we only use direct objects accessed for lineage.",
	description="Experimental. Use sql parser to resolve view/table lineage. If there is a view being referenced then BigQuery records both the view and the underlying tables as lineage. Enabling SQL parsing allows us to only include the views/tables that were actually accessed by a query.",

		@@ -0,0 +1,36 @@
		from typing import Any, Dict, TypeVar, Union

		T = TypeVar("T")

fix(ingest): bigquery-beta - turning sql parsing off in lineage extraction #6163

fix(ingest): bigquery-beta - turning sql parsing off in lineage extraction #6163

Conversation

treff7es commented Oct 10, 2022

Checklist

github-actions bot commented Oct 10, 2022 • edited

Unit Test Results (build & test)

github-actions bot commented Oct 10, 2022 • edited

Unit Test Results (metadata ingestion)

hsheth2 Oct 11, 2022

Choose a reason for hiding this comment

hsheth2 Oct 11, 2022

Choose a reason for hiding this comment

hsheth2 Oct 11, 2022

Choose a reason for hiding this comment

hsheth2 Oct 11, 2022

Choose a reason for hiding this comment

hsheth2 Oct 11, 2022

Choose a reason for hiding this comment

shirshanka left a comment

Choose a reason for hiding this comment

github-actions bot commented Oct 10, 2022 •

edited

github-actions bot commented Oct 10, 2022 •

edited