Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(ingest): bigquery-beta - turning sql parsing off in lineage extraction #6163

Merged
merged 7 commits into from Oct 11, 2022

Conversation

treff7es
Copy link
Contributor

  • Turning SQL parsing in bigquery beta off by default as it slowed down significantly the lineage generation, and it was not precise enough.
  • Collecting report metrics for lineage per project and not globally
  • Fixing table pattern in lineage extraction

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

… significantly the lineage generation and it was not precise enough

Collecting report metrics for lineage per project and not globally
Fixing table pattern in lineage extraction
@github-actions github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Oct 10, 2022
@github-actions
Copy link

github-actions bot commented Oct 10, 2022

Unit Test Results (build & test)

597 tests  ±0   593 ✔️ ±0   12m 3s ⏱️ +24s
147 suites ±0       4 💤 ±0 
147 files   ±0       0 ±0 

Results for commit d98216c. ± Comparison against base commit e9f6154.

♻️ This comment has been updated with latest results.

@github-actions
Copy link

github-actions bot commented Oct 10, 2022

Unit Test Results (metadata ingestion)

       8 files  +    1         8 suites  +1   58m 33s ⏱️ + 9m 54s
   731 tests  -     1     729 ✔️ +  10  2 💤 ±0  0  - 8 
1 464 runs  +678  1 460 ✔️ +689  4 💤 ±0  0  - 8 

Results for commit d98216c. ± Comparison against base commit e9f6154.

This pull request removes 1 test.
tests.integration.tableau.test_tableau_ingest ‑ test_tableau_usage_stat

♻️ This comment has been updated with latest results.

# The inheritance hierarchy is wonky here, but these options need modifications.
project_id: Optional[str] = Field(
default=None,
description="[deprecated] Use project_id_pattern instead.",
)
storage_project_id: None = Field(default=None, exclude=True)

lineage_use_sql_parser: bool = Field(
default=False,
description="Experimental. Use sql parser to resolve view/table lineage. If there is a view being referenced then bigquery sends both the view as well as underlying tablein the references. There is no distinction between direct/base objects accessed. So doing sql parsing to ensure we only use direct objects accessed for lineage.",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
description="Experimental. Use sql parser to resolve view/table lineage. If there is a view being referenced then bigquery sends both the view as well as underlying tablein the references. There is no distinction between direct/base objects accessed. So doing sql parsing to ensure we only use direct objects accessed for lineage.",
description="Experimental. Use sql parser to resolve view/table lineage. If there is a view being referenced then BigQuery records both the view and the underlying tables as lineage. Enabling SQL parsing allows us to only include the views/tables that were actually accessed by a query.",

Comment on lines +363 to +365
self.report.num_total_audit_entries[event.project_id] = (
self.report.num_total_audit_entries.get(event.project_id, 0) + 1
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally the TopKDict would be a subclass of defaultdict. That'd let this code be simplified

Suggested change
self.report.num_total_audit_entries[event.project_id] = (
self.report.num_total_audit_entries.get(event.project_id, 0) + 1
)
self.report.num_total_audit_entries[event.project_id] += 1

dict_as_tuples = sorted_tuples[:10]
trimmed_dict = {k: v for k, v in dict_as_tuples}
trimmed_dict[f"... top(10) of total {len(big_dict)} entries"] = ""
print(f"Dropping entries {sorted_tuples[11:]}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use logger


def __init__(self, top_k: int = 10) -> None:
super().__init__()
self.top_k = 10
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.top_k = 10
self.top_k = top_k

@@ -0,0 +1,36 @@
from typing import Any, Dict, TypeVar, Union

T = TypeVar("T")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it doesn't look like this is used anywhere?

Copy link
Contributor

@shirshanka shirshanka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@shirshanka shirshanka merged commit 128e3a8 into datahub-project:master Oct 11, 2022
@treff7es treff7es deleted the slow_ingestion_fix branch February 8, 2023 11:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ingestion PR or Issue related to the ingestion of metadata
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants