New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(ingest): bigquery-beta - turning sql parsing off in lineage extraction #6163
fix(ingest): bigquery-beta - turning sql parsing off in lineage extraction #6163
Conversation
… significantly the lineage generation and it was not precise enough Collecting report metrics for lineage per project and not globally Fixing table pattern in lineage extraction
Unit Test Results (metadata ingestion) 8 files + 1 8 suites +1 58m 33s ⏱️ + 9m 54s Results for commit d98216c. ± Comparison against base commit e9f6154. This pull request removes 1 test.
♻️ This comment has been updated with latest results. |
# The inheritance hierarchy is wonky here, but these options need modifications. | ||
project_id: Optional[str] = Field( | ||
default=None, | ||
description="[deprecated] Use project_id_pattern instead.", | ||
) | ||
storage_project_id: None = Field(default=None, exclude=True) | ||
|
||
lineage_use_sql_parser: bool = Field( | ||
default=False, | ||
description="Experimental. Use sql parser to resolve view/table lineage. If there is a view being referenced then bigquery sends both the view as well as underlying tablein the references. There is no distinction between direct/base objects accessed. So doing sql parsing to ensure we only use direct objects accessed for lineage.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
description="Experimental. Use sql parser to resolve view/table lineage. If there is a view being referenced then bigquery sends both the view as well as underlying tablein the references. There is no distinction between direct/base objects accessed. So doing sql parsing to ensure we only use direct objects accessed for lineage.", | |
description="Experimental. Use sql parser to resolve view/table lineage. If there is a view being referenced then BigQuery records both the view and the underlying tables as lineage. Enabling SQL parsing allows us to only include the views/tables that were actually accessed by a query.", |
self.report.num_total_audit_entries[event.project_id] = ( | ||
self.report.num_total_audit_entries.get(event.project_id, 0) + 1 | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally the TopKDict would be a subclass of defaultdict. That'd let this code be simplified
self.report.num_total_audit_entries[event.project_id] = ( | |
self.report.num_total_audit_entries.get(event.project_id, 0) + 1 | |
) | |
self.report.num_total_audit_entries[event.project_id] += 1 |
dict_as_tuples = sorted_tuples[:10] | ||
trimmed_dict = {k: v for k, v in dict_as_tuples} | ||
trimmed_dict[f"... top(10) of total {len(big_dict)} entries"] = "" | ||
print(f"Dropping entries {sorted_tuples[11:]}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: use logger
|
||
def __init__(self, top_k: int = 10) -> None: | ||
super().__init__() | ||
self.top_k = 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.top_k = 10 | |
self.top_k = top_k |
@@ -0,0 +1,36 @@ | |||
from typing import Any, Dict, TypeVar, Union | |||
|
|||
T = TypeVar("T") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it doesn't look like this is used anywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Checklist