Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(ingest/bigquery): use correct row count in null count profiling c… #9123

Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -629,7 +629,16 @@ def generate_dataset_profile( # noqa: C901 (complexity)
self.query_combiner.flush()

assert profile.rowCount is not None
row_count: int = profile.rowCount
row_count: int # used for null counts calculation
if profile.partitionSpec and "SAMPLE" in profile.partitionSpec.partition:
# We can alternatively use `self._get_dataset_rows(profile)` to get
# exact count of rows in sample, as actual rows involved in sample
# may be slightly different (more or less) than configured `sample_size`.
# However not doing so to start with, as that adds another query overhead
# plus approximate metrics should work for sampling based profiling.
row_count = self.config.sample_size
else:
row_count = profile.rowCount

for column_spec in columns_profiling_queue:
column = column_spec.column
Expand Down Expand Up @@ -781,7 +790,7 @@ def update_dataset_batch_use_sampling(self, profile: DatasetProfileClass) -> Non
sample_pc = 100 * self.config.sample_size / profile.rowCount
sql = (
f"SELECT * FROM {str(self.dataset._table)} "
+ f"TABLESAMPLE SYSTEM ({sample_pc:.3f} percent)"
+ f"TABLESAMPLE SYSTEM ({sample_pc:.8f} percent)"
)
temp_table_name = create_bigquery_temp_table(
self,
Expand Down