fix(ingest/bigquery): Increase batch size in metadata extraction if no partitioned table involved #7252

treff7es · 2023-02-03T21:32:35Z

Checklist

The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
Links to related issues (if applicable)
Tests for the changes have been added/updated (if applicable)
Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

… partitioned table is involved

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_config.py

jjoyce0510 · 2023-02-06T21:28:02Z

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_config.py

@@ -78,9 +78,15 @@ class BigQueryV2Config(
    )

    number_of_datasets_process_in_batch: int = Field(
+        default=2000,


Can we try to 500?

# Conflicts: # metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery.py

- The new query returns max_column_size + 1 column per table maximum - Only returns the columns to the latest shard and not for the rest

jjoyce0510 · 2023-02-15T22:04:48Z

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_config.py

@@ -171,6 +179,12 @@ class BigQueryV2Config(
        description="Useful for debugging lineage information. Set to True to see the raw lineage created internally.",
    )

+    run_optimized_column_query: bool = Field(
+        hidden_from_schema=True,
+        default=True,


Can we default this to False?

jjoyce0510

This LGTM

asikowitz

Didn't have time to finish this review, just a quick comment before you merge

asikowitz · 2023-02-15T22:02:11Z

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery.py

+                        partitioned_table_count_in_this_batch = (
+                            partitioned_table_count_in_this_batch + 1
+                        )


Can this be replaced with a += 1? Also, do we want to increment this even if we skip the table based on prefix?

asikowitz

I may be totally misunderstanding the logic flow here, but I have one major concern around how we drop ingestion of tables based on table_identifier prefix. I also think the counting logic may be able to be simplified

asikowitz · 2023-02-15T22:21:26Z

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_schema.py

+  description as comment,
+  c.is_hidden as is_hidden,
+  c.is_partitioning_column as is_partitioning_column,
+  -- We count the colums to be able limit it later


Typo colums -> columns

asikowitz · 2023-02-15T22:30:03Z

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_schema.py

+  -- We count the colums to be able limit it later
+  row_number() over (partition by c.table_catalog, c.table_schema, c.table_name order by c.ordinal_position asc, c.data_type DESC) as column_num,
+  -- Getting the maximum shard for each table
+  row_number() over (partition by c.table_catalog, c.table_schema, ifnull(REGEXP_EXTRACT(c.table_name, r'(.*)_\\d{{8}}$'), c.table_name), cfp.field_path order by c.table_catalog, c.table_schema asc, c.table_name desc) as shard_num


Just checking my knowledge, the sort by c.table_name desc is what makes it so shard_num = 1 is the most recent shard

Yes, a sharded table has a suffix like table_name_YYYYMMDD and if we order it to descending order then we should get the latest one.

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery.py

asikowitz · 2023-02-15T22:54:33Z

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery.py

+                    if (
+                        table.time_partitioning
+                        or "range_partitioning" in table._properties
+                    ):
+                        partitioned_table_count_in_this_batch += 1
+
                    table_items[table.table_id] = table


Should we be doing this after checking table_identifier? I feel like the below if statement is not actually dropping tables, because they've been added to table_items. If this is the case, I think it makes sense to put this in an elif attached to the if shard block

asikowitz · 2023-02-15T22:54:59Z

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery.py

+                if (
+                    table_count_in_this_batch
+                    % self.config.number_of_datasets_process_in_batch
+                    == 0
+                ) or (
+                    partitioned_table_count_in_this_batch > 0
+                    and partitioned_table_count_in_this_batch
+                    % self.config.number_of_partitioned_datasets_process_in_batch
+                    == 0
+                ):


Could we replace the counting with table_count_in_this_batch by just checking len(table_items)? It seems like they'll match each other, unless we're getting duplicate table.table_id keys, and in any case, that seems like the number we want to track when making the query

Also, since we're resetting partitioned_table_count_in_this_batch = 0 each query, I think the second conditional can be simplified to partitioned_table_count_in_this_batch == self.config.number_of_partitioned_datasets_process_in_batch

asikowitz

One comment but looks good to me!

asikowitz · 2023-02-16T21:05:07Z

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery.py

                if (
-                    table_count_in_this_batch
-                    % self.config.number_of_datasets_process_in_batch
+                    len(table_items) % self.config.number_of_datasets_process_in_batch


I think this can also become a == because we clear table_items() on querying

…o partitioned table involved (datahub-project#7252)

…o partitioned table involved (#7252)

Improving bigquery metadata extraction by increasing batch size if no…

1d94bfe

… partitioned table is involved

treff7es force-pushed the bq_increase_batch_process branch from ee3ac54 to 1d94bfe Compare February 3, 2023 21:36

github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Feb 3, 2023

anshbansal reviewed Feb 6, 2023

View reviewed changes

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_config.py Show resolved Hide resolved

jjoyce0510 reviewed Feb 6, 2023

View reviewed changes

treff7es added 2 commits February 14, 2023 13:58

Merge branch 'master' into bq_increase_batch_process

c0fa7b2

# Conflicts: # metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery.py

Couple of fixes to the pr

6b38116

treff7es force-pushed the bq_increase_batch_process branch from 287681a to 6b38116 Compare February 14, 2023 14:12

treff7es added 2 commits February 14, 2023 15:42

Merge branch 'master' into bq_increase_batch_process

7cee331

Add option to run more optimized column query in Bigquery:

83ce310

- The new query returns max_column_size + 1 column per table maximum - Only returns the columns to the latest shard and not for the rest

jjoyce0510 reviewed Feb 15, 2023

View reviewed changes

jjoyce0510 approved these changes Feb 15, 2023

View reviewed changes

asikowitz reviewed Feb 15, 2023

View reviewed changes

treff7es added 2 commits February 15, 2023 23:12

Disable optimized column query by default

d9fb349

Fixing pr review comment

c91aca3

asikowitz reviewed Feb 15, 2023

View reviewed changes

treff7es added 2 commits February 16, 2023 22:01

Fixing pr review comments

91897df

Fixing typo in config

20ba636

asikowitz approved these changes Feb 16, 2023

View reviewed changes

treff7es merged commit aa388f0 into datahub-project:master Feb 17, 2023

oleg-ruban pushed a commit to RChygir/datahub that referenced this pull request Feb 28, 2023

fix(ingest/bigquery): Increase batch size in metadata extraction if n…

dbf203e

…o partitioned table involved (datahub-project#7252)

yoonhyejin pushed a commit that referenced this pull request Mar 3, 2023

fix(ingest/bigquery): Increase batch size in metadata extraction if n…

6c40c97

…o partitioned table involved (#7252)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ingest/bigquery): Increase batch size in metadata extraction if no partitioned table involved #7252

fix(ingest/bigquery): Increase batch size in metadata extraction if no partitioned table involved #7252

treff7es commented Feb 3, 2023

jjoyce0510 Feb 6, 2023

jjoyce0510 Feb 15, 2023

jjoyce0510 left a comment

asikowitz left a comment

asikowitz Feb 15, 2023 •

edited

Loading

asikowitz left a comment

asikowitz Feb 15, 2023

asikowitz Feb 15, 2023

treff7es Feb 16, 2023

asikowitz Feb 15, 2023

asikowitz Feb 15, 2023

asikowitz left a comment

asikowitz Feb 16, 2023

fix(ingest/bigquery): Increase batch size in metadata extraction if no partitioned table involved #7252

fix(ingest/bigquery): Increase batch size in metadata extraction if no partitioned table involved #7252

Conversation

treff7es commented Feb 3, 2023

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jjoyce0510 left a comment

Choose a reason for hiding this comment

asikowitz left a comment

Choose a reason for hiding this comment

asikowitz Feb 15, 2023 • edited Loading

Choose a reason for hiding this comment

asikowitz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

asikowitz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

asikowitz Feb 15, 2023 •

edited

Loading