fix(ingest/bigquery): Querying table metadata details in batch properly #7429

treff7es · 2023-02-24T17:34:19Z

Even though we queried table details earlier in batch as well but there were multiple problems with that:
Unfortunately, in the INFORMATION_SCHEAM.PARTITIONS table (which we query to get table stats for table profile info + to get the latest partition to profile) contains the non-partitioned tables as well so even if we don't have any partitioned tables the if filter on tables in that table it will raise the too many table was queried exception.

In this fix I changed to basically use a higher batch size (200) if profiling is not enabled and a lower batch size (80) if it is enabled.

Checklist

The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
Links to related issues (if applicable)
Tests for the changes have been added/updated (if applicable)
Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

asikowitz

I like the cleaner logic but am not yet understanding how this reduces the number of tables hit in the query

asikowitz · 2023-02-24T18:33:10Z

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_config.py

@@ -80,10 +80,10 @@ class BigQueryV2Config(
    number_of_datasets_process_in_batch: int = Field(
        hidden_from_schema=True,
        default=500,


In the description you said 200 without profiling, did you mean 500?

ahh, yes, sorry

asikowitz · 2023-02-24T18:34:34Z

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery.py

+                    stored_table_identifier.raw_table_name()
+                )
+                # When table is none, we use dataset_name as table_name
+                assert stored_shard


Do we have exception handling for this assert? What does this statement do to the execution flow of ingestion?

This should not happen and it is for mypy

I see. I guess if we wanted to be more type safe we could make sharded_tables store mapping table_name -> (table, shard) but doesn't seem necessary

Perhaps we should wrap with a try catch and log an error just in case we later make a code change that adds a non-sharded table to this dict

asikowitz · 2023-02-24T18:37:28Z

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery.py

+    ) -> Dict[str, TableListItem]:
+        table_items: Dict[str, TableListItem] = {}
+        # Dict to store sharded table and the last seen max shard id
+        sharded_tables: Dict[str, TableListItem] = defaultdict()


What does this being a defaultdict give us? The only default usage I see is sharded_tables[table_name].table_id but idk how this would work because the default_factory is None right? Just seems like it complicates the logic because we have to check:
if not sharded_tables.get(table_name): rather than if table_name not in sharded_tables

makes sense, I can change it

asikowitz · 2023-02-24T18:54:41Z

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery.py

+            table_items = self.get_core_table_details(conn, dataset_name, project_id)
+
+            items_to_get: Dict[str, TableListItem] = {}
+            for table_item in table_items.keys():


So ultimately we're querying INFORMATION_SCHEMA.TABLES (perhaps joined with INFORMATION_SCHEMA.PARTITIONS) with some filter and t.table_name in .... Is there any way to order the tables so we're fetching data most efficiently, or not really?

I guess I'm a bit confused how accessing the PARTITIONS table causes us to query more tables, unless each partition counts as its own table (or maybe a % of a table? lol)

On Partitions table has this limitation the other information schemas don't and that's why only this table is affected. We don't query more tables here than in any other information schema table.

Is this documented on https://cloud.google.com/bigquery/docs/information-schema-partitions? Or just something you have to find out lol

asikowitz · 2023-02-24T18:57:08Z

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery.py

-                if (
-                    len(table_items) % self.config.number_of_datasets_process_in_batch
-                    == 0


We were already batching this even if there was no partitioning right? What is this PR changing that will fix the issue?

Earlier, we added the list of sharded tables to the last batch and because of this the last batch could be way more than the max batch size. This is now fixed

asikowitz

LGTM

asikowitz · 2023-02-24T21:15:36Z

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery.py

+            # For example some_dataset.20220110 will be turned to some_dataset.some_dataset
+            # It seems like there are some bigquery user who uses this non-standard way of sharding the tables.
+            if shard:
+                if table_name in sharded_tables:


Think you need to change this to not in !!

Adding test for the case when a table name is a shard id only

…ly (datahub-project#7429)

…ly (#7429)

Querying table details in batch properly

e436511

treff7es requested a review from asikowitz February 24, 2023 17:36

github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Feb 24, 2023

asikowitz reviewed Feb 24, 2023

View reviewed changes

asikowitz approved these changes Feb 24, 2023

View reviewed changes

pr review comment fix

549ad8c

asikowitz reviewed Feb 24, 2023

View reviewed changes

treff7es added 2 commits February 27, 2023 10:00

Fixing pr review comment

e11d75a

Adding test for the case when a table name is a shard id only

Merge branch 'master' into bigquery_query_read_too_many_tables_fix

8f00dfe

treff7es merged commit 14a6604 into datahub-project:master Feb 27, 2023

oleg-ruban pushed a commit to RChygir/datahub that referenced this pull request Feb 28, 2023

fix(ingest/bigquery): Querying table metadata details in batch proper…

4298090

…ly (datahub-project#7429)

yoonhyejin pushed a commit that referenced this pull request Mar 3, 2023

fix(ingest/bigquery): Querying table metadata details in batch proper…

31950c3

…ly (#7429)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ingest/bigquery): Querying table metadata details in batch properly #7429

fix(ingest/bigquery): Querying table metadata details in batch properly #7429

treff7es commented Feb 24, 2023

asikowitz left a comment

asikowitz Feb 24, 2023

treff7es Feb 24, 2023

asikowitz Feb 24, 2023

treff7es Feb 24, 2023

asikowitz Feb 24, 2023

asikowitz Feb 24, 2023

asikowitz Feb 24, 2023

treff7es Feb 24, 2023

asikowitz Feb 24, 2023

treff7es Feb 24, 2023

asikowitz Feb 24, 2023

asikowitz Feb 24, 2023

treff7es Feb 24, 2023

asikowitz left a comment

asikowitz Feb 24, 2023

fix(ingest/bigquery): Querying table metadata details in batch properly #7429

fix(ingest/bigquery): Querying table metadata details in batch properly #7429

Conversation

treff7es commented Feb 24, 2023

Checklist

asikowitz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

asikowitz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment