New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

fix(ingest/bigquery): Fix for table cache was not cleared #7323

Merged

treff7es merged 5 commits into datahub-project:master from treff7es:bq_fixes

Feb 13, 2023

Contributor

treff7es commented Feb 12, 2023

Fix for not clearing BigQuery table cache, which caused to carry over tables from one project to the other
Fixing range partition profiling
Fixing peak memory report

Checklist

The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
Links to related issues (if applicable)
Tests for the changes have been added/updated (if applicable)
Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub


          - Fix for not clearing bigquery table cache which caused to carry ove…

1e57de7

…r tables from one project to the other

- Fixing range partition profiling
- Fixing peak memory report

github-actions bot added the ingestion label

jjoyce0510 reviewed

View reviewed changes

Collaborator

jjoyce0510 left a comment

Overall looks okay to me, but I am definitely nervous about the time partitioning changes given that we've had to change it a few times now

metadata-ingestion/src/datahub/ingestion/run/pipeline.py

@@ @@ -123,15 +123,15 @@ class CliReport(Report): @@
                   py_version: str = sys.version
                   py_exec_path: str = sys.executable
                   os_details: str = platform.platform()
-                  _peek_memory_usage: int = 0
+                  _peak_memory_usage: int = 0

Collaborator

jjoyce0510 Feb 12, 2023

Nice :)

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery.py

@@ @@ -572,7 +568,10 @@ def _process_project( @@
                               self.report.report_dropped(f"{bigquery_dataset.name}.*")
                               continue
                           try:
-                              yield from self._process_schema(conn, project_id, bigquery_dataset)
+                              # db_tables and db_views are populated in the this method

Collaborator

jjoyce0510 Feb 12, 2023

Thanks for adding this comment

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery.py Outdated Show resolved Hide resolved

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery.py

@@ @@ -768,12 +789,12 @@ def _process_table( @@
                           )
                       # If table has time partitioning, set the data type of the partitioning field
-                      if table.time_partitioning:
-                          table.time_partitioning.column = next(
+                      if table.partition_info:

Collaborator

jjoyce0510 Feb 12, 2023

Curious -- Why was the previous code wrong??

Contributor Author

treff7es Feb 13, 2023

It was only supported time partitioned columns and not time and range partitioned columns

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery.py Outdated Show resolved Hide resolved

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery.py

+                                      # When table is none, we use dataset_name as table_name
+                                      table_name = table_identifier.get_table_name().split(".")[-1]
+                                      assert stored_shard
+                                      if stored_shard < shard:

Collaborator

jjoyce0510 Feb 12, 2023

This condition makes very little sense to me as an unfamiliar reader. Consider adding a comment to explain

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery.py

+                                      if stored_shard < shard:
+                                          sharded_tables[table_name] = table
+                                      continue
+                              else:

Collaborator

jjoyce0510 Feb 12, 2023

Hard to follow all of these nested if else conditions. Would maybe be easier to read if we broke into small well-named functions

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery.py

+                                      continue
+                              else:
+                                  table_count = table_count + 1
+                                  table_items[table.table_id] = table

Collaborator

jjoyce0510 Feb 12, 2023

I have no clue from reading the name of this data structure what is inside of it. 'Table items' is a super generic name. Wondering if we can make it clearer

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery.py

+                                      table_items,
+                                      with_data_read_permission=self.config.profiling.enabled,
+                                  )
+                              )

Collaborator

jjoyce0510 Feb 12, 2023

so having read through this section of the code I can say I have little idea what the implications of making these changes will be

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_schema.py

-                                  tables[table.table_name].time_partitioning
-                              )
-                              if tables and tables[table.table_name].time_partitioning
+                              partition_info=PartitionInfo.from_table_info(tables[table.table_name])

Collaborator

jjoyce0510 Feb 12, 2023

So we previously assumed all Partitioned Tables were time partitioned? And now we know there's another default case of range partitioning that we are handling?

Contributor Author

treff7es Feb 13, 2023

In the bigquery api if you call the list_tables API call, then it returns TableInfo objects which have a TimePartitioning property that is filled in if it is a time partitioned column.
In the case of a ranged partitioned table, this is not filled in, and the only way to detect it is to check if there is a rangePartitioning key in it's _properties.
It is weird why this is not exposed as a main property, but it is as it is.
Earlier, I used only the TimePartitioning field, and this way we skipped the range partitioned tables.

Collaborator

jjoyce0510 Feb 13, 2023

got it - makes sense

treff7es added 4 commits

February 13, 2023 11:30


          PR review fixes

a4da49b

Adding tests


          Fixing test

82f4e32


          Fixing error printing

1aa953a


          Adding a 3.7 compatible arg check

55cba69

jjoyce0510 reviewed

View reviewed changes

metadata-ingestion/tests/unit/test_bigquery_profiler.py

		from datahub.ingestion.source.bigquery_v2.profiler import BigqueryProfiler


		def test_not_generate_partition_profiler_query_if_not_partitioned_sharded_table():

Collaborator

jjoyce0510 Feb 13, 2023

Nice test!

jjoyce0510 reviewed

View reviewed changes

metadata-ingestion/tests/unit/test_bigquery_source.py

+                      conn=client_mock, project_id="test-project", dataset_name="test-dataset"
+                  )
+                  assert data_dictionary_mock.call_count == 1

Collaborator

jjoyce0510 Feb 13, 2023

Thank you for adding these!

jjoyce0510 approved these changes

View reviewed changes

Collaborator

jjoyce0510 left a comment

Thanks for addressing comments!

treff7es merged commit b34e4fe into datahub-project:master

oleg-ruban pushed a commit to RChygir/datahub that referenced this pull request


          fix(ingest/bigquery): Fix for table cache was not cleared (datahub-pr…

0a3f97c

…oject#7323)

yoonhyejin pushed a commit that referenced this pull request


          fix(ingest/bigquery): Fix for table cache was not cleared (#7323)

fcece8b

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels