New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

fix(ingest/bigquery) Lowering significantly the memory usage of the BigQuery connector #7315

Merged

jjoyce0510 merged 4 commits into datahub-project:master from treff7es:bq_memory_improvements

Feb 10, 2023

Contributor

treff7es commented Feb 10, 2023

Adding blackhole sink for testing

Checklist

The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
Links to related issues (if applicable)
Tests for the changes have been added/updated (if applicable)
Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

treff7es added 3 commits

February 10, 2023 17:52


          Lowering significantly the memory usage of the BigQuery connector

Adding blackhole sink for testing


          Merge branch 'master' into bq_memory_improvements

52db611


          Additional fixes

d154168

jjoyce0510 reviewed

View reviewed changes

metadata-ingestion/setup.py

@@ @@ -576,6 +576,7 @@ def get_long_description(): @@
                   "datahub.ingestion.sink.plugins": [
                       "file = datahub.ingestion.sink.file:FileSink",
                       "console = datahub.ingestion.sink.console:ConsoleSink",
+                      "blackhole = datahub.ingestion.sink.blackhole:BlackHoleSink",

Collaborator

jjoyce0510 Feb 10, 2023

Pretty fun :) DevNull!

jjoyce0510 reviewed

View reviewed changes

metadata-ingestion/src/datahub/ingestion/run/pipeline.py Show resolved Hide resolved

jjoyce0510 reviewed

View reviewed changes

metadata-ingestion/src/datahub/ingestion/run/pipeline.py

+                      mem_usage = psutil.Process(os.getpid()).memory_info().rss
+                      if self._peek_memory_usage < mem_usage:
+                          self._peek_memory_usage = mem_usage
+                          self.peek_memory_usage = humanfriendly.format_size(self._peek_memory_usage)

Collaborator

jjoyce0510 Feb 10, 2023

Thanks - this is SUPER useful since logs get cutoff

jjoyce0510 reviewed

View reviewed changes

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery.py Show resolved Hide resolved

jjoyce0510 reviewed

View reviewed changes

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery.py

                           ]
-                      for dataset in self.db_views[project_id]:
+                      for dataset in self.db_views.keys():
                           tables[dataset].extend(

Collaborator

jjoyce0510 Feb 10, 2023

qq -- is there a reason we are adding VIEWS into a structure called TABLES?

Contributor Author

treff7es Feb 10, 2023

It is just for simplicity to pass in all the table/view names and validate usage with it. Maybe I should add some more expressive names.

jjoyce0510 reviewed

View reviewed changes

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery.py Outdated

+                                  BigqueryTableIdentifier(
+                                      project_id, dataset, table.name
+                                  ).get_table_name()
+                                  for table in self.db_views[dataset]

Collaborator

jjoyce0510 Feb 10, 2023

for view in self.db_views?

jjoyce0510 reviewed

View reviewed changes

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery.py

@@ @@ -692,25 +696,50 @@ def _process_schema( @@
                           project_id,
                       )
+                      columns = BigQueryDataDictionary.get_columns_for_dataset(

Collaborator

jjoyce0510 Feb 10, 2023

This is a dictionary? Of table name to columns?

Collaborator

jjoyce0510 Feb 10, 2023

Oh this fetches ALL columns for ALL tables in the dataset. very interesting.

jjoyce0510 reviewed

View reviewed changes

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery.py Outdated

+                                  view, view_columns, project_id, dataset_name
+                              )
+                  # This methode is used to generate the ignore list for datatypes the profiler doesn't support we have to do it here

Collaborator

jjoyce0510 Feb 10, 2023

typo: method

jjoyce0510 reviewed

View reviewed changes

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery.py Show resolved Hide resolved

jjoyce0510 reviewed

View reviewed changes

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery.py

                           logger.warning(
                               f"Table doesn't have any column or unable to get columns for table: {table_identifier}"
                           )
-                      yield from self.gen_table_dataset_workunits(table, project_id, schema_name)
+                      # If table has time partitioning, set the data type of the partitioning field
+                      if table.time_partitioning:

Collaborator

jjoyce0510 Feb 10, 2023

Why do we now need this??

Contributor Author

treff7es Feb 10, 2023

Bigquery in the table metadata returns the Partition column's name but not the data type. We have to look up the list of columns.

jjoyce0510 reviewed

View reviewed changes

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery.py Show resolved Hide resolved

jjoyce0510 reviewed

View reviewed changes

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery.py

-                              yield from self._process_table(conn, table, project_id, dataset_name)
+                          tables = self.get_tables_for_dataset(conn, project_id, dataset_name)
+                          for table in tables:
+                              table_columns = columns.get(table.name, []) if columns else []

Collaborator

jjoyce0510 Feb 10, 2023

Nice handling!

jjoyce0510 reviewed

View reviewed changes

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery.py

-                          table_identifier.project_id,
-                          table_identifier.dataset,
-                      ) not in self.schema_columns.keys():
-                          columns = BigQueryDataDictionary.get_columns_for_dataset(

Collaborator

jjoyce0510 Feb 10, 2023

Was this making a full call for EVERY table before?

jjoyce0510 reviewed

View reviewed changes

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery.py

-                              dataset_name=table_identifier.dataset,
-                              column_limit=column_limit,
-                          )
-                          self.schema_columns[

Collaborator

jjoyce0510 Feb 10, 2023

Ah I see we were caching this. Wow! All columns cached here. :(

jjoyce0510 reviewed

View reviewed changes

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/profiler.py

-                              if table.time_partitioning.type_ in ("HOUR", "DAY", "MONTH", "YEAR"):
-                                  partition_where_clause = f"{partition_column_type}(`{table.time_partitioning.field}`) BETWEEN {partition_column_type}('{partition_datetime}') AND {partition_column_type}('{upper_bound_partition_datetime}')"
+                              if table.time_partitioning.type in ("HOUR", "DAY", "MONTH", "YEAR"):

Collaborator

jjoyce0510 Feb 10, 2023

Please ensure we have sufficient try: except around this method. There are a LOT of new changes and it makes me quite nervous.

Contributor Author

treff7es Feb 10, 2023

I'm wouldn't worry about this now as here we have typed objects with non-optional properties, and earlier, we used Bigquery's object, which was backed by a dict. (that is where the key error happened because it's to string method failed)

jjoyce0510 reviewed

View reviewed changes

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/profiler.py

-                                  normalized_table_name = BigqueryTableIdentifier(
-                                      project_id=project, dataset=dataset, table=table.name
-                                  ).get_table_name()
-                                  for column in table.columns:

Collaborator

jjoyce0510 Feb 10, 2023

I'm glad we are better organizing this.

jjoyce0510 reviewed

View reviewed changes

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/profiler.py

@@ @@ -257,7 +247,7 @@ def get_bigquery_profile_request( @@
                               + 1
                           )
-                      if not table.columns:
+                      if not table.column_count:

Collaborator

jjoyce0510 Feb 10, 2023

nice...

jjoyce0510 reviewed

View reviewed changes

metadata-ingestion/src/datahub/ingestion/source/sql/sql_generic_profiler.py Show resolved Hide resolved

github-actions bot added the ingestion label


          PR review fixes

c3e52f8

jjoyce0510 approved these changes

View reviewed changes

Collaborator

jjoyce0510 left a comment

LGTM

jjoyce0510 merged commit 793f303 into datahub-project:master

hsheth2 reviewed

View reviewed changes

metadata-ingestion/src/datahub/ingestion/run/pipeline.py

+                          self._peek_memory_usage = mem_usage
+                          self.peek_memory_usage = humanfriendly.format_size(self._peek_memory_usage)
+                      self.mem_info = humanfriendly.format_size(self._peek_memory_usage)

Collaborator

hsheth2 Feb 11, 2023

Shouldn't be on peek memory usage

metadata-ingestion/src/datahub/ingestion/sink/sink_registry.py

+              # These sinks are always enabled
+              assert sink_registry.get("console")
+              assert sink_registry.get("file")
+              assert sink_registry.get("blackhole")

Collaborator

hsheth2 Feb 11, 2023

Looks like a bad merge - we don't need these lines

asikowitz mentioned this pull request

fix(ingest/snowflake): Improve memory usage of metadata extraction #7349

Merged

5 tasks

oleg-ruban pushed a commit to RChygir/datahub that referenced this pull request


          fix(ingest/bigquery): Lowering significantly the memory usage of the …

70009cd

…BigQuery connector (datahub-project#7315)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment