Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(ingest/bigquery) Lowering significantly the memory usage of the BigQuery connector #7315

Merged
merged 4 commits into from
Feb 10, 2023

Conversation

treff7es
Copy link
Contributor

Adding blackhole sink for testing

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@@ -576,6 +576,7 @@ def get_long_description():
"datahub.ingestion.sink.plugins": [
"file = datahub.ingestion.sink.file:FileSink",
"console = datahub.ingestion.sink.console:ConsoleSink",
"blackhole = datahub.ingestion.sink.blackhole:BlackHoleSink",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty fun :) DevNull!

mem_usage = psutil.Process(os.getpid()).memory_info().rss
if self._peek_memory_usage < mem_usage:
self._peek_memory_usage = mem_usage
self.peek_memory_usage = humanfriendly.format_size(self._peek_memory_usage)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - this is SUPER useful since logs get cutoff

]
for dataset in self.db_views[project_id]:
for dataset in self.db_views.keys():
tables[dataset].extend(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qq -- is there a reason we are adding VIEWS into a structure called TABLES?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is just for simplicity to pass in all the table/view names and validate usage with it. Maybe I should add some more expressive names.

BigqueryTableIdentifier(
project_id, dataset, table.name
).get_table_name()
for table in self.db_views[dataset]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for view in self.db_views?

@@ -692,25 +696,50 @@ def _process_schema(
project_id,
)

columns = BigQueryDataDictionary.get_columns_for_dataset(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a dictionary? Of table name to columns?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh this fetches ALL columns for ALL tables in the dataset. very interesting.

view, view_columns, project_id, dataset_name
)

# This methode is used to generate the ignore list for datatypes the profiler doesn't support we have to do it here
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: method

logger.warning(
f"Table doesn't have any column or unable to get columns for table: {table_identifier}"
)

yield from self.gen_table_dataset_workunits(table, project_id, schema_name)
# If table has time partitioning, set the data type of the partitioning field
if table.time_partitioning:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we now need this??

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bigquery in the table metadata returns the Partition column's name but not the data type. We have to look up the list of columns.

yield from self._process_table(conn, table, project_id, dataset_name)
tables = self.get_tables_for_dataset(conn, project_id, dataset_name)
for table in tables:
table_columns = columns.get(table.name, []) if columns else []
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice handling!

table_identifier.project_id,
table_identifier.dataset,
) not in self.schema_columns.keys():
columns = BigQueryDataDictionary.get_columns_for_dataset(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this making a full call for EVERY table before?

dataset_name=table_identifier.dataset,
column_limit=column_limit,
)
self.schema_columns[
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see we were caching this. Wow! All columns cached here. :(


if table.time_partitioning.type_ in ("HOUR", "DAY", "MONTH", "YEAR"):
partition_where_clause = f"{partition_column_type}(`{table.time_partitioning.field}`) BETWEEN {partition_column_type}('{partition_datetime}') AND {partition_column_type}('{upper_bound_partition_datetime}')"
if table.time_partitioning.type in ("HOUR", "DAY", "MONTH", "YEAR"):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please ensure we have sufficient try: except around this method. There are a LOT of new changes and it makes me quite nervous.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wouldn't worry about this now as here we have typed objects with non-optional properties, and earlier, we used Bigquery's object, which was backed by a dict. (that is where the key error happened because it's to string method failed)

normalized_table_name = BigqueryTableIdentifier(
project_id=project, dataset=dataset, table=table.name
).get_table_name()
for column in table.columns:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm glad we are better organizing this.

@@ -257,7 +247,7 @@ def get_bigquery_profile_request(
+ 1
)

if not table.columns:
if not table.column_count:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice...

@github-actions github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Feb 10, 2023
Copy link
Collaborator

@jjoyce0510 jjoyce0510 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jjoyce0510 jjoyce0510 merged commit 793f303 into datahub-project:master Feb 10, 2023
self._peek_memory_usage = mem_usage
self.peek_memory_usage = humanfriendly.format_size(self._peek_memory_usage)

self.mem_info = humanfriendly.format_size(self._peek_memory_usage)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't be on peek memory usage

# These sinks are always enabled
assert sink_registry.get("console")
assert sink_registry.get("file")
assert sink_registry.get("blackhole")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a bad merge - we don't need these lines

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ingestion PR or Issue related to the ingestion of metadata
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants