fix(ingest/snowflake): Improve memory usage of metadata extraction #7349

asikowitz · 2023-02-15T18:20:27Z

Similar to #7315, attempts to reduce the memory usage of the snowflake connector by remove global data stores.

This differs from the bigquery changes, as I attempt to entirely remove the global stores, instead using the lru_cache decorator to cache the results of repeated function calls. We order our metadata extraction queries such that we should be doing no more queries while only caching the results of a single query at a time (vs. caching all the results). I have to keep a single store, db_tables to support per-database profiling, but make this per-database rather than fully global.

This does not include any lineage memory optimization -- I'll do that in a followup PR.

Checklist

The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
Links to related issues (if applicable)
Tests for the changes have been added/updated (if applicable)
Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

asikowitz · 2023-02-15T18:20:41Z

metadata-ingestion/scripts/docgen.py

+        key=lambda x: (x[1]["name"].casefold(), x[1]["name"])
+        if "name" in x[1]


Result of running black

asikowitz · 2023-02-15T18:21:12Z

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_profiler.py

-                    self.state_handler.add_to_state(
-                        dataset_urn, int(datetime.now().timestamp() * 1000)
-                    )
+        for request, profile in self.generate_profiles(


No changes below, just changing indentation

asikowitz · 2023-02-15T18:21:40Z

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_report.py


 from datahub.ingestion.source.snowflake.constants import SnowflakeEdition
 from datahub.ingestion.source.sql.sql_generic_profiler import ProfilingSqlReport
 from datahub.ingestion.source_report.sql.snowflake import SnowflakeReport
 from datahub.ingestion.source_report.usage.snowflake_usage import SnowflakeUsageReport


+@dataclass


Pretty sure we want this on every class, even though it inherits from a dataclass?

asikowitz · 2023-02-15T18:21:50Z

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_report.py

+    _processed_tags: MutableSet[str] = field(default_factory=set)
+    _scanned_tags: MutableSet[str] = field(default_factory=set)


As a result of adding @dataclass

asikowitz · 2023-02-15T18:22:46Z

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_v2.py


-        if self.config.profiling.enabled and len(databases) != 0:
-            yield from self.profiler.get_workunits(databases)
+        # TODO: The checkpoint state for stale entity detection can be committed here.


Just fixing a typo

mayurinehate · 2023-02-17T06:41:13Z

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_v2.py

@@ -504,29 +492,36 @@ def get_workunits_internal(self) -> Iterable[MetadataWorkUnit]:
                yield from self._process_database(snowflake_db)

            except SnowflakePermissionError as e:
-                # FIXME - This may break satetful ingestion if new tables than previous run are emitted above
+                # FIXME - This may break stateful ingestion if new tables than previous run are emitted above


Thank you !

treff7es

nice, lgtm

mayurinehate

Although this PR is merged, submitting comments as they are still valid. Probably, can be discussed offline and handled in followup PR, if it makes sense.

(I had added these comments earlier but they did not get published as I forgot to click "Submit Review" button at the end.)

mayurinehate · 2023-02-20T07:17:43Z

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_v2.py

-            Tuple[str, str], Dict[str, List[SnowflakeFK]]
-        ] = {}
+        # Caches tables for a single database. Consider moving to disk or S3 when possible.
+        self.db_tables: Dict[str, List[SnowflakeTable]] = {}


Okay, so db_views is not here since we don't profile views, right ?
Just confirming my understanding.

Yup! I initially had neither, but had to add this one back for profiling

mayurinehate · 2023-02-20T08:22:33Z

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_v2.py

        for snowflake_schema in snowflake_db.schemas:
            yield from self._process_schema(snowflake_schema, db_name)

+        if self.config.profiling.enabled and self.db_tables:


Okay - so earlier profiling started after ingesting all basic workunits (schemaMetadata, subTypes, datasetProperties) for tables from all databases.

This change will change that behavior.

Since profiling usually takes longer, is it possible to refractor in such a way that profiling happens after ingesting all tables from all databases ?

We need 2 things out from In the technical schema phase (subtypes, schema, etc)

list of all tables/views for lineage, usage, operational history etc
2.database wise table metadata listing (size, last updated, number of rows) for profiler

Probably we can create profiler requests in advance and start profiling later ?

mayurinehate · 2023-02-20T10:47:02Z

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_report.py

+    # Should not be more than the number of databases / schemas scanned.
+    # Maps (function name) -> (stat_name) -> (stat_value)
+    lru_cache_info: Dict[str, Dict[str, int]] = field(default_factory=dict)
+


…atahub-project#7349)

…7349)

asikowitz added 5 commits February 15, 2023 13:03

fix(ingest/snowflake): Improve memory usage of metadata extraction

875057c

Merge branch 'master' into snowflake-memory-reduction

9ce7e13

clean up

c211c67

revert snowflake_utils

c23fa53

revert more files

bc6925b

asikowitz requested review from treff7es, jjoyce0510 and mayurinehate February 15, 2023 18:20

github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Feb 15, 2023

asikowitz commented Feb 15, 2023

View reviewed changes

asikowitz added 5 commits February 15, 2023 14:47

Merge branch 'master' into snowflake-memory-reduction

3bc0004

fix types

349c3b1

isort

f376452

profiling at database level, not schema level

cbe62be

Merge branch 'master' into snowflake-memory-reduction

0418bf1

mayurinehate reviewed Feb 17, 2023

View reviewed changes

treff7es approved these changes Feb 20, 2023

View reviewed changes

treff7es merged commit 8fd2cc5 into datahub-project:master Feb 20, 2023

asikowitz deleted the snowflake-memory-reduction branch February 21, 2023 14:50

mayurinehate reviewed Feb 23, 2023

View reviewed changes

oleg-ruban pushed a commit to RChygir/datahub that referenced this pull request Feb 28, 2023

fix(ingest/snowflake): Improve memory usage of metadata extraction (d…

a868257

…atahub-project#7349)

yoonhyejin pushed a commit that referenced this pull request Mar 3, 2023

fix(ingest/snowflake): Improve memory usage of metadata extraction (#…

ee8926b

…7349)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ingest/snowflake): Improve memory usage of metadata extraction #7349

fix(ingest/snowflake): Improve memory usage of metadata extraction #7349

asikowitz commented Feb 15, 2023 •

edited

Loading

asikowitz Feb 15, 2023

asikowitz Feb 15, 2023

asikowitz Feb 15, 2023

asikowitz Feb 15, 2023

asikowitz Feb 15, 2023

mayurinehate Feb 17, 2023

treff7es left a comment

mayurinehate left a comment

mayurinehate Feb 20, 2023

asikowitz Feb 23, 2023

mayurinehate Feb 20, 2023

mayurinehate Feb 20, 2023

mayurinehate Feb 20, 2023

mayurinehate Feb 20, 2023

		key=lambda x: (x[1]["name"].casefold(), x[1]["name"])
		if "name" in x[1]

		_processed_tags: MutableSet[str] = field(default_factory=set)
		_scanned_tags: MutableSet[str] = field(default_factory=set)

fix(ingest/snowflake): Improve memory usage of metadata extraction #7349

fix(ingest/snowflake): Improve memory usage of metadata extraction #7349

Conversation

asikowitz commented Feb 15, 2023 • edited Loading

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

treff7es left a comment

Choose a reason for hiding this comment

mayurinehate left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

asikowitz commented Feb 15, 2023 •

edited

Loading