feat(ingest/sql-queries): Add sql queries source, SqlParsingBuilder, sqlglot_lineage performance optimizations #8494

asikowitz · 2023-07-24T17:36:05Z

Creates a source for generating lineage and usage (stats + operations) from a file containing a list of queries. Uses SqlParsingBuilder which hopefully can be reused for other sources. Adds 2 sqlglot_lineage performance optimizations.
cc @mayurinehate for awareness

Checklist

The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
Links to related issues (if applicable)
Tests for the changes have been added/updated (if applicable)
Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

…r. Add some sqlglot_lineage performance optimizations

asikowitz · 2023-07-24T17:37:12Z

metadata-ingestion/src/datahub/ingestion/graph/client.py

+                if i % 1000 == 0:
+                    logger.debug(f"Loaded {i} schema metadata")
+                try:
+                    schema_metadata = self.get_aspect(urn, SchemaMetadataClass)


We really need a bulk endpoint here. Loading 45k aspects took over an hour

I think you can use the graphql endpoints to bulk fetch schema metadata

We also do have bulk endpoints somewhere, but not sure the exact syntax

asikowitz · 2023-07-24T17:38:02Z

metadata-ingestion/src/datahub/ingestion/source/sql_queries.py

+@config_class(SqlQueriesSourceConfig)
+@support_status(SupportStatus.TESTING)
+class SqlQueriesSource(Source):
+    # TODO: Documentation


This has to get done before I merge. Want to specify what this does and what format the query file is expected to be in

asikowitz · 2023-07-24T17:39:25Z

metadata-ingestion/src/datahub/utilities/sqlglot_lineage.py

@@ -238,6 +253,7 @@ def __init__(
        self.platform = platform
        self.platform_instance = platform_instance
        self.env = env
+        self.urns: Optional[Set[str]] = None


Optimization 1: don't fetch schema metadata for urns that don't exist, which I expect could be a lot e.g. when we filter out urns. This doesn't apply if we're filtering out temporary tables like with bigquery

asikowitz · 2023-07-24T17:40:35Z

metadata-ingestion/src/datahub/utilities/sqlglot_lineage.py

-        schema_info = self._resolve_schema_info(urn)
-        if schema_info:
-            return urn, schema_info
+        if not (self.urns and urn not in self.urns):


I don't really like this logic. Was thinking about creating a "top" set that overrides __contains__ to always return True and setting that as the default instead of None, but could see how that could be confusing. Thoughts? This type of logic comes up in the sql parsing builder too

see my comment above - we should be able to remove this

asikowitz · 2023-07-24T17:41:42Z

metadata-ingestion/src/datahub/utilities/sqlglot_lineage.py

@@ -755,6 +778,7 @@ def _sqlglot_lineage_inner(
    )


+@functools.lru_cache(maxsize=1000)


Optimization 2: saw lots of duplicate queries, think this is an easy performance gain for what shouldn't be too much memory. Can lower the # if we think SqlParsingResults can be very large, or use a file backed dict

yup this makes sense - let's make the 1000 a constant but otherwise is fine

hsheth2 · 2023-07-26T19:02:38Z

metadata-ingestion/src/datahub/ingestion/graph/client.py

+            )
+
+        schema_resolver = self._make_schema_resolver(platform, platform_instance, env)
+        schema_resolver.set_include_urns(urns)


Since you're doing all the schema resolution here, you don't need to pass a DataHubGraph instance into the SchemaResolver

Once you do that, we can remove this set_include_urns thing

hsheth2 · 2023-07-26T19:03:19Z

metadata-ingestion/src/datahub/ingestion/graph/client.py

+                if i % 1000 == 0:
+                    logger.debug(f"Loaded {i} schema metadata")
+                try:
+                    schema_metadata = self.get_aspect(urn, SchemaMetadataClass)


I think you can use the graphql endpoints to bulk fetch schema metadata

We also do have bulk endpoints somewhere, but not sure the exact syntax

hsheth2 · 2023-07-26T19:04:13Z

metadata-ingestion/src/datahub/utilities/sqlglot_lineage.py

+    if schema_resolver.platform == "presto-on-hive":
+        dialect = "hive"
+    else:
+        dialect = schema_resolver.platform


let's extract this into a helper method - I suspect it'll grow / need to be monkeypatched in the future

hsheth2 · 2023-07-26T19:04:40Z

metadata-ingestion/src/datahub/utilities/sqlglot_lineage.py

@@ -755,6 +778,7 @@ def _sqlglot_lineage_inner(
    )


+@functools.lru_cache(maxsize=1000)


yup this makes sense - let's make the 1000 a constant but otherwise is fine

hsheth2 · 2023-07-26T19:05:38Z

metadata-ingestion/src/datahub/utilities/sqlglot_lineage.py

-        schema_info = self._resolve_schema_info(urn)
-        if schema_info:
-            return urn, schema_info
+        if not (self.urns and urn not in self.urns):


see my comment above - we should be able to remove this

hsheth2 · 2023-07-26T19:10:46Z

metadata-ingestion/src/datahub/ingestion/source/sql_queries.py

+    platform: str = Field(
+        description="The platform for which to generate data, e.g. snowflake"
+    )
+    dialect: str = Field(description="The SQL dialect of the queries, e.g. snowflake")


do we need both of these as required fields?

Remove dialect

… schema resolver; remove dialect config param; nits

vercel · 2023-08-23T16:29:43Z

Deployment failed with the following error:

The provided GitHub repository does not contain the requested branch or commit reference. Please ensure the repository is not empty.

hsheth2 · 2023-08-23T20:11:53Z

metadata-ingestion/src/datahub/ingestion/graph/client.py

+            urns = set(
+                self.get_urns_by_filter(
+                    entity_types=[DatasetUrn.ENTITY_TYPE],
+                    platform=platform,


there's a bug here - it doesn't respect platform_instance

So it won't actually break anything right -- just add more items to schema resolver than necessary. Do we have the ability to filter on platform instance?

Yep - I'm adding that here #8709

asikowitz · 2023-08-24T14:35:38Z

Merging as I don't think failures are related, sorry if that's not the case

feat(ingest/sql-queries): Add sql queries source and SqlParsingBuilde…

03419b4

…r. Add some sqlglot_lineage performance optimizations

asikowitz requested review from treff7es and hsheth2 July 24, 2023 17:36

github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Jul 24, 2023

asikowitz commented Jul 24, 2023

View reviewed changes

vercel bot deployed to Preview July 24, 2023 17:50 View deployment

Merge branch 'master' into sql-queries-source

4a32f1e

vercel bot deployed to Preview July 25, 2023 13:56 View deployment

hsheth2 reviewed Jul 27, 2023

View reviewed changes

Merge branch 'master' into sql-queries-source

f10aee0

vercel bot deployed to Preview August 3, 2023 03:42 View deployment

Merge branch 'master' into sql-queries-source

4cf2cdf

vercel bot deployed to Preview August 14, 2023 12:31 View deployment

asikowitz added 3 commits August 23, 2023 12:29

revert

ebbcfd7

pr feedback: remove set_include_urns in favor of not passing graph to…

d975fe4

… schema resolver; remove dialect config param; nits

lint

4d60b70

asikowitz force-pushed the sql-queries-source branch from 8279981 to 4d60b70 Compare August 23, 2023 16:33

Merge branch 'master' into sql-queries-source

d96378b

asikowitz requested a review from hsheth2 August 23, 2023 16:37

vercel bot deployed to Preview August 23, 2023 17:36 View deployment

hsheth2 approved these changes Aug 23, 2023

View reviewed changes

hsheth2 reviewed Aug 23, 2023

View reviewed changes

asikowitz merged commit 6659ff2 into datahub-project:master Aug 24, 2023
49 of 51 checks passed

asikowitz deleted the sql-queries-source branch August 24, 2023 14:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ingest/sql-queries): Add sql queries source, SqlParsingBuilder, sqlglot_lineage performance optimizations #8494

feat(ingest/sql-queries): Add sql queries source, SqlParsingBuilder, sqlglot_lineage performance optimizations #8494

asikowitz commented Jul 24, 2023 •

edited

asikowitz Jul 24, 2023

hsheth2 Jul 26, 2023

asikowitz Jul 24, 2023

asikowitz Jul 24, 2023

asikowitz Jul 24, 2023

hsheth2 Jul 26, 2023

asikowitz Jul 24, 2023

hsheth2 Jul 26, 2023

hsheth2 Jul 26, 2023

hsheth2 Jul 26, 2023

hsheth2 Jul 26, 2023

hsheth2 Jul 26, 2023

hsheth2 Jul 26, 2023

hsheth2 Jul 26, 2023

asikowitz Aug 23, 2023

vercel bot commented Aug 23, 2023

hsheth2 Aug 23, 2023

asikowitz Aug 23, 2023

hsheth2 Aug 23, 2023

asikowitz commented Aug 24, 2023

		@@ -755,6 +778,7 @@ def _sqlglot_lineage_inner(
		)


		@functools.lru_cache(maxsize=1000)

feat(ingest/sql-queries): Add sql queries source, SqlParsingBuilder, sqlglot_lineage performance optimizations #8494

feat(ingest/sql-queries): Add sql queries source, SqlParsingBuilder, sqlglot_lineage performance optimizations #8494

Conversation

asikowitz commented Jul 24, 2023 • edited

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vercel bot commented Aug 23, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

asikowitz commented Aug 24, 2023

asikowitz commented Jul 24, 2023 •

edited