feat(ingest/snowflake): tables from snowflake shares as siblings #8531

mayurinehate · 2023-07-31T12:23:52Z

Introduce new configurations inbound_shares_map and outbound_shares_map - to declare databases included in and created from shares and corresponding snowflake platform instances.
Emit upstreamLineage and siblings aspect for linked tables/views in such shared databases.
Move allow deny pattern filtering in snowflake source such that, this is needed only at one place
Add documentation around snowflake shares and corresponding configuration example
Unit tests for snowflake shares code

Checklist

The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
Links to related issues (if applicable)
Tests for the changes have been added/updated (if applicable)
Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

1. add config maps to accept inbound and outbound share details from user 2. emit mirrored database tables as siblings, with tables from share owner(producer) account as primary sibling. 3. push down allow-deny patterns in snowflake_schema

mayurinehate · 2023-07-31T12:28:07Z

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_shares.py

+                    # 1. attempt listing shares using `show shares` to identify name of share associated with this database (cache query result).
+                    # 2. if corresponding share is listed, then run `show grants to share <share_name>` to identify exact tables, views included in share.
+                    # 3. emit siblings only for the objects listed above.
+                    # This will work only if the configured role has accountadmin role access OR is owner of share.


It is not advisable to use role with "accountadmin" access hence this is not done. Also this PR takes care to hide ghost nodes in siblings relation so this is not required.

asikowitz

Overall LGTM, thanks for cranking this out. I have some minor naming / refactoring suggestions, but my main question is around how we should format the config. I'm not very familiar with shares, but the format seems a little unintuitive / cumbersome to me. Changing the format will require some code changes though

Also, I know there's some quirks with siblings and don't know how lineage will show up between siblings (which it seems like we're generating?) but if you've ran this by Gabe or Harshal then that's probably sufficient.

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_schema.py

asikowitz · 2023-07-31T15:17:37Z

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_config.py

@@ -42,6 +43,12 @@ class TagOption(str, Enum):
    skip = "skip"


+@dataclass(frozen=True)
+class SnowflakeDatabaseDataHubId:


This name is pretty confusing to me... not sure on alternatives though

How about DatabaseId?

asikowitz · 2023-07-31T15:26:36Z

metadata-ingestion/docs/sources/snowflake/snowflake_pre.md

+          - platform_instance: instance2 # this is a list, as db1 can be shared with multiple snowflake accounts using X
+            database_name: db1_from_X
+  ```
+- In snowflake recipe of `account2` :


If you have a lot of snowflake recipes, I could see how this could get tiresome to set up for every ingestion pipeline. Thoughts on having a config that could be the same for every recipe, e.g.

shares: X: database: db1 platform_instance: instance1 outbounds: - database: db1_from_X platform_instance: instance2

This has crossed my mind too and I'd really like "having a config that could be the same for every recipe". The only downside is we require additional unused information - e.g. share name "X" and it is possible that some of the shares config will not be relevant for some account recipes, making validation and probable errors hard to find. I feel, the ease of using same shares config outweighs the downsides so let me think more on this and update.

I've updated the config to use similar structure as your example, except that using term consumers instead of outbounds. Outbound is relative and can be confusing term. Consumer has a peculiar meaning for snowflake shares and hence unambiguous.

If we don't want to have them specify the share name, we can also do:

shares: - platform_instance: instance1 database_name: db1 consumers: - platform_instance: instance2 # this is a list, as db1 can be shared with multiple snowflake accounts using X database_name: db1_from_X - platform_instance: ...

but maybe that's more confusing

yeah, I like the one with share name better - more precise and readable. Also shares do have unique names across accounts. This change primarily makes shares configuration absolute and exhaustive for all accounts and configurations need not be thought of wrt the particular account/recipe.

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_config.py

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_shares.py

asikowitz

Overall looks good to me. I have a refactor proposal to hopefully make snowflake_shares.py easier to understand, let me know what you think.

metadata-ingestion/docs/sources/snowflake/snowflake_pre.md

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_config.py

asikowitz · 2023-08-22T16:39:13Z

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_config.py

+            databases_included_in_share: List[DatabaseId] = []
+            databases_created_from_share: List[DatabaseId] = []
+
+            for _, share_details in shares.items():


Suggested change

for _, share_details in shares.items():

for share_details in shares.values():

asikowitz · 2023-08-22T16:41:11Z

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_config.py

@@ -197,3 +226,41 @@ def get_sql_alchemy_url(
    @property
    def parse_view_ddl(self) -> bool:
        return self.include_view_column_lineage
+
+    @validator("shares")
+    def validate_shares(


Generally we raise ValueError in validators, not use assertions. Do you want to change that convention? For now at least, can you change to stay consistent?

I believe we allow ValueError, AssertionError, TypeError as a convention, as also mentioned here - https://datahubproject.io/docs/metadata-ingestion/developing/#coding

Sometimes asserts are more readable /briefer so I'd prefer them. In this case, I'm okay to change.

Oh nice. I like the assert syntax more, I think we're just hesitant because you can disable assertions with a certain flag. I don't feel strongly here, up to you

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_config.py

asikowitz · 2023-08-22T22:59:13Z

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_shares.py

+        self, databases: List[SnowflakeDatabase]
+    ) -> Iterable[MetadataWorkUnit]:
+        shared_databases = self._get_shared_databases(
+            self.config.shares or {}, self.config.platform_instance


self.config.shares can't be null here right (otherwise the assert in _get_shared_databases could fail). Perhaps instead of passing self.config into SnowflakeSharesHandler, we can pass non-optional self.config.shares and self.config.platform_instance so we don't have to put any assertions in this class

self.config is also used in other places in SnowflakeCommonMixin - primarily in deciding whether to lowercase urn. hence keeping self.config and refractoring a bit to avoid asserts.

asikowitz · 2023-08-22T23:07:57Z

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_shares.py

+    created_from_share: bool
+
+    # This will have exactly entry if created_from_share = True
+    shares: List[str]


The name created_from_share is confusing to me, because I feel like all of these are created from "shares" lol. Could we just call primary or inbound or something similar. Although inbound also doesn't really make sense to me... I think of it more as is_share_source

Well, technically some database are created from share while others are used to create a share, i.e. included in share. I am okay to use "primary/is_share_source" = not (created_from_share/secondary).

asikowitz · 2023-08-22T23:14:32Z

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_shares.py

+class SharedDatabase:
+    """
+    Represents shared database from current platform instance
+    This is either created from an inbound share or included in an outbound share.


Can you mention that this relies on the invariant that a snowflake database can't both be in a share and the consumer of a share

asikowitz · 2023-08-22T23:45:05Z

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_shares.py

+                else:
+                    shared_databases[share_details.database].shares.append(share_name)


Can the same platform instance and database really appear as inbound in shares multiple times?

a corner case but yes.

asikowitz · 2023-08-22T23:56:58Z

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_shares.py

+        self.logger = logger
+        self.dataset_urn_builder = dataset_urn_builder
+
+    def _get_shared_databases(


Overall thought the logic in this method and by users of the shared_databases dict returned by this method was confusing, where we basically have 2 different cases for inbound and outbound shares info. I think this could be clearer if we separated the two.

I think this logic could be also simplified if we did some preprocessing first. What do you think about, in the config file:

class SnowflakeShareConfig(ConfigModel): # Add to this class @property def inbound(self) -> DatabaseId: return DatabaseId(share.database, share.platform_instance) # For below, add to SnowflakeV2Config @lru_cache(maxsize=1) # would love to use @cached_property def inbound_to_consumers(self) -> Dict[DatabaseId, Set[DatabaseId]]: d = defaultdict(set) for share in self.shares: d[share.inbound].update(share.consumers) return d @lru_cache(maxsize=1) def outbound_to_inbound(self) -> Dict[DatabaseId, DatabaseId]: d = {} for share in self.shares: for outbound in share.consumers: d[outbound] = share.inbound

Could def have better naming, but once you have these, then I think you can get rid of _get_shared_databases and do something like:

key = DatabaseId(db, self.platform_instance) is_inbound = key in self.inbound_to_consumers is_outbound = key in self.outbound_to_inbound if not is_inbound and not is_outbound: continue sibling_databases = inbound_to_consumers[key] if is_inbound else [self.outbound_to_inbound[key]]

report_missing_databases will be a bit more complicated but I don't think a dealbreaker

I see, let me check.

…neage_siblings' into snowflake_shares_lineage_siblings

asikowitz

LGTM!

asikowitz · 2023-08-23T15:12:30Z

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_shares.py

    ) -> None:
        db_names = [db.name for db in databases]
-        missing_dbs = [db for db in shared_databases.keys() if db not in db_names]
+        missing_dbs = [db for db in inbounds + outbounds if db not in db_names]


Could alternatively not cast to list and do inbounds | outbounds

asikowitz · 2023-08-23T15:13:24Z

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_shares.py

            return

        logger.debug("Checking databases for inbound or outbound shares.")
        for db in databases:
-            if db.name not in shared_databases:
+            db.name = db.name


Looks like a no-op

asikowitz · 2023-08-23T15:14:47Z

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_shares.py

                logger.debug(f"database {db.name} is not shared.")
                continue

-            sibling_dbs = self.get_sibling_databases(shared_databases[db.name])
+            sibling_dbs = (


If this is typed as Collection or Iterable then you don't have to cast to list. Doesn't matter though

…neage_siblings' into snowflake_shares_lineage_siblings

mayurinehate added 4 commits July 26, 2023 19:38

add unit tests, docs, logging

752b2dd

remove node itself from sibling, more TODO

daf87ef

update tests, comments

8018e5a

github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Jul 31, 2023

mayurinehate commented Jul 31, 2023

View reviewed changes

mayurinehate requested a review from asikowitz July 31, 2023 12:28

Merge branch 'master' into snowflake_shares_lineage_siblings

f704d9c

vercel bot deployed to Preview July 31, 2023 12:52 View deployment

asikowitz reviewed Jul 31, 2023

View reviewed changes

mayurinehate added 2 commits August 1, 2023 17:46

revert push down filtering changes near query

69e3792

change config structure to allow same shares config across accounts

4ce37e5

vercel bot deployed to Preview August 1, 2023 12:40 View deployment

mayurinehate and others added 2 commits August 1, 2023 18:37

fix

b33cafd

Merge branch 'master' into snowflake_shares_lineage_siblings

ea1ba40

mayurinehate requested a review from asikowitz August 1, 2023 13:10

fix indent

f19d1ed

vercel bot deployed to Preview August 1, 2023 13:51 View deployment

Merge branch 'master' into snowflake_shares_lineage_siblings

89a06b1

vercel bot deployed to Preview August 4, 2023 06:14 View deployment

Merge branch 'master' into snowflake_shares_lineage_siblings

4f6035a

vercel bot deployed to Preview August 14, 2023 10:08 View deployment

Merge branch 'master' into snowflake_shares_lineage_siblings

6d4be65

vercel bot deployed to Preview August 22, 2023 08:21 View deployment

asikowitz added the release-0.10.6 label Aug 22, 2023

asikowitz reviewed Aug 23, 2023

View reviewed changes

Merge branch 'master' into snowflake_shares_lineage_siblings

cfd3924

vercel bot deployed to Preview August 23, 2023 09:30 View deployment

mayurinehate and others added 2 commits August 23, 2023 16:12

update doc content, refractor to avoid assers

78eba30

Merge branch 'master' into snowflake_shares_lineage_siblings

d1bc494

vercel bot deployed to Preview August 23, 2023 11:40 View deployment

mayurinehate added 2 commits August 23, 2023 19:40

simplification refractor

1a77245

Merge remote-tracking branch 'refs/remotes/origin/snowflake_shares_li…

c55e37a

…neage_siblings' into snowflake_shares_lineage_siblings

vercel bot deployed to Preview August 23, 2023 14:32 View deployment

Update snowflake_pre.md

afbe095

vercel bot deployed to Preview August 23, 2023 15:04 View deployment

asikowitz approved these changes Aug 23, 2023

View reviewed changes

asikowitz added the merge-pending-ci A PR that has passed review and should be merged once CI is green. label Aug 23, 2023

mayurinehate and others added 3 commits August 23, 2023 21:18

remove no-op

02585c4

Merge remote-tracking branch 'refs/remotes/origin/snowflake_shares_li…

b28ebb1

…neage_siblings' into snowflake_shares_lineage_siblings

Merge branch 'master' into snowflake_shares_lineage_siblings

6cde852

vercel bot deployed to Preview August 23, 2023 16:54 View deployment

asikowitz merged commit e285da3 into datahub-project:master Aug 24, 2023
50 of 51 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ingest/snowflake): tables from snowflake shares as siblings #8531

feat(ingest/snowflake): tables from snowflake shares as siblings #8531

mayurinehate commented Jul 31, 2023 •

edited

mayurinehate Jul 31, 2023

asikowitz left a comment

asikowitz Jul 31, 2023

mayurinehate Aug 1, 2023

asikowitz Jul 31, 2023

mayurinehate Aug 1, 2023 •

edited

mayurinehate Aug 1, 2023

asikowitz Aug 1, 2023

mayurinehate Aug 2, 2023

asikowitz left a comment

asikowitz Aug 22, 2023

asikowitz Aug 22, 2023

mayurinehate Aug 23, 2023

asikowitz Aug 23, 2023

asikowitz Aug 22, 2023

mayurinehate Aug 23, 2023

asikowitz Aug 22, 2023

mayurinehate Aug 23, 2023 •

edited

asikowitz Aug 22, 2023

mayurinehate Aug 23, 2023

asikowitz Aug 22, 2023

mayurinehate Aug 23, 2023

asikowitz Aug 22, 2023

mayurinehate Aug 23, 2023

asikowitz left a comment

asikowitz Aug 23, 2023

asikowitz Aug 23, 2023

asikowitz Aug 23, 2023

	for _, share_details in shares.items():
	for share_details in shares.values():

		else:
		shared_databases[share_details.database].shares.append(share_name)

feat(ingest/snowflake): tables from snowflake shares as siblings #8531

feat(ingest/snowflake): tables from snowflake shares as siblings #8531

Conversation

mayurinehate commented Jul 31, 2023 • edited

Checklist

Choose a reason for hiding this comment

asikowitz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mayurinehate Aug 1, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

asikowitz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mayurinehate Aug 23, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

asikowitz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mayurinehate commented Jul 31, 2023 •

edited

mayurinehate Aug 1, 2023 •

edited

mayurinehate Aug 23, 2023 •

edited