feat(ingest): allow extracting snowflake tags #6500

frsann · 2022-11-21T11:47:33Z

Adding support to extracts tags object tags from Snowflake. The scope is limited to database, schema, table/view, and column tags. It also (optionally) creates the extracted tags. Only the tags that are applied to objects are created; it does not pull all existing tags in Snowflake. Caters to https://feature-requests.datahubproject.io/p/ingest-snowflake-object-tags

There are some peculiarities about Snowflake tags that should be considered.

1. Snowflake tags are key-values, not simple strings (as in Datahub).

This is handled in this PR by converting the key-value tag into a simple string with the following formatting: <db_name>.<schema_name>.<tag_name>=<tag_value>, where db_name and schema_name are the DB and schema in which the tag is defined, respectively. While not the prettiest, this format should allow users to quite easily define a formatter to customize the look of the tag, if they chose to. Implementing this formatter is outside the scope of this PR.

2. Snowflake allows tagging objects directly, but also through propagation (or lineage as they call it).

This means that a tag applied to a database propagates down to the schemas, tables, and columns as well. Applying the same, propagated tag directly on the object, but with a different value, takes precedence over the propagated value. There are different methodologies to get the tags that are directly applied and the ones that are also propagated.

As the propagated tags are "effective" (for example tag_based masking policies applied to tables evaluated when querying columns), a Datahub user might want extract them in addition to the directly applied tags. To accommodate both uses cases, a configuration field extract_tags is introduced that allows selecting which type of tags extraction is wanted.

It should be noted that extracting the propagated tags is done on an object basis (except for columns) and will therefore require more queries to be made. Extracting only the directly applied tags can be done in a single query per DB (+ cached).

Checklist

The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
Links to related issues (if applicable)
Tests for the changes have been added/updated (if applicable)
Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

frsann · 2022-11-21T11:48:07Z

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_schema.py

+    value: str
+
+    def __str__(self):
+        return f"{self.database}.{self.schema}.{self.name}={self.value}".lower()


The format of the stringification can be discussed.

Right. We use key:value in some source, however let me think about what should be done here.

Hey @frsann Let's change this string conversion to use key:value form. Also Is there any explicit need to convert this to lowercase ? I would recommend keeping these to their original case.

Fixed. I removed the lower and replaced it with calling the self.snowflake_identifier function which lowercases based on the config.

Okay, that seems fair. We should keep display name [tagProperties->name] unlowercased - in exactly same case as snowflake, as we do for snowflake tables, schema, etc.

Also, I noticed that snowflake tags can have descriptions (for key part)

create tag cost_center comment = 'cost_center tag';

Would it make sense to add it to tag description ?

We could get the comment, but it would of course require an extra show tags or select * from snowflake.account_usage.tags-query and complicate the tag-workunit creation (very) slightly. It would also give us the allowed_value field, but it might not be super interesting in Datahub. Let me know if you want it added.

I fixed the un-lowercasing of the tag name now, only the urn is optionally lowercased.

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_config.py

github-actions · 2022-11-21T12:07:23Z

Unit Test Results (metadata ingestion)

      8 files       8 suites 52m 36s ⏱️
  765 tests   756 ✔️ 2 💤 2 ❌ 5 🔥
1 532 runs 1 520 ✔️ 5 💤 2 ❌ 5 🔥

For more details on these failures and errors, see this check.

Results for commit 8df82d6.

♻️ This comment has been updated with latest results.

github-actions · 2022-11-21T12:21:34Z

Unit Test Results (build & test)

621 tests ±0 617 ✔️ ±0 15m 39s ⏱️ ±0s
157 suites ±0     4 💤 ±0
157 files ±0     0 ❌ ±0

Results for commit 8df82d6. ± Comparison against base commit 6eb63c2.

♻️ This comment has been updated with latest results.

mayurinehate

Thank you for working on this !
I have completed the first pass of review. It would be great if you can address the comments. Also, can you please resolve the merge conflicts ?

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_schema.py

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_query.py

mayurinehate · 2022-11-28T12:55:46Z

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_schema.py

+    value: str
+
+    def __str__(self):
+        return f"{self.database}.{self.schema}.{self.name}={self.value}".lower()


Right. We use key:value in some source, however let me think about what should be done here.

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_v2.py

frsann · 2022-11-28T19:50:09Z

@mayurinehate Thanks for the review. I fixed the issues, but will double check tomorrow that the queries still work with the changed quoting.

frsann · 2022-11-29T07:38:54Z

@mayurinehate Turns out the single quotes are required in the function calls, so I fixed that.

For example, this works:

SELECT tag_database as "TAG_DATABASE",
        tag_schema AS "TAG_SCHEMA",
        tag_name AS "TAG_NAME",
        tag_value AS "TAG_VALUE"
        FROM table(information_schema.tag_references('"my-speci@l_schema"', 'schema'));

Without the single quotes I get SQL compilation error: error line 5 at position 53 invalid identifier '"my-speci@l_schema"'

jjoyce0510 · 2022-11-29T16:39:43Z

One thing I want to make sure to call out on this PR.... We've found in the past that if we simply ingest reference to tags (e.g. tag urns for a dataset) without actually creating the tags with DataHub (e.g. by producing the TagProperties or TagKey aspect), then when a user goes to make changes to the Tag via the UI the DataHub app will treat the tag as an entity that does not exist.

We should always consider minting a TagKey or TagProperties (less desirable since it can overwrite a UI-authored description) aspect when ingesting these foreign-authored tags

frsann · 2022-11-29T19:14:22Z

Absolutely! The minting is now behind the same config flag (include_technical_schema, defaults to true) as the minting of e.g. tables and the db and schema containers, but I can remove that criteria if you think it works better.

frsann · 2022-12-13T07:01:39Z

@mayurinehate any chance to get this moving forward?

mayurinehate · 2022-12-13T11:45:23Z

@frsann I'll get to this early next week. Thank you for your patience.

hsheth2 · 2022-12-29T05:41:13Z

@frsann two high-level things before I get into the review

My understanding from reading the snowflake docs is that tags in snowflake are global within an account. If that's correct, I'm not confident that the <db_name>.<schema_name>.<tag_name>=<tag_value> makes sense. For example, what do we put in the schema_name of a tag is applied to a Snowflake database? I'm thinking that a simple <optional_configurable_prefix>.<tag_name>=<tag_value> may make more sense, but definitely want to understand better.
It looks like there's a number of merge conflicts between this PR and master. Do you want me to review before or after you've resolved those?

frsann · 2022-12-29T06:32:26Z

The tags are defined in a <database>.<schema> "namespace". It's true that the tags are global in the sense that you can apply tag my_db.public.my_tag on a table your_db.staging.some_table, but you can also create a tag your_db.public.my_tag and apply that to the same table without conflicts.
This module seems to be a moving target 😄 I dont mind fixing the conflicts if the review and merge is prompt after that, but I have already made some conflict-fixing commits earlier and I'd prefer to not have to do it many times more.

frsann · 2023-01-02T12:00:29Z

@mayurinehate @hsheth2 the merge conflicts are now fixed. Please have a look.

hsheth2

A few minor comments around code cleanup, but overall should be good to merge soon

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_config.py

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_query.py

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_schema.py

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_tag.py

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_v2.py

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_query.py

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_schema.py

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_tag.py

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_v2.py

hsheth2 · 2023-01-04T19:21:10Z

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_schema.py

                tags.add_table_tag(
                    object_name, object_schema, object_database, snowflake_tag
                )
-            elif domain == "COLUMN":
+            elif domain == SnowflakeObjectDomain.COLUMN:


nice thanks for refactoring this :)

frsann · 2023-01-04T19:24:43Z

@hsheth2 @mayurinehate all comments have been fixed, so this PR should now be mergeable

* chore(ci): update base ingestion image requirements file (datahub-project#6687) * fix(ci): reduce warnings due to deprecated action (datahub-project#6686) * refactor(ui): Adding caching for users, groups, and roles (datahub-project#6673) * fix(ci): revert confluent kafka in base image (datahub-project#6690) * fix(security): version bump to latest minor python image (datahub-project#6694) * docs(ingest/salesforce): list required permissions (datahub-project#6610) * feat(ingest): bigquery - option to set on behalf project (datahub-project#6660) * ci: stop commenting unit test results on PR (datahub-project#6700) The results will still be surfaced under the "Test Results" action workflow, but the results won't be commented on the PR itself. * fix(publish): Attempting to fix publish for auth-api (datahub-project#6695) * build(deps): bump qs from 6.5.2 to 6.5.3 in /smoke-test/tests/cypress (datahub-project#6663) * build(deps): bump express from 4.17.1 to 4.18.2 in /datahub-web-react (datahub-project#6665) * fix(ingest/tableau): support ssl_verify flag properly (datahub-project#6682) * fix(config): unify the handling of boolean environment variables (datahub-project#6684) * fix(ui): fix search on policy builder (datahub-project#6703) * build(deps): bump qs from 6.5.2 to 6.5.3 in /datahub-web-react (datahub-project#6664) * fix(ingest): cleanup config extra usage (datahub-project#6699) * docs(logos): Update Great Expectations logo (datahub-project#6698) * fix(security): play framework upgrade (datahub-project#6626) * fix(security): play framework upgrade * fix(ingest/sagemaker): handle missing ProcessingInputs field (datahub-project#6697) Fixes datahub-project#6360. * build: add retries to gradle wrapper download in ingestion docker (datahub-project#6704) * test(quickstart): add debugging to quickstart test (datahub-project#6718) * fix(setup): Bump setup images to alpine 3.14 with arch based on machine OS. (datahub-project#6612) * fix(setup): Bump setup images to alpine 3.14 with arch based on machine OS. * fix(ingest): fix bug in auto_status_aspect (datahub-project#6705) Co-authored-by: Tamas Nemeth <treff7es@gmail.com> * fix(security): commons-text in frontend, hadoop-commons in datahub-upgrade (datahub-project#6723) * fix(build): rename conflicting module `auth-api` (datahub-project#6728) * fix(build): rename conflicting module `auth-api` * docs(aws): edit markdown link (datahub-project#6706) * fix(ingest): mysql - fix mysql ingestion issue with non-lowercase database (datahub-project#6713) * feat(ingest): redact configs reported in ingestion_run_summary (datahub-project#6696) * fix(ingest): bigquery - rectify filter for BigQuery external tables (datahub-project#6691) * feat(ingest): snowflake - add separate config for include_column_lineage in snowflake (datahub-project#6712) * fix(ci): flakiness due to onboarding tour in add user test (datahub-project#6734) * feat(ui): Support DataBricks Unity Catalog Source in Ui Ingestion (datahub-project#6707) * feat(ingest/iceberg): add stateful ingestion (datahub-project#6344) * doc(restore): document restore indices API endpoint (datahub-project#6737) * feat(): Views Feature Milestone 1 (datahub-project#6666) * feat(ingest): bigquery - external url support and a small profiling filter fix (datahub-project#6714) * test(ingest): make hive/trino test more reliable (datahub-project#6741) * Initial commit for bigquery ingestion guide (datahub-project#6587) * Initial commit for bigquery ingestion guide * Addressing PR review comments * Fixing lint error * Shorten titles * Removing images * update copy on overview.md * update to setup steps with additional roles * update configuration.md * lowcasing overview.md filename * lowcasing setup.md * lowcasing configuration.md * update reference to setup.md * update reference to setup.md * update reference to configuration.md * lowcase bigquery ingestion guide filenames * Update location of ingestion guides in sidebar * renaming ingestion quickstart guide sidebar * remove old files * Update docs-website/sidebars.js * tweak Co-authored-by: Maggie Hays <maggiem.hays@gmail.com> Co-authored-by: Harshal Sheth <hsheth2@gmail.com> * fix(ci): remove warnings due to deprecated action (datahub-project#6735) * feat(ingest): add stateful ingestion to the ldap source (datahub-project#6127) Co-authored-by: Harshal Sheth <hsheth2@gmail.com> * fix(ingest): fix serde for empty dicts in unions with null (datahub-project#6745) The code changes in acryldata/avro_gen#16, but tests are written here. * feat(ingest): start simplifying stateful ingestion state (datahub-project#6740) * fix(): Add auth-api as compileOnly dependency (datahub-project#6747) Co-authored-by: MohdSiddique Bagwan <mohdsiddique.bagwan@gslab.com> * fix(elasticsearch): build in resilience against IO exceptions on httpclient (datahub-project#6680) * fix(elasticsearch): build in resilience against IO exceptions on http client * ci: fix ingestion gradle retry (datahub-project#6752) * fix(ingest): support airflow mapped operators (datahub-project#6738) * fix(actions): fix mistype slack/teams base url (datahub-project#6754) * fix(smoke-test): fix stateful ingestion test regression (datahub-project#6753) * fix(auth): Renames metadata-auth archive name to not conflict with other modules. (datahub-project#6749) Co-authored-by: david-leifker <114954101+david-leifker@users.noreply.github.com> * fix(ingest/lookml): fix directory handling and a github_info resolution bug (datahub-project#6751) * refactor(ingest): bigquery-lineage - allow tables and datasets in uppercase (datahub-project#6739) * refactor(ux): Misc UX Improvements (tutorial copy, caching, filters) (datahub-project#6743) * docs(): Added build failed yarn error (datahub-project#6757) docs: add build failed yarn error message and how to deal with it I encountere this error and with the help of the community i could deal with it (https://datahubspace.slack.com/archives/CV2UVAPPG/p1670608619404699). * feat(ingest): remove source config from DatahubIngestionCheckpoint (datahub-project#6722) * fix(python-sdk): DataHubGraph get_aspect should accept empty responses (datahub-project#6760) * fix(): Fix the datahub-web-react build (datahub-project#6764) * docs(ingest/airflow): clarify Airflow 1.x docs for airflow plugin (datahub-project#6761) * feat(ingest): simplify more stateful ingestion state (datahub-project#6762) * fix(ingest): bigquery - handling custom sql errors as warning (datahub-project#6777) * docs(docker): add section for adding community images (datahub-project#6770) * docs(ingest): fix error in custom tags transformer example (datahub-project#6767) * feat(ingest): add `datahub state inspect` command (datahub-project#6763) * refactor(ui): Caching Ingestion Secrets (datahub-project#6772) * docs(snowflake) Snowflake quick ingestion guide (datahub-project#6750) * Optimize kafka setup (datahub-project#6778) * fix(kafka-setup): parallelize topic creation * feat(ingest): lookml - add unreachable views to report (datahub-project#6779) * feat(ci): adding github security reporting to trivy scans (datahub-project#6773) * fix(smoke-test): remove stateful ingestion config check (datahub-project#6781) * fix(ingest): correct external url for account identifier with account name (datahub-project#6715) * fix(tutorial): skip getting steps if there is no user (datahub-project#6786) * fix(kafka-setup): fix return code check (datahub-project#6782) * fix(kafka-setup): parallelize topic creation * Remove -setup from docker compose (not services) * fix(ui): Fixing minor issues with Ingestion forms (datahub-project#6790) * fix(ingest): prevent NullPointerException when non-jdbc SaveIntoDataSourceCommand (datahub-project#6803) * fix(docs): edit text to link (datahub-project#6798) * fix(ingest/dbt): remove unsupported usage indicator (datahub-project#6805) * refactor(ui): Miscellaneous caching improvements (datahub-project#6796) * fix(ingest): bigquery - sharded table support improvements (datahub-project#6789) * chore(ingest): pin black version (datahub-project#6807) * refactor(ingest/stateful): remove most remaining state classes (datahub-project#6791) * fix(bigquery-legacy): Fix for TypeError related failures in legacy plugin (datahub-project#6806) Co-authored-by: John Joyce <john@acryl.io> * Update Grafana Dashboard (datahub-project#6076) * Add Datasource as variable in dashboard (cherry picked from commit e75b3f7) * Update datahub_dashboard.json (cherry picked from commit 7015926) * Bump docker compose version to 3.8 (cherry picked from commit ff6a97b) * Update grafana image tag from latest to 9.1.4 (cherry picked from commit 2c88e2a) * Update old metric name in datahub_dashboard.json (cherry picked from commit 21b502e) * Add panel for new metrics (cherry picked from commit 1944527) Co-authored-by: Peter Szalai <szalaipeti.vagyok@gmail.com> * refactor(ingest/stateful): remove `IngestionJobStateProvider` (datahub-project#6792) * chore(ingest): bump python package dependencies to resolve vulns (datahub-project#6384) Co-authored-by: John Joyce <john@acryl.io> * refactor(ingest/stateful): remove `get_last_state` method (datahub-project#6794) * fix(ui): URL encode urns for ownership entity links (datahub-project#6814) * fix(posts): add deletePost GraphQL endpoint (datahub-project#6813) * fix(policies): resolve the associated domain for a domain as the domain itself (datahub-project#6812) * feat(lineage) Adds ability to edit lineage manually from the UI (datahub-project#6816) * fix(ui): change caching to happen post server-response when creating a UI ingestion recipe (datahub-project#6815) * feat(ingest/stateful): remove platform_instance_id from state urn (datahub-project#6795) * feat(ui): Adding DBT Cloud support for UI ingestion (datahub-project#6804) * feat(kafka): expose default kafka producer mechanism (datahub-project#6381) * Expose Kafka Sender Retry Parameters * Implement KafkaHealthChecker * feat(kafka): expose default kafka producer mechanism * feat(ingest): add failure/warning counts to ingest_stats (datahub-project#6823) * refactor(ingest): clean up pipeline init error handling (datahub-project#6817) * fix(ingest): exclude ztsd from uber jar to prevent jni conflicts with spark (datahub-project#6787) Co-authored-by: Tamas Nemeth <treff7es@gmail.com> * feat(ingest/bigquery): add option to enable/disable legacy sharded table support (datahub-project#6822) Co-authored-by: Harshal Sheth <hsheth2@gmail.com> Co-authored-by: John Joyce <john@acryl.io> * fix(ingest): support patches in `auto_status_aspect` (datahub-project#6827) Patches generate a raw MCP because MCPW doesn't support patches right now, so we need to handle that correctly downstream. * fix(ci): reduce flakiness views select test (datahub-project#6821) * refactor(ingest): clean up exception types (datahub-project#6818) * fix(ingest): fixed snowflake oauth ingestion not using role attribute from recipe (datahub-project#6825) * refactor(ingestion): Browse Paths Upgrade V2 Feast & Sagemaker (datahub-project#6002) * fix(lineage) Fix lineage viz with multiple siblings (datahub-project#6826) * fix(pac4j-oidc): add verifier parameter (datahub-project#6835) * fix(pac4j-oidc): add verifier parameter * feat(ingest): extract kafka topic config properties as customProperties (datahub-project#6783) * docs: Incorrect import statement fixed in example (datahub-project#6838) * feat(ingestion): spark - support lineage for delta lake writes (datahub-project#6834) * feat(ui): Support adding custom id when creating term and term group (datahub-project#6830) * feat(ci): add cypress test ui based ingestion (datahub-project#6769) * feat(ui): sortable domain list (datahub-project#6736) * fix(ci): add labels based on more folders (datahub-project#6840) * fix(ingest): kafka ingest task hand up with error bootstrap server (datahub-project#6820) * fix(ingest): Fixing lint (datahub-project#6844) * fix(ingestion) Inject pipeline_name into recipes at runtime (datahub-project#6833) * feat(ingest): add db/schema properties hook to SQL common (datahub-project#6847) * fix(oidc): fix oidc authentication loop (datahub-project#6848) * fix(oidc): fix oidc authentication loop * docs(confluent): add details for actions pod for confluent (datahub-project#6810) * feat(ingestion): Business Glossary# Add domain support in GlossaryTerm ingestion (datahub-project#6829) * lint fix * domain in term * domain in term * review comments * add todo Co-authored-by: MohdSiddique Bagwan <mohdsiddique.bagwan@gslab.com> Co-authored-by: Harshal Sheth <hsheth2@gmail.com> * fix(ingest/looker): handle missing `label` fields (datahub-project#6849) * refactor(ui): Misc domains improvements (datahub-project#6850) * feat(ingest): add pydantic helper for removed fields (datahub-project#6853) * chore(0.9.5): Bump defaults for release v0.9.5 (datahub-project#6856) * Revert "fix(ci): remove warnings due to deprecated action (datahub-project#6735)" (datahub-project#6857) This reverts commit 1da27ed. * refactor(restli-mce-consumer) (datahub-project#6744) * fix(security): commons-text in frontend * refactor(restli): set threads based on cpu cores feat(mce-consumers): hit local restli endpoint * testing docker build * Add retry configuration options for entity client * Kafka debugging * fix(kafka-setup): parallelize topic creation * Adjust docker build * Docker build updates * WIP * fix(lint): metadata-ingestion lint * fix(gradle-docker): fix docker frontend dep * fix(elastic): fix race condition between gms and mae for index creation * Revert "fix(elastic): fix race condition between gms and mae for index creation" This reverts commit 9629d12. * fix(test): fix datahub frontend test for clean/test cycle * fix(test): datahub-frontend missing assets in test * fix(security): set protobuf lib datahub-upgrade & mce/mae-consumer * gitingore update * fix(docker): remove platform on docker base image, set by buildx * refactor(kafka-producer): update kafka producer tracking/logging * updates per PR feedback * Add documentation around mce standalone consumer Kafka consumer concurrency to follow thread count for restli & sql connection pool Co-authored-by: leifker <dleifker@gmail.com> Co-authored-by: Pedro Silva <pedro@acryl.io> * fix(ci): reduce smoke test run time (datahub-project#6841) * fix(security): require signed/encrypted jwt tokens (datahub-project#6565) * fix(security): require unsigned/encrypted jwt tokens * Add import Co-authored-by: Pedro Silva <pedro@acryl.io> * feat(ingest): update profiling to fetch configurable number of sample values (datahub-project#6859) * feat(ingest/airflow): support raw dataset urns in airflow lineage (datahub-project#6854) * feat(ingest/airflow): support dataset Urns in airflow lineage This PR also - resolves a reported circular import issue - refactors the Airflow tests to reduce duplication * fix test * refactor(graphql): make graphqlengine easier to use (datahub-project#6865) * fix(kafka): datahub-upgrade job (datahub-project#6864) * feat(ingest): pass timeout config in kafka admin client api calls (datahub-project#6863) * chore(ingest): loosen requirements file (datahub-project#6867) * feat(ingest): upgrade pydantic version (datahub-project#6858) This PR also removes the requirement on docker-compose v1 and makes our tests use v2 instead. Co-authored-by: Harshal Sheth <hsheth2@gmail.com> * fix(elasticsearch): fixes out of order runId writes (datahub-project#6845) Co-authored-by: leifker <dleifker@gmail.com> Co-authored-by: Pedro Silva <pedro@acryl.io> * chore(ingest): loosen additional requirements (datahub-project#6868) * feat(ingest): bigquery/snowflake - Store last profile date in state (datahub-project#6832) * docs(google-analytics): Correct grammatical error in README.md (datahub-project#6870) * feat(CI): add venv caching (datahub-project#6843) Co-authored-by: Harshal Sheth <hsheth2@gmail.com> * feat(ingest/snowflake): handle failures gracefully and raise permission failures (datahub-project#6748) * fix(runid): always update runid, except when queued (datahub-project#6876) * fix(ingest): conditionally include env in assertion guid (datahub-project#6811) * chore(ci): update dependencies docs-website (datahub-project#6871) * feat(ui) - Add a custom error message for bulk edit to add clarity (datahub-project#6775) Co-authored-by: Kamalaseshan Radha <mkamalas@LAMU02DN212MD6R.uhc.com> * docs(adding users): Refreshing the docs for adding new DataHub Users (datahub-project#6879) * test(mce-consumer): mockbeans (datahub-project#6878) * feat(ingest): avoid embedding serialized json in metadata files (datahub-project#6742) * refactor(gradle): move the local docker registry to common location (datahub-project#6881) * refactor(smoke): use env variables (datahub-project#6866) * fix(lint): pin pydantic version (datahub-project#6886) * refactor(docs): Correctly spell elasticsearch in docs (datahub-project#6880) * fix(ingest): okta undefined variable error (datahub-project#6882) * fix(ci): reduce flakiness in add_users, siblings smoke test (datahub-project#6883) * fix(ingest): trino - fall back to default table comment method for all Trino query errors (datahub-project#6873) * test(misc): misc test updates (datahub-project#6890) * deprecate(ingest): bigquery - Removing bigquery-legacy source (datahub-project#6851) Co-authored-by: John Joyce <john@acryl.io> * chore(ingest): remove inferred args to MCPW, part 1 (datahub-project#6819) * test(ingest/kafka-connect): make docker setup more reliable (datahub-project#6902) * fix(ingest): profiling (bigquery) - Address biquery profiling query error due to timestamp vs data mismatch (datahub-project#6874) * fix(cli): Make datahub quickstart work with latest docker compose in M1 (datahub-project#6891) Co-authored-by: Harshal Sheth <hsheth2@gmail.com> * fix(cli): fix delete urn cli bug + stricter type annotations (datahub-project#6903) * fix(ingest/airflow): reorder imports to avoid cyclical dependencies (datahub-project#6719) Co-authored-by: Harshal Sheth <hsheth2@gmail.com> * feat: remove jq requirement + tweak modeldocgen args (datahub-project#6904) Co-authored-by: Tamas Nemeth <treff7es@gmail.com> * chore(ingest): loosen pyspark and pydeequ deps (datahub-project#6908) * docs(ingest/looker): fix typos + update lookml github action example (datahub-project#6910) * fix(ingest/metabase): use card_id in dashboard to chart lineage (datahub-project#6583) Co-authored-by: 陈城 <cheng.chen@tenclass.com> Co-authored-by: Harshal Sheth <hsheth2@gmail.com> * fix(es-setup): create data stream on non-aws (datahub-project#6926) * docs(): Adding missing Platform logos (datahub-project#6892) * feat(ingestion): PowerBI# Improve PowerBI source ingestion (datahub-project#6549) Co-authored-by: MohdSiddique Bagwan <mohdsiddique.bagwan@gslab.com> * fix(kafka-setup): Fix compose context for kafka-setup (datahub-project#6923) * feat(backend): Supporting Embeddable Previews for Dashboards, Charts, Datasets (datahub-project#6875) * chore(deps): bump json5 from 2.2.1 to 2.2.3 in /docs-website (datahub-project#6930) Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * chore(deps): bump json5 from 1.0.1 to 1.0.2 in /datahub-web-react (datahub-project#6931) Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Aseem Bansal <asmbansal2@gmail.com> * fix(ci): managed ingestion test fix (datahub-project#6946) * feat(ingest): add `include_table_location_lineage` flag for SQL common (datahub-project#6934) * feat(ingest): allow extracting snowflake tags (datahub-project#6500) * chore(ingest): unpin pydantic dep (datahub-project#6909) * chore(ingest): partially revert pyspark dep from datahub-project#6908 (datahub-project#6954) * fix(ingest): use branch info when cloning git repos (datahub-project#6937) * chore(ingest): remove inferred args to MCPW, part 2 (datahub-project#6905) * fix(ingest/unity): simplify MCP generation and reporting (datahub-project#6911) Co-authored-by: John Joyce <john@acryl.io> * chore(ci): parallelise build and test workflow to reduce time (datahub-project#6949) * fix(frontend): sasl.client.callback.handler.class (datahub-project#6962) * chore(react): remove outdated cypress tests and dependency (datahub-project#6948) * fix(ci): restrict GE to fix build issues (datahub-project#6967) * feat(queries): [Experimental] Allow customization of # of queries in Query tab via env var (datahub-project#6964) * feat(ingest/postgres): emit lineage for postgres views (datahub-project#6953) * feat(ingest/vertica): support projections and lineage in vertica (datahub-project#6785) Co-authored-by: mraman2512 <MY_mramaan2512@gmail.com> Co-authored-by: Aman.Kumar <64635307+mraman2512@users.noreply.github.com> Co-authored-by: Harshal Sheth <hsheth2@gmail.com> * fix(ingest): add missing dep for powerbi (datahub-project#6969) * Docs fixes week of 12 22 (datahub-project#6963) Co-authored-by: John Joyce <john@acryl.io> * fix(ingest): unfreeze bigquery/snowflake column dataclass (datahub-project#6921) * chore(frontend) Remove unused dependencies from package.json (datahub-project#6974) * chore: misc fixes (datahub-project#6966) * feat(ingest/glue): emit s3 lineage for s3a and s3n schemes (datahub-project#6788) Co-authored-by: Harshal Sheth <hsheth2@gmail.com> * fix(kafka-setup): Make kafka-setup run with multiple threads (datahub-project#6970) * feat(ingest): mark database_alias and env as deprecated (datahub-project#6901) * fix(docs): Updating Tag, Glossary Term docs to point to correct GraphQL methods (datahub-project#6965) * chore(deps): bump certifi from 2020.12.5 to 2022.12.7 in /metadata-ingestion/src/datahub/ingestion/source/feast_image (datahub-project#6979) Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: John Joyce <john@acryl.io> * fix(ingest): profiling - Fixing issue with the wrong timestamp stored in check (datahub-project#6978) Co-authored-by: Harshal Sheth <hsheth2@gmail.com> * config(quickstart): enable auto-reindex for quickstart (datahub-project#6983) * feat(privileges) - Create a privilege to manage glossary children recursively (datahub-project#6731) Co-authored-by: Kamalaseshan Radha <mkamalas@LAMU02DN212MD6R.uhc.com> Co-authored-by: John Joyce <john@acryl.io> * chore(ingest): finish removing feast-legacy (datahub-project#6985) * feat(ingest): add import descriptions of two or more nested messages (datahub-project#6959) Co-authored-by: 서주현[G플레이스데이터개발] <juhyun.seo@navercorp.com> * feat(docs) Add feature guide for Manual Lineage (datahub-project#6933) Co-authored-by: John Joyce <john@acryl.io> * docs(rfc): Serialising GMS Updates with Preconditions (datahub-project#5818) * fix(ingest): kafka-connect - support newer version of debezium (datahub-project#6943) Co-authored-by: Mayuri Nehate <33225191+mayurinehate@users.noreply.github.com> Co-authored-by: John Joyce <john@acryl.io> * fix(docs): build and broken snowflake docs fix (datahub-project#6997) * fix(ingest): bigquery - views in case more than 1 datasets with views (datahub-project#6995) Co-authored-by: Tamas Nemeth <treff7es@gmail.com> * fix(docs): Renaming Business Glossary Doc (datahub-project#7001) * fix(ingest/snowflake): fix type annotations + refactor get_connect_args (datahub-project#7004) * fix(docs): Changing the platform event topic name in kafka custom topic docs (datahub-project#7007) * fix(docs): fix name of privilege referenced in posts doc (datahub-project#7002) * fix(SSO): Correctly redirect to originally requested URL in SSO (datahub-project#7011) * fix(ingest): remove dead code from tests (datahub-project#7005) Co-authored-by: John Joyce <john@acryl.io> * feat(ingestion): Tableau # Embed links (datahub-project#6994) * feat(auth) Update auth cookies to have same-site none for chrome extension (datahub-project#6976) * docs(website): DPG WIP (datahub-project#6998) Co-authored-by: Jeff Merrick <jeff@wireform.io> * docs: resize datahub logo (datahub-project#7014) * fix(kafka-setup): Remove reference to non-existing topic (datahub-project#7019) * fix(ingest): powerbi # use display name field as title for powerbi report page (datahub-project#7017) * feat(auth): Allow session ttl to be configurable by env variable (datahub-project#7022) * fix(ui): URL Encode all Entity Profile URLs (datahub-project#7023) * fix(ui ingest): Fix test connection when stateful ingest is enabled (datahub-project#7013) * docs(sso) move root user warning to earlier in SSO guides (datahub-project#7028) * fix(ingest/looker): add clarity in chart input parsing logs (datahub-project#7003) * chore(ingest): remove duplicate data_platform.json file (datahub-project#7026) * feat(ingestion): PowerBI # Remove corpUserInfo aspect ingestion (datahub-project#7034) Co-authored-by: MohdSiddique Bagwan <mohdsiddique.bagwan@gslab.com> * fix(metadata-models): remove unnecessary bin folder (datahub-project#7035) * fix(docs): fixing typos (datahub-project#7030) * feat(ingest): Ingest Previews for Looker Charts, Dashboards, and Explores (datahub-project#6941) * fix(graphql):fix issue: autorender aspect could not be displayed on t… (datahub-project#6993) Co-authored-by: yangjd33 <yangjd33@chinaunicom.cn> * fix(config): adding quotes (datahub-project#7038) * fix(config): adding quotes (datahub-project#7040) * fix(ingest/bigquery): Turning some usage warning message to debug log as it caused confusion (datahub-project#7024) * feat(ingest/vertica): Adding Vertica as source in Datahub UI (datahub-project#7010) Co-authored-by: Vishal <vishal.k@simplify3x.com> Co-authored-by: VISHAL KUMAR <110387730+vishalkSimplify@users.noreply.github.com> Co-authored-by: John Joyce <john@acryl.io> * fix(): Removed a double set for two fields (datahub-project#7037) * fix(secret-service): fix default encrypt key (datahub-project#7074) * Fix versioning file * Modify SourceConfig and SourceReport * Instantiate StaleEntityRemovalHandler in __init__ method of source * Add entities from current run to the state object * Emitting soft-delete workunits associated with stale entities * Run 'black' tests * Run 'isort' tests * Run 'flake8' and 'mypy' tests * Update status method of all entities * Add draft of integration test for Iceberg source * Add a new integration test for the stateful feature of the iceberg source * Fixing automatic merge * Add steps to build datahub's front-end and back-end to the pipeline * Modified command to build front-end to be consistent with what DataHub (upstream) uses --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Aseem Bansal <asmbansal2@gmail.com> Co-authored-by: John Joyce <john@acryl.io> Co-authored-by: david-leifker <114954101+david-leifker@users.noreply.github.com> Co-authored-by: orlandine <46893386+orlandine@users.noreply.github.com> Co-authored-by: Tamas Nemeth <treff7es@gmail.com> Co-authored-by: Harshal Sheth <hsheth2@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Felix Lüdin <13187726+Masterchen09@users.noreply.github.com> Co-authored-by: Aditya Radhakrishnan <aditya.radhakrish@gmail.com> Co-authored-by: Maggie Hays <maggiem.hays@gmail.com> Co-authored-by: Pedro Silva <pedro@acryl.io> Co-authored-by: jx2lee <63435794+jx2lee@users.noreply.github.com> Co-authored-by: Mayuri Nehate <33225191+mayurinehate@users.noreply.github.com> Co-authored-by: Jan Hicken <janhicken@users.noreply.github.com> Co-authored-by: Dmitry Bryazgin <58312247+bda618@users.noreply.github.com> Co-authored-by: mohdsiddique <mohdsiddiquebagwan@gmail.com> Co-authored-by: MohdSiddique Bagwan <mohdsiddique.bagwan@gslab.com> Co-authored-by: RyanHolstien <RyanHolstien@users.noreply.github.com> Co-authored-by: cccs-seb <64980897+cccs-seb@users.noreply.github.com> Co-authored-by: fully <ssilb4@gmail.com> Co-authored-by: Patrick Franco Braz <patrickfbraz@poli.ufrj.br> Co-authored-by: jakobhanna <55106217+jakobhanna@users.noreply.github.com> Co-authored-by: Shirshanka Das <shirshanka@apache.org> Co-authored-by: danielli-ziprecruiter <91145628+danielli-ziprecruiter@users.noreply.github.com> Co-authored-by: Monica Senapati <89276149+senapatim@users.noreply.github.com> Co-authored-by: Navin Sharma <103643430+NavinSharma13@users.noreply.github.com> Co-authored-by: Peter Szalai <szalaipeti.vagyok@gmail.com> Co-authored-by: raysaka <ray.sakanoue@gmail.com> Co-authored-by: Chris Collins <chriscollins3456@gmail.com> Co-authored-by: djordje-mijatovic <97875950+djordje-mijatovic@users.noreply.github.com> Co-authored-by: Dago Romer <dagoromer85@gmail.com> Co-authored-by: Mirko R <118171912+mirac-cisco@users.noreply.github.com> Co-authored-by: Teppo Naakka <teppo.naakka@gmail.com> Co-authored-by: wangsaisai <wangsaisai@users.noreply.github.com> Co-authored-by: leifker <dleifker@gmail.com> Co-authored-by: cccs-eric <eric.ladouceur@cyber.gc.ca> Co-authored-by: Meenakshi Kamalaseshan Radha <62914384+mkamalas@users.noreply.github.com> Co-authored-by: Kamalaseshan Radha <mkamalas@LAMU02DN212MD6R.uhc.com> Co-authored-by: Marvin Rösch <marvinroesch99@gmail.com> Co-authored-by: Stijn De Haes <stijn.de.haes@gmail.com> Co-authored-by: cc <50856789+ccpypy@users.noreply.github.com> Co-authored-by: 陈城 <cheng.chen@tenclass.com> Co-authored-by: Fredrik Sannholm <fredrik.sannholm@wolt.com> Co-authored-by: Gabe Lyons <itsgabelyons@gmail.com> Co-authored-by: Lucas Roesler <roesler.lucas@gmail.com> Co-authored-by: VISHAL KUMAR <110387730+vishalkSimplify@users.noreply.github.com> Co-authored-by: mraman2512 <MY_mramaan2512@gmail.com> Co-authored-by: Aman.Kumar <64635307+mraman2512@users.noreply.github.com> Co-authored-by: Paul Logan <101486603+laulpogan@users.noreply.github.com> Co-authored-by: seoju <wngus606@gmail.com> Co-authored-by: 서주현[G플레이스데이터개발] <juhyun.seo@navercorp.com> Co-authored-by: Matt Matravers <mattmatravers@hotmail.com> Co-authored-by: 서재권(Data Platform) <90180644+jaegwonseo@users.noreply.github.com> Co-authored-by: Thosan Girisona <thosan.girisona@gmail.com> Co-authored-by: Jeff Merrick <jeff@wireform.io> Co-authored-by: Yang Jiandan <succeedin2010@163.com> Co-authored-by: yangjd33 <yangjd33@chinaunicom.cn> Co-authored-by: Rajasekhar-Vuppala <122261609+Rajasekhar-Vuppala@users.noreply.github.com> Co-authored-by: Vishal <vishal.k@simplify3x.com>

github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Nov 21, 2022

frsann commented Nov 21, 2022

View reviewed changes

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_config.py Show resolved Hide resolved

jjoyce0510 requested a review from mayurinehate November 23, 2022 22:04

mayurinehate suggested changes Nov 28, 2022

View reviewed changes

frsann force-pushed the snowflake-tags branch from 60177d4 to 3a5aa43 Compare November 28, 2022 19:44

anshbansal added the community-contribution PR or Issue raised by member(s) of DataHub Community label Dec 6, 2022

frsann requested a review from mayurinehate December 13, 2022 07:01

mayurinehate requested a review from hsheth2 December 16, 2022 12:50

frsann added 11 commits January 2, 2023 07:24

feat(ingest): allow extracting snowflake tags

796a242

Fix capabilities check

2e339a5

Linting

1af8a7a

Put tag pattern in the correct place

35b997f

Add error log message

96388a3

Clarify schema and db tag logic

1346510

Work with quoted identifiers

7bda0a5

Clean up description generation

2a1bc95

Clarify docs

a9edf68

Fix golden

da47d58

Correct quotes in function calls

e270f4c

Split tag logic to it's own module

b952d74

hsheth2 requested changes Jan 3, 2023

View reviewed changes

frsann added 4 commits January 4, 2023 06:39

Fix review comments

7789472

Fix more review comments

16c25d8

Fix tests

9bbc546

Restructure cache

20e3735

frsann requested review from hsheth2 and mayurinehate and removed request for mayurinehate and hsheth2 January 4, 2023 06:42

Correct term in function name

2900e67