feat(ingest): Create zero usage aspects #8205

asikowitz · 2023-06-09T11:33:46Z

Contains a lot more refactoring than I would like. Tried standardizing each usage source:

Add dataset_urn_builder method to unify dataset urn creation logic between source and usage extractor. I don't like this too much, esp with tests, but think it's preferred to the old way which was just matched logic between the two
Standardize method names: entrypoint get_usage_workunits and helper _get_internal_workunits

Checklist

The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
Links to related issues (if applicable)
Tests for the changes have been added/updated (if applicable)
Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

asikowitz · 2023-06-09T11:37:51Z

metadata-ingestion/src/datahub/ingestion/source/redshift/usage.py


+            resource: str = f"{event.database}.{event.schema_}.{event.table}".lower()


lower() was previously called in _make_usage_stat

asikowitz · 2023-06-09T11:41:16Z

metadata-ingestion/src/datahub/ingestion/source/unity/report.py

@@ -23,7 +23,6 @@ class UnityCatalogReport(StaleEntityRemovalSourceReport):
    num_queries_parsed_by_spark_plan: int = 0

    num_operational_stats_workunits_emitted: int = 0
-    num_usage_workunits_emitted: int = 0


Redundant with workunit reporter

asikowitz · 2023-06-14T14:26:02Z

metadata-ingestion/tests/integration/snowflake/snowflake_golden.json

@@ -294,17 +283,6 @@
                    },
                    "nativeDataType": "VARCHAR(255)",
                    "recursive": false,
-                    "glossaryTerms": {


Not sure why these are coming up on my local machine, but not failing on CI... @mayurinehate any clue here? This started failing locally since ac06cf3

Can you try creating fresh venv once (or use acryl-datahub-classify==0.0.8)? If that doesn't work, this needs to be fixed in code, maybe by increasing number of sample values generated in mock_sample_values.return_value function.

Nice catch. Didn't realize classify code was in a separate package

mayurinehate · 2023-06-15T12:28:02Z

metadata-ingestion/src/datahub/ingestion/api/source_helpers.py

+    *,
+    dataset_urns: Set[str],
+    config: BaseTimeWindowConfig,
+    all_buckets: bool = False,  # TODO: Enable when CREATE changeType is supported for timeseries aspects


Whenever we enable this, we should check start_time, end_time defaults in config more carefully. Especially concerned that we may miss out on some usage for the bucket if the UPSERT is disabled.

With current defaults(without stateful ingestion). For today, entire usage until the ingestion time is emitted in today's run and it gets overwritten when usage runs again tomorrow with entire yesterday window.

Yeah, this seems like a pretty serious issue. I guess best solution here is to set start and end time correctly, to the start and end time of the nearest bucket

mayurinehate

Looks good overall. Some minor suggestions.

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery.py

metadata-ingestion/src/datahub/ingestion/source/redshift/redshift.py

mayurinehate · 2023-06-15T13:01:25Z

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_usage_v2.py

@@ -368,12 +373,6 @@ def _get_operation_aspect_work_unit(
                    )
                    continue

-                dataset_urn = make_dataset_urn_with_platform_instance(


Thank you for cleaning this up.

mayurinehate · 2023-06-15T13:06:49Z

metadata-ingestion/src/datahub/ingestion/source/unity/usage.py

-        ):
-            self.report.num_usage_workunits_emitted += 1
-            yield wu
+        yield from auto_empty_dataset_usage_statistics(


noticed that the call to auto_empty_dataset_usage_statistics is in get_usage_workunits in some sources whereas in _get_workunits_internal in some sources. Would be good to keep it consistent.

This is due to inconsistency in how the include_usage_stats param is interpreted. For snowflake, if include_usage_stats is false you can still ingest operational stats, while for bigquery, redshift, and unity, if include_usage_statistics is false then we don't ingest usage at all. We should standardize here, but idk which is best

mayurinehate · 2023-06-15T13:10:05Z

metadata-ingestion/tests/integration/snowflake/snowflake_golden.json

@@ -294,17 +283,6 @@
                    },
                    "nativeDataType": "VARCHAR(255)",
                    "recursive": false,
-                    "glossaryTerms": {


Can you try creating fresh venv once (or use acryl-datahub-classify==0.0.8)? If that doesn't work, this needs to be fixed in code, maybe by increasing number of sample values generated in mock_sample_values.return_value function.

mayurinehate · 2023-06-15T13:19:31Z

metadata-ingestion/tests/unit/test_bigquery_usage.py

-        assert list(workunits) == [
+        workunits = usage_extractor._get_workunits_internal(
+            events, [TABLE_REFS[TABLE_1.name]]
+        )


Curious why this changed from TABLE_REFS.values() to [TABLE_REFS[TABLE_1.name]] ?

The refs are used to determine which entities need zero usage, and I wanted to keep this test simple (i.e. smaller expected)

mayurinehate · 2023-06-15T13:23:31Z

metadata-ingestion/src/datahub/ingestion/api/source_helpers.py

+
+        urn = wu.get_urn()
+        if guess_entity_type(urn) == DatasetUrn.ENTITY_TYPE:
+            dataset_urns.add(urn)


Looks like we are already passing a well-populated set of dataset_urns to this helper function from our sources. This line again adds all the dataset urns - kind of redundant but a safety net against what connector may have missed ?

Ah this was left in back when I was not passing them in (before I realized this was not guaranteed, e.g. when include_technical_schema is False for snowflake). I think it's pretty harmless to keep and safety net feature may be nice

…gquery.py Co-authored-by: Mayuri Nehate <33225191+mayurinehate@users.noreply.github.com>

…ift.py Co-authored-by: Mayuri Nehate <33225191+mayurinehate@users.noreply.github.com>

hsheth2

Overall looks good

hsheth2 · 2023-06-21T18:09:03Z

metadata-ingestion/src/datahub/configuration/time_window_config.py

+        """Returns list of timestamps for each DatasetUsageStatistics bucket.
+
+        Includes all buckets in the time window, including partially contained buckets.
+        The timestamps are in milliseconds since the epoch.


a more general thing: I much prefer using/passing around proper datetime objects instead of timestamps - they're just easier to use and operate on

Agreed, I'll change

hsheth2 · 2023-06-21T18:09:44Z

metadata-ingestion/src/datahub/configuration/time_window_config.py

+    def majority_buckets(self) -> List[int]:
+        """Returns list of timestamps for each DatasetUsageStatistics bucket.
+
+        Includes only buckets in the time window for which a majority of the bucket is ingested.


just to confirm: majority = half or more?

hsheth2 · 2023-06-21T18:11:17Z

metadata-ingestion/src/datahub/ingestion/api/source_helpers.py

+                if usage_aspect.timestampMillis in buckets:
+                    usage_statistics_urns[usage_aspect.timestampMillis].add(urn)
+                elif all_buckets:
+                    logger.warning(


i doubt this ever gets hit, but if it does, it'll be a log line per aspect which will probably be super spammy

hsheth2 · 2023-06-21T18:12:15Z

metadata-ingestion/src/datahub/ingestion/api/source_helpers.py

+                    fieldCounts=[],
+                ),
+                changeType=ChangeTypeClass.CREATE
+                if all_buckets


so all_buckets doesn't work, since we don't support create?

Yeah. Normally don't like leaving dead code in but I would really like to get CREATE in and think it's the better way to implement this feature, so hopefully I'm able to get that in and switch this soon

feat(ingest): Create zero usage aspects

4ffc9ea

asikowitz requested review from treff7es, hsheth2 and mayurinehate June 9, 2023 11:33

github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Jun 9, 2023

revert unwanted changes

6042387

vercel bot deployed to Preview June 9, 2023 12:00 View deployment

asikowitz added 2 commits June 14, 2023 04:34

fix test

0e61f5b

Merge branch 'master' into zero-usage-aspects

56995b8

vercel bot deployed to Preview June 14, 2023 11:59 View deployment

asikowitz added 2 commits June 14, 2023 07:22

fix test

f21df7d

remove print

6d154e8

asikowitz commented Jun 14, 2023

View reviewed changes

vercel bot deployed to Preview June 14, 2023 14:37 View deployment

mayurinehate reviewed Jun 15, 2023

View reviewed changes

mayurinehate approved these changes Jun 15, 2023

View reviewed changes

asikowitz and others added 3 commits June 15, 2023 16:23

Update metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bi…

dac22f1

…gquery.py Co-authored-by: Mayuri Nehate <33225191+mayurinehate@users.noreply.github.com>

Update metadata-ingestion/src/datahub/ingestion/source/redshift/redsh…

557bb7d

…ift.py Co-authored-by: Mayuri Nehate <33225191+mayurinehate@users.noreply.github.com>

Merge branch 'master' into zero-usage-aspects

01208f4

vercel bot deployed to Preview June 16, 2023 00:01 View deployment

revert snowflake golden

58197cf

vercel bot deployed to Preview June 18, 2023 17:06 View deployment

update snowflake golden

dea2186

vercel bot deployed to Preview June 18, 2023 17:47 View deployment

match legacy lineage test

09253b0

vercel bot deployed to Preview June 20, 2023 11:58 View deployment

hsheth2 approved these changes Jun 21, 2023

View reviewed changes

pr feedback

bd15aa6

vercel bot deployed to Preview June 22, 2023 15:35 View deployment

lint

ffa9b34

vercel bot deployed to Preview June 22, 2023 16:24 View deployment

hsheth2 approved these changes Jun 22, 2023

View reviewed changes

fix test

45e9792

vercel bot deployed to Preview June 22, 2023 18:44 View deployment

lint

91b5ccf

vercel bot deployed to Preview June 22, 2023 19:55 View deployment

asikowitz merged commit aa5e02d into datahub-project:master Jun 22, 2023
44 checks passed

asikowitz deleted the zero-usage-aspects branch June 22, 2023 21:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ingest): Create zero usage aspects #8205

feat(ingest): Create zero usage aspects #8205

asikowitz commented Jun 9, 2023

asikowitz Jun 9, 2023

asikowitz Jun 9, 2023

asikowitz Jun 14, 2023

mayurinehate Jun 15, 2023

asikowitz Jun 18, 2023

mayurinehate Jun 15, 2023

asikowitz Jun 15, 2023

mayurinehate left a comment

mayurinehate Jun 15, 2023

mayurinehate Jun 15, 2023

asikowitz Jun 15, 2023

mayurinehate Jun 15, 2023

mayurinehate Jun 15, 2023

asikowitz Jun 15, 2023

mayurinehate Jun 15, 2023

asikowitz Jun 15, 2023

hsheth2 left a comment

hsheth2 Jun 21, 2023

asikowitz Jun 21, 2023

hsheth2 Jun 21, 2023

hsheth2 Jun 21, 2023

hsheth2 Jun 21, 2023

asikowitz Jun 21, 2023


		resource: str = f"{event.database}.{event.schema_}.{event.table}".lower()

feat(ingest): Create zero usage aspects #8205

feat(ingest): Create zero usage aspects #8205

Conversation

asikowitz commented Jun 9, 2023

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mayurinehate left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hsheth2 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment