test(ingest): Aspect level golden file comparison #8310

asikowitz · 2023-06-26T23:11:02Z

Basic idea is that, since deepdiff's iterable_compare_func appears to be broken (and doesn't work if ignore_order=True), we perform diffs against the list of aspects associated with a given urn and aspect name. Then these diffs are parsed to find which aspects were added / removed / changed and we (i) print out the diff in a prettier way and (ii) attempt to create a cleaner diff when overwriting the golden.

Only works on MCPs because handling MCEs was too complicated, and anyway we want to move away from them.

Here's an example pytest error message now:

Urn removed, urn:li:dataset:(urn:li:dataPlatform:bigquery,acryl-staging.smoke_test_db.lineage_from_tmp_table_2,PROD)

Urn changed, urn:li:dataset:(urn:li:dataPlatform:bigquery,acryl-staging.smoke_test_db.lineage_from_base,PROD):
<upstreamLineage> added

Urn changed, urn:li:dataset:(urn:li:dataPlatform:bigquery,acryl-staging.smoke_test_db.lineage_from_tmp_table,PROD):
<status> added
<upstreamLineage> changed:
	Value of aspect[0]['path'] changed from "/upstreams/urn%3Ali%3Adataset%3A%28urn%3Ali%3AdataPlatform%3Abigquery%2Cacryl-staging.smoke_test_db.base_table8%2CPROD%29" to "/upstreams/urn%3Ali%3Adataset%3A%28urn%3Ali%3AdataPlatform%3Abigquery%2Cacryl-staging.smoke_test_db.base_table%2CPROD%29".
	Value of aspect[0]['value']['dataset'] changed from "urn:li:dataset:(urn:li:dataPlatform:bigquery,acryl-staging.smoke_test_db.base_table8,PROD)" to "urn:li:dataset:(urn:li:dataPlatform:bigquery,acryl-staging.smoke_test_db.base_table,PROD)".

Urn changed, urn:li:dataset:(urn:li:dataPlatform:bigquery,acryl-staging-2.smoke_test_db_4.table_with_nested_fields,PROD):
<datasetProperties> changed:
	Value of aspect['description'] changed from ""Example name and addresses table yeah"" to ""Example name and addresses table"".

Urn changed, urn:li:dataset:(urn:li:dataPlatform:bigquery,acryl-staging.smoke_test_db.partition_test,PROD):
<status> changed:
	Value of aspect['removed'] changed from True to False.
<schemaMetadata> changed:
	Item aspect['fields'][2] removed from iterable.
	Value of aspect['fields'][0]['fieldPath'] changed from "date_utce" to "date_utc".

Urn changed, urn:li:dataset:(urn:li:dataPlatform:bigquery,acryl-staging.smoke_test_db.table_from_another_project,PROD):
<schemaMetadata> changed:
	Item aspect['fields'][0] added to iterable.

And then in stdout, I print a more verbose version with the relevant aspects, e.g.:

Urn changed, urn:li:dataset:(urn:li:dataPlatform:bigquery,acryl-staging.smoke_test_db.lineage_from_tmp_table,PROD):
<status> added
	removed: false
<upstreamLineage> changed:
	Value of aspect[0]['path'] changed from "/upstreams/urn%3Ali%3Adataset%3A%28urn%3Ali%3AdataPlatform%3Abigquery%2Cacryl-staging.smoke_test_db.base_table8%2CPROD%29" to "/upstreams/urn%3Ali%3Adataset%3A%28urn%3Ali%3AdataPlatform%3Abigquery%2Cacryl-staging.smoke_test_db.base_table%2CPROD%29".
	Value of aspect[0]['value']['dataset'] changed from "urn:li:dataset:(urn:li:dataPlatform:bigquery,acryl-staging.smoke_test_db.base_table8,PROD)" to "urn:li:dataset:(urn:li:dataPlatform:bigquery,acryl-staging.smoke_test_db.base_table,PROD)".
Old aspect:
	- op: add
	  path: /upstreams/urn%3Ali%3Adataset%3A%28urn%3Ali%3AdataPlatform%3Abigquery%2Cacryl-staging.smoke_test_db.base_table8%2CPROD%29
	  value:
	    auditStamp:
	      time: 1687075985648
	      actor: urn:li:corpuser:datahub
	    dataset: urn:li:dataset:(urn:li:dataPlatform:bigquery,acryl-staging.smoke_test_db.base_table8,PROD)
	    type: TRANSFORMED
New aspect:
	- op: add
	  path: /upstreams/urn%3Ali%3Adataset%3A%28urn%3Ali%3AdataPlatform%3Abigquery%2Cacryl-staging.smoke_test_db.base_table8%2CPROD%29
	  value:
	    auditStamp:
	      time: 1687075985648
	      actor: urn:li:corpuser:datahub
	    dataset: urn:li:dataset:(urn:li:dataPlatform:bigquery,acryl-staging.smoke_test_db.base_table8,PROD)
	    type: TRANSFORMED

Checklist

The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
Links to related issues (if applicable)
Tests for the changes have been added/updated (if applicable)
Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

asikowitz · 2023-06-26T23:40:53Z

metadata-ingestion/src/datahub/testing/compare_goldens.py

+            pytest.fail(diff.pretty(), pytrace=False)
+        else:
+            pytest.fail(pprint.pformat(diff), pytrace=False)


Doing this rather than raising an exception removes the stack trace, which I think will be nice for reading error messages

should add these as comments in the code directly

asikowitz · 2023-06-26T23:41:51Z

metadata-ingestion/src/datahub/testing/compare_goldens.py

+
+    if diff:
+        if isinstance(diff, GoldenDiff):
+            print(diff.pretty(verbose=True))


This leaves the actual error message shorter, but stdout pretty long, as it'll print out all all the full aspects associated with any differences. I still think this is preferred by default so that it's easy to debug failing connector tests

hsheth2

some quick comments from an initial review pass

hsheth2 · 2023-06-27T19:05:04Z

metadata-ingestion/src/datahub/ingestion/sink/file.py

+        serialized = obj.to_obj()
+        if serialized.get("aspect") and serialized["aspect"].get("contentType") in [
+            "application/json",
+            "application/json-patch+json",


can we extract these into constants?

we already have _ASPECT_CONTENT_TYPE somewhere

Yeah I was using that previously but I didn't really get what it meant. I can abstract them into some basic constants like JSON_CONTENT_TYPE and JSON_PATCH_CONTENT_TYPE though

hsheth2 · 2023-06-27T19:07:36Z

metadata-ingestion/src/datahub/testing/compare_goldens.py

+    except Exception as e:
+        logger.warning(f"Reverting to old diff method: {e}")
+        logger.debug("Error with new diff method", exc_info=True)
+        return diff_mces(output, golden, ignore_paths)


Why do we need this fallback? Is the new GoldenDiff stuff risky?

It doesn't work on MCEs and in general I'm just trying to be safe here. Deepdiff works on pretty much any object while GoldenDiff only works on a specifically formatted object (list of serialized MCPs)

hsheth2 · 2023-06-27T19:08:46Z

metadata-ingestion/src/datahub/testing/golden_diff.py

+        Tuple[int, GoldenAspect, GoldenAspect], List[DiffLevel]
+    ] = field(init=False, default_factory=lambda: defaultdict(list))
+
+    def __post_init__(self):


make this a classmethod create that then creates the AspectDiff obj at the end

I had this originally and swapped to post init. Do you not like post init, or is there any style recs we're following here?

more a personal style thing

If we're just initializing one small field, using post_init is fine. In this case we're basically initializing the entire thing in post init, which feels like an abuse of the post init mechanism

hsheth2

some quick comments from an initial review pass

hsheth2 · 2023-06-27T19:12:05Z

metadata-ingestion/setup.py

@@ -685,6 +689,7 @@ def get_long_description():
            )
        ),
        "dev": list(dev_requirements),
+        "test": list(test_api_requirements),  # To import `datahub.testing`


maybe call this testing-utils?

hsheth2 · 2023-06-27T19:12:26Z

metadata-ingestion/setup.py

@@ -430,7 +434,7 @@ def get_long_description():
    # pydantic 1.8.2 is incompatible with mypy 0.910.
    # See https://github.com/samuelcolvin/pydantic/pull/3175#issuecomment-995382910.
    "pydantic>=1.9.0",
-    "pytest>=6.2.2",
+    *pytest_common,


this should be test_api_requirements right?

This is what I wasn't sure about. Technically if we changed our test_api_requirements to not require deepdiff, we'd still want it in the dev requirements because we use deepdiff elsewhere (not just for the test api), so I'm not sure if I should nest the whole thing or separately declare the dependency

So then make a separate deepdiff_dep var, which gets included both here and by test_api_requirements

That way we can remove it from test_api_requirements and still have it be in dev mode

(it's totally fine to have the same thing listed twice - we're using sets to dedup, and setup tools handles it too)

hsheth2 · 2023-06-29T21:07:25Z

metadata-ingestion/src/datahub/ingestion/sink/file.py

+        if serialized.get("aspect") and serialized["aspect"].get("contentType") in [
+            JSON_CONTENT_TYPE,
+            JSON_PATCH_CONTENT_TYPE,
+        ]:


can we also add a test to tests/unit/serde/test_serde.py

we already have some stuff for patches in there

we'll specifically need a test for reading the old serialized format and producing the new unpacked format, if we don't have that already

hsheth2 · 2023-06-29T21:08:28Z

metadata-ingestion/src/datahub/testing/compare_metadata_json.py

+MetadataJson = List[Dict[str, Any]]
+
+default_exclude_paths = [
+    r"root\[\d+]\['systemMetadata']\['lastObserved']",


why do we have \[ but not \] ?

It doesn't seem to be necessary -- pycharm initially bugged me about it and I just kept it up for consistency. That being said, I think it does matter if you have an open, unescaped [ so I should prob escape ] too just in case

metadata-ingestion/src/datahub/testing/compare_metadata_json.py

metadata-ingestion/src/datahub/testing/mcp_diff.py

hsheth2 · 2023-06-29T21:13:33Z

metadata-ingestion/src/datahub/testing/mcp_diff.py

+        """The pretty human-readable string output of the diff between golden and output."""
+        s = []
+        for urn in self.urns_added:
+            s.append(f"Urn added, {urn}{' with aspects:' if verbose else ''}")


i wonder if we could add color / bolding to make the output even more readable?

might be cool if it's not too hard - click.style does some of this stuff, which we use in pipeline.py

In interest of time, will leave this as a potential followup

test(ingest): Aspect level golden file comparison

a99cfbc

asikowitz requested review from treff7es and hsheth2 June 26, 2023 23:11

minor adjustments

71091cd

github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Jun 26, 2023

asikowitz commented Jun 26, 2023

View reviewed changes

vercel bot deployed to Preview June 26, 2023 23:49 View deployment

asikowitz added 2 commits June 27, 2023 14:11

remove expand_mcp in favor of always expanding PATCH MCPCs

69ab181

remove inflection dependency; add 'test' extra

56d93d9

hsheth2 reviewed Jun 27, 2023

View reviewed changes

vercel bot deployed to Preview June 27, 2023 19:54 View deployment

asikowitz added 3 commits June 28, 2023 17:25

pr feedback; naming changes; add DeltaInfoOperator over ignore_path

5a471f4

rename

8b6bd37

lint

c6a5679

vercel bot deployed to Preview June 28, 2023 21:58 View deployment

asikowitz added 2 commits June 28, 2023 18:40

bug fixes

8561493

fix test

15d99ac

vercel bot deployed to Preview June 28, 2023 23:26 View deployment

hsheth2 reviewed Jun 29, 2023

View reviewed changes

asikowitz added 2 commits July 7, 2023 19:17

pr feedback

d54fe1d

Merge branch 'master' into update-testing-framework

c831928

asikowitz requested a review from hsheth2 July 7, 2023 23:18

vercel bot deployed to Preview July 7, 2023 23:49 View deployment

Merge branch 'master' into update-testing-framework

98b44a3

vercel bot deployed to Preview July 10, 2023 16:57 View deployment

Merge branch 'master' into update-testing-framework

a79451b

hsheth2 approved these changes Jul 10, 2023

View reviewed changes

hsheth2 added the merge-pending-ci A PR that has passed review and should be merged once CI is green. label Jul 10, 2023

vercel bot deployed to Preview July 10, 2023 19:06 View deployment

asikowitz merged commit 2261531 into datahub-project:master Jul 11, 2023
40 of 44 checks passed

asikowitz deleted the update-testing-framework branch July 11, 2023 14:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(ingest): Aspect level golden file comparison #8310

test(ingest): Aspect level golden file comparison #8310

asikowitz commented Jun 26, 2023 •

edited

Loading

asikowitz Jun 26, 2023

hsheth2 Jun 27, 2023

asikowitz Jun 26, 2023

hsheth2 left a comment

hsheth2 Jun 27, 2023

asikowitz Jun 27, 2023

hsheth2 Jun 27, 2023

asikowitz Jun 27, 2023

hsheth2 Jun 27, 2023

asikowitz Jun 27, 2023

hsheth2 Jun 27, 2023

hsheth2 left a comment

hsheth2 Jun 27, 2023

hsheth2 Jun 27, 2023

asikowitz Jun 27, 2023

hsheth2 Jun 27, 2023 •

edited

Loading

hsheth2 Jun 29, 2023

hsheth2 Jun 29, 2023

hsheth2 Jun 29, 2023

asikowitz Jul 7, 2023

hsheth2 Jun 29, 2023

asikowitz Jul 7, 2023

test(ingest): Aspect level golden file comparison #8310

test(ingest): Aspect level golden file comparison #8310

Conversation

asikowitz commented Jun 26, 2023 • edited Loading

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hsheth2 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hsheth2 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hsheth2 Jun 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

asikowitz commented Jun 26, 2023 •

edited

Loading

hsheth2 Jun 27, 2023 •

edited

Loading