-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test(ingest): Aspect level golden file comparison #8310
test(ingest): Aspect level golden file comparison #8310
Conversation
pytest.fail(diff.pretty(), pytrace=False) | ||
else: | ||
pytest.fail(pprint.pformat(diff), pytrace=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing this rather than raising an exception removes the stack trace, which I think will be nice for reading error messages
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should add these as comments in the code directly
|
||
if diff: | ||
if isinstance(diff, GoldenDiff): | ||
print(diff.pretty(verbose=True)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This leaves the actual error message shorter, but stdout pretty long, as it'll print out all all the full aspects associated with any differences. I still think this is preferred by default so that it's easy to debug failing connector tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some quick comments from an initial review pass
serialized = obj.to_obj() | ||
if serialized.get("aspect") and serialized["aspect"].get("contentType") in [ | ||
"application/json", | ||
"application/json-patch+json", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we extract these into constants?
we already have _ASPECT_CONTENT_TYPE
somewhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I was using that previously but I didn't really get what it meant. I can abstract them into some basic constants like JSON_CONTENT_TYPE
and JSON_PATCH_CONTENT_TYPE
though
except Exception as e: | ||
logger.warning(f"Reverting to old diff method: {e}") | ||
logger.debug("Error with new diff method", exc_info=True) | ||
return diff_mces(output, golden, ignore_paths) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this fallback? Is the new GoldenDiff stuff risky?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't work on MCEs and in general I'm just trying to be safe here. Deepdiff works on pretty much any object while GoldenDiff only works on a specifically formatted object (list of serialized MCPs)
Tuple[int, GoldenAspect, GoldenAspect], List[DiffLevel] | ||
] = field(init=False, default_factory=lambda: defaultdict(list)) | ||
|
||
def __post_init__(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make this a classmethod create
that then creates the AspectDiff
obj at the end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had this originally and swapped to post init. Do you not like post init, or is there any style recs we're following here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
more a personal style thing
If we're just initializing one small field, using post_init is fine. In this case we're basically initializing the entire thing in post init, which feels like an abuse of the post init mechanism
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some quick comments from an initial review pass
metadata-ingestion/setup.py
Outdated
@@ -685,6 +689,7 @@ def get_long_description(): | |||
) | |||
), | |||
"dev": list(dev_requirements), | |||
"test": list(test_api_requirements), # To import `datahub.testing` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe call this testing-utils
?
metadata-ingestion/setup.py
Outdated
@@ -430,7 +434,7 @@ def get_long_description(): | |||
# pydantic 1.8.2 is incompatible with mypy 0.910. | |||
# See https://github.com/samuelcolvin/pydantic/pull/3175#issuecomment-995382910. | |||
"pydantic>=1.9.0", | |||
"pytest>=6.2.2", | |||
*pytest_common, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be test_api_requirements
right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what I wasn't sure about. Technically if we changed our test_api_requirements
to not require deepdiff
, we'd still want it in the dev requirements because we use deepdiff elsewhere (not just for the test api), so I'm not sure if I should nest the whole thing or separately declare the dependency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So then make a separate deepdiff_dep var, which gets included both here and by test_api_requirements
That way we can remove it from test_api_requirements
and still have it be in dev mode
(it's totally fine to have the same thing listed twice - we're using sets to dedup, and setup tools handles it too)
if serialized.get("aspect") and serialized["aspect"].get("contentType") in [ | ||
JSON_CONTENT_TYPE, | ||
JSON_PATCH_CONTENT_TYPE, | ||
]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we also add a test to tests/unit/serde/test_serde.py
we already have some stuff for patches in there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we'll specifically need a test for reading the old serialized format and producing the new unpacked format, if we don't have that already
MetadataJson = List[Dict[str, Any]] | ||
|
||
default_exclude_paths = [ | ||
r"root\[\d+]\['systemMetadata']\['lastObserved']", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we have \[
but not \]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't seem to be necessary -- pycharm initially bugged me about it and I just kept it up for consistency. That being said, I think it does matter if you have an open, unescaped [
so I should prob escape ]
too just in case
"""The pretty human-readable string output of the diff between golden and output.""" | ||
s = [] | ||
for urn in self.urns_added: | ||
s.append(f"Urn added, {urn}{' with aspects:' if verbose else ''}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i wonder if we could add color / bolding to make the output even more readable?
might be cool if it's not too hard - click.style
does some of this stuff, which we use in pipeline.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In interest of time, will leave this as a potential followup
@mayurinehate for awareness
Basic idea is that, since deepdiff's
iterable_compare_func
appears to be broken (and doesn't work ifignore_order=True
), we perform diffs against the list of aspects associated with a given urn and aspect name. Then these diffs are parsed to find which aspects were added / removed / changed and we (i) print out the diff in a prettier way and (ii) attempt to create a cleaner diff when overwriting the golden.Only works on MCPs because handling MCEs was too complicated, and anyway we want to move away from them.
Here's an example pytest error message now:
And then in stdout, I print a more verbose version with the relevant aspects, e.g.:
Checklist