fix(transformers): pattern add domain transformer - enable replace_existing #7317

asikowitz · 2023-02-10T23:05:43Z

Propagates the replace_existing config flag and adds tests for the pattern add domain transformer.

Note, as demonstrated by the tests, that if the pattern does not match and replace_existing = True, it still replaces the domains for that entity, just with []. This is the same behavior as other transformers with the replace_existing flag and patterns, like the pattern add dataset ownership transformer.

Checklist

The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
Links to related issues (if applicable)
Tests for the changes have been added/updated (if applicable)
Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

…isting

jjoyce0510 · 2023-02-10T23:36:30Z

metadata-ingestion/tests/unit/test_transform_dataset.py

+    assert isinstance(output[0].record.aspect, models.DomainsClass)
+    transformed_aspect = cast(models.DomainsClass, output[0].record.aspect)
+    assert len(transformed_aspect.domains) == 2
+    assert gslab_domain in transformed_aspect.domains


So PATCH is the default? (if semantics are not specified)

No, OVERWRITE is the default, but if you don't have replace_existing = True (which defaults to false) it appends new domains to the existing domains for the entity (but in a different metadata_aspect_v2 row because of OVERWRITE). https://datahubproject.io/docs/metadata-ingestion/docs/transformer/dataset_transformer#relationship-between-replace_existing-and-semantics explains all the possible situations, although I don't like how this is confusing enough it warrants an article

Ah this is quite confusing...

So IF the aspect is a DomainsClass aspect, then an OVERWRITE will mean "we merge into this aspect".

If the aspect is not a DomainsClass aspect, then we'll just generate a new DomainsClass aspect altogether.

I think this makes sense but the documentation is a bit hard to follow for sure.

Maybe another set of cases that would be useful to include in the tests are what happens when a non-domains aspect (e.g. DatasetPropertiesClass) is sent through the pipeline (to show how it behaves then)

The way I see it, the semantics just says whether you edit the existing row (via a AddDatasetDomain aspect) or create a new one (via a DomainsClass aspect). replace_existing determines whether you append to the existing domains list, or truncate. And, confusingly, pattern matching does absolutely nothing to change this logic -- transformers are applied on all change event / proposals with matching aspect_name, the pattern matching just allows you to say which entities get which domain, and if there's no match then they get no domains.

Shouldn't the infrastructure around transformers (i.e. base_transformer.py::BaseTransformer.transform and _transform_or_record_mcp in that same class) handle the filtering of non-domains aspects? I would expect testing around that to be on the base transformer, rather than every transformer than inherits from the base transformer.

EDIT: Eh, I guess it's testing that aspect_name is set correctly. ~~I'll add a test~~ actually not really sure how to test this. The transformers aren't written to handle unexpected aspects. I could add a test that checks PatternAddDatasetDomain.aspect_name == "domains" I guess?

Hmm my understanding of this transformer is that it can handle ANY aspects not just incoming domain aspects produced by the Ingestion Source... Maybe that's a bad interpretation

It seems that way. So for example, if I configure a Transformer with an Ingestion Source that does not itself sent DomainsClass aspects (some do not), then this transformer will intercept whatever aspect it does produce (e.g. datasetProperties) and will inject the Domains aspect (this would be my expectation, at least)

Got it - so yeah the parent class adds filtering on the Domains aspect... Very odd because the source should not need to produce a Domains aspect for this to apply... Let me see if my understanding of the docs aligns

metadata-ingestion/src/datahub/ingestion/transformer/dataset_domain.py

…isting (datahub-project#7317)

…isting (#7317)

fix(transformers): pattern add domain transformer - enable replace_ex…

f982348

…isting

github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Feb 10, 2023

jjoyce0510 reviewed Feb 10, 2023

View reviewed changes

metadata-ingestion/src/datahub/ingestion/transformer/dataset_domain.py Show resolved Hide resolved

asikowitz added 2 commits February 10, 2023 19:19

add aspect name tests

fc7a878

Merge branch 'master' into domain-transformer-duplicates

255bdd2

asikowitz requested a review from jjoyce0510 February 13, 2023 05:43

asikowitz added 2 commits February 13, 2023 13:45

Merge branch 'master' into domain-transformer-duplicates

e8a1981

Merge branch 'master' into domain-transformer-duplicates

95272a4

jjoyce0510 approved these changes Feb 13, 2023

View reviewed changes

jjoyce0510 merged commit 8901498 into datahub-project:master Feb 13, 2023

asikowitz deleted the domain-transformer-duplicates branch February 13, 2023 21:01

oleg-ruban pushed a commit to RChygir/datahub that referenced this pull request Feb 28, 2023

fix(transformers): pattern add domain transformer - enable replace_ex…

01cb6eb

…isting (datahub-project#7317)

yoonhyejin pushed a commit that referenced this pull request Mar 3, 2023

fix(transformers): pattern add domain transformer - enable replace_ex…

9ecb303

…isting (#7317)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(transformers): pattern add domain transformer - enable replace_existing #7317

fix(transformers): pattern add domain transformer - enable replace_existing #7317

asikowitz commented Feb 10, 2023 •

edited

jjoyce0510 Feb 10, 2023

asikowitz Feb 10, 2023 •

edited

jjoyce0510 Feb 10, 2023

asikowitz Feb 11, 2023 •

edited

jjoyce0510 Feb 11, 2023

jjoyce0510 Feb 11, 2023

jjoyce0510 Feb 11, 2023

fix(transformers): pattern add domain transformer - enable replace_existing #7317

fix(transformers): pattern add domain transformer - enable replace_existing #7317

Conversation

asikowitz commented Feb 10, 2023 • edited

Checklist

jjoyce0510 Feb 10, 2023

Choose a reason for hiding this comment

asikowitz Feb 10, 2023 • edited

Choose a reason for hiding this comment

jjoyce0510 Feb 10, 2023

Choose a reason for hiding this comment

asikowitz Feb 11, 2023 • edited

Choose a reason for hiding this comment

jjoyce0510 Feb 11, 2023

Choose a reason for hiding this comment

jjoyce0510 Feb 11, 2023

Choose a reason for hiding this comment

jjoyce0510 Feb 11, 2023

Choose a reason for hiding this comment

asikowitz commented Feb 10, 2023 •

edited

asikowitz Feb 10, 2023 •

edited

asikowitz Feb 11, 2023 •

edited