New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(transformers): pattern add domain transformer - enable replace_existing #7317
fix(transformers): pattern add domain transformer - enable replace_existing #7317
Conversation
assert isinstance(output[0].record.aspect, models.DomainsClass) | ||
transformed_aspect = cast(models.DomainsClass, output[0].record.aspect) | ||
assert len(transformed_aspect.domains) == 2 | ||
assert gslab_domain in transformed_aspect.domains |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So PATCH is the default? (if semantics are not specified)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, OVERWRITE is the default, but if you don't have replace_existing = True
(which defaults to false
) it appends new domains to the existing domains for the entity (but in a different metadata_aspect_v2
row because of OVERWRITE). https://datahubproject.io/docs/metadata-ingestion/docs/transformer/dataset_transformer#relationship-between-replace_existing-and-semantics explains all the possible situations, although I don't like how this is confusing enough it warrants an article
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah this is quite confusing...
So IF the aspect is a DomainsClass aspect, then an OVERWRITE will mean "we merge into this aspect".
If the aspect is not a DomainsClass aspect, then we'll just generate a new DomainsClass aspect altogether.
I think this makes sense but the documentation is a bit hard to follow for sure.
Maybe another set of cases that would be useful to include in the tests are what happens when a non-domains aspect (e.g. DatasetPropertiesClass) is sent through the pipeline (to show how it behaves then)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way I see it, the semantics just says whether you edit the existing row (via a AddDatasetDomain
aspect) or create a new one (via a DomainsClass
aspect). replace_existing
determines whether you append to the existing domains list, or truncate. And, confusingly, pattern matching does absolutely nothing to change this logic -- transformers are applied on all change event / proposals with matching aspect_name
, the pattern matching just allows you to say which entities get which domain, and if there's no match then they get no domains.
Shouldn't the infrastructure around transformers (i.e. base_transformer.py::BaseTransformer.transform
and _transform_or_record_mcp
in that same class) handle the filtering of non-domains aspects? I would expect testing around that to be on the base transformer, rather than every transformer than inherits from the base transformer.
EDIT: Eh, I guess it's testing that aspect_name
is set correctly. I'll add a test actually not really sure how to test this. The transformers aren't written to handle unexpected aspects. I could add a test that checks PatternAddDatasetDomain.aspect_name == "domains"
I guess?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm my understanding of this transformer is that it can handle ANY aspects not just incoming domain aspects produced by the Ingestion Source... Maybe that's a bad interpretation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that way. So for example, if I configure a Transformer with an Ingestion Source that does not itself sent DomainsClass aspects (some do not), then this transformer will intercept whatever aspect it does produce (e.g. datasetProperties) and will inject the Domains aspect (this would be my expectation, at least)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it - so yeah the parent class adds filtering on the Domains aspect... Very odd because the source should not need to produce a Domains aspect for this to apply... Let me see if my understanding of the docs aligns
Propagates the
replace_existing
config flag and adds tests for the pattern add domain transformer.Note, as demonstrated by the tests, that if the pattern does not match and
replace_existing = True
, it still replaces the domains for that entity, just with[]
. This is the same behavior as other transformers with thereplace_existing
flag and patterns, like the pattern add dataset ownership transformer.Checklist