Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(transformers): pattern add domain transformer - enable replace_existing #7317

Merged

Conversation

asikowitz
Copy link
Collaborator

@asikowitz asikowitz commented Feb 10, 2023

Propagates the replace_existing config flag and adds tests for the pattern add domain transformer.

Note, as demonstrated by the tests, that if the pattern does not match and replace_existing = True, it still replaces the domains for that entity, just with []. This is the same behavior as other transformers with the replace_existing flag and patterns, like the pattern add dataset ownership transformer.

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@github-actions github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Feb 10, 2023
assert isinstance(output[0].record.aspect, models.DomainsClass)
transformed_aspect = cast(models.DomainsClass, output[0].record.aspect)
assert len(transformed_aspect.domains) == 2
assert gslab_domain in transformed_aspect.domains
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So PATCH is the default? (if semantics are not specified)

Copy link
Collaborator Author

@asikowitz asikowitz Feb 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, OVERWRITE is the default, but if you don't have replace_existing = True (which defaults to false) it appends new domains to the existing domains for the entity (but in a different metadata_aspect_v2 row because of OVERWRITE). https://datahubproject.io/docs/metadata-ingestion/docs/transformer/dataset_transformer#relationship-between-replace_existing-and-semantics explains all the possible situations, although I don't like how this is confusing enough it warrants an article

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah this is quite confusing...

So IF the aspect is a DomainsClass aspect, then an OVERWRITE will mean "we merge into this aspect".

If the aspect is not a DomainsClass aspect, then we'll just generate a new DomainsClass aspect altogether.

I think this makes sense but the documentation is a bit hard to follow for sure.

Maybe another set of cases that would be useful to include in the tests are what happens when a non-domains aspect (e.g. DatasetPropertiesClass) is sent through the pipeline (to show how it behaves then)

Copy link
Collaborator Author

@asikowitz asikowitz Feb 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way I see it, the semantics just says whether you edit the existing row (via a AddDatasetDomain aspect) or create a new one (via a DomainsClass aspect). replace_existing determines whether you append to the existing domains list, or truncate. And, confusingly, pattern matching does absolutely nothing to change this logic -- transformers are applied on all change event / proposals with matching aspect_name, the pattern matching just allows you to say which entities get which domain, and if there's no match then they get no domains.

Shouldn't the infrastructure around transformers (i.e. base_transformer.py::BaseTransformer.transform and _transform_or_record_mcp in that same class) handle the filtering of non-domains aspects? I would expect testing around that to be on the base transformer, rather than every transformer than inherits from the base transformer.

EDIT: Eh, I guess it's testing that aspect_name is set correctly. I'll add a test actually not really sure how to test this. The transformers aren't written to handle unexpected aspects. I could add a test that checks PatternAddDatasetDomain.aspect_name == "domains" I guess?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm my understanding of this transformer is that it can handle ANY aspects not just incoming domain aspects produced by the Ingestion Source... Maybe that's a bad interpretation

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that way. So for example, if I configure a Transformer with an Ingestion Source that does not itself sent DomainsClass aspects (some do not), then this transformer will intercept whatever aspect it does produce (e.g. datasetProperties) and will inject the Domains aspect (this would be my expectation, at least)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it - so yeah the parent class adds filtering on the Domains aspect... Very odd because the source should not need to produce a Domains aspect for this to apply... Let me see if my understanding of the docs aligns

@jjoyce0510 jjoyce0510 merged commit 8901498 into datahub-project:master Feb 13, 2023
@asikowitz asikowitz deleted the domain-transformer-duplicates branch February 13, 2023 21:01
oleg-ruban pushed a commit to RChygir/datahub that referenced this pull request Feb 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ingestion PR or Issue related to the ingestion of metadata
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants