Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingestion/bigquery): BigQuery Owner Label to Datahub Ownership #10047

Conversation

shubhamjagtap639
Copy link
Contributor

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@github-actions github-actions bot added ingestion PR or Issue related to the ingestion of metadata community-contribution PR or Issue raised by member(s) of DataHub Community labels Mar 14, 2024
Comment on lines 299 to 309
owner_lable_character_mapping: Dict[str, str] = Field(
default={},
description="A mapping of bigquery owner label character to datahub owner character."
"Provided mapping will get added to default mapping.",
)

owner_key_pattern: str = Field(
default="_owner_email",
description="A pattern which defines what identifies an owner label.",
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm looking at the BigQuery datahub source docs and seeing capture_table_labels_as_tags and think we should model this to be similar. So something like:

capture_table_owner_label_as_owner: 
    enabled: true
    # we want them to be able to define a mapping, so our BQ label -> datahub owner ingestion is generic
    label_character_mapping:
        - "_": "."
        - "-": "@"
        - "__": "_"
        - "--": "-"
        - "_-": "#"
        - "-_": " "

    # we want them to be able to define what identifies an owner label, so it is also generic
    owner_key_pattern: "_owner_email"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metadata-ingestion/docs/transformer/dataset_transformer.md Outdated Show resolved Hide resolved
metadata-ingestion/docs/transformer/dataset_transformer.md Outdated Show resolved Hide resolved
@@ -111,5 +156,7 @@ def transform_aspect(
),
)
)

return None
if not self.config.replace_existing:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't make sense - we should always return aspect

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even I thought so. But previous behavior was removing tag metadata from ingestion after extracting ownership from it.

metadata-ingestion/tests/unit/test_bigquery_source.py Outdated Show resolved Hide resolved
metadata-ingestion/tests/unit/test_bigquery_source.py Outdated Show resolved Hide resolved
@shubhamjagtap639 shubhamjagtap639 marked this pull request as ready for review March 26, 2024 08:11
Copy link
Collaborator

@hsheth2 hsheth2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

functionality looks good, just some code cleanup things remaining

```

So if we have input dataset tag like
- `urn:li:tag:dataset_owner_email:abc@email.com`
- `urn:li:tag:dataset_owner_email:xyz@email.com`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these have dataset_owner_email but the example is tag_pattern: "owner_email:" - why the mismatch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because now the config is tag_pattern and not tag_prefix. We can provide pattern in this way as well.

metadata-ingestion/docs/transformer/dataset_transformer.md Outdated Show resolved Hide resolved
metadata-ingestion/docs/transformer/dataset_transformer.md Outdated Show resolved Hide resolved
metadata-ingestion/docs/transformer/dataset_transformer.md Outdated Show resolved Hide resolved
metadata-ingestion/docs/transformer/dataset_transformer.md Outdated Show resolved Hide resolved
Copy link
Collaborator

@hsheth2 hsheth2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to approve and merge this for now, but there are multiple issues that need to be addressed in a follow up

@hsheth2 hsheth2 merged commit 9f2c5d3 into datahub-project:master Mar 28, 2024
56 of 57 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-contribution PR or Issue raised by member(s) of DataHub Community ingestion PR or Issue related to the ingestion of metadata
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants