Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(ingestion) bug fix: emit platform instance aspect for dataset in Databricks ingestion #8671

Merged

Conversation

jinlintt
Copy link
Contributor

Golden file auto generated using the command below.

pytest tests/integration/unity/test_unity_catalog_ingest.py --update-golden-files

@github-actions github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Aug 18, 2023
@maggiehays maggiehays added the community-contribution PR or Issue raised by member(s) of DataHub Community label Aug 21, 2023
@jinlintt
Copy link
Contributor Author

@hsheth2 @asikowitz
Could you or someone take a look? It is small change and similar to #8585

@asikowitz
Copy link
Collaborator

We're aware of this bug, but unfortunately this is going to change all container urns which will not be backwards compatible. I'll talk over with the team the best way to handle this

@jinlintt
Copy link
Contributor Author

We're aware of this bug, but unfortunately this is going to change all container urns which will not be backwards compatible. I'll talk over with the team the best way to handle this

Thank you @asikowitz , what we really care about is emitting the platform instance aspect for the dataset. I can revert the changes to the container. Let me know.

@asikowitz
Copy link
Collaborator

Can you revert the container changes then and regenerate the golden? Please also add a comment on config.platform_instance that it's not consistent with the data platform instance aspect emitted

@jinlintt
Copy link
Contributor Author

Can you revert the container changes then and regenerate the golden? Please also add a comment on config.platform_instance that it's not consistent with the data platform instance aspect emitted

Done. Regenerated the golden file and no idea why it still has so many churns. @asikowitz

@asikowitz asikowitz self-assigned this Aug 22, 2023
@asikowitz asikowitz self-requested a review August 22, 2023 15:50
@jinlintt jinlintt changed the title (ingestion) bug fix: emit platform instance aspect for dataset. Fix the wrong references of platform instance (ingestion) bug fix: emit platform instance aspect for dataset Aug 22, 2023
@jinlintt jinlintt changed the title (ingestion) bug fix: emit platform instance aspect for dataset (ingestion) bug fix: emit platform instance aspect for dataset in Databricks ingestion Aug 23, 2023
@jinlintt
Copy link
Contributor Author

@asikowitz and team, could we get this reviewed and merged soon?
We are blocked by this.

@asikowitz
Copy link
Collaborator

Hi Jinlin, sorry for the delay. I believe adding this aspect will cause the data platform instance (i.e. workspace name) to appear in the UI as part of the browse path, which we'd generally like to avoid. Just curious, do you want to add this aspect because you're building a feature around it, or just for consistency with the dataset urns.

In any case, for the near future, can you put a flag around creating this aspect, default False? This should also allow you to revert the golden file changes. We may want to produce this in the future, but it'll require some discussion on our side, so I think this is the easiest way to unblock you.

@jinlintt
Copy link
Contributor Author

Hi Jinlin, sorry for the delay. I believe adding this aspect will cause the data platform instance (i.e. workspace name) to appear in the UI as part of the browse path, which we'd generally like to avoid. Just curious, do you want to add this aspect because you're building a feature around it, or just for consistency with the dataset urns.

In any case, for the near future, can you put a flag around creating this aspect, default False? This should also allow you to revert the golden file changes. We may want to produce this in the future, but it'll require some discussion on our side, so I think this is the easiest way to unblock you.

Hi @asikowitz
Why would you want to hide data platform instance from the browse path? That is always part of the data model. So if it is not a problem for the other data sources such as MySQL, can you explain why this is a problem for Databricks?

I could add a flag to disable this by default, but that seems in consistent with how platform_instance works with other data sources. For the other data sources, setting the platform_instance means the data platform instance aspect will be ingested. But for Databricks, the user has to set one more flag to get it work.

CC @jjoyce0510 @hsheth2

@asikowitz
Copy link
Collaborator

For other sources like MySQL, we don't have a default platform instance. Thus, if a user is manually specifying a platform instance, they likely (i) intend to have multiple instances of the same platform and want to distinguish between them and (ii) won't be surprised seeing this platform instance name in the UI. This is different for platform instances that are automatically determined, like here. If a user only has one databricks instance ingested into datahub, they might not want the clutter of seeing their workspace id attached to every databricks dataset.

This is made worse by the fact that we are using a different platform instance name for databricks containers vs datasets, and that we should only be automatically setting platform instance if that value is globally unique. It's not immediately clear to me that our current way of calculating platform_instance_name will be globally unique, e.g. as per https://docs.databricks.com/en/workspace/workspace-details.html, so I don't want to commit further to this value if we have to change this in the future.

We also in general need to clean up how the DataPlatformInstance aspect is generated and how it's represented in the UI. For generation, I am almost positive that databricks is not the only source that's failing to produce DataPlatformInstance aspects; we need to standardize this across the board. However, doing this is blocked by the fact that we don't have a consistent UI for displaying a dataset's "browse path". Some places, we're using the BrowsePaths aspect, another BrowsePathsV2 (which can include platform instance data, but currently will not include platform instance info for databricks as it's determined by config.platform_instance), and another we recursively look up a dataset's container structure and prepend the DataPlatformInstance aspect instance value, if it exists. We likely want to use the BrowsePathsV2 aspect as the source of truth for all UI elements, but until then, we don't want to be making changes by producing more DataPlatformInstance aspects if that's going to produce UI inconsistency, as it will for databricks.

Hopefully that gives you some insight on why, to unblock you, it's simplest to put this behind a flag.

@jinlintt
Copy link
Contributor Author

@asikowitz thank you for the detailed explanation and it makes sense.
I have added a new flag to guard the ingestion of the DPI aspect and it defaults to false.

@asikowitz asikowitz merged commit 437b787 into datahub-project:master Aug 28, 2023
51 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-contribution PR or Issue raised by member(s) of DataHub Community ingestion PR or Issue related to the ingestion of metadata
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants