-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(ingestion) bug fix: emit platform instance aspect for dataset in Databricks ingestion #8671
(ingestion) bug fix: emit platform instance aspect for dataset in Databricks ingestion #8671
Conversation
…atform_instance references
@hsheth2 @asikowitz |
We're aware of this bug, but unfortunately this is going to change all container urns which will not be backwards compatible. I'll talk over with the team the best way to handle this |
Thank you @asikowitz , what we really care about is emitting the platform instance aspect for the dataset. I can revert the changes to the container. Let me know. |
Can you revert the container changes then and regenerate the golden? Please also add a comment on |
Done. Regenerated the golden file and no idea why it still has so many churns. @asikowitz |
@asikowitz and team, could we get this reviewed and merged soon? |
Hi Jinlin, sorry for the delay. I believe adding this aspect will cause the data platform instance (i.e. workspace name) to appear in the UI as part of the browse path, which we'd generally like to avoid. Just curious, do you want to add this aspect because you're building a feature around it, or just for consistency with the dataset urns. In any case, for the near future, can you put a flag around creating this aspect, default False? This should also allow you to revert the golden file changes. We may want to produce this in the future, but it'll require some discussion on our side, so I think this is the easiest way to unblock you. |
Hi @asikowitz I could add a flag to disable this by default, but that seems in consistent with how |
For other sources like MySQL, we don't have a default platform instance. Thus, if a user is manually specifying a platform instance, they likely (i) intend to have multiple instances of the same platform and want to distinguish between them and (ii) won't be surprised seeing this platform instance name in the UI. This is different for platform instances that are automatically determined, like here. If a user only has one databricks instance ingested into datahub, they might not want the clutter of seeing their workspace id attached to every databricks dataset. This is made worse by the fact that we are using a different platform instance name for databricks containers vs datasets, and that we should only be automatically setting platform instance if that value is globally unique. It's not immediately clear to me that our current way of calculating We also in general need to clean up how the DataPlatformInstance aspect is generated and how it's represented in the UI. For generation, I am almost positive that databricks is not the only source that's failing to produce DataPlatformInstance aspects; we need to standardize this across the board. However, doing this is blocked by the fact that we don't have a consistent UI for displaying a dataset's "browse path". Some places, we're using the BrowsePaths aspect, another BrowsePathsV2 (which can include platform instance data, but currently will not include platform instance info for databricks as it's determined by Hopefully that gives you some insight on why, to unblock you, it's simplest to put this behind a flag. |
@asikowitz thank you for the detailed explanation and it makes sense. |
Golden file auto generated using the command below.