New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add DataProcess Urn #1673
add DataProcess Urn #1673
Conversation
|
||
public static DataJobUrn deserialize(String rawUrn) throws URISyntaxException { | ||
return createFromString(rawUrn); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll most likely need to add the coercer as well like this: https://github.com/linkedin/datahub/blob/f9324377424209d74ba756d85a2dbc0839ca306d/li-utils/src/main/java/com/linkedin/common/urn/DatasetUrn.java#L53
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't that coercer
has been removed? I looked at the code base, all coercer
files have been removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh. I see. It is not remove, but combined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes there was an issue loading the non-embedded coercer correctly so we combined them all now.
@loftyet - Thanks for proposing the PR. To clarify, this DataJob urn represents the definition of a data processing job and not an executing instance. We will need to define a separate urn for an instance of a DataJob. I looked into Azkaban and Airflow (popular workflow managers) and the way they represent a datajob includes the following:
So we have two options for defining this urn:
My preference is to go with Option 2 since we can define specific urns for each workflow manager/scheduler according to their internal working and not have to modify this later if the way these systems represent a job changes. I can help define the urns for Azkaban and Airflow. What do you think? @mars-lan and @keremsahin1 let me know your thoughts on the above as well. |
@hshahoss yes. Datajob defined here is for data processing only. I am only thinking about the data processing job for right now. To me, the value I see to onboard a data processing job is that the ETL scripts, the data transformation tool can be reused. Dataset A is transformed into Dataset B by processing job AB. Do we really need to log how job AB is triggered? @clojurians-org do you have some thoughts about job urn's key fields? I understand you are working on Airflow job extraction. |
I think we should first clarify the terminology here and make sure we are talking about the same concepts. Current Airflow/Azkaban terminology is like this:
I see all the four above as entities in DataHub. @loftyet Let me create share an example of what I am talking about and then we can take it further. |
@hshahoss I updated my previous comment a little bit. It was written under rush. I agreed your definition of the 4 entities.
You also suggested the following
I think it makes sense to have optional aspects as those. |
@loftyet To unblock you for this PR, I suggest we rename this URN to be specific for the job manager/orchestrator that you are targeting instead of making this generic. We plan to define separate urns for different orchestrators (like separate AzkabanJobUrn and AirflowTaskUrn) in the future. |
After talked to @hshahoss , I rename this entity to
|
You'll most likely also need to the typeref in a follow-up PR, e.g. https://github.com/linkedin/datahub/blob/master/li-utils/src/main/pegasus/com/linkedin/common/DatasetUrn.pdsc |
Checklist
refer to this issue for the details