add DataProcess Urn #1673

liangjun-jiang · 2020-05-15T19:24:22Z

Checklist

The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
Links to related issues (if applicable)
Tests for the changes have been added/updated (if applicable)
Docs related to the changes have been added/updated (if applicable)
refer to this issue for the details

mars-lan · 2020-05-18T16:37:05Z

li-utils/src/main/java/com/linkedin/common/urn/DataJobUrn.java

+
+    public static DataJobUrn deserialize(String rawUrn) throws URISyntaxException {
+        return createFromString(rawUrn);
+    }


You'll most likely need to add the coercer as well like this: https://github.com/linkedin/datahub/blob/f9324377424209d74ba756d85a2dbc0839ca306d/li-utils/src/main/java/com/linkedin/common/urn/DatasetUrn.java#L53

Isn't that coercer has been removed? I looked at the code base, all coercer files have been removed.

Oh. I see. It is not remove, but combined.

Yes there was an issue loading the non-embedded coercer correctly so we combined them all now.

hshahoss · 2020-05-19T13:19:41Z

@loftyet - Thanks for proposing the PR.

To clarify, this DataJob urn represents the definition of a data processing job and not an executing instance. We will need to define a separate urn for an instance of a DataJob.

I looked into Azkaban and Airflow (popular workflow managers) and the way they represent a datajob includes the following:

Namespace/Project in which the workflow and datajob exists
Workflow Id - Workflow in which the datajob exists
Job Id - Id of the data job

So we have two options for defining this urn:

Define urn which includes three parts (namespace, workflow_id and job_id) as above
Define individual workflow manager urns and make DataJob urn as a union of those individual urns.

My preference is to go with Option 2 since we can define specific urns for each workflow manager/scheduler according to their internal working and not have to modify this later if the way these systems represent a job changes. I can help define the urns for Azkaban and Airflow.

What do you think?

@mars-lan and @keremsahin1 let me know your thoughts on the above as well.

liangjun-jiang · 2020-05-19T15:46:13Z

@hshahoss yes. Datajob defined here is for data processing only.

I am only thinking about the data processing job for right now. ~~For the instance of a job - the flow~~, I am still thinking about it.

To me, the value I see to onboard a data processing job is that the ETL scripts, the data transformation tool can be reused. ~~The flow could be started, stopped and destroyed~~

Dataset A is transformed into Dataset B by processing job AB. Do we really need to log how job AB is triggered?

@clojurians-org do you have some thoughts about job urn's key fields? I understand you are working on Airflow job extraction.

hshahoss · 2020-05-19T19:03:17Z

I think we should first clarify the terminology here and make sure we are talking about the same concepts. Current Airflow/Azkaban terminology is like this:

DataFlow = Workflow = DAG = A graph of tasks/jobs. = This includes static metadata/definition of a flow
DataJob = Job = Task = A single unit of execution which is part of a workflow = This includes static metadata/definition of a job
DataFlowInstance = A running/executing instance of a DataFlow = This includes runtime information about an execution of a flow like status, execution time etc
DataJobInstance = A running/executing instance of a DataJob = This includes runtime information about an execution of a job like status, execution time etc

I see all the four above as entities in DataHub.

@loftyet Let me create share an example of what I am talking about and then we can take it further.

liangjun-jiang · 2020-05-20T01:47:43Z

@hshahoss I updated my previous comment a little bit. It was written under rush. I agreed your definition of the 4 entities.
However, for the 4 entities you listed, I can only see 1 & 2 are useful. 3 & 4 are transient, and don't really bring that much value to keep track of them in my opinion.
for example, if I repeatedly schedule a DataFlow instance or DataJobInstance on the daily basis, why do I need to keep track of each run's status & execution time?
For the sake of understanding, I am looking forward to seeing the 4 examples.
For this particular PR, what should be a DataJob's basic fields, and aspects? We agreed

name, 2. orchestrator 3. fabric are three basic fields to define a datajob urn.
I also proposed 1)inputs & outputs of datasets as its basic aspects

You also suggested the following

Namespace/Project in which the workflow and datajob exists
Workflow Id - Workflow in which the datajob exists
Job Id - Id of the data job

I think it makes sense to have optional aspects as those.

hshahoss · 2020-05-20T16:17:53Z

@loftyet To unblock you for this PR, I suggest we rename this URN to be specific for the job manager/orchestrator that you are targeting instead of making this generic. We plan to define separate urns for different orchestrators (like separate AzkabanJobUrn and AirflowTaskUrn) in the future.

liangjun-jiang · 2020-05-20T19:41:48Z

After talked to @hshahoss , I rename this entity to dataprocess. Apart from the job &flow we have discussed, dataprocess is used to address use cases as such

In this ADF example, this copy activity is smallest unit of a pipeline .It comes with source, sink and transform information, we want to use dataprocess track it.
2. example 2

!/bin/bash 
    TABLENAME=${^^1} 
    HDFSPATH=${^^2} 
    NOW=$(date +"%m-%d-%Y-%H-%M-%S") 
    sqoop --import --connect jdbc:db2://mystsrem:60000/SCHEMA \
     --username username \
     --password-file password \
     --query "select * from ${TABLENAME} \$CONDITIONS" \
     -m 1 \
     --delete-target-dir \
     --target-dir ${HDFSPATH} \
     --fetch-size 30000 \
     --class-name ${TABLENAME} \
     --fields-terminated-by '\01' \
     --lines-terminated-by '\n' \
     --escaped-by '\' \
     --verbose &> logonly/${TABLENAME}_import_${NOW}.log
``
This is a `bash` script to move data by `Sqoop`. We want to track this script by this `dataprocess` entity.

mars-lan · 2020-05-21T01:13:43Z

You'll most likely also need to the typeref in a follow-up PR, e.g. https://github.com/linkedin/datahub/blob/master/li-utils/src/main/pegasus/com/linkedin/common/DatasetUrn.pdsc

add DataJob Urn

399f711

mars-lan reviewed May 18, 2020

View reviewed changes

keremsahin1 requested a review from hshahoss May 18, 2020 16:50

add coercer for datajob urn

9af1fbf

rename to dataprocess entity

62bcee3

liangjun-jiang changed the title ~~add DataJob Urn~~ add DataProcess Urn May 20, 2020

hshahoss approved these changes May 20, 2020

View reviewed changes

hshahoss merged commit 18ce1e1 into datahub-project:master May 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add DataProcess Urn #1673

add DataProcess Urn #1673

liangjun-jiang commented May 15, 2020

mars-lan May 18, 2020

liangjun-jiang May 18, 2020 •

edited

liangjun-jiang May 18, 2020

mars-lan May 18, 2020

hshahoss commented May 19, 2020

liangjun-jiang commented May 19, 2020 •

edited

hshahoss commented May 19, 2020

liangjun-jiang commented May 20, 2020 •

edited

hshahoss commented May 20, 2020 •

edited

liangjun-jiang commented May 20, 2020

mars-lan commented May 21, 2020

add DataProcess Urn #1673

add DataProcess Urn #1673

Conversation

liangjun-jiang commented May 15, 2020

Checklist

mars-lan May 18, 2020

Choose a reason for hiding this comment

liangjun-jiang May 18, 2020 • edited

Choose a reason for hiding this comment

liangjun-jiang May 18, 2020

Choose a reason for hiding this comment

mars-lan May 18, 2020

Choose a reason for hiding this comment

hshahoss commented May 19, 2020

liangjun-jiang commented May 19, 2020 • edited

hshahoss commented May 19, 2020

liangjun-jiang commented May 20, 2020 • edited

hshahoss commented May 20, 2020 • edited

liangjun-jiang commented May 20, 2020

mars-lan commented May 21, 2020

liangjun-jiang May 18, 2020 •

edited

liangjun-jiang commented May 19, 2020 •

edited

liangjun-jiang commented May 20, 2020 •

edited

hshahoss commented May 20, 2020 •

edited