Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add DataProcess Urn #1673

Merged
merged 3 commits into from May 20, 2020
Merged

add DataProcess Urn #1673

merged 3 commits into from May 20, 2020

Conversation

liangjun-jiang
Copy link
Contributor

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable)
    refer to this issue for the details


public static DataJobUrn deserialize(String rawUrn) throws URISyntaxException {
return createFromString(rawUrn);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't that coercer has been removed? I looked at the code base, all coercer files have been removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh. I see. It is not remove, but combined.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes there was an issue loading the non-embedded coercer correctly so we combined them all now.

@keremsahin1 keremsahin1 requested a review from hshahoss May 18, 2020 16:50
@hshahoss
Copy link
Contributor

@loftyet - Thanks for proposing the PR.

To clarify, this DataJob urn represents the definition of a data processing job and not an executing instance. We will need to define a separate urn for an instance of a DataJob.

I looked into Azkaban and Airflow (popular workflow managers) and the way they represent a datajob includes the following:

  1. Namespace/Project in which the workflow and datajob exists
  2. Workflow Id - Workflow in which the datajob exists
  3. Job Id - Id of the data job

So we have two options for defining this urn:

  1. Define urn which includes three parts (namespace, workflow_id and job_id) as above
  2. Define individual workflow manager urns and make DataJob urn as a union of those individual urns.

My preference is to go with Option 2 since we can define specific urns for each workflow manager/scheduler according to their internal working and not have to modify this later if the way these systems represent a job changes. I can help define the urns for Azkaban and Airflow.

What do you think?

@mars-lan and @keremsahin1 let me know your thoughts on the above as well.

@liangjun-jiang
Copy link
Contributor Author

liangjun-jiang commented May 19, 2020

@hshahoss yes. Datajob defined here is for data processing only.

I am only thinking about the data processing job for right now. For the instance of a job - the flow, I am still thinking about it.

To me, the value I see to onboard a data processing job is that the ETL scripts, the data transformation tool can be reused. The flow could be started, stopped and destroyed

Dataset A is transformed into Dataset B by processing job AB. Do we really need to log how job AB is triggered?

@clojurians-org do you have some thoughts about job urn's key fields? I understand you are working on Airflow job extraction.

@hshahoss
Copy link
Contributor

I think we should first clarify the terminology here and make sure we are talking about the same concepts. Current Airflow/Azkaban terminology is like this:

  1. DataFlow = Workflow = DAG = A graph of tasks/jobs. = This includes static metadata/definition of a flow
  2. DataJob = Job = Task = A single unit of execution which is part of a workflow = This includes static metadata/definition of a job
  3. DataFlowInstance = A running/executing instance of a DataFlow = This includes runtime information about an execution of a flow like status, execution time etc
  4. DataJobInstance = A running/executing instance of a DataJob = This includes runtime information about an execution of a job like status, execution time etc

I see all the four above as entities in DataHub.

@loftyet Let me create share an example of what I am talking about and then we can take it further.

@liangjun-jiang
Copy link
Contributor Author

liangjun-jiang commented May 20, 2020

@hshahoss I updated my previous comment a little bit. It was written under rush. I agreed your definition of the 4 entities.
However, for the 4 entities you listed, I can only see 1 & 2 are useful. 3 & 4 are transient, and don't really bring that much value to keep track of them in my opinion.
for example, if I repeatedly schedule a DataFlow instance or DataJobInstance on the daily basis, why do I need to keep track of each run's status & execution time?
For the sake of understanding, I am looking forward to seeing the 4 examples.
For this particular PR, what should be a DataJob's basic fields, and aspects? We agreed

  1. name, 2. orchestrator 3. fabric are three basic fields to define a datajob urn.
    I also proposed 1)inputs & outputs of datasets as its basic aspects

You also suggested the following

  1. Namespace/Project in which the workflow and datajob exists
  2. Workflow Id - Workflow in which the datajob exists
  3. Job Id - Id of the data job

I think it makes sense to have optional aspects as those.

@hshahoss
Copy link
Contributor

hshahoss commented May 20, 2020

@loftyet To unblock you for this PR, I suggest we rename this URN to be specific for the job manager/orchestrator that you are targeting instead of making this generic. We plan to define separate urns for different orchestrators (like separate AzkabanJobUrn and AirflowTaskUrn) in the future.

@liangjun-jiang liangjun-jiang changed the title add DataJob Urn add DataProcess Urn May 20, 2020
@liangjun-jiang
Copy link
Contributor Author

After talked to @hshahoss , I rename this entity to dataprocess. Apart from the job &flow we have discussed, dataprocess is used to address use cases as such
azure data factory example
In this ADF example, this copy activity is smallest unit of a pipeline .It comes with source, sink and transform information, we want to use dataprocess track it.
2. example 2

!/bin/bash 
    TABLENAME=${^^1} 
    HDFSPATH=${^^2} 
    NOW=$(date +"%m-%d-%Y-%H-%M-%S") 
    sqoop --import --connect jdbc:db2://mystsrem:60000/SCHEMA \
     --username username \
     --password-file password \
     --query "select * from ${TABLENAME} \$CONDITIONS" \
     -m 1 \
     --delete-target-dir \
     --target-dir ${HDFSPATH} \
     --fetch-size 30000 \
     --class-name ${TABLENAME} \
     --fields-terminated-by '\01' \
     --lines-terminated-by '\n' \
     --escaped-by '\' \
     --verbose &> logonly/${TABLENAME}_import_${NOW}.log
``
This is a `bash` script to move data by `Sqoop`. We want to track this script by this `dataprocess` entity. 

@hshahoss hshahoss merged commit 18ce1e1 into datahub-project:master May 20, 2020
@mars-lan
Copy link
Contributor

You'll most likely also need to the typeref in a follow-up PR, e.g. https://github.com/linkedin/datahub/blob/master/li-utils/src/main/pegasus/com/linkedin/common/DatasetUrn.pdsc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants