Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#1165] improvement(tez): Unregister shuffle data after completing the execution of a DAG. #1166

Merged
merged 6 commits into from
Aug 25, 2023

Conversation

zhuyaogai
Copy link
Contributor

What changes were proposed in this pull request?

Generally, one application will execute multiple DAGs, and there is no correlation between the DAGs. Therefore, after completing the execution of a DAG, you can unregister the relevant shuffle data. Otherwise, when there are many DAGs, an application will occupy a large amount of resources for a long period of time.

Why are the changes needed?

Fix: #1165

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing test cases.

@jerqi jerqi changed the title [Improvement] (tez): Unregister shuffle data after completing the execution of a DAG. [#1165] improvement(tez): Unregister shuffle data after completing the execution of a DAG. Aug 22, 2023
@jerqi jerqi requested a review from zhengchenyu August 22, 2023 09:03
@jerqi
Copy link
Contributor

jerqi commented Aug 22, 2023

@zhengchenyu Could you help me review this pr?

// and there is no correlation between the DAGs.
// Therefore, after completing the execution of a DAG,
// you can unregister the relevant shuffle data.
tezRemoteShuffleManager.unregisterShuffle();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that we run two DAGs concurrently?

Copy link
Contributor Author

@zhuyaogai zhuyaogai Aug 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Through analyzing the source code, you will find that this situation does not exist. Perhaps you can refer to the following links.

  1. https://github.com/apache/tez/blob/5beab4ced9d07bc813a8d79ded111b72af5a2f02/tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java#L1315
  2. https://github.com/apache/tez/blob/5beab4ced9d07bc813a8d79ded111b72af5a2f02/tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java#L1337

Its mechanism seems to ensure that the DAGs are executed in sequence.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we register the app again after we unregister the app?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though only one current dag can run, I think we can not make sure the tez will not change in the future version.
As shuffleId is composed of dagId, source vertexId, vertexId. We should unregister all shuffle ids which are owned by dag when dag is finished, but not all shuffle id owned by app. I think this is more reasonable. Maybe the tez will support parallel dag in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhengchenyu hi, thank you for your reply, and I know what you mean. I have currently chosen a relatively simple way to implement it, but it can also be achieved by listening to the DAG_FINISHED event. If you think the latter is better, I can make changes.

jerqi
jerqi previously approved these changes Aug 22, 2023
Copy link
Contributor

@jerqi jerqi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Let @zhengchenyu take another look.

@codecov-commenter
Copy link

codecov-commenter commented Aug 22, 2023

Codecov Report

Merging #1166 (5ed95e0) into master (680b4ea) will increase coverage by 1.08%.
Report is 4 commits behind head on master.
The diff coverage is 33.33%.

@@             Coverage Diff              @@
##             master    #1166      +/-   ##
============================================
+ Coverage     53.63%   54.71%   +1.08%     
- Complexity     2576     2581       +5     
============================================
  Files           391      371      -20     
  Lines         22359    20043    -2316     
  Branches       1875     1878       +3     
============================================
- Hits          11992    10967    -1025     
+ Misses         9660     8439    -1221     
+ Partials        707      637      -70     
Files Changed Coverage Δ
...rg/apache/tez/dag/app/TezRemoteShuffleManager.java 62.14% <0.00%> (-4.79%) ⬇️
...n/java/org/apache/tez/dag/app/RssDAGAppMaster.java 47.73% <40.90%> (-0.57%) ⬇️
...c/main/java/org/apache/tez/common/RssTezUtils.java 63.96% <100.00%> (+0.83%) ⬆️

... and 25 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@jerqi jerqi dismissed their stale review August 22, 2023 10:59

Another suggestion.

@zhuyaogai zhuyaogai requested a review from jerqi August 23, 2023 03:19
@zhuyaogai
Copy link
Contributor Author

@jerqi @zhengchenyu hi, could you review for my PR? thanks!

// 9 send INTERNAL_ERROR to dispatcher
dispatcher.getEventHandler().handle(new DAGEvent(dagImpl.getID(), DAGEventType.INTERNAL_ERROR));

// 10 wait DAGImpl transient to INITED state
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe some spell error in comments. wait DAGImpl transient to ERROR state?

appMaster, (OnStateChangedCallback) callbackMap.get(finalState))));
}

static class DagFinalStateCallback implements OnStateChangedCallback<DAGState, DAGImpl> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think DagFinalStateCallback should be inherit from DAGImpl::DagStateChangedCallback.
What a pitty that DAGImpl::DagStateChangedCallback is private. But we can not ignore the logical in DAGImpl::DagStateChangedCallback.

https://github.com/apache/tez/blob/23b58b2b996eee255aab1a045412de00677ca2f1/tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/DAGImpl.java#L576

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new DagFinalStateCallback(appMaster, (OnStateChangedCallback) callbackMap.get(finalState))
DagFinalStateCallback has field called callback which got from callbackMap, and the callback will be called when dag state was changed. callback.onStateChanged(dag, dagState);

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I miss it.

Copy link
Contributor

@jerqi jerqi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @zhuyaogai @zhengchenyu , merged to master.

@jerqi jerqi merged commit d26da9c into apache:master Aug 25, 2023
32 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Improvement] (tez): Unregister shuffle data after completing the execution of a DAG.
4 participants