[#1165] improvement(tez): Unregister shuffle data after completing the execution of a DAG. #1166

zhuyaogai · 2023-08-22T08:59:57Z

What changes were proposed in this pull request?

Generally, one application will execute multiple DAGs, and there is no correlation between the DAGs. Therefore, after completing the execution of a DAG, you can unregister the relevant shuffle data. Otherwise, when there are many DAGs, an application will occupy a large amount of resources for a long period of time.

Why are the changes needed?

Fix: #1165

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing test cases.

…cution of a DAG.

jerqi · 2023-08-22T09:03:51Z

@zhengchenyu Could you help me review this pr?

jerqi · 2023-08-22T09:04:32Z

client-tez/src/main/java/org/apache/tez/dag/app/RssDAGAppMaster.java

+    // and there is no correlation between the DAGs.
+    // Therefore, after completing the execution of a DAG,
+    // you can unregister the relevant shuffle data.
+    tezRemoteShuffleManager.unregisterShuffle();


Is it possible that we run two DAGs concurrently?

Through analyzing the source code, you will find that this situation does not exist. Perhaps you can refer to the following links.

https://github.com/apache/tez/blob/5beab4ced9d07bc813a8d79ded111b72af5a2f02/tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java#L1315

https://github.com/apache/tez/blob/5beab4ced9d07bc813a8d79ded111b72af5a2f02/tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java#L1337

Its mechanism seems to ensure that the DAGs are executed in sequence.

Should we register the app again after we unregister the app?

Even though only one current dag can run, I think we can not make sure the tez will not change in the future version.
As shuffleId is composed of dagId, source vertexId, vertexId. We should unregister all shuffle ids which are owned by dag when dag is finished, but not all shuffle id owned by app. I think this is more reasonable. Maybe the tez will support parallel dag in the future.

@zhengchenyu hi, thank you for your reply, and I know what you mean. I have currently chosen a relatively simple way to implement it, but it can also be achieved by listening to the DAG_FINISHED event. If you think the latter is better, I can make changes.

jerqi

LGTM, Let @zhengchenyu take another look.

codecov-commenter · 2023-08-22T09:27:13Z

Codecov Report

Merging #1166 (5ed95e0) into master (680b4ea) will increase coverage by 1.08%.
Report is 4 commits behind head on master.
The diff coverage is 33.33%.

@@             Coverage Diff              @@
##             master    #1166      +/-   ##
============================================
+ Coverage     53.63%   54.71%   +1.08%     
- Complexity     2576     2581       +5     
============================================
  Files           391      371      -20     
  Lines         22359    20043    -2316     
  Branches       1875     1878       +3     
============================================
- Hits          11992    10967    -1025     
+ Misses         9660     8439    -1221     
+ Partials        707      637      -70

Files Changed	Coverage Δ
...rg/apache/tez/dag/app/TezRemoteShuffleManager.java	`62.14% <0.00%> (-4.79%)`	⬇️
...n/java/org/apache/tez/dag/app/RssDAGAppMaster.java	`47.73% <40.90%> (-0.57%)`	⬇️
...c/main/java/org/apache/tez/common/RssTezUtils.java	`63.96% <100.00%> (+0.83%)`	⬆️

... and 25 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Another suggestion.

…cution of a DAG.

zhuyaogai · 2023-08-24T02:04:35Z

@jerqi @zhengchenyu hi, could you review for my PR? thanks!

zhengchenyu · 2023-08-24T02:20:04Z

client-tez/src/test/java/org/apache/tez/dag/app/RssDAGAppMasterTest.java

+    // 9 send INTERNAL_ERROR to dispatcher
+    dispatcher.getEventHandler().handle(new DAGEvent(dagImpl.getID(), DAGEventType.INTERNAL_ERROR));
+
+    // 10 wait DAGImpl transient to INITED state


Maybe some spell error in comments. wait DAGImpl transient to ERROR state?

zhengchenyu · 2023-08-24T02:30:40Z

client-tez/src/main/java/org/apache/tez/dag/app/RssDAGAppMaster.java

+                    appMaster, (OnStateChangedCallback) callbackMap.get(finalState))));
+  }
+
+  static class DagFinalStateCallback implements OnStateChangedCallback<DAGState, DAGImpl> {


I think DagFinalStateCallback should be inherit from DAGImpl::DagStateChangedCallback.
What a pitty that DAGImpl::DagStateChangedCallback is private. But we can not ignore the logical in DAGImpl::DagStateChangedCallback.

https://github.com/apache/tez/blob/23b58b2b996eee255aab1a045412de00677ca2f1/tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/DAGImpl.java#L576

new DagFinalStateCallback(appMaster, (OnStateChangedCallback) callbackMap.get(finalState))
DagFinalStateCallback has field called callback which got from callbackMap, and the callback will be called when dag state was changed. callback.onStateChanged(dag, dagState);

OK, I miss it.

jerqi

LGTM, thanks @zhuyaogai @zhengchenyu , merged to master.

[Improvement] (tez): Unregister shuffle data after completing the exe…

c670049

…cution of a DAG.

jerqi changed the title ~~[Improvement] (tez): Unregister shuffle data after completing the execution of a DAG.~~ [#1165] improvement(tez): Unregister shuffle data after completing the execution of a DAG. Aug 22, 2023

jerqi requested a review from zhengchenyu August 22, 2023 09:03

jerqi reviewed Aug 22, 2023

View reviewed changes

jerqi previously approved these changes Aug 22, 2023

View reviewed changes

[Improvement] (tez): Unregister shuffle data after completing the exe…

a5dd9f7

…cution of a DAG.

zhuyaogai requested a review from jerqi August 23, 2023 03:19

zhuyaogai added 4 commits August 23, 2023 11:43

[Improvement] (tez): Unregister shuffle data after completing the exe…

5ed95e0

…cution of a DAG.

[Improvement] (tez): Unregister shuffle data after completing the exe…

57e8b5b

…cution of a DAG.

[Improvement] (tez): Unregister shuffle data after completing the exe…

c7555f6

…cution of a DAG.

[Improvement] (tez): Unregister shuffle data after completing the exe…

d89b386

…cution of a DAG.

zhengchenyu requested changes Aug 24, 2023

View reviewed changes

zhuyaogai requested a review from zhengchenyu August 24, 2023 16:02

zhengchenyu approved these changes Aug 25, 2023

View reviewed changes

jerqi approved these changes Aug 25, 2023

View reviewed changes

jerqi merged commit d26da9c into apache:master Aug 25, 2023
32 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[#1165] improvement(tez): Unregister shuffle data after completing the execution of a DAG. #1166

[#1165] improvement(tez): Unregister shuffle data after completing the execution of a DAG. #1166

zhuyaogai commented Aug 22, 2023

jerqi commented Aug 22, 2023

jerqi Aug 22, 2023

zhuyaogai Aug 22, 2023 •

edited

Loading

jerqi Aug 22, 2023

zhengchenyu Aug 22, 2023

zhuyaogai Aug 22, 2023

jerqi left a comment

codecov-commenter commented Aug 22, 2023 •

edited

Loading

zhuyaogai commented Aug 24, 2023

zhengchenyu Aug 24, 2023

zhengchenyu Aug 24, 2023

zhuyaogai Aug 24, 2023

zhengchenyu Aug 25, 2023

jerqi left a comment

[#1165] improvement(tez): Unregister shuffle data after completing the execution of a DAG. #1166

[#1165] improvement(tez): Unregister shuffle data after completing the execution of a DAG. #1166

Conversation

zhuyaogai commented Aug 22, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

jerqi commented Aug 22, 2023

Choose a reason for hiding this comment

zhuyaogai Aug 22, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jerqi left a comment

Choose a reason for hiding this comment

codecov-commenter commented Aug 22, 2023 • edited Loading

Codecov Report

zhuyaogai commented Aug 24, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jerqi left a comment

Choose a reason for hiding this comment

zhuyaogai Aug 22, 2023 •

edited

Loading

codecov-commenter commented Aug 22, 2023 •

edited

Loading