Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#1011] feat(tez): Avoid recompute succeeded task. #1033

Merged
merged 2 commits into from
Jul 31, 2023

Conversation

zhengchenyu
Copy link
Collaborator

What changes were proposed in this pull request?

Avoid recompute succeeded task. Detailed information see #1011
Here only 2.a, 2.b, 2.c is solved. 2.d will not be solved in this PR.

Why are the changes needed?

Fix: #1011

Does this PR introduce any user-facing change?

New config 'rss.avoid.recompute.succeeded.task' was introduced, default value is false. If set to true, we won't recompute the succeed task when the reason of recompute is about node failed.

Note: I suggest change default value to true after running on production cluster some months.

How was this patch tested?

integration test and test in yarn cluster.

@jerqi jerqi requested a review from lifeSo July 24, 2023 10:58
@jerqi
Copy link
Contributor

jerqi commented Jul 24, 2023

@lifeSo @bin41215 Could you help me review this pr?

@jerqi
Copy link
Contributor

jerqi commented Jul 24, 2023

Could we add some documents if we add new config?

@bin41215
Copy link
Contributor

@lifeSo @bin41215 Could you help me review this pr?

ok

@zhengchenyu
Copy link
Collaborator Author

Could we add some documents if we add new config?

@jerqi The config is just a switch. True is enable this function, False is disable this function. It seems no document which is used to describe tez configuration.

@jerqi
Copy link
Contributor

jerqi commented Jul 25, 2023

Could we add some documents if we add new config?

@jerqi The config is just a switch. True is enable this function, False is disable this function. It seems no document which is used to describe tez configuration.

We can add it to this document. If the document is too big, we can also separate it.
https://github.com/apache/incubator-uniffle/blob/master/docs/client_guide.md

@zhengchenyu
Copy link
Collaborator Author

Could we add some documents if we add new config?

@jerqi The config is just a switch. True is enable this function, False is disable this function. It seems no document which is used to describe tez configuration.

We can add it to this document. If the document is too big, we can also separate it. https://github.com/apache/incubator-uniffle/blob/master/docs/client_guide.md

I doubt that whether this config should be introduced to Common Setting in client_guide.md or not.
Because I think avoid recompute succeeded task attempt should be a default function for remote shuffle service.
Why do I introduce this config? The reason is that this function has not run in production for enough months. So I prefer disable in default. After several month, I prefer enable in default or delete this config.

@jerqi
Copy link
Contributor

jerqi commented Jul 25, 2023

Could we add some documents if we add new config?

@jerqi The config is just a switch. True is enable this function, False is disable this function. It seems no document which is used to describe tez configuration.

We can add it to this document. If the document is too big, we can also separate it. https://github.com/apache/incubator-uniffle/blob/master/docs/client_guide.md

I doubt that whether this config should be introduced to Common Setting in client_guide.md or not. Because I think avoid recompute succeeded task attempt should be a default function for remote shuffle service. Why do I introduce this config? The reason is that this function has not run in production for enough months. So I prefer disable in default. After several month, I prefer enable in default or delete this config.

This document describe all the config options which we used. You can see there are many
dedicated Spark‘s config options. If we have time, we can separate this document into common_client.md, spark_client.md, mr_client.md, tez_client.md.

@codecov-commenter
Copy link

codecov-commenter commented Jul 26, 2023

Codecov Report

Merging #1033 (5fdda17) into master (ecfed5e) will increase coverage by 1.37%.
Report is 2 commits behind head on master.
The diff coverage is 0.00%.

@@             Coverage Diff              @@
##             master    #1033      +/-   ##
============================================
+ Coverage     54.10%   55.47%   +1.37%     
- Complexity     2547     2550       +3     
============================================
  Files           382      362      -20     
  Lines         21728    19405    -2323     
  Branches       1802     1809       +7     
============================================
- Hits          11755    10765     -990     
+ Misses         9269     8011    -1258     
+ Partials        704      629      -75     
Files Changed Coverage Δ
.../main/java/org/apache/tez/common/RssTezConfig.java 92.85% <ø> (ø)
...n/java/org/apache/tez/dag/app/RssDAGAppMaster.java 26.80% <0.00%> (-3.63%) ⬇️

... and 32 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Contributor

@jerqi jerqi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, merged to master. Thanks @zhengchenyu @bin41215

@jerqi jerqi merged commit 1701d06 into apache:master Jul 31, 2023
30 checks passed
@zhengchenyu zhengchenyu deleted the 1011 branch August 8, 2023 02:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE] [tez] Avoid recompute succeeded task.
4 participants