Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#706] Implement spill method to avoid memory deadlock #714

Merged
merged 3 commits into from
Mar 28, 2023

Conversation

zuston
Copy link
Member

@zuston zuston commented Mar 13, 2023

What changes were proposed in this pull request?

  1. Introduce the DataPusher to replace the eventLoop, this could be as general part for spark2 and spark3.
  2. Implement the spill method in WriterBufferManager to avoid memory deadlock.

Why are the changes needed?

In current codebase, if having several WriterBufferManagers, when each other is acquiring memory, the deadlock will happen. To solve this, we should implement spill function to break this deadlock condition.

Fix: #706

Does this PR introduce any user-facing change?

No.

How was this patch tested?

  1. Existing UTs
  2. Newly added UTs

@zuston zuston marked this pull request as draft March 13, 2023 09:19
@codecov-commenter
Copy link

codecov-commenter commented Mar 13, 2023

Codecov Report

Merging #714 (d959e35) into master (3f9ba81) will increase coverage by 2.32%.
The diff coverage is 79.66%.

@@             Coverage Diff              @@
##             master     #714      +/-   ##
============================================
+ Coverage     60.95%   63.27%   +2.32%     
- Complexity     1956     1984      +28     
============================================
  Files           244      231      -13     
  Lines         13308    11456    -1852     
  Branches       1119     1125       +6     
============================================
- Hits           8112     7249     -863     
+ Misses         4740     3804     -936     
+ Partials        456      403      -53     
Impacted Files Coverage Δ
...va/org/apache/uniffle/common/util/ThreadUtils.java 50.00% <44.44%> (-5.56%) ⬇️
...va/org/apache/spark/shuffle/writer/DataPusher.java 62.50% <62.50%> (ø)
...pache/spark/shuffle/writer/WriteBufferManager.java 85.78% <91.83%> (+2.69%) ⬆️
.../java/org/apache/spark/shuffle/RssSparkConfig.java 98.01% <100.00%> (+0.09%) ⬆️
...org/apache/spark/shuffle/writer/AddBlockEvent.java 94.11% <100.00%> (+94.11%) ⬆️

... and 15 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@zuston zuston marked this pull request as ready for review March 14, 2023 06:32
@zuston
Copy link
Member Author

zuston commented Mar 14, 2023

@jerqi PTAL. I have updated the spark3 code, if the design is OK, I will change the spark2 related code.

@zuston
Copy link
Member Author

zuston commented Mar 14, 2023

cc @advancedxy @smallzhongfeng

@advancedxy
Copy link
Contributor

I did a quick through of the code, I didn't see any serious problem about this problem. I will look it in details later tonight.

However, do you think it's worth a design doc to address problems in #706?

@zuston
Copy link
Member Author

zuston commented Mar 14, 2023

However, do you think it's worth a design doc to address problems in #706?

It looks unnecessary, the optimization is not a big change for current codebase.

@zuston
Copy link
Member Author

zuston commented Mar 14, 2023

I did a quick through of the code, I didn't see any serious problem about this problem. I will look it in details later tonight.

Looking forward this.

@advancedxy
Copy link
Contributor

However, do you think it's worth a design doc to address problems in #706?

It looks unnecessary, the optimization is not a big change for current codebase.

OK. Let me do some code review first. Let's decide that later. Even if a design doc is not necessary, some docs might be necessary.

Copy link
Contributor

@advancedxy advancedxy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did some review, this PR may require more work, especially with UTs and some integration tests....

@zuston
Copy link
Member Author

zuston commented Mar 16, 2023

PTAL again @smallzhongfeng @advancedxy . Thanks

@advancedxy
Copy link
Contributor

the check style looks is still failed, please help address that.
I will take a look later.

Copy link
Contributor

@advancedxy advancedxy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This version looks in a good shape. I will also review it later this weekend

import org.apache.uniffle.common.ShuffleBlockInfo;
import org.apache.uniffle.common.util.ThreadUtils;

public class DataPusher implements Closeable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could DataPusher be used for MapReduce?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't seen this part code.

@zuston zuston requested a review from advancedxy March 20, 2023 11:25
@zuston
Copy link
Member Author

zuston commented Mar 21, 2023

PTAL again @advancedxy

Copy link
Contributor

@advancedxy advancedxy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally lgtm, except some minor comments

Copy link
Contributor

@smallzhongfeng smallzhongfeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for some minor comment.

@zuston
Copy link
Member Author

zuston commented Mar 23, 2023

PTAL again @smallzhongfeng @advancedxy

BTW, this PR has been applied in our internal uniffle, it works well.

advancedxy
advancedxy previously approved these changes Mar 27, 2023
@zuston
Copy link
Member Author

zuston commented Mar 28, 2023

Just rebase the latest master. cc @advancedxy @smallzhongfeng

@zuston zuston requested a review from advancedxy March 28, 2023 06:52
advancedxy
advancedxy previously approved these changes Mar 28, 2023
Copy link
Contributor

@advancedxy advancedxy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. But it seems that you need to rebase again.

@zuston
Copy link
Member Author

zuston commented Mar 28, 2023

LGTM. But it seems that you need to rebase again.

Pity. Add done @advancedxy

@zuston zuston merged commit 1b48c12 into apache:master Mar 28, 2023
xianjingfeng pushed a commit to xianjingfeng/incubator-uniffle that referenced this pull request Apr 5, 2023
…che#714)

### What changes were proposed in this pull request?
1.  Introduce the `DataPusher` to replace the `eventLoop`, this could be as general part for spark2 and spark3.
2. Implement the `spill` method in `WriterBufferManager` to avoid memory deadlock.

### Why are the changes needed?
In current codebase, if having several `WriterBufferManagers`, when each other is acquiring memory, the deadlock will happen. To solve this, we should implement spill function to break this deadlock condition.

Fix: apache#706 

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
1. Existing UTs
2. Newly added UTs
jerqi added a commit to jerqi/incubator-uniffle that referenced this pull request Apr 11, 2023
jerqi added a commit to jerqi/incubator-uniffle that referenced this pull request Apr 11, 2023
jerqi pushed a commit that referenced this pull request Apr 26, 2023
### What changes were proposed in this pull request?

Disable the memory spill operation.

### Why are the changes needed?

In #714 , the memory spill is introduced to solve the dead lock. For a pity,
these part code should be handled carefully, including concurrency and data
consistency, like the fix PR #811 . And this part has bugs and I will fix these in
the next days. 

Currently, I want to revert the PR #714. But the partial refactor of #714 is still meaningful. So
I submit this PR to disable the memory spill.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Don't need
zuston pushed a commit that referenced this pull request Mar 7, 2024
…sure data correctness (#1558)

### What changes were proposed in this pull request?

Verify the number of written records to enhance data accuracy.
Make sure all data records are sent by clients.
Make sure bugs like #714 will never be introduced into the code.

### Why are the changes needed?

A follow-up PR for #848.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing UTs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Improvement] Fine-grained memory control for spark client
5 participants