Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce data cleanup mechanism on stage level #249

Merged
merged 8 commits into from
Oct 12, 2022

Conversation

zuston
Copy link
Member

@zuston zuston commented Sep 28, 2022

What changes were proposed in this pull request?

Introduce data cleanup mechanism on stage level

Why are the changes needed?

This PR is to optimize the disk capacity. For example

  1. For some spark ML jobs, it will run multiple stages and reserve large unused shuffle data in shuffle-servers.

Does this PR introduce any user-facing change?

No

How was this patch tested?

UTs

@codecov-commenter
Copy link

codecov-commenter commented Sep 28, 2022

Codecov Report

Merging #249 (e24fbbd) into master (fc6d49b) will decrease coverage by 0.94%.
The diff coverage is 55.02%.

@@             Coverage Diff              @@
##             master     #249      +/-   ##
============================================
- Coverage     60.11%   59.17%   -0.95%     
- Complexity     1217     1369     +152     
============================================
  Files           150      166      +16     
  Lines          7602     8974    +1372     
  Branches        718      853     +135     
============================================
+ Hits           4570     5310     +740     
- Misses         2791     3390     +599     
- Partials        241      274      +33     
Impacted Files Coverage Δ
...e/uniffle/client/factory/ShuffleClientFactory.java 0.00% <0.00%> (ø)
...pache/uniffle/server/ShuffleServerGrpcService.java 0.87% <0.00%> (-0.03%) ⬇️
...he/uniffle/server/storage/MultiStorageManager.java 37.50% <0.00%> (ø)
...storage/handler/impl/HdfsShuffleDeleteHandler.java 0.00% <0.00%> (ø)
...e/storage/handler/impl/LocalFileDeleteHandler.java 0.00% <0.00%> (ø)
...he/uniffle/client/impl/ShuffleWriteClientImpl.java 21.23% <9.67%> (-1.75%) ⬇️
.../org/apache/uniffle/server/ShuffleTaskManager.java 76.80% <78.94%> (-0.16%) ⬇️
...he/uniffle/server/buffer/ShuffleBufferManager.java 83.33% <86.36%> (+1.17%) ⬆️
...va/org/apache/uniffle/server/event/PurgeEvent.java 88.88% <88.88%> (ø)
.../java/org/apache/spark/shuffle/RssSparkConfig.java 96.77% <100.00%> (+0.10%) ⬆️
... and 23 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@zuston zuston requested a review from jerqi September 28, 2022 08:57
@zuston
Copy link
Member Author

zuston commented Sep 28, 2022

PTAL @jerqi . The unregister is now scoped with Spark, MR will be supported in the next commits.

@jerqi
Copy link
Contributor

jerqi commented Sep 28, 2022

PTAL @jerqi . The unregister is now scoped with Spark, MR will be supported in the next commits.

Why do the mr need to clean up the shuffle of stage level? MR don't have multiple stages.

@zuston
Copy link
Member Author

zuston commented Sep 28, 2022

PTAL @jerqi . The unregister is now scoped with Spark, MR will be supported in the next commits.

Why do the mr need to clean up the shuffle of stage level? MR don't have multiple stages.

Oh yes. My fault.

@zuston zuston requested a review from jerqi September 29, 2022 01:52
@zuston
Copy link
Member Author

zuston commented Sep 30, 2022

Gentle ping @jerqi If u have time, could u help review this? I think I have some time to quick fix at national day.

@jerqi
Copy link
Contributor

jerqi commented Oct 8, 2022

I will take a look at this pr asap.

try {
ExecutorService executorService =
Executors.newFixedThreadPool(
Math.min(10, shuffleServerInfoSet.size()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we use 10?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Emm. Maybe we should use the Runtime.getRuntime().availableProcessors(). And I have no idea on this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10 is the experienced value. The request is lightweight, I think 10 threads are enough. There is no need to introduce extra config to increase the understanding burden for users.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Em... we would better use configuration option.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I will extract config entry.

@jerqi
Copy link
Contributor

jerqi commented Oct 8, 2022

One question?
Why do we add the concept of PurgeEvent? What's the advantage? Could we just add a shuffle level deletion method for StorageManager?

@zuston
Copy link
Member Author

zuston commented Oct 8, 2022

One question? Why do we add the concept of PurgeEvent? What's the advantage? Could we just add a shuffle level deletion method for StorageManager?

Two reasons.

  1. Make clearResourceThread clean two kinds of events of stage and app level data
  2. I dont want to introduce extra delete method in StorageManager because most of deletion logic is the same

@zuston zuston requested a review from jerqi October 9, 2022 02:02
Math.min(10, shuffleServerInfoSet.size()),
ThreadUtils.getThreadFactory("unregister-shuffle-%d")
);
List<Future<Void>> futures = executorService.invokeAll(callableList, 10, TimeUnit.SECONDS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10 seconds? Should it be configuration option?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the default timeout, considering the request is lightweight, I think 10s is enough. There is no need to introduce extra config to increase the understanding burden for users.

Copy link
Contributor

@jerqi jerqi Oct 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes the timeout may not be enough, we should give a default value according to our experience. Usually we can't image all the situations that users use.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@zuston zuston requested a review from jerqi October 9, 2022 08:49
@zuston
Copy link
Member Author

zuston commented Oct 10, 2022

Updated.

@jerqi
Copy link
Contributor

jerqi commented Oct 11, 2022

One question? Why do we add the concept of PurgeEvent? What's the advantage? Could we just add a shuffle level deletion method for StorageManager?

Two reasons.

  1. Make clearResourceThread clean two kinds of events of stage and app level data
  2. I dont want to introduce extra delete method in StorageManager because most of deletion logic is the same

Actually we still add a method removeResourcesByShuffleIds. Let me think twice.

@zuston
Copy link
Member Author

zuston commented Oct 11, 2022

Changelog of latest commit

  1. Extract the config entries of unregistering thread pool size and request timeout sec

spark.rss.client.unregister.request.timeout.sec and spark.rss.client.unregister.thread.pool.size

@zuston
Copy link
Member Author

zuston commented Oct 12, 2022

Gentle ping @jerqi

Copy link
Contributor

@jerqi jerqi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @zuston

@jerqi jerqi merged commit 1ee6820 into apache:master Oct 12, 2022
@zuston zuston deleted the unregisterShuffle branch October 12, 2022 08:59
@zuston
Copy link
Member Author

zuston commented Oct 12, 2022

Thanks for your review @jerqi

kaijchen added a commit to kaijchen/incubator-uniffle that referenced this pull request Feb 7, 2023
kaijchen added a commit that referenced this pull request Feb 8, 2023
### What changes were proposed in this pull request?

Followup #249. Cleanup code and unify interfaces.

### Why are the changes needed?

Cleanup code and unify interfaces.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.
@jerqi jerqi mentioned this pull request Feb 28, 2023
27 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants