Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support writing multi files of single partition to improve speed in HDFS storage #396

Merged
merged 6 commits into from
Dec 13, 2022

Conversation

zuston
Copy link
Member

@zuston zuston commented Dec 9, 2022

What changes were proposed in this pull request?

  1. Introduce the PooledHdfsShuffleWriteHandler to support writing single partition to multiple HDFS files concurrently.

Why are the changes needed?

As the problem mentioned by #378 (comment), the writing speed of HDFS is too slow and it can't write concurrently. Especially when huge partition exists, this problem will cause other apps slow due to the slight memory.

So the improvement of writing speed is an important factor to flush the huge partition to HDFS quickly.

Does this PR introduce any user-facing change?

Yes.

How was this patch tested?

  1. UTs

@zuston zuston changed the title Support writing multi files of single partition to improve speed in HDFS storage [WIP] Support writing multi files of single partition to improve speed in HDFS storage Dec 9, 2022
int endPartition,
String storageBasePath,
String fileNamePrefix,
Configuration hadoopConf,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can get the parameter concurrency from hadooConf. The hadoopConf we can pass from our client. It will be more flexible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Although it’s a little bit strange.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added the todo comment to support this in the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you create some issues for these todo?

Copy link
Member Author

@zuston zuston Dec 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I will do after this PR is merged.

@zuston zuston changed the title [WIP] Support writing multi files of single partition to improve speed in HDFS storage Support writing multi files of single partition to improve speed in HDFS storage Dec 12, 2022
@codecov-commenter
Copy link

codecov-commenter commented Dec 12, 2022

Codecov Report

Merging #396 (1b47add) into master (3ec3f41) will decrease coverage by 0.22%.
The diff coverage is 25.00%.

@@             Coverage Diff              @@
##             master     #396      +/-   ##
============================================
- Coverage     58.77%   58.54%   -0.23%     
- Complexity     1602     1607       +5     
============================================
  Files           193      195       +2     
  Lines         10939    11021      +82     
  Branches        955      963       +8     
============================================
+ Hits           6429     6452      +23     
- Misses         4132     4193      +61     
+ Partials        378      376       -2     
Impacted Files Coverage Δ
...org/apache/uniffle/storage/common/HdfsStorage.java 0.00% <0.00%> (ø)
.../storage/handler/impl/HdfsShuffleWriteHandler.java 87.09% <ø> (ø)
...le/storage/handler/impl/LocalFileWriteHandler.java 75.51% <ø> (ø)
...ge/handler/impl/PooledHdfsShuffleWriteHandler.java 0.00% <0.00%> (ø)
...rage/request/CreateShuffleWriteHandlerRequest.java 71.42% <80.00%> (+0.59%) ⬆️
...org/apache/uniffle/server/ShuffleFlushManager.java 82.90% <100.00%> (+4.89%) ⬆️
...a/org/apache/uniffle/server/ShuffleServerConf.java 99.24% <100.00%> (+0.01%) ⬆️
...torage/handler/impl/AbstractClientReadHandler.java 12.00% <0.00%> (-8.00%) ⬇️
...che/uniffle/client/impl/ShuffleReadClientImpl.java 88.46% <0.00%> (ø)
... and 6 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@advancedxy
Copy link
Contributor

Is there any read side related changes should be applied for this PR?

And, if possible, please add an integration test for spark3 with concurrent writer enabled?

@zuston
Copy link
Member Author

zuston commented Dec 12, 2022

Is there any read side related changes should be applied for this PR?

There is no need to do any compatible change in read client, because the original logic has covered the different partition files due to the prefix of retry times.

And, if possible, please add an integration test for spark3 with concurrent writer enabled?

It's OK. But the spark client test may be not accurate, because it's hard to control the concurrent write. And I think the integration test is enough and accurate.

Copy link
Contributor

@advancedxy advancedxy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overall logic lgtm, left some minor comments.

Copy link
Contributor

@advancedxy advancedxy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zuston
Copy link
Member Author

zuston commented Dec 13, 2022

Do you have other comments? @jerqi

Copy link
Contributor

@jerqi jerqi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jerqi jerqi merged commit d3aa5dc into apache:master Dec 13, 2022
zuston added a commit that referenced this pull request Jan 3, 2023
… distribute pressure (#452)

### What changes were proposed in this pull request?
[Improvement] Read HDFS data files with random sequence to distribute pressure #452

### Why are the changes needed?
In PR #396 to support concurrently writing single partition's data into multiple HDFS files, it's better to randomly read HDFS data files to distribute stress in client side.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing UTs
zuston added a commit that referenced this pull request Apr 27, 2023
…cy to write in client side (#815)

### What changes were proposed in this pull request?

1. Support specifying per-partition's max concurrency to write in client side

### Why are the changes needed?

The PR of #396 has introduced the concurrent HDFS writing for one partition, 
but the concurrency is determined by the server client. In order to increase flexibility, 
this PR supports specifying per-partition's max concurrency to write in client side

### Does this PR introduce _any_ user-facing change?

Yes. The client conf of `<client_type>.rss.client.max.concurrency.per-partition.write` and `rss.server.client.max.concurrency.limit.per-partition.write` are introduced.

### How was this patch tested?
1. UTs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants