Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Experimental Feature] MR Supports Remote Spill #55

Merged
merged 4 commits into from
Jul 15, 2022

Conversation

frankliee
Copy link
Contributor

@frankliee frankliee commented Jul 14, 2022

What changes were proposed in this pull request?

Rewrite Mapreduce's MergerManager to spill sorted segments to HDFS,
It returns a merge-sorted iterator to read these HDFS segments.

Why are the changes needed?

In cloud, machines may have very limited disk space and performance.
This PR allows to spill data to remote storage (e.g., hdfs)

Does this PR introduce any user-facing change?

Yes.

Property Name Default Description
mapreduce.rss.reduce.remote.spill.enable false Whether to use remote spill
mapreduce.rss.reduce.remote.spill.attempt.inc 1 Increase reduce attempts as hdfs is easier to crash than disk
mapreduce.rss.reduce.remote.spill.replication 1 The replication number to spill data to hdfs
mapreduce.rss.reduce.remote.spill.retries 5 The retry number to spill data to hdfs

How was this patch tested?

New UT and IT with remote spill.

Co-authored-by: roryqi roryqi@tencent.com

@codecov-commenter
Copy link

codecov-commenter commented Jul 14, 2022

Codecov Report

Merging #55 (08077bb) into master (aa02ee6) will increase coverage by 0.67%.
The diff coverage is n/a.

@@             Coverage Diff              @@
##             master      #55      +/-   ##
============================================
+ Coverage     54.89%   55.56%   +0.67%     
+ Complexity     1092      991     -101     
============================================
  Files           146      135      -11     
  Lines          7775     6736    -1039     
  Branches        749      647     -102     
============================================
- Hits           4268     3743     -525     
+ Misses         3270     2782     -488     
+ Partials        237      211      -26     
Impacted Files Coverage Δ
...storage/handler/impl/DataSkippableReadHandler.java 81.25% <0.00%> (-3.13%) ⬇️
.../java/org/apache/hadoop/mapreduce/RssMRConfig.java
...n/java/org/apache/hadoop/mapreduce/RssMRUtils.java
...pache/hadoop/mapreduce/task/reduce/RssShuffle.java
...apache/hadoop/mapreduce/v2/app/RssMRAppMaster.java
...rg/apache/hadoop/mapred/RssMapOutputCollector.java
.../hadoop/mapreduce/task/reduce/RssEventFetcher.java
.../hadoop/mapreduce/task/reduce/RssBypassWriter.java
...g/apache/hadoop/mapred/SortWriteBufferManager.java
...pache/hadoop/mapreduce/task/reduce/RssFetcher.java
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update aa02ee6...08077bb. Read the comment docs.

@jerqi
Copy link
Contributor

jerqi commented Jul 14, 2022

What changes were proposed in this pull request?

Rewrite Mapreduce's MergerManager to spill sorted segments to HDFS, It returns a merge-sorted iterator to read these HDFS segments.

Why are the changes needed?

In cloud, machines may have very limited disk space and performance. This PR allows to spill data to remote storage (e.g., hdfs)

Does this PR introduce any user-facing change?

Yes. rss.reduce.remote.spill.enable (default false)

How was this patch tested?

New UT and IT with remote spill.

Co-authored-by: roryqi roryqi@tencent.com

Because this pr will introduce user-facing change. We should update doc.
And we should supply the performance test results.

frankliee and others added 2 commits July 15, 2022 16:17
Add RssInMemoryMerger

We need write memory data to Hdfs

Yes

UT

Co-authored-by: roryqi <roryqi@tencent.com>
@jerqi
Copy link
Contributor

jerqi commented Jul 15, 2022

What changes were proposed in this pull request?

Rewrite Mapreduce's MergerManager to spill sorted segments to HDFS, It returns a merge-sorted iterator to read these HDFS segments.

Why are the changes needed?

In cloud, machines may have very limited disk space and performance. This PR allows to spill data to remote storage (e.g., hdfs)

Does this PR introduce any user-facing change?

Yes. rss.reduce.remote.spill.enable (default false)

How was this patch tested?

New UT and IT with remote spill.

Co-authored-by: roryqi roryqi@tencent.com

update your description and document. This pr introduce another configuration option.

@jerqi
Copy link
Contributor

jerqi commented Jul 15, 2022

LGTM except for pr's description and document.

@frankliee
Copy link
Contributor Author

What changes were proposed in this pull request?

Rewrite Mapreduce's MergerManager to spill sorted segments to HDFS, It returns a merge-sorted iterator to read these HDFS segments.

Why are the changes needed?

In cloud, machines may have very limited disk space and performance. This PR allows to spill data to remote storage (e.g., hdfs)

Does this PR introduce any user-facing change?

Yes. rss.reduce.remote.spill.enable (default false)

How was this patch tested?

New UT and IT with remote spill.
Co-authored-by: roryqi roryqi@tencent.com

update your description and document. This pr introduce another configuration option.

Doc is updated

Copy link
Contributor

@jerqi jerqi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@frankliee frankliee changed the title [Feature][MR] Support remote spill [Experimental Feature] MR Supports Remote Spill Jul 15, 2022
@frankliee frankliee merged commit f4ce2ed into apache:master Jul 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants