Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ISSUE-163][FEATURE] Write to hdfs when local disk can't be write #235

Merged
merged 15 commits into from
Oct 28, 2022

Conversation

xianjingfeng
Copy link
Member

@xianjingfeng xianjingfeng commented Sep 21, 2022

What changes were proposed in this pull request?

Write to hdfs when local disk can't be write

Why are the changes needed?

There should be a fallback mechanism when disk can't be write. #163

Does this PR introduce any user-facing change?

No

How was this patch tested?

Already added

@codecov-commenter
Copy link

codecov-commenter commented Sep 21, 2022

Codecov Report

Merging #235 (5d5767b) into master (47effb2) will decrease coverage by 0.71%.
The diff coverage is 69.44%.

@@             Coverage Diff              @@
##             master     #235      +/-   ##
============================================
- Coverage     59.71%   58.99%   -0.72%     
+ Complexity     1377     1336      -41     
============================================
  Files           166      166              
  Lines          8918     8570     -348     
  Branches        853      840      -13     
============================================
- Hits           5325     5056     -269     
+ Misses         3318     3233      -85     
- Partials        275      281       +6     
Impacted Files Coverage Δ
...he/uniffle/server/storage/MultiStorageManager.java 49.23% <48.83%> (+11.73%) ⬆️
...er/storage/HdfsStorageManagerFallbackStrategy.java 71.42% <71.42%> (ø)
...r/storage/LocalStorageManagerFallbackStrategy.java 71.42% <71.42%> (ø)
...e/uniffle/server/storage/SingleStorageManager.java 67.64% <71.42%> (+0.43%) ⬆️
...torage/AbstractStorageManagerFallbackStrategy.java 75.00% <75.00%> (ø)
...org/apache/uniffle/server/ShuffleFlushManager.java 78.80% <100.00%> (+0.11%) ⬆️
...a/org/apache/uniffle/server/ShuffleServerConf.java 99.21% <100.00%> (+0.03%) ⬆️
.../storage/RotateStorageManagerFallbackStrategy.java 100.00% <100.00%> (ø)
...ava/org/apache/uniffle/common/RssShuffleUtils.java 0.00% <0.00%> (-95.66%) ⬇️
.../java/org/apache/hadoop/mapreduce/RssMRConfig.java 23.07% <0.00%> (-51.93%) ⬇️
... and 28 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@xianjingfeng
Copy link
Member Author

There are some flaky ut

java.lang.ClassCastException: org.apache.spark.shuffle.RssShuffleManager cannot be cast to org.apache.uniffle.test.GetShuffleReportForMultiPartTest$RssShuffleManagerWrapper at org.apache.uniffle.test.GetShuffleReportForMultiPartTest.runTest(GetShuffleReportForMultiPartTest.java:180) at org.apache.uniffle.test.SparkIntegrationTestBase.runSparkApp(SparkIntegrationTestBase.java:74) at org.apache.uniffle.test.SparkIntegrationTestBase.run(SparkIntegrationTestBase.java:52) at org.apache.uniffle.test.GetShuffleReportForMultiPartTest.resultCompareTest(GetShuffleReportForMultiPartTest.java:141) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

Error: Tests run: 2, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 12.256 s <<< FAILURE! - in org.apache.uniffle.coordinator.LowestIOSampleCostSelectStorageStrategyTest Error: selectStorageTest Time elapsed: 6.083 s <<< FAILURE! org.opentest4j.AssertionFailedError: expected: <hdfs://p2> but was: <hdfs://p1> at org.junit.jupiter.api.AssertionUtils.fail(AssertionUtils.java:55) at org.junit.jupiter.api.AssertionUtils.failNotEqual(AssertionUtils.java:62) at org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:182) at org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:177) at org.junit.jupiter.api.Assertions.assertEquals(Assertions.java:1141) at org.apache.uniffle.coordinator.LowestIOSampleCostSelectStorageStrategyTest.selectStorageTest(LowestIOSampleCostSelectStorageStrategyTest.java:133)

@zuston
Copy link
Member

zuston commented Sep 26, 2022

I think this PR is a good improvement! We also need this PR to avoid the problem of full local disk, although we dont hope to enable the big block directly written to HDFS.

@zuston
Copy link
Member

zuston commented Oct 10, 2022

Do you have time to invest this PR, I hope this can be introduced in our company internal version, looking forward to be merged assp. @xianjingfeng

# Conflicts:
#	server/src/main/java/org/apache/uniffle/server/storage/MultiStorageManager.java
@jerqi jerqi requested a review from zuston October 26, 2022 03:08
int nextIdx = -1;
for (int i = 0; i < candidates.length; i++) {
if (current == candidates[i]) {
nextIdx = (i + 1) % candidates.length;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we merge these two loops into one loop?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think two loops is better understood and not easy to make mistakes, and there is no difference in the performance between the two ways.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's easy logic. One loop is enough. If you insist on it, I'm also ok for two loops.

@jerqi jerqi changed the title Write to hdfs when local disk can't be write [ISSUE-163][FEATURE] Write to hdfs when local disk can't be write Oct 27, 2022
public static final ConfigOption<String> MULTISTORAGE_FALLBACK_STRATEGY_CLASS = ConfigOptions
.key("rss.server.multistorage.fallback.strategy.class")
.stringType()
.noDefaultValue()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should choose origin behavior as default value.
Could we add some docs for this config option?

jerqi
jerqi previously approved these changes Oct 27, 2022
Copy link
Contributor

@jerqi jerqi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, @zuston Do you have another suggestion?

@jerqi
Copy link
Contributor

jerqi commented Oct 27, 2022

Wait for CI

@jerqi
Copy link
Contributor

jerqi commented Oct 28, 2022

@zuston Gently ping.

Copy link
Member

@zuston zuston left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@zuston zuston merged commit 7d4428e into apache:master Oct 28, 2022
@zuston
Copy link
Member

zuston commented Oct 28, 2022

Merged. @xianjingfeng Thanks for your contribution

CreateShuffleWriteHandlerRequest request = storage.getCreateWriterHandlerRequest(
event.getAppId(), event.getShuffleId(), event.getStartPartition());
storage = storageManager.selectStorage(event);
handler = storage.getOrCreateWriteHandler(request);
Copy link
Member

@zuston zuston Nov 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reviewing this part, I think this is a bit strange that it changes the external objects reference of storage and handler. This will make the external invoker like ShuffleFlushManager confused.

@jerqi @xianjingfeng

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reviewing this part, I think this is a bit strange that change the external objects reference of storage and handler. This will make the external invoker like ShuffleFlushManager confused.

@jerqi @xianjingfeng

We encapsulate the behavior. ShuffleFlushManger don't need care.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have better solution, you can propose it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently I have no ideas. I just want to use this fallback strategy and review this part.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently I have no ideas. I just want to use this fallback strategy and review this part.

Flush process worry me a lot. I'm not satisfied about the code quality. But I don't have idea how to improve them. Maybe @LuciferYang @advancedxy can help us.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have two directions to improve this part of MultiStorageManager

  1. Remove the storageManagerCache , it looks unused and bring some problems. One possible problem I have seen is that, when one event enters into pending queue due to storage cannot write, it will not have a chance to get a new fallback strategy due to cache.
  2. Avoid changing the external reference object in write method. If we want to fallback write to other storage. We could use the invoking sequence like. (selectStorage -> write) if failed and then (selectStorage -> write), instead of selectStorage -> write (failed) -> write(choose other storage in this method). That means we should change the fallback strategy invoked from write method to selectStorage method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently I have no ideas. I just want to use this fallback strategy and review this part.

Flush process worry me a lot. I'm not satisfied about the code quality. But I don't have idea how to improve them. Maybe @LuciferYang @advancedxy can help us.

@jerqi I'm just browsing the code. Let's discuss details later when I get more context.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jerqi Busy at the end of the year. I'll think this on the weekend

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have two directions to improve this part of MultiStorageManager

  1. Remove the storageManagerCache , it looks unused and bring some problems. One possible problem I have seen is that, when one event enters into pending queue due to storage cannot write, it will not have a chance to get a new fallback strategy due to cache.
  2. Avoid changing the external reference object in write method. If we want to fallback write to other storage. We could use the invoking sequence like. (selectStorage -> write) if failed and then (selectStorage -> write), instead of selectStorage -> write (failed) -> write(choose other storage in this method). That means we should change the fallback strategy invoked from write method to selectStorage method.

Agree

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants