[ISSUE-163][FEATURE] Write to hdfs when local disk can't be write #235

xianjingfeng · 2022-09-21T10:24:21Z

What changes were proposed in this pull request?

Write to hdfs when local disk can't be write

Why are the changes needed?

There should be a fallback mechanism when disk can't be write. #163

Does this PR introduce any user-facing change?

No

How was this patch tested?

Already added

codecov-commenter · 2022-09-21T10:50:21Z

Codecov Report

Merging #235 (5d5767b) into master (47effb2) will decrease coverage by 0.71%.
The diff coverage is 69.44%.

@@             Coverage Diff              @@
##             master     #235      +/-   ##
============================================
- Coverage     59.71%   58.99%   -0.72%     
+ Complexity     1377     1336      -41     
============================================
  Files           166      166              
  Lines          8918     8570     -348     
  Branches        853      840      -13     
============================================
- Hits           5325     5056     -269     
+ Misses         3318     3233      -85     
- Partials        275      281       +6

Impacted Files	Coverage Δ
...he/uniffle/server/storage/MultiStorageManager.java	`49.23% <48.83%> (+11.73%)`	⬆️
...er/storage/HdfsStorageManagerFallbackStrategy.java	`71.42% <71.42%> (ø)`
...r/storage/LocalStorageManagerFallbackStrategy.java	`71.42% <71.42%> (ø)`
...e/uniffle/server/storage/SingleStorageManager.java	`67.64% <71.42%> (+0.43%)`	⬆️
...torage/AbstractStorageManagerFallbackStrategy.java	`75.00% <75.00%> (ø)`
...org/apache/uniffle/server/ShuffleFlushManager.java	`78.80% <100.00%> (+0.11%)`	⬆️
...a/org/apache/uniffle/server/ShuffleServerConf.java	`99.21% <100.00%> (+0.03%)`	⬆️
.../storage/RotateStorageManagerFallbackStrategy.java	`100.00% <100.00%> (ø)`
...ava/org/apache/uniffle/common/RssShuffleUtils.java	`0.00% <0.00%> (-95.66%)`	⬇️
.../java/org/apache/hadoop/mapreduce/RssMRConfig.java	`23.07% <0.00%> (-51.93%)`	⬇️
... and 28 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

server/src/main/java/org/apache/uniffle/server/storage/MultiStorageManager.java

xianjingfeng · 2022-09-22T08:45:45Z

There are some flaky ut

java.lang.ClassCastException: org.apache.spark.shuffle.RssShuffleManager cannot be cast to org.apache.uniffle.test.GetShuffleReportForMultiPartTest$RssShuffleManagerWrapper at org.apache.uniffle.test.GetShuffleReportForMultiPartTest.runTest(GetShuffleReportForMultiPartTest.java:180) at org.apache.uniffle.test.SparkIntegrationTestBase.runSparkApp(SparkIntegrationTestBase.java:74) at org.apache.uniffle.test.SparkIntegrationTestBase.run(SparkIntegrationTestBase.java:52) at org.apache.uniffle.test.GetShuffleReportForMultiPartTest.resultCompareTest(GetShuffleReportForMultiPartTest.java:141) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

Error: Tests run: 2, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 12.256 s <<< FAILURE! - in org.apache.uniffle.coordinator.LowestIOSampleCostSelectStorageStrategyTest Error: selectStorageTest Time elapsed: 6.083 s <<< FAILURE! org.opentest4j.AssertionFailedError: expected: <hdfs://p2> but was: <hdfs://p1> at org.junit.jupiter.api.AssertionUtils.fail(AssertionUtils.java:55) at org.junit.jupiter.api.AssertionUtils.failNotEqual(AssertionUtils.java:62) at org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:182) at org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:177) at org.junit.jupiter.api.Assertions.assertEquals(Assertions.java:1141) at org.apache.uniffle.coordinator.LowestIOSampleCostSelectStorageStrategyTest.selectStorageTest(LowestIOSampleCostSelectStorageStrategyTest.java:133)

server/src/main/java/org/apache/uniffle/server/ShuffleDataFlushEvent.java

server/src/main/java/org/apache/uniffle/server/storage/MultiStorageManager.java

zuston · 2022-09-26T02:14:11Z

I think this PR is a good improvement! We also need this PR to avoid the problem of full local disk, although we dont hope to enable the big block directly written to HDFS.

zuston · 2022-10-10T09:59:02Z

Do you have time to invest this PR, I hope this can be introduced in our company internal version, looking forward to be merged assp. @xianjingfeng

# Conflicts: # server/src/main/java/org/apache/uniffle/server/storage/MultiStorageManager.java

server/src/main/java/org/apache/uniffle/server/storage/SingleStorageManager.java

server/src/main/java/org/apache/uniffle/server/ShuffleDataFlushEvent.java

server/src/main/java/org/apache/uniffle/server/storage/MultiStorageManager.java

.../src/main/java/org/apache/uniffle/server/storage/AbstractStorageManagerFallbackStrategy.java

...r/src/main/java/org/apache/uniffle/server/storage/DefaultStorageManagerFallbackStrategy.java

server/src/main/java/org/apache/uniffle/server/storage/MultiStorageManager.java

...r/src/main/java/org/apache/uniffle/server/storage/DefaultStorageManagerFallbackStrategy.java

jerqi · 2022-10-26T11:57:28Z

...r/src/main/java/org/apache/uniffle/server/storage/DefaultStorageManagerFallbackStrategy.java

+    int nextIdx = -1;
+    for (int i = 0; i < candidates.length; i++) {
+      if (current == candidates[i]) {
+        nextIdx = (i + 1) % candidates.length;


Could we merge these two loops into one loop?

I think two loops is better understood and not easy to make mistakes, and there is no difference in the performance between the two ways.

I think it's easy logic. One loop is enough. If you insist on it, I'm also ok for two loops.

server/src/main/java/org/apache/uniffle/server/storage/MultiStorageManager.java

…eManager#write`

jerqi · 2022-10-27T11:22:14Z

server/src/main/java/org/apache/uniffle/server/ShuffleServerConf.java

+  public static final ConfigOption<String> MULTISTORAGE_FALLBACK_STRATEGY_CLASS = ConfigOptions
+      .key("rss.server.multistorage.fallback.strategy.class")
+      .stringType()
+      .noDefaultValue()


We should choose origin behavior as default value.
Could we add some docs for this config option?

jerqi

LGTM, @zuston Do you have another suggestion?

jerqi · 2022-10-27T12:56:36Z

Wait for CI

jerqi · 2022-10-28T07:10:48Z

@zuston Gently ping.

zuston

LGTM.

zuston · 2022-10-28T07:38:00Z

Merged. @xianjingfeng Thanks for your contribution

zuston · 2022-11-30T06:56:11Z

server/src/main/java/org/apache/uniffle/server/storage/MultiStorageManager.java

+          CreateShuffleWriteHandlerRequest request = storage.getCreateWriterHandlerRequest(
+              event.getAppId(), event.getShuffleId(), event.getStartPartition());
+          storage = storageManager.selectStorage(event);
+          handler = storage.getOrCreateWriteHandler(request);


After reviewing this part, I think this is a bit strange that it changes the external objects reference of storage and handler. This will make the external invoker like ShuffleFlushManager confused.

@jerqi @xianjingfeng

After reviewing this part, I think this is a bit strange that change the external objects reference of storage and handler. This will make the external invoker like ShuffleFlushManager confused.

@jerqi @xianjingfeng

We encapsulate the behavior. ShuffleFlushManger don't need care.

If you have better solution, you can propose it.

Currently I have no ideas. I just want to use this fallback strategy and review this part.

Currently I have no ideas. I just want to use this fallback strategy and review this part.

Flush process worry me a lot. I'm not satisfied about the code quality. But I don't have idea how to improve them. Maybe @LuciferYang @advancedxy can help us.

I have two directions to improve this part of MultiStorageManager

Remove the storageManagerCache , it looks unused and bring some problems. One possible problem I have seen is that, when one event enters into pending queue due to storage cannot write, it will not have a chance to get a new fallback strategy due to cache.

Avoid changing the external reference object in write method. If we want to fallback write to other storage. We could use the invoking sequence like. (selectStorage -> write) if failed and then (selectStorage -> write), instead of selectStorage -> write (failed) -> write(choose other storage in this method). That means we should change the fallback strategy invoked from write method to selectStorage method.

Currently I have no ideas. I just want to use this fallback strategy and review this part.

Flush process worry me a lot. I'm not satisfied about the code quality. But I don't have idea how to improve them. Maybe @LuciferYang @advancedxy can help us.

@jerqi I'm just browsing the code. Let's discuss details later when I get more context.

@jerqi Busy at the end of the year. I'll think this on the weekend

I have two directions to improve this part of MultiStorageManager

Remove the storageManagerCache , it looks unused and bring some problems. One possible problem I have seen is that, when one event enters into pending queue due to storage cannot write, it will not have a chance to get a new fallback strategy due to cache.

Avoid changing the external reference object in write method. If we want to fallback write to other storage. We could use the invoking sequence like. (selectStorage -> write) if failed and then (selectStorage -> write), instead of selectStorage -> write (failed) -> write(choose other storage in this method). That means we should change the fallback strategy invoked from write method to selectStorage method.

Agree

Write to hdfs when local disk can't be write

4dccc6a

jerqi reviewed Sep 21, 2022

View reviewed changes

server/src/main/java/org/apache/uniffle/server/storage/MultiStorageManager.java Outdated Show resolved Hide resolved

xianjingfeng added 3 commits September 22, 2022 10:13

Mark StorageManager in ShuffleDataFlushEvent

30e22e0

fix bug

34ce174

keep origin storageManager if appId not be registered

b42c54a

xianjingfeng closed this Sep 22, 2022

xianjingfeng reopened this Sep 22, 2022

xianjingfeng closed this Sep 22, 2022

xianjingfeng reopened this Sep 22, 2022

xianjingfeng closed this Sep 22, 2022

xianjingfeng reopened this Sep 22, 2022

xianjingfeng closed this Sep 22, 2022

xianjingfeng reopened this Sep 22, 2022

jerqi reviewed Sep 26, 2022

View reviewed changes

server/src/main/java/org/apache/uniffle/server/ShuffleDataFlushEvent.java Outdated Show resolved Hide resolved

server/src/main/java/org/apache/uniffle/server/storage/MultiStorageManager.java Outdated Show resolved Hide resolved

zuston reviewed Sep 26, 2022

View reviewed changes

server/src/main/java/org/apache/uniffle/server/storage/MultiStorageManager.java Outdated Show resolved Hide resolved

xianjingfeng added 3 commits October 23, 2022 12:49

add StorageManager fallback strategy

b6be35e

Merge branch 'master' into discuss_163

a57e2bd

# Conflicts: # server/src/main/java/org/apache/uniffle/server/storage/MultiStorageManager.java

Modify ShuffleServerGrpcTest.registerTest

f267fba

xianjingfeng closed this Oct 23, 2022

xianjingfeng reopened this Oct 23, 2022

fix bug

d149d4a

xianjingfeng closed this Oct 23, 2022

xianjingfeng reopened this Oct 23, 2022

jerqi reviewed Oct 23, 2022

View reviewed changes

server/src/main/java/org/apache/uniffle/server/storage/SingleStorageManager.java Outdated Show resolved Hide resolved

server/src/main/java/org/apache/uniffle/server/ShuffleDataFlushEvent.java Outdated Show resolved Hide resolved

zuston reviewed Oct 24, 2022

View reviewed changes

Record the mapping information in storageManager

eaf01f3

jerqi reviewed Oct 26, 2022

View reviewed changes

server/src/main/java/org/apache/uniffle/server/storage/MultiStorageManager.java Show resolved Hide resolved

jerqi reviewed Oct 26, 2022

View reviewed changes

server/src/main/java/org/apache/uniffle/server/storage/MultiStorageManager.java Outdated Show resolved Hide resolved

jerqi requested a review from zuston October 26, 2022 03:08

jerqi reviewed Oct 26, 2022

View reviewed changes

server/src/main/java/org/apache/uniffle/server/storage/MultiStorageManager.java Outdated Show resolved Hide resolved

jerqi reviewed Oct 26, 2022

View reviewed changes

...r/src/main/java/org/apache/uniffle/server/storage/DefaultStorageManagerFallbackStrategy.java Outdated Show resolved Hide resolved

jerqi reviewed Oct 26, 2022

View reviewed changes

server/src/main/java/org/apache/uniffle/server/storage/MultiStorageManager.java Show resolved Hide resolved

xianjingfeng added 2 commits October 27, 2022 17:01

Remove argment request in `org.apache.uniffle.server.storage.Storag…

b176508

…eManager#write`

Add two more fallback strategy

f9635ee

jerqi changed the title ~~Write to hdfs when local disk can't be write~~ [ISSUE-163][FEATURE] Write to hdfs when local disk can't be write Oct 27, 2022

jerqi reviewed Oct 27, 2022

View reviewed changes

xianjingfeng added 3 commits October 27, 2022 20:29

Set default strategy to HdfsStorageManagerFallbackStrategy

276b42c

optimize

de870b8

Add docs for rss.server.multistorage.fallback.strategy.class

615d336

jerqi previously approved these changes Oct 27, 2022

View reviewed changes

fix ut

5d5767b

xianjingfeng dismissed jerqi’s stale review via 5d5767b October 27, 2022 13:27

jerqi approved these changes Oct 27, 2022

View reviewed changes

zuston approved these changes Oct 28, 2022

View reviewed changes

zuston merged commit 7d4428e into apache:master Oct 28, 2022

zuston reviewed Nov 30, 2022

View reviewed changes

zuston mentioned this pull request Dec 1, 2022

[Improvement] Optimize data flushing and memory usage for huge partitions to improve stability #378

Closed

8 tasks

xianjingfeng mentioned this pull request Dec 2, 2022

[ISSUE-380] Refactor the flush process to fix fallback fail #383

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ISSUE-163][FEATURE] Write to hdfs when local disk can't be write #235

[ISSUE-163][FEATURE] Write to hdfs when local disk can't be write #235

xianjingfeng commented Sep 21, 2022 •

edited

Loading

codecov-commenter commented Sep 21, 2022 •

edited

Loading

xianjingfeng commented Sep 22, 2022

zuston commented Sep 26, 2022

zuston commented Oct 10, 2022

jerqi Oct 26, 2022

xianjingfeng Oct 26, 2022

jerqi Oct 26, 2022

jerqi Oct 27, 2022

jerqi left a comment

jerqi commented Oct 27, 2022

jerqi commented Oct 28, 2022

zuston left a comment

zuston commented Oct 28, 2022

zuston Nov 30, 2022 •

edited

Loading

jerqi Nov 30, 2022

jerqi Nov 30, 2022

zuston Nov 30, 2022

jerqi Nov 30, 2022

zuston Dec 1, 2022

advancedxy Dec 1, 2022

LuciferYang Dec 1, 2022

xianjingfeng Dec 1, 2022

[ISSUE-163][FEATURE] Write to hdfs when local disk can't be write #235

[ISSUE-163][FEATURE] Write to hdfs when local disk can't be write #235

Conversation

xianjingfeng commented Sep 21, 2022 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

codecov-commenter commented Sep 21, 2022 • edited Loading

Codecov Report

xianjingfeng commented Sep 22, 2022

zuston commented Sep 26, 2022

zuston commented Oct 10, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jerqi left a comment

Choose a reason for hiding this comment

jerqi commented Oct 27, 2022

jerqi commented Oct 28, 2022

zuston left a comment

Choose a reason for hiding this comment

zuston commented Oct 28, 2022

zuston Nov 30, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xianjingfeng commented Sep 21, 2022 •

edited

Loading

codecov-commenter commented Sep 21, 2022 •

edited

Loading

zuston Nov 30, 2022 •

edited

Loading