Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#477] feat(spark): support getShuffleResult throws FetchFailedException. #1004

Merged
merged 3 commits into from
Jul 20, 2023

Conversation

leixm
Copy link
Contributor

@leixm leixm commented Jul 11, 2023

What changes were proposed in this pull request?

feat #477

Why are the changes needed?

support getShuffleResult throws FetchFailedException.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Exiting UTs, Already added getShuffleResult failed, see MockedShuffleServerGrpcService.

@leixm
Copy link
Contributor Author

leixm commented Jul 11, 2023

cc @advancedxy @jerqi PTAL.

@leixm
Copy link
Contributor Author

leixm commented Jul 11, 2023

I try to shutdown a ShuffleServer during app running, it will throws FetchFailed, you can see here Fetch Failed image.

@codecov-commenter
Copy link

codecov-commenter commented Jul 11, 2023

Codecov Report

Merging #1004 (99c6a16) into master (d6b43f0) will increase coverage by 1.09%.
The diff coverage is 0.00%.

@@             Coverage Diff              @@
##             master    #1004      +/-   ##
============================================
+ Coverage     53.66%   54.76%   +1.09%     
- Complexity     2523     2526       +3     
============================================
  Files           382      362      -20     
  Lines         21672    19333    -2339     
  Branches       1795     1799       +4     
============================================
- Hits          11630    10587    -1043     
+ Misses         9338     8115    -1223     
+ Partials        704      631      -73     
Impacted Files Coverage Δ
...org/apache/spark/shuffle/RssSparkShuffleUtils.java 51.49% <0.00%> (-7.01%) ⬇️
...he/uniffle/client/impl/ShuffleWriteClientImpl.java 35.98% <0.00%> (-0.08%) ⬇️

... and 27 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@jerqi jerqi requested a review from advancedxy July 11, 2023 09:46
@leixm
Copy link
Contributor Author

leixm commented Jul 11, 2023

Seems flaky test.

Copy link
Contributor

@advancedxy advancedxy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for submitting this PR, left some minor comments.

However maybe we need some refactor to:

  1. move getShuffleResult into ShuffleReadClient
  2. the get result part should also in the RssShuffleReader, that might make the whole code a bit clearer.
    But let's do that in follow-up prs.

RssConf rssConf = RssSparkConfig.toRssConf(sparkConf);
if (rssConf.getBoolean(RssClientConfig.RSS_RESUBMIT_STAGE, false)
&& RssSparkShuffleUtils.isStageResubmitSupported()) {
String driver = rssConf.getString("driver.host", "");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this literal driver.host should be extracted as a constant field. This also applies to RssShuffleReader.
image

Comment on lines 602 to 617
int port = rssConf.get(RssClientConf.SHUFFLE_MANAGER_GRPC_PORT);
try (ShuffleManagerClient client = ShuffleManagerClientFactory
.getInstance().createShuffleManagerClient(ClientType.GRPC, driver, port)) {
RssReportShuffleFetchFailureRequest req = new RssReportShuffleFetchFailureRequest(
appId, shuffleId, stageAttemptId, partitionId, e.getMessage());
RssReportShuffleFetchFailureResponse response = client.reportShuffleFetchFailure(req);
if (response.getReSubmitWholeStage()) {
// since we are going to roll out the whole stage, mapIndex shouldn't matter, hence -1 is provided.
FetchFailedException ffe =
RssSparkShuffleUtils.createFetchFailedException(shuffleId, -1, partitionId, e);
throw new RssException(ffe);
}
} catch (IOException ioe) {
LOG.info("Error closing shuffle manager client with error:", ioe);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we extract this into a utility method and reuse code for both spark 2 and spark 3?

Comment on lines 875 to 878
RssReportShuffleFetchFailureRequest req = new RssReportShuffleFetchFailureRequest(
appId, shuffleId, stageAttemptId, partitionId, e.getMessage());
RssReportShuffleFetchFailureResponse response = client.reportShuffleFetchFailure(req);
if (response.getReSubmitWholeStage()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this part, I think we should try to report all the shuffle fetch failures then throw FetchFailedException if any of response indicates the whole stage should be re-submitted.

P.S: one todo is create a new rpc interface to report failures in batch.

@leixm
Copy link
Contributor Author

leixm commented Jul 12, 2023

I will raise sub PRs to address the following problem:

  1. Create a new rpc interface to report failures in batch
  2. move getShuffleResult into ShuffleReadClient
  3. get result part move to RssShuffleReader

@leixm
Copy link
Contributor Author

leixm commented Jul 12, 2023

Flaky test.

@jerqi jerqi changed the title [#477] feat: support getShuffleResult throws FetchFailedException. [#477] feat(spark): support getShuffleResult throws FetchFailedException. Jul 13, 2023
Copy link
Contributor

@advancedxy advancedxy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, except one minor comment. Thanks for your work.

@leixm
Copy link
Contributor Author

leixm commented Jul 13, 2023

LGTM, except one minor comment. Thanks for your work.

cc @smallzhongfeng , Can you help review plz?

smallzhongfeng
smallzhongfeng previously approved these changes Jul 13, 2023
Copy link
Contributor

@smallzhongfeng smallzhongfeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for ping me. LTGM!

int port = rssConf.get(RssClientConf.SHUFFLE_MANAGER_GRPC_PORT);
try (ShuffleManagerClient client = ShuffleManagerClientFactory
.getInstance().createShuffleManagerClient(ClientType.GRPC, driver, port)) {
// todo: Create a new rpc interface to report failures in batch.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remember create a new issue for this.

@@ -581,4 +582,15 @@ public int getPartitionNum(int shuffleId) {
public int getNumMaps(int shuffleId) {
return shuffleIdToNumMapTasks.getOrDefault(shuffleId, 0);
}

private Roaring64NavigableMap getShuffleResult(String clientType, Set<ShuffleServerInfo> shuffleServerInfoSet,
String appId, int shuffleId, int partitionId, int stageAttemptId) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: code indentation

int currentFailedReadRequest = failedReadRequest.getAndIncrement();
if (currentFailedReadRequest < numOfFailedReadRequest) {
LOG.info("This request is failed as mocked failure, current/firstN: {}/{}",
currentFailedReadRequest, numOfFailedReadRequest);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

int currentFailedReadRequest = failedReadRequest.getAndIncrement();
if (currentFailedReadRequest < numOfFailedReadRequest) {
LOG.info("This request is failed as mocked failure, current/firstN: {}/{}",
currentFailedReadRequest, numOfFailedReadRequest);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

@jerqi
Copy link
Contributor

jerqi commented Jul 20, 2023

you can run this command to fix spotless issues.

mvn spotless:apply -Pspark3 -Pspark2 -Ptez -Pmr -Phadoop2.8

@leixm
Copy link
Contributor Author

leixm commented Jul 20, 2023

mvn spotless:apply -Pspark3 -Pspark2 -Ptez -Pmr -Phadoop2.8

Thank you. Already fixed.

Copy link
Contributor

@jerqi jerqi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @leixm @smallzhongfeng @advancedxy , wait for CI.

@jerqi jerqi merged commit d0f0b57 into apache:master Jul 20, 2023
30 checks passed
@jerqi
Copy link
Contributor

jerqi commented Jul 20, 2023

Merged to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants