Skip to content

[SPARK-31093][SHUFFLE] Fast fail while fetching shuffle data unsuccessfully#27855

Closed
turboFei wants to merge 2 commits intoapache:masterfrom
turboFei:SPARK-31093-fast-fail-fetch
Closed

[SPARK-31093][SHUFFLE] Fast fail while fetching shuffle data unsuccessfully#27855
turboFei wants to merge 2 commits intoapache:masterfrom
turboFei:SPARK-31093-fast-fail-fetch

Conversation

@turboFei
Copy link
Member

@turboFei turboFei commented Mar 9, 2020

What changes were proposed in this pull request?

When fetching shuffle data unsuccessfully, we put a FailureFetchResult into results(a linkedBlockingQueue) and wait it to be taken.
Then method throwFetchFailedException() would be invoked.

In fact, we can fast fail the task once fetching shuffle data unsuccessfully.
In this PR, we invoke throwFetchFailedException once fetching shuffle data unsuccessfully.

Why are the changes needed?

It can save the time when fetching shuffle data unsuccessfully.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing UT.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@turboFei turboFei force-pushed the SPARK-31093-fast-fail-fetch branch from 722d426 to 3036bcd Compare March 9, 2020 13:24
Comment on lines 269 to 272
override def onBlockFetchFailure(blockId: String, e: Throwable): Unit = {
logError(s"Failed to get block(s) from ${req.address.host}:${req.address.port}", e)
results.put(new FailureFetchResult(BlockId(blockId), infoMap(blockId)._2, address, e))
throwFetchFailedException(BlockId(blockId), infoMap(blockId)._2, address, e)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

onBlockFetchFailure will be called within Netty's thread. Thus, ShuffleBlockFetcherIterator can not catch the exception.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, have modified the code.

Comment on lines +639 to +641
failureFetchResult.foreach { ffr =>
buf.release()
throwFetchFailedException(ffr.blockId, ffr.mapIndex, ffr.address, ffr.e)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

errr...I just feel that this makes things be more complex... I actually think that fail-fast here is not such necessary unless we really have high delay in current way.

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Jun 19, 2020
@github-actions github-actions bot closed this Jun 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments