Skip to content

[CELEBORN-1152] fix GetShuffleId RPC NPE for empty shuffle#2136

Closed
ErikFang wants to merge 2 commits intoapache:mainfrom
ErikFang:fix-GetShuffleId-RPC-NPE-for-empty-shuffle
Closed

[CELEBORN-1152] fix GetShuffleId RPC NPE for empty shuffle#2136
ErikFang wants to merge 2 commits intoapache:mainfrom
ErikFang:fix-GetShuffleId-RPC-NPE-for-empty-shuffle

Conversation

@ErikFang
Copy link
Contributor

@ErikFang ErikFang commented Dec 7, 2023

What changes were proposed in this pull request?

In celeborn-955, GetShuffleId RPC was introduced to generate a celeborn shuffle id from app shuffle id to support spark stage rerun
GetShuffleId RPC assumes that Shuffle Write operation always happens before Shuffle Read operation, but this is not true for empty shuffle data in celeborn, which causes GetShuffleId RPC to throw NPE and fail the Job
This PR fixes this bug

Why are the changes needed?

to avoid spark job failure with empty shuffle data

Does this PR introduce any user-facing change?

No

How was this patch tested?

a new test case is included for empty shuffle data

@codecov
Copy link

codecov bot commented Dec 7, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (af6fd8a) 46.21% compared to head (ef123bc) 46.36%.
Report is 5 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2136      +/-   ##
==========================================
+ Coverage   46.21%   46.36%   +0.16%     
==========================================
  Files         175      175              
  Lines       11341    11358      +17     
  Branches     1017     1019       +2     
==========================================
+ Hits         5240     5265      +25     
+ Misses       5753     5740      -13     
- Partials      348      353       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ErikFang ErikFang changed the title fix GetShuffleId RPC NPE for empty shuffle [CELEBORN-1152] fix GetShuffleId RPC NPE for empty shuffle Dec 7, 2023
if (shuffleIds == null) {
logWarning(s"unknown appShuffleId $appShuffleId, maybe no shuffle data for this shuffle")
val pbGetShuffleIdResponse =
PbGetShuffleIdResponse.newBuilder().setShuffleId(UNKNOWN_APP_SHUFFLE_ID).build()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can short-circuit ShuffleClientImpl#readPartition if partitionId is -1 and return CelebornInputStream.empty() without calling loadFileGroup(shuffleId, partitionId)

Copy link
Contributor

@waitinfuture waitinfuture left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks! Merging to main(v0.4.0)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants