Fix NPE in mkAssignments when assignment is deleted during scheduling#8441
Merged
Fix NPE in mkAssignments when assignment is deleted during scheduling#8441
Conversation
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
rzo1
approved these changes
Mar 28, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fix a
NullPointerExceptioninNimbus.mkAssignments()caused by a TOCTOU raceagainst ZooKeeper:
state.assignmentInfo()can returnnullwhen an assignment isdeleted between the
state.assignments()listing (line 2510) and the per-topologyread (line 2517)
Bug
In
mkAssignments(), the code iterates over assigned topology IDs and fetches eachassignment from ZooKeeper:
assignmentInfo() returns null when the assignment znode no longer exists. This
happens when a topology is killed or its assignment is cleaned up between the two
ZooKeeper reads — a classic TOCTOU (time-of-check-to-time-of-use) race condition.
The same method already handles this correctly elsewhere in Nimbus.java (line 3125):
Impact
mkAssignments runs on a recurring timer as part of Nimbus's scheduling loop. When
this NPE fires:
updated assignments for that cycle
state.assignments() listing (e.g., due to a slow ZooKeeper cleanup), every
scheduling round crashes until it disappears
rebalance, failed workers) is blocked, not just the one whose assignment was deleted
This makes Nimbus scheduling fragile under topology churn (rapid submit/kill cycles)
or ZooKeeper latency spikes.
Fix
Add a null guard consistent with the existing pattern at line 3125:
When assignmentInfo() returns null, the null flows into the existingAssignments
map. All four downstream consumers in lockingMkAssignments already handle null
values from this map (lines 2566, 2576, 2581, 2663).
Test plan
lockingMkAssignments to confirm they handle null (they do — see lines 2566,
2576, 2581, 2663)
🤖 Generated with Claude Code