Skip to content

Fix NPE in mkAssignments when assignment is deleted during scheduling#8441

Merged
rzo1 merged 1 commit intomasterfrom
fix/nimbus-mkassignments-npe
Mar 28, 2026
Merged

Fix NPE in mkAssignments when assignment is deleted during scheduling#8441
rzo1 merged 1 commit intomasterfrom
fix/nimbus-mkassignments-npe

Conversation

@jnioche
Copy link
Copy Markdown
Contributor

@jnioche jnioche commented Mar 28, 2026

Summary

Fix a NullPointerException in Nimbus.mkAssignments() caused by a TOCTOU race
against ZooKeeper: state.assignmentInfo() can return null when an assignment is
deleted between the state.assignments() listing (line 2510) and the per-topology
read (line 2517)

Bug

In mkAssignments(), the code iterates over assigned topology IDs and fetches each
assignment from ZooKeeper:

Assignment currentAssignment = state.assignmentInfo(id, null);  // can return null
if (!currentAssignment.is_set_owner()) {                         // NPE

assignmentInfo() returns null when the assignment znode no longer exists. This
happens when a topology is killed or its assignment is cleaned up between the two
ZooKeeper reads — a classic TOCTOU (time-of-check-to-time-of-use) race condition.

The same method already handles this correctly elsewhere in Nimbus.java (line 3125):

Assignment assignment = state.assignmentInfo(topoId, null);                                                                                                                                                                                 
if (assignment != null && assignment.is_set_executor_node_port()) { ... }                          

Impact

mkAssignments runs on a recurring timer as part of Nimbus's scheduling loop. When
this NPE fires:

  1. The entire scheduling round fails — no topology in the cluster gets new or
    updated assignments for that cycle
  2. The error is persistent — if the deleted topology ID remains in the
    state.assignments() listing (e.g., due to a slow ZooKeeper cleanup), every
    scheduling round crashes until it disappears
  3. All topologies are starved — any topology needing re-assignment (new workers,
    rebalance, failed workers) is blocked, not just the one whose assignment was deleted

This makes Nimbus scheduling fragile under topology churn (rapid submit/kill cycles)
or ZooKeeper latency spikes.

Fix

Add a null guard consistent with the existing pattern at line 3125:

       Assignment currentAssignment = state.assignmentInfo(id, null);                                                                                                                                                                         
  -    if (!currentAssignment.is_set_owner()) {             
  +    if (currentAssignment != null && !currentAssignment.is_set_owner()) {                                                                                                                                                                  

When assignmentInfo() returns null, the null flows into the existingAssignments
map. All four downstream consumers in lockingMkAssignments already handle null
values from this map (lines 2566, 2576, 2581, 2663).

Test plan

  • Verify the fix compiles: mvn compile -pl storm-server
  • Review the four existingAssignments.get() call sites in
    lockingMkAssignments to confirm they handle null (they do — see lines 2566,
    2576, 2581, 2663)
  • Confirm the fix matches the existing pattern at line 3125

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jnioche jnioche added this to the 2.8.6 milestone Mar 28, 2026
@jnioche jnioche added the bug label Mar 28, 2026
@rzo1 rzo1 merged commit 9961d32 into master Mar 28, 2026
12 checks passed
@jnioche jnioche deleted the fix/nimbus-mkassignments-npe branch March 30, 2026 08:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants