Skip to content

fix AssertError for newer but lower message idx#194

Merged
bakwc merged 2 commits intobakwc:masterfrom
snubba:fix_assert_error_in_apply_command
Mar 16, 2026
Merged

fix AssertError for newer but lower message idx#194
bakwc merged 2 commits intobakwc:masterfrom
snubba:fix_assert_error_in_apply_command

Conversation

@snubba
Copy link
Contributor

@snubba snubba commented Feb 23, 2026

line 975-978

idx = message['log_idx']
term = message['log_term']
assert idx > self.__raftLastApplied          # ← crashes here
self.__commandsWaitingCommit[idx].append((term, callback))

The Race Condition

Key triggering scenario — leader reconnects after restart with stale socket data:

  1. Follower A sends apply_command to Leader B → commandsWaitingReply[N] = callback
  2. Leader B logs the command at log_idx=X, sends apply_command_response (message in OS socket buffer, not yet read by A)
  3. Leader B restarts → new TCP connection established from B → A
  4. In _onIncomingMessageReceived (transport.py:368), the new connection replaces the old one: self._connections[node] = new_conn, but the old socket stays registered with the poller
  5. The new connection delivers fresh append_entries with a snapshot → __loadDumpFile(clearJournal=True) → __raftLastApplied jumps past X
  6. Crucially: because the reconnecting node has the same address (same Node.id), line 886 evaluates self.__raftLeader != node as False__onLeaderChanged() is never called, so commandsWaitingReply[N] is not cleared
  7. The old socket then delivers its buffered apply_command_response with log_idx=X
  8. Callback is found in commandsWaitingReplyX > __raftLastAppliedFalseAssertionError

Secondary scenario — tick ordering:

The tick order is:
__applyLogEntries()__sendAppendEntries()_checkCommandsToApply()_poller.poll()
If __applyLogEntries() applies entry at idx=X (advancing __raftLastApplied to X) in the same tick that _poller.poll() delivers an apply_command_response with log_idx=X, you get X > X → False.

Why it "doesn't go away"

AssertionError is a subclass of Exception. The _autoTickThread has a blanket except Exception at line 539 that catches it, logs it, and continues running. The popped callback is silently dropped (never called, never put in commandsWaitingCommit). If the application retries the command under the same cluster instability, it can reproduce the same race.

@bakwc Can you please confirm this fix? It is hard to reproduce reliable.

@snubba
Copy link
Contributor Author

snubba commented Mar 10, 2026

@bakwc can you validate the fix?

@bakwc
Copy link
Owner

bakwc commented Mar 10, 2026

Hey! Thanks for the PR. Will take a look, need to double check logic.

@bakwc
Copy link
Owner

bakwc commented Mar 15, 2026

The fix is good! But there is a small issue - the fail reason stops matching the description. As for this case - command is not DISCARDED anymore - it could be applied or discarded. So we should probably introduce a new fail reason 'UNKNOWN_OUTCOME' - this mean command may be applied or may be discarded.

@snubba
Copy link
Contributor Author

snubba commented Mar 16, 2026

Added UNKNOWN_OUTCOME as new FAIL_REASON and changed the callback.

@bakwc bakwc merged commit bb42edf into bakwc:master Mar 16, 2026
1 check passed
@snubba snubba deleted the fix_assert_error_in_apply_command branch March 16, 2026 08:14
@snubba snubba mentioned this pull request Mar 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants