Remove inFlightEcho entry on ECHO_REQ failure#4865
Open
grom358 wants to merge 1 commit into
Open
Conversation
ace5203 to
a373053
Compare
a373053 to
83bf57c
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
In Gossiper, echoHandler only implements onResponse. RequestCallback.onFailure has a default no-op, so when the ECHO_REQ times out or the remote node returns an error, inflightEcho.remove(addr) is never called. The stale entry persists. Any subsequent markAlive(addr, localState) call — where localState is the same in-place-mutated object already in inflightEcho — sees localState.equals(prevState) = true (identity equality, same reference) and skips indefinitely. In a temporary-partition scenario (node briefly unreachable, echo times out, node recovers with the same generation), the node can get stuck permanently dead: the failure detector sees it as alive and keeps triggering markAlive, but every invocation is suppressed by the stale entry. The stale entry is only cleared by removeEndpoint() (explicit removal) or silentlyMarkDead() via markDead() (failure detector conviction) — neither of which fires if the failure detector is reporting the node as healthy.
Fix: override onFailure in echoHandler to call inflightEcho.remove(addr).