-
Notifications
You must be signed in to change notification settings - Fork 3.8k
CASSANDRA-17805: Check that the replacing node is alive during host r… #1773
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…eplacement Add a new check during host replacement. Currently, during a node replacement, we check that the node has not updated gossip for a configured ring_delay amount of time (defaults to 30 seconds). In CASSANDRA-17776, the delay is calculated from the max value between the `BROADCAST_INTERVAL` and 2X the configured `ring_delay`. If we see an update from the node that we are replacing in less than the calculated `sleep delay`, we throw a `UnsupportedOperationException` with message `Cannot replace a live node....`. However, we never check whether the node is reporting as alive or not alive. In this commit, we add the check to ensure that the node is in fact reporting as alive before throwing the exception. Additionally, we add logging information with values for the token, `updateTimestamp`, and `allowedDelay` values for better reporting.
| } | ||
|
|
||
| // check for operator errors... | ||
| long nanoDelay = MILLISECONDS.toNanos(ringTimeoutMillis); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to recalculate this value in every iteration of the loop. So moved outside of the for loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also using MILLISECONDS.toNanos here to avoid conversion errors
| long updateTimestamp = endpointStateForExisting.getUpdateTimestamp(); | ||
| long allowedDelay = nanoTime() - nanoDelay; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't need to save to a local variable, can keep in the if statement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we use it in the if statement as well as the log
| EndpointState endpointStateForExisting = Gossiper.instance.getEndpointStateForEndpoint(existing); | ||
| long updateTimestamp = endpointStateForExisting.getUpdateTimestamp(); | ||
| long allowedDelay = nanoTime() - nanoDelay; | ||
| if (updateTimestamp > allowedDelay && endpointStateForExisting.isAlive()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be ||. If it was updated within the last ring delay or we think its alive, then fail
| // if the node was updated within the ring delay or the node is alive, we should fail | ||
| if (updateTimestamp > allowedDelay || endpointStateForExisting.isAlive()) | ||
| { | ||
| logger.error("Unable to replace node for token={}. The node is reporting as alive with updateTimestamp={} which exceeds the allowedDelay={}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The node is reporting as alive
This log can be confusing in the case that the endpoint is isAlive but the update <= allowedDelay, can you rework this to handle both cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch, I have updated the log statement
| // if the node was updated within the ring delay or the node is alive, we should fail | ||
| if (updateTimestamp > allowedDelay || endpointStateForExisting.isAlive()) | ||
| { | ||
| logger.error("Unable to replace node for token={}. The node is reporting as {}alive with updateTimestamp={} which exceeds the allowedDelay={}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which exceeds the allowedDelay={}
again, this may not be true. you can keep it simple and just log the values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spoke in slack Unable to replace node for token={}. The node is reporting as {}alive with updateTimestamp={}, allowedDelay={}" works for me!
### What is the issue Fixes riptano/cndb#14242 ### What does this PR fix and why was it fixed Commits: datastax/jvector@4.0.0-beta.4...4.0.0-beta.5 CNDB test pr riptano/cndb#14243
…eplacement
Add a new check during host replacement. Currently, during a node replacement, we check that the node
has not updated gossip for a configured ring_delay amount of time (defaults to 30 seconds). In CASSANDRA-17776,
the delay is calculated from the max value between the
BROADCAST_INTERVALand 2X the configuredring_delay.If we see an update from the node that we are replacing in less than the calculated
sleep delay, we throw aUnsupportedOperationExceptionwith messageCannot replace a live node..... However, we never check whetherthe node is reporting as alive or not alive. In this commit, we add the check to ensure that the node is in
fact reporting as alive before throwing the exception. Additionally, we add logging information with values for
the token,
updateTimestamp, andallowedDelayvalues for better reporting.