Skip to content

Conversation

@frankgh
Copy link
Contributor

@frankgh frankgh commented Aug 5, 2022

…eplacement

Add a new check during host replacement. Currently, during a node replacement, we check that the node
has not updated gossip for a configured ring_delay amount of time (defaults to 30 seconds). In CASSANDRA-17776,
the delay is calculated from the max value between the BROADCAST_INTERVAL and 2X the configured ring_delay.

If we see an update from the node that we are replacing in less than the calculated sleep delay, we throw a
UnsupportedOperationException with message Cannot replace a live node..... However, we never check whether
the node is reporting as alive or not alive. In this commit, we add the check to ensure that the node is in
fact reporting as alive before throwing the exception. Additionally, we add logging information with values for
the token, updateTimestamp, and allowedDelay values for better reporting.

…eplacement

Add a new check during host replacement. Currently, during a node replacement, we check that the node
has not updated gossip for a configured ring_delay amount of time (defaults to 30 seconds). In CASSANDRA-17776,
the delay is calculated from the max value between the `BROADCAST_INTERVAL` and 2X the configured `ring_delay`.

If we see an update from the node that we are replacing in less than the calculated `sleep delay`, we throw a
`UnsupportedOperationException` with message `Cannot replace a live node....`. However, we never check whether
the node is reporting as alive or not alive. In this commit, we add the check to ensure that the node is in
fact reporting as alive before throwing the exception. Additionally, we add logging information with values for
the token, `updateTimestamp`, and `allowedDelay` values for better reporting.
}

// check for operator errors...
long nanoDelay = MILLISECONDS.toNanos(ringTimeoutMillis);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to recalculate this value in every iteration of the loop. So moved outside of the for loop

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also using MILLISECONDS.toNanos here to avoid conversion errors

Comment on lines +1865 to +1866
long updateTimestamp = endpointStateForExisting.getUpdateTimestamp();
long allowedDelay = nanoTime() - nanoDelay;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't need to save to a local variable, can keep in the if statement

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we use it in the if statement as well as the log

EndpointState endpointStateForExisting = Gossiper.instance.getEndpointStateForEndpoint(existing);
long updateTimestamp = endpointStateForExisting.getUpdateTimestamp();
long allowedDelay = nanoTime() - nanoDelay;
if (updateTimestamp > allowedDelay && endpointStateForExisting.isAlive())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be ||. If it was updated within the last ring delay or we think its alive, then fail

// if the node was updated within the ring delay or the node is alive, we should fail
if (updateTimestamp > allowedDelay || endpointStateForExisting.isAlive())
{
logger.error("Unable to replace node for token={}. The node is reporting as alive with updateTimestamp={} which exceeds the allowedDelay={}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The node is reporting as alive

This log can be confusing in the case that the endpoint is isAlive but the update <= allowedDelay, can you rework this to handle both cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, I have updated the log statement

// if the node was updated within the ring delay or the node is alive, we should fail
if (updateTimestamp > allowedDelay || endpointStateForExisting.isAlive())
{
logger.error("Unable to replace node for token={}. The node is reporting as {}alive with updateTimestamp={} which exceeds the allowedDelay={}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which exceeds the allowedDelay={}

again, this may not be true. you can keep it simple and just log the values

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spoke in slack Unable to replace node for token={}. The node is reporting as {}alive with updateTimestamp={}, allowedDelay={}" works for me!

@frankgh frankgh deleted the CASSANDRA-17805 branch January 31, 2024 01:28
michaeljmarshall added a commit to michaeljmarshall/cassandra that referenced this pull request Sep 17, 2025
### What is the issue

Fixes riptano/cndb#14242

### What does this PR fix and why was it fixed

Commits:
datastax/jvector@4.0.0-beta.4...4.0.0-beta.5

CNDB test pr riptano/cndb#14243
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants