CASSANDRA-17805: Check that the replacing node is alive during host r… #1773

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

frankgh wants to merge 5 commits into apache:trunk from frankgh:CASSANDRA-17805

Contributor

frankgh commented Aug 5, 2022

…eplacement

Add a new check during host replacement. Currently, during a node replacement, we check that the node
has not updated gossip for a configured ring_delay amount of time (defaults to 30 seconds). In CASSANDRA-17776,
the delay is calculated from the max value between the BROADCAST_INTERVAL and 2X the configured ring_delay.

If we see an update from the node that we are replacing in less than the calculated sleep delay, we throw a
UnsupportedOperationException with message Cannot replace a live node..... However, we never check whether
the node is reporting as alive or not alive. In this commit, we add the check to ensure that the node is in
fact reporting as alive before throwing the exception. Additionally, we add logging information with values for
the token, updateTimestamp, and allowedDelay values for better reporting.


          CASSANDRA-17805: Check that the replacing node is alive during host r…

0c4d595

…eplacement

Add a new check during host replacement. Currently, during a node replacement, we check that the node
has not updated gossip for a configured ring_delay amount of time (defaults to 30 seconds). In CASSANDRA-17776,
the delay is calculated from the max value between the `BROADCAST_INTERVAL` and 2X the configured `ring_delay`.

If we see an update from the node that we are replacing in less than the calculated `sleep delay`, we throw a
`UnsupportedOperationException` with message `Cannot replace a live node....`. However, we never check whether
the node is reporting as alive or not alive. In this commit, we add the check to ensure that the node is in
fact reporting as alive before throwing the exception. Additionally, we add logging information with values for
the token, `updateTimestamp`, and `allowedDelay` values for better reporting.

frankgh commented

View reviewed changes

src/java/org/apache/cassandra/service/StorageService.java

                               }
                               // check for operator errors...
+                              long nanoDelay = MILLISECONDS.toNanos(ringTimeoutMillis);

Contributor Author

frankgh Aug 5, 2022

no need to recalculate this value in every iteration of the loop. So moved outside of the for loop

Contributor Author

frankgh Aug 5, 2022

also using MILLISECONDS.toNanos here to avoid conversion errors


          higher

f5948ab

dcapwell requested changes

View reviewed changes

src/java/org/apache/cassandra/service/StorageService.java

Comment on lines +1865 to +1866

		long updateTimestamp = endpointStateForExisting.getUpdateTimestamp();
		long allowedDelay = nanoTime() - nanoDelay;

Contributor

dcapwell Aug 5, 2022

don't need to save to a local variable, can keep in the if statement

Contributor Author

frankgh Aug 5, 2022

we use it in the if statement as well as the log

src/java/org/apache/cassandra/service/StorageService.java Outdated

+                                      EndpointState endpointStateForExisting = Gossiper.instance.getEndpointStateForEndpoint(existing);
+                                      long updateTimestamp = endpointStateForExisting.getUpdateTimestamp();
+                                      long allowedDelay = nanoTime() - nanoDelay;
+                                      if (updateTimestamp > allowedDelay && endpointStateForExisting.isAlive())

Contributor

dcapwell Aug 5, 2022

should be ||. If it was updated within the last ring delay or we think its alive, then fail


          Address PR comments

963103f

dcapwell reviewed

View reviewed changes

src/java/org/apache/cassandra/service/StorageService.java Outdated

+                                      // if the node was updated within the ring delay or the node is alive, we should fail
+                                      if (updateTimestamp > allowedDelay || endpointStateForExisting.isAlive())
+                                      {
+                                          logger.error("Unable to replace node for token={}. The node is reporting as alive with updateTimestamp={} which exceeds the allowedDelay={}",

Contributor

dcapwell Aug 8, 2022

The node is reporting as alive

This log can be confusing in the case that the endpoint is isAlive but the update <= allowedDelay, can you rework this to handle both cases?

Contributor Author

frankgh Aug 8, 2022

good catch, I have updated the log statement


          Address PR feedback

7204d5c

dcapwell reviewed

View reviewed changes

src/java/org/apache/cassandra/service/StorageService.java Outdated

+                                      // if the node was updated within the ring delay or the node is alive, we should fail
+                                      if (updateTimestamp > allowedDelay || endpointStateForExisting.isAlive())
+                                      {
+                                          logger.error("Unable to replace node for token={}. The node is reporting as {}alive with updateTimestamp={} which exceeds the allowedDelay={}",

Contributor

dcapwell Aug 8, 2022

which exceeds the allowedDelay={}

again, this may not be true. you can keep it simple and just log the values

Contributor

dcapwell Aug 9, 2022

spoke in slack Unable to replace node for token={}. The node is reporting as {}alive with updateTimestamp={}, allowedDelay={}" works for me!


          Address PR feedback

cbceed2

dcapwell approved these changes

View reviewed changes

smiklosovic closed this

frankgh deleted the CASSANDRA-17805 branch

January 31, 2024 01:28

michaeljmarshall added a commit to michaeljmarshall/cassandra that referenced this pull request


          CNDB-14242: Upgrade jvector to 4.0.0-beta.5 (apache#1773)

569eb8a

### What is the issue

Fixes riptano/cndb#14242

### What does this PR fix and why was it fixed

Commits:
datastax/jvector@4.0.0-beta.4...4.0.0-beta.5

CNDB test pr riptano/cndb#14243

michaelsembwever pushed a commit to thelastpickle/cassandra that referenced this pull request


          CNDB-14242: Upgrade jvector to 4.0.0-beta.5 (apache#1773)

9eadd7a

Fixes riptano/cndb#14242

Commits:
datastax/jvector@4.0.0-beta.4...4.0.0-beta.5

CNDB test pr riptano/cndb#14243

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet