New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KAFKA-15048: Improve handling of unexpected quorum controller errors #13799
Conversation
When the active quorum controller encounters an "unexpected" error, such as a NullPointerException, it currently resigns its leadership. This PR fixes it so that in addition to doing that, it also increments the metadata error count metric. This will allow us to better track down these errors. This PR also fixes a minor bug where performing read operations on a standby controller would result in an unexpected RuntimeException. The bug happened because the standby controller does not take in-memory snapshots, and read operations were attempting to read from the epoch of the latest committed offset. The fix is for the standby controller to simply read the latest value of each data structure. This is always safe, because standby controllers don't contain uncommitted data. Also, fix a bug where listPartitionReassignments was reading the latest data, rather than data from the last committed offset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this PR, this can also fix KAFKA-15036, LGTM.
// Standby controllers never have uncommitted data in memory. Therefore, we return | ||
// Long.MAX_VALUE, a special value which means "always read the latest from every | ||
// data structure." | ||
return Long.MAX_VALUE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about using SnapshotRegistry.LATEST_EPOCH
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Fixed.
* @param latestController The current controller. | ||
* @return The new NotControllerException. | ||
*/ | ||
public static NotControllerException newPreMigrationException(OptionalInt latestController) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing the argument here is the controller ID? Can you rename it to reflect that? (same for other methods in this class)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -2024,13 +2029,8 @@ public CompletableFuture<ElectLeadersResponseData> electLeaders( | |||
public CompletableFuture<FinalizedControllerFeatures> finalizedFeatures( | |||
ControllerRequestContext context | |||
) { | |||
// It's possible that we call ApiVersionRequest before consuming the log since ApiVersionRequest is sent when | |||
// initialize NetworkClient, we should not return an error since it would stop the NetworkClient from working correctly. | |||
if (lastCommittedOffset == -1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this a bug previously? If another node sent an ApiVersionRequest prior to loading the log, we would have not returned any MetadataVersion (whereas with this patch, we would).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, this was previously a bug.
When the active quorum controller encounters an "unexpected" error, such as a NullPointerException, it currently resigns its leadership. This PR fixes it so that in addition to doing that, it also increments the metadata error count metric. This will allow us to better track down these errors.
This PR also fixes a minor bug where performing read operations on a standby controller would result in an unexpected RuntimeException. The bug happened because the standby controller does not take in-memory snapshots, and read operations were attempting to read from the epoch of the latest committed offset. The fix is for the standby controller to simply read the latest value of each data structure. This is always safe, because standby controllers don't contain uncommitted data.
Also, fix a bug where listPartitionReassignments was reading the latest data, rather than data from the last committed offset.