Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KAFKA-15048: Improve handling of unexpected quorum controller errors #13799

Merged
merged 2 commits into from Jun 2, 2023

Conversation

cmccabe
Copy link
Contributor

@cmccabe cmccabe commented Jun 2, 2023

When the active quorum controller encounters an "unexpected" error, such as a NullPointerException, it currently resigns its leadership. This PR fixes it so that in addition to doing that, it also increments the metadata error count metric. This will allow us to better track down these errors.

This PR also fixes a minor bug where performing read operations on a standby controller would result in an unexpected RuntimeException. The bug happened because the standby controller does not take in-memory snapshots, and read operations were attempting to read from the epoch of the latest committed offset. The fix is for the standby controller to simply read the latest value of each data structure. This is always safe, because standby controllers don't contain uncommitted data.

Also, fix a bug where listPartitionReassignments was reading the latest data, rather than data from the last committed offset.

When the active quorum controller encounters an "unexpected" error, such as a NullPointerException,
it currently resigns its leadership. This PR fixes it so that in addition to doing that, it also
increments the metadata error count metric. This will allow us to better track down these errors.

This PR also fixes a minor bug where performing read operations on a standby controller would
result in an unexpected RuntimeException. The bug happened because the standby controller does not
take in-memory snapshots, and read operations were attempting to read from the epoch of the latest
committed offset. The fix is for the standby controller to simply read the latest value of each
data structure. This is always safe, because standby controllers don't contain uncommitted data.

Also, fix a bug where listPartitionReassignments was reading the latest data, rather than data from
the last committed offset.
@cmccabe cmccabe added the kraft label Jun 2, 2023
Copy link
Member

@dengziming dengziming left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR, this can also fix KAFKA-15036, LGTM.

// Standby controllers never have uncommitted data in memory. Therefore, we return
// Long.MAX_VALUE, a special value which means "always read the latest from every
// data structure."
return Long.MAX_VALUE;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using SnapshotRegistry.LATEST_EPOCH

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Fixed.

* @param latestController The current controller.
* @return The new NotControllerException.
*/
public static NotControllerException newPreMigrationException(OptionalInt latestController) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing the argument here is the controller ID? Can you rename it to reflect that? (same for other methods in this class)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -2024,13 +2029,8 @@ public CompletableFuture<ElectLeadersResponseData> electLeaders(
public CompletableFuture<FinalizedControllerFeatures> finalizedFeatures(
ControllerRequestContext context
) {
// It's possible that we call ApiVersionRequest before consuming the log since ApiVersionRequest is sent when
// initialize NetworkClient, we should not return an error since it would stop the NetworkClient from working correctly.
if (lastCommittedOffset == -1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this a bug previously? If another node sent an ApiVersionRequest prior to loading the log, we would have not returned any MetadataVersion (whereas with this patch, we would).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this was previously a bug.

@cmccabe cmccabe merged commit 146a697 into apache:trunk Jun 2, 2023
1 check was pending
@cmccabe cmccabe deleted the KAFKA-15048 branch June 2, 2023 19:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants