-
Notifications
You must be signed in to change notification settings - Fork 24.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix FileSettingsService hang on error update #89630
Conversation
Pinging @elastic/es-core-infra (Team:Core/Infra) |
Hi @grcevski, I've created a changelog YAML for you. |
@@ -186,8 +188,7 @@ static boolean isNewError(ReservedStateMetadata existingMetadata, Long newStateV | |||
|| existingMetadata.errorMetadata().version() < newStateVersion); | |||
} | |||
|
|||
private void saveErrorState(ErrorState errorState) { | |||
ClusterState clusterState = clusterService.state(); | |||
private void saveErrorState(ClusterState clusterState, ErrorState errorState) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change to pass in the most recent cluster state from the cluster state update task is sort of tangential, it just ensures we hit the issue more reliably. ClusterService.state() might not be the latest state, but the one we get in the update task will be.
@@ -170,6 +170,8 @@ public void onFailure(Exception e) { | |||
if (isNewError(existingMetadata, reservedStateVersion.version())) { | |||
logger.debug("Failed to apply reserved cluster state", e); | |||
errorListener.accept(e); | |||
} else { | |||
errorListener.accept(null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the actual bug, we never called the callback to ensure the async watcher is released.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks Chris! |
💚 Backport successful
|
While working on new tests for #89567 I discovered a bug where on duplicate error update we don't unlock the file settings watcher thread properly.
It's a race condition where the settings file might be updated in succession quicker than the error state is updated. We check for duplicate error state and avoid the new cluster change update event, however we didn't call the error listener will null to unlock the watcher.
This is intermittent and should've been caught by our existing error state save integration tests, but just happened to be hit with the SLM specific test I was writing.