Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix FileSettingsService hang on error update #89630

Merged
merged 2 commits into from
Aug 29, 2022

Conversation

grcevski
Copy link
Contributor

While working on new tests for #89567 I discovered a bug where on duplicate error update we don't unlock the file settings watcher thread properly.

It's a race condition where the settings file might be updated in succession quicker than the error state is updated. We check for duplicate error state and avoid the new cluster change update event, however we didn't call the error listener will null to unlock the watcher.

This is intermittent and should've been caught by our existing error state save integration tests, but just happened to be hit with the SLM specific test I was writing.

@grcevski grcevski added >bug :Core/Infra/Core Core issues without another label Team:Core/Infra Meta label for core/infra team auto-backport-and-merge Automatically create backport pull requests and merge when ready v8.4.1 v8.5.0 labels Aug 25, 2022
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

@elasticsearchmachine
Copy link
Collaborator

Hi @grcevski, I've created a changelog YAML for you.

@@ -186,8 +188,7 @@ static boolean isNewError(ReservedStateMetadata existingMetadata, Long newStateV
|| existingMetadata.errorMetadata().version() < newStateVersion);
}

private void saveErrorState(ErrorState errorState) {
ClusterState clusterState = clusterService.state();
private void saveErrorState(ClusterState clusterState, ErrorState errorState) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change to pass in the most recent cluster state from the cluster state update task is sort of tangential, it just ensures we hit the issue more reliably. ClusterService.state() might not be the latest state, but the one we get in the update task will be.

@@ -170,6 +170,8 @@ public void onFailure(Exception e) {
if (isNewError(existingMetadata, reservedStateVersion.version())) {
logger.debug("Failed to apply reserved cluster state", e);
errorListener.accept(e);
} else {
errorListener.accept(null);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the actual bug, we never called the callback to ensure the async watcher is released.

Copy link
Contributor

@ChrisHegarty ChrisHegarty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@grcevski grcevski merged commit 94d7a31 into elastic:main Aug 29, 2022
@grcevski grcevski deleted the operator/fix_error_save branch August 29, 2022 14:23
@grcevski
Copy link
Contributor Author

Thanks Chris!

@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
8.4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport-and-merge Automatically create backport pull requests and merge when ready >bug :Core/Infra/Core Core issues without another label Team:Core/Infra Meta label for core/infra team v8.4.1 v8.5.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants