Fix FileSettingsService hang on error update #89630

grcevski · 2022-08-25T19:00:13Z

While working on new tests for #89567 I discovered a bug where on duplicate error update we don't unlock the file settings watcher thread properly.

It's a race condition where the settings file might be updated in succession quicker than the error state is updated. We check for duplicate error state and avoid the new cluster change update event, however we didn't call the error listener will null to unlock the watcher.

This is intermittent and should've been caught by our existing error state save integration tests, but just happened to be hit with the SLM specific test I was writing.

elasticsearchmachine · 2022-08-25T19:00:37Z

Pinging @elastic/es-core-infra (Team:Core/Infra)

elasticsearchmachine · 2022-08-25T19:00:38Z

Hi @grcevski, I've created a changelog YAML for you.

grcevski · 2022-08-25T19:02:06Z

server/src/main/java/org/elasticsearch/reservedstate/service/ReservedClusterStateService.java

@@ -186,8 +188,7 @@ static boolean isNewError(ReservedStateMetadata existingMetadata, Long newStateV
            || existingMetadata.errorMetadata().version() < newStateVersion);
    }

-    private void saveErrorState(ErrorState errorState) {
-        ClusterState clusterState = clusterService.state();
+    private void saveErrorState(ClusterState clusterState, ErrorState errorState) {


This change to pass in the most recent cluster state from the cluster state update task is sort of tangential, it just ensures we hit the issue more reliably. ClusterService.state() might not be the latest state, but the one we get in the update task will be.

grcevski · 2022-08-25T19:03:03Z

server/src/main/java/org/elasticsearch/reservedstate/service/ReservedClusterStateService.java

@@ -170,6 +170,8 @@ public void onFailure(Exception e) {
                        if (isNewError(existingMetadata, reservedStateVersion.version())) {
                            logger.debug("Failed to apply reserved cluster state", e);
                            errorListener.accept(e);
+                        } else {
+                            errorListener.accept(null);


This is the actual bug, we never called the callback to ensure the async watcher is released.

ChrisHegarty

LGTM

grcevski · 2022-08-29T14:23:30Z

Thanks Chris!

elasticsearchmachine · 2022-08-29T14:24:18Z

💚 Backport successful

Status	Branch	Result
✅	8.4

Fix FileSettingsService hang on error update

cd4e5ee

grcevski added >bug :Core/Infra/Core Core issues without another label Team:Core/Infra Meta label for core/infra team auto-backport-and-merge Automatically create backport pull requests and merge when ready v8.4.1 v8.5.0 labels Aug 25, 2022

Update docs/changelog/89630.yaml

6bed1c5

grcevski commented Aug 25, 2022

View reviewed changes

ChrisHegarty approved these changes Aug 29, 2022

View reviewed changes

grcevski merged commit 94d7a31 into elastic:main Aug 29, 2022

grcevski deleted the operator/fix_error_save branch August 29, 2022 14:23

grcevski mentioned this pull request Aug 29, 2022

[8.4] Fix FileSettingsService hang on error update (#89630) #89697

Merged

grcevski added a commit to grcevski/elasticsearch that referenced this pull request Aug 29, 2022

Fix FileSettingsService hang on error update (elastic#89630)

93cafbd

elasticsearchmachine pushed a commit that referenced this pull request Aug 29, 2022

Fix FileSettingsService hang on error update (#89630) (#89697)

9fd3948

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix FileSettingsService hang on error update #89630

Fix FileSettingsService hang on error update #89630

grcevski commented Aug 25, 2022

elasticsearchmachine commented Aug 25, 2022

elasticsearchmachine commented Aug 25, 2022

grcevski Aug 25, 2022

grcevski Aug 25, 2022

ChrisHegarty left a comment

grcevski commented Aug 29, 2022

elasticsearchmachine commented Aug 29, 2022

Fix FileSettingsService hang on error update #89630

Fix FileSettingsService hang on error update #89630

Conversation

grcevski commented Aug 25, 2022

elasticsearchmachine commented Aug 25, 2022

elasticsearchmachine commented Aug 25, 2022

grcevski Aug 25, 2022

Choose a reason for hiding this comment

grcevski Aug 25, 2022

Choose a reason for hiding this comment

ChrisHegarty left a comment

Choose a reason for hiding this comment

grcevski commented Aug 29, 2022

elasticsearchmachine commented Aug 29, 2022

💚 Backport successful