Optimize AsyncSearchErrorTraceIT to avoid failures #137716

drempapis · 2025-11-07T10:17:59Z

In some previous work (#137078), we introduced changes that mitigated the .async-search shard lock issue. While that significantly reduced the frequency of the problem, a few cases still persisted.

The error can occur when HTTP Response objects are not fully closed, or when persisted .async-search results (keep_on_completion=true) are not deleted. Either situation can leave file handles open long enough that the .async-search shard remains locked during cluster teardown, resulting in:

Shard [.async-search][0] is still locked after 5 sec waiting

This change further addresses the issue by preventing unreleased resources that cause .async-search shard lock failures:

Deterministic HTTP response cleanup: all responses now fully consume the HttpEntity to release connections.
Async-search cleanup: persisted .async-search results created with keep_on_completion=true are now deleted in a finally block after each test.

Closes #137150

elasticsearchmachine · 2025-11-07T10:18:25Z

Pinging @elastic/es-search-foundations (Team:Search Foundations)

benchaplin

A few questions...

benchaplin · 2025-11-11T18:35:16Z

...rch/src/internalClusterTest/java/org/elasticsearch/xpack/search/AsyncSearchErrorTraceIT.java

+            return;
+        }
+
+        // Make sure the .async-search system index is green before deleting it


Why ensure green?

By the time we reach the cleanup phase, the .async-search shard may still be relocating or recovering, which is when shard-lock timeouts are most likely to occur during test teardown. To prevent this, we ensure that the .async-search system index is fully ready and stable before deleting the async search result.

benchaplin · 2025-11-11T18:36:17Z

...rch/src/internalClusterTest/java/org/elasticsearch/xpack/search/AsyncSearchErrorTraceIT.java

+            // check that the stack trace was not sent from the data node to the coordinating node
+            ErrorTraceHelper.assertStackTraceCleared(internalCluster());
+        } finally {
+            deleteAsyncSearchIfPresent(createAsyncResponseEntity);


This makes sense to me, I see a similar thing is done in CCSDuelIT for async searches.

Thank you, Ben, for the review. Yes, this is actually the most important part of this PR, ensuring that the entry (with id) in the index is deleted before the test reaches the “after test cleanup,” where the exception is thrown.

benchaplin · 2025-11-11T18:36:44Z

...rch/src/internalClusterTest/java/org/elasticsearch/xpack/search/AsyncSearchErrorTraceIT.java

    value = "org.elasticsearch.xpack.search.MutableSearchResponse:DEBUG,org.elasticsearch.xpack.search.AsyncSearchTask:DEBUG"
 )
-public class AsyncSearchErrorTraceIT extends ESIntegTestCase {
+public class AsyncSearchErrorTraceIT extends AsyncSearchIntegTestCase {


Is this needed for the fix or is it an unrelated improvement?

This isn’t necessary for this test; I added it to keep it consistent with the other tests in the same package.

benchaplin · 2025-11-11T18:37:17Z

...rch/src/internalClusterTest/java/org/elasticsearch/xpack/search/AsyncSearchErrorTraceIT.java

-        XContentType entityContentType = XContentType.fromMediaType(response.getEntity().getContentType().getValue());
-        return XContentHelper.convertToMap(entityContentType.xContent(), response.getEntity().getContent(), false);
+
+        HttpEntity entity = response.getEntity();


Can you explain what this entity stuff has to do with the test errors?

I guess nothing! :) I got the idea from ESRestTestCase, to ensure that the connection is released back to the pool regardless of what happens. Upon re-examining this, the ShardLockObtainFailedException is a server-side issue related to the shard lifecycle, and consuming the entity doesn’t affect shard locks, it only impacts client connection reuse. I’ll revert this part.

However, I’m keeping it in the deleteAsyncSearchIfPresent to ensure that the Http connection is fully consumed and returned to the pool before teardown.

…rempapis/elasticsearch into fix/AsyncSearchErrorTraceIT_shard_lock

benchaplin

LGTM, thanks for explaining!

elasticsearchmachine · 2025-11-13T15:13:41Z

💚 Backport successful

Status	Branch	Result
✅	9.2

…-json * upstream/main: (158 commits) Cleanup files from repo root folder (elastic#138030) Implement OpenShift AI integration for chat completion, embeddings, and reranking (elastic#136624) Optimize AsyncSearchErrorTraceIT to avoid failures (elastic#137716) Removes support for null TransportService in RemoteClusterService (elastic#137939) Mute org.elasticsearch.index.mapper.DateFieldMapperTests testSortShortcuts elastic#138018 rest-api-spec: fix type of enums (elastic#137521) Update Gradle wrapper to 9.2.0 (elastic#136155) Add RCS Strong Verification Documentation (elastic#137822) Use docvalue skippers on dimension fields (elastic#137029) Introduce INDEX_SHARD_COUNT_FORMAT (elastic#137210) Mute org.elasticsearch.xpack.inference.integration.AuthorizationTaskExecutorIT testCreatesChatCompletion_AndThenCreatesTextEmbedding elastic#138012 Fix ES|QL search context creation to use correct results type (elastic#137994) Improve Snapshot Logging (elastic#137470) Support extra output field in TOP function (elastic#135434) Remove NumericDoubleValues class (elastic#137884) [ML] Fix ML calendar event update scalability issues (elastic#136886) Task may be unregistered outside of the trace context in exceptional cases. (elastic#137865) Refine workaround for S3 repo analysis known issue (elastic#138000) Additional DEBUG logging on authc failures (elastic#137941) Cleanup index resolution (elastic#137867) ...

optimize test class

07d3482

drempapis requested a review from benchaplin November 7, 2025 10:18

elasticsearchmachine and others added 3 commits November 7, 2025 10:24

[CI] Auto commit changes from spotless

3eeb382

Merge branch 'main' into fix/AsyncSearchErrorTraceIT_shard_lock

f85e0e0

Merge branch 'main' into fix/AsyncSearchErrorTraceIT_shard_lock

9f43a48

benchaplin reviewed Nov 11, 2025

View reviewed changes

drempapis and others added 13 commits November 12, 2025 09:04

Merge branch 'main' into fix/AsyncSearchErrorTraceIT_shard_lock

8eb666d

update after review

7a196af

update

9f29d54

Merge branch 'main' into fix/AsyncSearchErrorTraceIT_shard_lock

579a670

[CI] Auto commit changes from spotless

a7d09d4

update after review

470111d

Merge branch 'fix/AsyncSearchErrorTraceIT_shard_lock' of github.com:d…

2564e70

…rempapis/elasticsearch into fix/AsyncSearchErrorTraceIT_shard_lock

Merge branch 'main' into fix/AsyncSearchErrorTraceIT_shard_lock

87b0136

Merge branch 'main' into fix/AsyncSearchErrorTraceIT_shard_lock

24b8090

Merge branch 'main' into fix/AsyncSearchErrorTraceIT_shard_lock

abe679e

Merge branch 'main' into fix/AsyncSearchErrorTraceIT_shard_lock

9436352

Merge branch 'main' into fix/AsyncSearchErrorTraceIT_shard_lock

01a6f6e

Merge branch 'main' into fix/AsyncSearchErrorTraceIT_shard_lock

e7dac60

benchaplin approved these changes Nov 13, 2025

View reviewed changes

drempapis merged commit 118ce80 into elastic:main Nov 13, 2025
34 checks passed

drempapis mentioned this pull request Nov 13, 2025

[9.2] Optimize AsyncSearchErrorTraceIT to avoid failures (#137716) #138026

Merged

drempapis added a commit to drempapis/elasticsearch that referenced this pull request Nov 13, 2025

Optimize AsyncSearchErrorTraceIT to avoid failures (elastic#137716)

1b14f78

elasticsearchmachine pushed a commit that referenced this pull request Nov 13, 2025

Optimize AsyncSearchErrorTraceIT to avoid failures (#137716) (#138026)

6a3e650

Optimize AsyncSearchErrorTraceIT to avoid failures #137716

Optimize AsyncSearchErrorTraceIT to avoid failures #137716

Uh oh!

Conversation

drempapis commented Nov 7, 2025

Uh oh!

elasticsearchmachine commented Nov 7, 2025

Uh oh!

benchaplin left a comment

Choose a reason for hiding this comment

Uh oh!

benchaplin Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

drempapis Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

benchaplin Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

drempapis Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

benchaplin Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

drempapis Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

benchaplin Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

drempapis Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

benchaplin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

elasticsearchmachine commented Nov 13, 2025

💚 Backport successful

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

drempapis Nov 12, 2025 •

edited

Loading