Skip to content

Conversation

@drempapis
Copy link
Contributor

In some previous work (#137078), we introduced changes that mitigated the .async-search shard lock issue. While that significantly reduced the frequency of the problem, a few cases still persisted.

The error can occur when HTTP Response objects are not fully closed, or when persisted .async-search results (keep_on_completion=true) are not deleted. Either situation can leave file handles open long enough that the .async-search shard remains locked during cluster teardown, resulting in:

Shard [.async-search][0] is still locked after 5 sec waiting

This change further addresses the issue by preventing unreleased resources that cause .async-search shard lock failures:

  • Deterministic HTTP response cleanup: all responses now fully consume the HttpEntity to release connections.
  • Async-search cleanup: persisted .async-search results created with keep_on_completion=true are now deleted in a finally block after each test.

Closes #137150

@drempapis drempapis requested a review from benchaplin November 7, 2025 10:18
@drempapis drempapis added >test Issues or PRs that are addressing/adding tests auto-backport Automatically create backport pull requests when merged Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch :Search Foundations/Search Catch all for Search Foundations v9.2.0 v9.3.0 labels Nov 7, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-foundations (Team:Search Foundations)

Copy link
Contributor

@benchaplin benchaplin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few questions...

return;
}

// Make sure the .async-search system index is green before deleting it
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why ensure green?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the time we reach the cleanup phase, the .async-search shard may still be relocating or recovering, which is when shard-lock timeouts are most likely to occur during test teardown. To prevent this, we ensure that the .async-search system index is fully ready and stable before deleting the async search result.

// check that the stack trace was not sent from the data node to the coordinating node
ErrorTraceHelper.assertStackTraceCleared(internalCluster());
} finally {
deleteAsyncSearchIfPresent(createAsyncResponseEntity);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me, I see a similar thing is done in CCSDuelIT for async searches.

Copy link
Contributor Author

@drempapis drempapis Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, Ben, for the review. Yes, this is actually the most important part of this PR, ensuring that the entry (with id) in the index is deleted before the test reaches the “after test cleanup,” where the exception is thrown.

value = "org.elasticsearch.xpack.search.MutableSearchResponse:DEBUG,org.elasticsearch.xpack.search.AsyncSearchTask:DEBUG"
)
public class AsyncSearchErrorTraceIT extends ESIntegTestCase {
public class AsyncSearchErrorTraceIT extends AsyncSearchIntegTestCase {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed for the fix or is it an unrelated improvement?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn’t necessary for this test; I added it to keep it consistent with the other tests in the same package.

XContentType entityContentType = XContentType.fromMediaType(response.getEntity().getContentType().getValue());
return XContentHelper.convertToMap(entityContentType.xContent(), response.getEntity().getContent(), false);

HttpEntity entity = response.getEntity();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain what this entity stuff has to do with the test errors?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess nothing! :) I got the idea from ESRestTestCase, to ensure that the connection is released back to the pool regardless of what happens. Upon re-examining this, the ShardLockObtainFailedException is a server-side issue related to the shard lifecycle, and consuming the entity doesn’t affect shard locks, it only impacts client connection reuse. I’ll revert this part.

However, I’m keeping it in the deleteAsyncSearchIfPresent to ensure that the Http connection is fully consumed and returned to the pool before teardown.

Copy link
Contributor

@benchaplin benchaplin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for explaining!

@drempapis drempapis merged commit 118ce80 into elastic:main Nov 13, 2025
34 checks passed
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
9.2

drempapis added a commit to drempapis/elasticsearch that referenced this pull request Nov 13, 2025
szybia added a commit to szybia/elasticsearch that referenced this pull request Nov 13, 2025
…-json

* upstream/main: (158 commits)
  Cleanup files from repo root folder (elastic#138030)
  Implement OpenShift AI integration for chat completion, embeddings, and reranking (elastic#136624)
  Optimize AsyncSearchErrorTraceIT to avoid failures (elastic#137716)
  Removes support for null TransportService in RemoteClusterService (elastic#137939)
  Mute org.elasticsearch.index.mapper.DateFieldMapperTests testSortShortcuts elastic#138018
  rest-api-spec: fix type of enums (elastic#137521)
  Update Gradle wrapper to 9.2.0 (elastic#136155)
  Add RCS Strong Verification Documentation (elastic#137822)
  Use docvalue skippers on dimension fields (elastic#137029)
  Introduce INDEX_SHARD_COUNT_FORMAT (elastic#137210)
  Mute org.elasticsearch.xpack.inference.integration.AuthorizationTaskExecutorIT testCreatesChatCompletion_AndThenCreatesTextEmbedding elastic#138012
  Fix ES|QL search context creation to use correct results type (elastic#137994)
  Improve Snapshot Logging (elastic#137470)
  Support extra output field in TOP function (elastic#135434)
  Remove NumericDoubleValues class (elastic#137884)
  [ML] Fix ML calendar event update scalability issues (elastic#136886)
  Task may be unregistered outside of the trace context in exceptional cases. (elastic#137865)
  Refine workaround for S3 repo analysis known issue (elastic#138000)
  Additional DEBUG logging on authc failures (elastic#137941)
  Cleanup index resolution (elastic#137867)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-backport Automatically create backport pull requests when merged :Search Foundations/Search Catch all for Search Foundations Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch >test Issues or PRs that are addressing/adding tests v9.2.0 v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CI] AsyncSearchErrorTraceIT testAsyncSearchFailingQueryErrorTraceTrue failing

3 participants