-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Optimize AsyncSearchErrorTraceIT to avoid failures #137716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize AsyncSearchErrorTraceIT to avoid failures #137716
Conversation
|
Pinging @elastic/es-search-foundations (Team:Search Foundations) |
benchaplin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few questions...
| return; | ||
| } | ||
|
|
||
| // Make sure the .async-search system index is green before deleting it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why ensure green?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the time we reach the cleanup phase, the .async-search shard may still be relocating or recovering, which is when shard-lock timeouts are most likely to occur during test teardown. To prevent this, we ensure that the .async-search system index is fully ready and stable before deleting the async search result.
| // check that the stack trace was not sent from the data node to the coordinating node | ||
| ErrorTraceHelper.assertStackTraceCleared(internalCluster()); | ||
| } finally { | ||
| deleteAsyncSearchIfPresent(createAsyncResponseEntity); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense to me, I see a similar thing is done in CCSDuelIT for async searches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, Ben, for the review. Yes, this is actually the most important part of this PR, ensuring that the entry (with id) in the index is deleted before the test reaches the “after test cleanup,” where the exception is thrown.
| value = "org.elasticsearch.xpack.search.MutableSearchResponse:DEBUG,org.elasticsearch.xpack.search.AsyncSearchTask:DEBUG" | ||
| ) | ||
| public class AsyncSearchErrorTraceIT extends ESIntegTestCase { | ||
| public class AsyncSearchErrorTraceIT extends AsyncSearchIntegTestCase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed for the fix or is it an unrelated improvement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn’t necessary for this test; I added it to keep it consistent with the other tests in the same package.
| XContentType entityContentType = XContentType.fromMediaType(response.getEntity().getContentType().getValue()); | ||
| return XContentHelper.convertToMap(entityContentType.xContent(), response.getEntity().getContent(), false); | ||
|
|
||
| HttpEntity entity = response.getEntity(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain what this entity stuff has to do with the test errors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess nothing! :) I got the idea from ESRestTestCase, to ensure that the connection is released back to the pool regardless of what happens. Upon re-examining this, the ShardLockObtainFailedException is a server-side issue related to the shard lifecycle, and consuming the entity doesn’t affect shard locks, it only impacts client connection reuse. I’ll revert this part.
However, I’m keeping it in the deleteAsyncSearchIfPresent to ensure that the Http connection is fully consumed and returned to the pool before teardown.
…rempapis/elasticsearch into fix/AsyncSearchErrorTraceIT_shard_lock
benchaplin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for explaining!
💚 Backport successful
|
…-json * upstream/main: (158 commits) Cleanup files from repo root folder (elastic#138030) Implement OpenShift AI integration for chat completion, embeddings, and reranking (elastic#136624) Optimize AsyncSearchErrorTraceIT to avoid failures (elastic#137716) Removes support for null TransportService in RemoteClusterService (elastic#137939) Mute org.elasticsearch.index.mapper.DateFieldMapperTests testSortShortcuts elastic#138018 rest-api-spec: fix type of enums (elastic#137521) Update Gradle wrapper to 9.2.0 (elastic#136155) Add RCS Strong Verification Documentation (elastic#137822) Use docvalue skippers on dimension fields (elastic#137029) Introduce INDEX_SHARD_COUNT_FORMAT (elastic#137210) Mute org.elasticsearch.xpack.inference.integration.AuthorizationTaskExecutorIT testCreatesChatCompletion_AndThenCreatesTextEmbedding elastic#138012 Fix ES|QL search context creation to use correct results type (elastic#137994) Improve Snapshot Logging (elastic#137470) Support extra output field in TOP function (elastic#135434) Remove NumericDoubleValues class (elastic#137884) [ML] Fix ML calendar event update scalability issues (elastic#136886) Task may be unregistered outside of the trace context in exceptional cases. (elastic#137865) Refine workaround for S3 repo analysis known issue (elastic#138000) Additional DEBUG logging on authc failures (elastic#137941) Cleanup index resolution (elastic#137867) ...
In some previous work (#137078), we introduced changes that mitigated the
.async-searchshard lock issue. While that significantly reduced the frequency of the problem, a few cases still persisted.The error can occur when
HTTPResponse objects are not fully closed, or when persisted.async-searchresults (keep_on_completion=true) are not deleted. Either situation can leave file handles open long enough that the .async-search shard remains locked during cluster teardown, resulting in:This change further addresses the issue by preventing unreleased resources that cause
.async-searchshard lock failures:HttpEntityto release connections..async-searchresults created withkeep_on_completion=trueare now deleted in a finally block after each test.Closes #137150