Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to update expiration time for async-search #63213

Closed
consulthys opened this issue Oct 3, 2020 · 10 comments · Fixed by #63652
Closed

Failed to update expiration time for async-search #63213

consulthys opened this issue Oct 3, 2020 · 10 comments · Fixed by #63652
Assignees
Labels
>bug :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team

Comments

@consulthys
Copy link
Contributor

Elasticsearch version (bin/elasticsearch --version):
7.9.1 (Elastic Cloud)

Plugins installed: []
N/A

JVM version (java -version):

        "version" : "14.0.1",
        "vm_name" : "OpenJDK 64-Bit Server VM",
        "vm_version" : "14.0.1+7",
        "vm_vendor" : "AdoptOpenJDK",
        "bundled_jdk" : true,

OS version (uname -a if on a Unix-like system):

        "name" : "Linux",
        "pretty_name" : "CentOS Linux 7 (Core)",
        "arch" : "amd64",
        "version" : "4.15.0-1027-aws",

Description of the problem including expected versus actual behavior:

In the ES Cloud console logs, I see the following ERROR log one or two times a day:

[instance-0000000043] failed to update expiration time for async-search [FnFMaHdUUmZQVG1tMWVuZy05bUo4aEEfUGY1ZThBaTRTeHlqQWNBcHBVdFlJUTozNDE4NjcxNw==] org.elasticsearch.transport.RemoteTransportException: [instance-0000000054][172.17.0.15:19354][indices:data/write/update[s]] Caused by: org.elasticsearch.index.engine.VersionConflictEngineException: [qLhwTRfPTmm1eng-9mJ8hA]: version conflict, required seqNo [144258], primary term [18]. but no document was found at org.elasticsearch.index.engine.InternalEngine.planIndexingAsPrimary(InternalEngine.java:1052) ~[elasticsearch-7.9.1.jar:7.9.1] 
at org.elasticsearch.index.engine.InternalEngine.indexingStrategyForOperation(InternalEngine.java:1022) ~[elasticsearch-7.9.1.jar:7.9.1] 
at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:911) ~[elasticsearch-7.9.1.jar:7.9.1] 
at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:819) ~[elasticsearch-7.9.1.jar:7.9.1] 
at org.elasticsearch.index.shard.IndexShard.applyIndexOperation(IndexShard.java:791) ~[elasticsearch-7.9.1.jar:7.9.1] at org.elasticsearch.index.shard.IndexShard.applyIndexOperationOnPrimary(IndexShard.java:748) ~[elasticsearch-7.9.1.jar:7.9.1] 
at org.elasticsearch.action.bulk.TransportShardBulkAction.executeBulkItemRequest(TransportShardBulkAction.java:274) ~[elasticsearch-7.9.1.jar:7.9.1] 
at org.elasticsearch.action.bulk.TransportShardBulkAction$2.doRun(TransportShardBulkAction.java:164) ~[elasticsearch-7.9.1.jar:7.9.1] 
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.9.1.jar:7.9.1] 
at org.elasticsearch.action.bulk.TransportShardBulkAction.performOnPrimary(TransportShardBulkAction.java:209) ~[elasticsearch-7.9.1.jar:7.9.1] 
at org.elasticsearch.action.bulk.TransportShardBulkAction.dispatchedShardOperationOnPrimary(TransportShardBulkAction.java:116) ~[elasticsearch-7.9.1.jar:7.9.1] 
at org.elasticsearch.action.bulk.TransportShardBulkAction.dispatchedShardOperationOnPrimary(TransportShardBulkAction.java:82) ~[elasticsearch-7.9.1.jar:7.9.1] 
at org.elasticsearch.action.support.replication.TransportWriteAction$1.doRun(TransportWriteAction.java:161) ~[elasticsearch-7.9.1.jar:7.9.1] 
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:710) ~[elasticsearch-7.9.1.jar:7.9.1] 
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.9.1.jar:7.9.1] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) ~[?:?] 
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) ~[?:?] 
at java.lang.Thread.run(Thread.java:832) [?:?]

It's hard to know where that would come from (maybe from interactions in Kibana), but I wouldn't expect to see such errors. It seems to be due to a version conflict when updating a document in .async_search

@consulthys consulthys added >bug needs:triage Requires assignment of a team area label labels Oct 3, 2020
@matriv matriv added the :Search/Search Search-related issues that do not fall into other categories label Oct 5, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (:Search/Search)

@elasticmachine elasticmachine added the Team:Search Meta label for search team label Oct 5, 2020
@jimczi
Copy link
Contributor

jimczi commented Oct 6, 2020

This error means that the search was cancelled or deleted so the document doesn't exist anymore. Are you seeing any error in Kibana when this is logged in ES ? In any case, this logging is misleading since we shouldn't log anything (at least not at the error level) if the document is "just" missing. I'll leave the bug label to fix the logging level.

@consulthys
Copy link
Contributor Author

Thanks for looking into this @jimczi !
I don't really see any error in Kibana.

@tomsozolins
Copy link

I have the same problem since 7.9.1 its really annoying to see error every time i click search in kibana. I see error message in kibana frontend and in elasticsearch logs.
Now running 7.9.2 and the error is still happening.

Elastic logs:
failed to update expiration time for async-search [FmN6QVNOSENCUmVPNmV4bnRQYTRUUkEfdWVDUXJwbktSU09TQV80ZWZxUEhVQTo4NjUyNTEwMA==]

@javanna javanna removed the needs:triage Requires assignment of a team area label label Oct 8, 2020
@javanna javanna self-assigned this Oct 8, 2020
@jimczi
Copy link
Contributor

jimczi commented Oct 8, 2020

I have the same problem since 7.9.1 its really annoying to see error every time i click search in kibana

Can you share the error that you see in Kibana and the stack trace that is logged in ES ? Does it happen on every search ? It could be a bug in Kibana too so we'll need to understand how this error is triggered.

@tomsozolins
Copy link

tomsozolins commented Oct 9, 2020

JVM version:

OpenJDK Runtime Environment (build 1.8.0_262-b10)
OpenJDK 64-Bit Server VM (build 25.262-b10, mixed mode)

The error is happening on any index - filebeat, winlogbeat etc.
Here is the response from clicking search in Kibana:

{"statusCode":404,"error":"Not Found","message":"[resource_not_found_exception] Fi15TVZfQnBZVElDZUtmQjlubmhyUncfdWVDUXJwbktSU09TQV80ZWZxUEhVQTo5MTU2MzI3Nw==","attributes":{"error":{"root_cause":[{"type":"resource_not_found_exception","reason":"Fi15TVZfQnBZVElDZUtmQjlubmhyUncfdWVDUXJwbktSU09TQV80ZWZxUEhVQTo5MTU2MzI3Nw=="}],"type":"resource_not_found_exception","reason":"Fi15TVZfQnBZVElDZUtmQjlubmhyUncfdWVDUXJwbktSU09TQV80ZWZxUEhVQTo5MTU2MzI3Nw=="}}}

Elasticsearch logs:

elasticsearch.server
[elasticsearch.server][ERROR] failed to update expiration time for async-search [Fnc5QmJBY21nUzVpZk9yUGJNSzJyLVEfWGQzRndnSlFUQ2VJZk14aGM4U0tCQTo4NjE4NjA0NA==]
12:45:16.339
elasticsearch.server
[elasticsearch.server][ERROR] failed to update expiration time for async-search [Fk54WFByNzJqUzlTMXZMNFlTMGRZR3cfWGQzRndnSlFUQ2VJZk14aGM4U0tCQTo4NjE4NjkyMQ==]
org.elasticsearch.transport.RemoteTransportException: [node1][xxx.xxx.xxx.xxx:9300][indices:data/write/update[s]]
Caused by: org.elasticsearch.index.engine.VersionConflictEngineException: [NxXPr72jS9S1vL4YS0dYGw]: version conflict, required seqNo [16838], primary term [19]. current document has seqNo [16839] and primary term [19]
	at org.elasticsearch.index.engine.InternalEngine.planIndexingAsPrimary(InternalEngine.java:1059) ~[elasticsearch-7.9.2.jar:7.9.2]
	at org.elasticsearch.index.engine.InternalEngine.indexingStrategyForOperation(InternalEngine.java:1022) ~[elasticsearch-7.9.2.jar:7.9.2]
	at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:911) ~[elasticsearch-7.9.2.jar:7.9.2]
	at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:819) ~[elasticsearch-7.9.2.jar:7.9.2]
	at org.elasticsearch.index.shard.IndexShard.applyIndexOperation(IndexShard.java:791) ~[elasticsearch-7.9.2.jar:7.9.2]
	at org.elasticsearch.index.shard.IndexShard.applyIndexOperationOnPrimary(IndexShard.java:748) ~[elasticsearch-7.9.2.jar:7.9.2]
	at org.elasticsearch.action.bulk.TransportShardBulkAction.executeBulkItemRequest(TransportShardBulkAction.java:274) ~[elasticsearch-7.9.2.jar:7.9.2]
	at org.elasticsearch.action.bulk.TransportShardBulkAction$2.doRun(TransportShardBulkAction.java:164) ~[elasticsearch-7.9.2.jar:7.9.2]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.9.2.jar:7.9.2]
	at org.elasticsearch.action.bulk.TransportShardBulkAction.performOnPrimary(TransportShardBulkAction.java:209) ~[elasticsearch-7.9.2.jar:7.9.2]
	at org.elasticsearch.action.bulk.TransportShardBulkAction.dispatchedShardOperationOnPrimary(TransportShardBulkAction.java:116) ~[elasticsearch-7.9.2.jar:7.9.2]
	at org.elasticsearch.action.bulk.TransportShardBulkAction.dispatchedShardOperationOnPrimary(TransportShardBulkAction.java:82) ~[elasticsearch-7.9.2.jar:7.9.2]
	at org.elasticsearch.action.support.replication.TransportWriteAction$1.doRun(TransportWriteAction.java:161) ~[elasticsearch-7.9.2.jar:7.9.2]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:737) ~[elasticsearch-7.9.2.jar:7.9.2]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.9.2.jar:7.9.2]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) ~[?:?]
	at java.lang.Thread.run(Thread.java:832) [?:?]
12:45:16.339
elasticsearch.server
[elasticsearch.server][ERROR] failed to update expiration time for async-search [Fk54WFByNzJqUzlTMXZMNFlTMGRZR3cfWGQzRndnSlFUQ2VJZk14aGM4U0tCQTo4NjE4NjkyMQ==]
12:45:32.922
elasticsearch.server
[elasticsearch.server][ERROR] failed to update expiration time for async-search [FnJDSXlUeS1ZUzh1UnF6RVloZVJwd2cfWGQzRndnSlFUQ2VJZk14aGM4U0tCQTo4NjE4NzgxMg==]```

@curtdept
Copy link

curtdept commented Oct 12, 2020

Ya we hit this one a lot across instances. Seems to make queries hang while it resolves. (7.9.2 / ECK).

I spent more time with this. It would be nice to control in some way what nodes this index goes on. My issue seems to be directly related to a primary being on a heavily utilized hot node, deleting it and forcing the system to recreate it on another set of nodes (I had to do this a few times), showed immediate and lasting benefit.

Guessing somewhat related to #37867

@tomsozolins
Copy link

tomsozolins commented Oct 14, 2020

JVM version:

OpenJDK Runtime Environment (build 1.8.0_262-b10)
OpenJDK 64-Bit Server VM (build 25.262-b10, mixed mode)

The error is happening on any index - filebeat, winlogbeat etc.
Here is the response from clicking search in Kibana:

{"statusCode":404,"error":"Not Found","message":"[resource_not_found_exception] Fi15TVZfQnBZVElDZUtmQjlubmhyUncfdWVDUXJwbktSU09TQV80ZWZxUEhVQTo5MTU2MzI3Nw==","attributes":{"error":{"root_cause":[{"type":"resource_not_found_exception","reason":"Fi15TVZfQnBZVElDZUtmQjlubmhyUncfdWVDUXJwbktSU09TQV80ZWZxUEhVQTo5MTU2MzI3Nw=="}],"type":"resource_not_found_exception","reason":"Fi15TVZfQnBZVElDZUtmQjlubmhyUncfdWVDUXJwbktSU09TQV80ZWZxUEhVQTo5MTU2MzI3Nw=="}}}

Elasticsearch logs:

elasticsearch.server
[elasticsearch.server][ERROR] failed to update expiration time for async-search [Fnc5QmJBY21nUzVpZk9yUGJNSzJyLVEfWGQzRndnSlFUQ2VJZk14aGM4U0tCQTo4NjE4NjA0NA==]
12:45:16.339
elasticsearch.server
[elasticsearch.server][ERROR] failed to update expiration time for async-search [Fk54WFByNzJqUzlTMXZMNFlTMGRZR3cfWGQzRndnSlFUQ2VJZk14aGM4U0tCQTo4NjE4NjkyMQ==]
org.elasticsearch.transport.RemoteTransportException: [node1][xxx.xxx.xxx.xxx:9300][indices:data/write/update[s]]
Caused by: org.elasticsearch.index.engine.VersionConflictEngineException: [NxXPr72jS9S1vL4YS0dYGw]: version conflict, required seqNo [16838], primary term [19]. current document has seqNo [16839] and primary term [19]
	at org.elasticsearch.index.engine.InternalEngine.planIndexingAsPrimary(InternalEngine.java:1059) ~[elasticsearch-7.9.2.jar:7.9.2]
	at org.elasticsearch.index.engine.InternalEngine.indexingStrategyForOperation(InternalEngine.java:1022) ~[elasticsearch-7.9.2.jar:7.9.2]
	at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:911) ~[elasticsearch-7.9.2.jar:7.9.2]
	at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:819) ~[elasticsearch-7.9.2.jar:7.9.2]
	at org.elasticsearch.index.shard.IndexShard.applyIndexOperation(IndexShard.java:791) ~[elasticsearch-7.9.2.jar:7.9.2]
	at org.elasticsearch.index.shard.IndexShard.applyIndexOperationOnPrimary(IndexShard.java:748) ~[elasticsearch-7.9.2.jar:7.9.2]
	at org.elasticsearch.action.bulk.TransportShardBulkAction.executeBulkItemRequest(TransportShardBulkAction.java:274) ~[elasticsearch-7.9.2.jar:7.9.2]
	at org.elasticsearch.action.bulk.TransportShardBulkAction$2.doRun(TransportShardBulkAction.java:164) ~[elasticsearch-7.9.2.jar:7.9.2]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.9.2.jar:7.9.2]
	at org.elasticsearch.action.bulk.TransportShardBulkAction.performOnPrimary(TransportShardBulkAction.java:209) ~[elasticsearch-7.9.2.jar:7.9.2]
	at org.elasticsearch.action.bulk.TransportShardBulkAction.dispatchedShardOperationOnPrimary(TransportShardBulkAction.java:116) ~[elasticsearch-7.9.2.jar:7.9.2]
	at org.elasticsearch.action.bulk.TransportShardBulkAction.dispatchedShardOperationOnPrimary(TransportShardBulkAction.java:82) ~[elasticsearch-7.9.2.jar:7.9.2]
	at org.elasticsearch.action.support.replication.TransportWriteAction$1.doRun(TransportWriteAction.java:161) ~[elasticsearch-7.9.2.jar:7.9.2]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:737) ~[elasticsearch-7.9.2.jar:7.9.2]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.9.2.jar:7.9.2]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) ~[?:?]
	at java.lang.Thread.run(Thread.java:832) [?:?]
12:45:16.339
elasticsearch.server
[elasticsearch.server][ERROR] failed to update expiration time for async-search [Fk54WFByNzJqUzlTMXZMNFlTMGRZR3cfWGQzRndnSlFUQ2VJZk14aGM4U0tCQTo4NjE4NjkyMQ==]
12:45:32.922
elasticsearch.server
[elasticsearch.server][ERROR] failed to update expiration time for async-search [FnJDSXlUeS1ZUzh1UnF6RVloZVJwd2cfWGQzRndnSlFUQ2VJZk14aGM4U0tCQTo4NjE4NzgxMg==]```

Here is Kibana stack trace:

    at Fetch._callee3$ (https://kibana/33984/bundles/core/core.entry.js:34:109213)
    at l (https://kibana/33984/bundles/kbn-ui-shared-deps/kbn-ui-shared-deps.js:368:155323)
    at Generator._invoke (https://kibana/33984/bundles/kbn-ui-shared-deps/kbn-ui-shared-deps.js:368:155076)
    at Generator.forEach.e.<computed> [as next] (https://kibana/33984/bundles/kbn-ui-shared-deps/kbn-ui-shared-deps.js:368:155680)
    at fetch_asyncGeneratorStep (https://kibana/33984/bundles/core/core.entry.js:34:102354)
    at _next (https://kibana/33984/bundles/core/core.entry.js:34:102670)```

jimczi added a commit to jimczi/elasticsearch that referenced this issue Oct 14, 2020
The _async_search APIs can throw version conflict exception when the internal response
is updated concurrently. That can happen if the final response is written while the user
extends the expiration time. That scenario should be rare but it happened in Kibana for
several users so this change ensures that updates are retried at least 5 times. That
should resolve the transient errors for Kibana. This change also preserves the version
conflict exception in case the retry didn't work instead of returning a confusing 404.
This commit also ensures that we don't delete the response if the search was cancelled
internally and not deleted explicitly by the user.

Closes elastic#63213
@jimczi
Copy link
Contributor

jimczi commented Oct 14, 2020

I opened #63652 after reading the additional stack traces that you provided. The one in the description is about a deleted search so it's not really a bug but the subsequent ones are due to a race condition. The update of the expiration time competes with the update of the response when the search finishes so a version conflict can happen. Retrying the updates in #63652 in should be enough to fix this bug.

@consulthys
Copy link
Contributor Author

consulthys commented Oct 14, 2020

Thank you @jimczi 👍
Also thanks to @tomsozolins for sharing stack traces 👍

jimczi added a commit that referenced this issue Oct 16, 2020
* Async search should retry updates on version conflict

The _async_search APIs can throw version conflict exception when the internal response
is updated concurrently. That can happen if the final response is written while the user
extends the expiration time. That scenario should be rare but it happened in Kibana for
several users so this change ensures that updates are retried at least 5 times. That
should resolve the transient errors for Kibana. This change also preserves the version
conflict exception in case the retry didn't work instead of returning a confusing 404.
This commit also ensures that we don't delete the response if the search was cancelled
internally and not deleted explicitly by the user.

Closes #63213
jimczi added a commit that referenced this issue Oct 16, 2020
* Async search should retry updates on version conflict

The _async_search APIs can throw version conflict exception when the internal response
is updated concurrently. That can happen if the final response is written while the user
extends the expiration time. That scenario should be rare but it happened in Kibana for
several users so this change ensures that updates are retried at least 5 times. That
should resolve the transient errors for Kibana. This change also preserves the version
conflict exception in case the retry didn't work instead of returning a confusing 404.
This commit also ensures that we don't delete the response if the search was cancelled
internally and not deleted explicitly by the user.

Closes #63213
jimczi added a commit that referenced this issue Oct 16, 2020
* Async search should retry updates on version conflict

The _async_search APIs can throw version conflict exception when the internal response
is updated concurrently. That can happen if the final response is written while the user
extends the expiration time. That scenario should be rare but it happened in Kibana for
several users so this change ensures that updates are retried at least 5 times. That
should resolve the transient errors for Kibana. This change also preserves the version
conflict exception in case the retry didn't work instead of returning a confusing 404.
This commit also ensures that we don't delete the response if the search was cancelled
internally and not deleted explicitly by the user.

Closes #63213
jimczi added a commit that referenced this issue Oct 16, 2020
* Async search should retry updates on version conflict

The _async_search APIs can throw version conflict exception when the internal response
is updated concurrently. That can happen if the final response is written while the user
extends the expiration time. That scenario should be rare but it happened in Kibana for
several users so this change ensures that updates are retried at least 5 times. That
should resolve the transient errors for Kibana. This change also preserves the version
conflict exception in case the retry didn't work instead of returning a confusing 404.
This commit also ensures that we don't delete the response if the search was cancelled
internally and not deleted explicitly by the user.

Closes #63213
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants