Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elasticsearch is fsyncing on transport threads #51904

Closed
Tim-Brooks opened this issue Feb 5, 2020 · 3 comments · Fixed by #51957
Closed

Elasticsearch is fsyncing on transport threads #51904

Tim-Brooks opened this issue Feb 5, 2020 · 3 comments · Fixed by #51957
Assignees
Labels
>bug :Distributed/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search.

Comments

@Tim-Brooks
Copy link
Contributor

It is currently possible for a cluster state listener to execute a transport fsync.

  1. Currently in TransportShardBulkAction it is possible that a shard operations will trigger a mapping update.
  2. When this happens Elasticsearch registers a ClusterStateObserver.Listener to continue when the mapping is complete.
  3. This listener will eventually attempt to reschedule the write operation.
  4. If the write thread pool cannot accept this operation, the onRejection callback will fail outstanding operations and complete the request (probably trying to notify of the operations that were able to be completed).
  5. Completing a TransportShardBulkAction will attempt to fsync or refresh as necessary after initiating replication.

Here is a transport_worker stack trace. I also think these listeners might be executed on cluster state threads?

ensureSynced:808, Translog (org.elasticsearch.index.translog)
ensureSynced:824, Translog (org.elasticsearch.index.translog)
ensureTranslogSynced:513, InternalEngine (org.elasticsearch.index.engine)
write:2980, IndexShard$5 (org.elasticsearch.index.shard)
processList:108, AsyncIOProcessor (org.elasticsearch.common.util.concurrent)
drainAndProcessAndRelease:96, AsyncIOProcessor (org.elasticsearch.common.util.concurrent)
put:84, AsyncIOProcessor (org.elasticsearch.common.util.concurrent)
sync:3003, IndexShard (org.elasticsearch.index.shard)
run:320, TransportWriteAction$AsyncAfterWriteAction (org.elasticsearch.action.support.replication)
runPostReplicationActions:163, TransportWriteAction$WritePrimaryResult (org.elasticsearch.action.support.replication)
handlePrimaryResult:136, ReplicationOperation (org.elasticsearch.action.support.replication)
accept:-1, 359201671 (org.elasticsearch.action.support.replication.ReplicationOperation$$Lambda$3596)
onResponse:63, ActionListener$1 (org.elasticsearch.action)
onResponse:163, ActionListener$4 (org.elasticsearch.action)
completeWith:336, ActionListener (org.elasticsearch.action)
finishRequest:186, TransportShardBulkAction$2 (org.elasticsearch.action.bulk)
onRejection:182, TransportShardBulkAction$2 (org.elasticsearch.action.bulk)
onRejection:681, ThreadContext$ContextPreservingAbstractRunnable (org.elasticsearch.common.util.concurrent)
execute:90, EsThreadPoolExecutor (org.elasticsearch.common.util.concurrent)
lambda$doRun$0:160, TransportShardBulkAction$2 (org.elasticsearch.action.bulk)
accept:-1, 95719050 (org.elasticsearch.action.bulk.TransportShardBulkAction$2$$Lambda$3833)
onResponse:63, ActionListener$1 (org.elasticsearch.action)
lambda$onResponse$0:289, TransportShardBulkAction$3 (org.elasticsearch.action.bulk)
run:-1, 1539599321 (org.elasticsearch.action.bulk.TransportShardBulkAction$3$$Lambda$3857)
onResponse:251, ActionListener$5 (org.elasticsearch.action)
onNewClusterState:125, TransportShardBulkAction$1 (org.elasticsearch.action.bulk)
onNewClusterState:311, ClusterStateObserver$ContextPreservingListener (org.elasticsearch.cluster)
waitForNextChange:169, ClusterStateObserver (org.elasticsearch.cluster)
waitForNextChange:120, ClusterStateObserver (org.elasticsearch.cluster)
waitForNextChange:112, ClusterStateObserver (org.elasticsearch.cluster)
lambda$shardOperationOnPrimary$1:122, TransportShardBulkAction (org.elasticsearch.action.bulk)
accept:-1, 1672258490 (org.elasticsearch.action.bulk.TransportShardBulkAction$$Lambda$3831)
onResponse:277, TransportShardBulkAction$3 (org.elasticsearch.action.bulk)
onResponse:273, TransportShardBulkAction$3 (org.elasticsearch.action.bulk)
onResponse:282, ActionListener$6 (org.elasticsearch.action)
onResponse:116, MappingUpdatedAction$1 (org.elasticsearch.cluster.action.index)
onResponse:113, MappingUpdatedAction$1 (org.elasticsearch.cluster.action.index)
lambda$executeLocally$0:97, NodeClient (org.elasticsearch.client.node)
accept:-1, 2099146048 (org.elasticsearch.client.node.NodeClient$$Lambda$2772)
onResponse:144, TaskManager$1 (org.elasticsearch.tasks)
onResponse:138, TaskManager$1 (org.elasticsearch.tasks)
handleResponse:54, ActionListenerResponseHandler (org.elasticsearch.action)
handleResponse:1053, TransportService$ContextRestoreResponseHandler (org.elasticsearch.transport)
doRun:220, InboundHandler$1 (org.elasticsearch.transport)
run:37, AbstractRunnable (org.elasticsearch.common.util.concurrent)
execute:196, EsExecutors$DirectExecutorService (org.elasticsearch.common.util.concurrent)
handleResponse:212, InboundHandler (org.elasticsearch.transport)
messageReceived:138, InboundHandler (org.elasticsearch.transport)
inboundMessage:102, InboundHandler (org.elasticsearch.transport)
inboundMessage:664, TcpTransport (org.elasticsearch.transport)
consumeNetworkReads:688, TcpTransport (org.elasticsearch.transport)
consumeReads:276, MockNioTransport$MockTcpReadWriteHandler (org.elasticsearch.transport.nio)
handleReadBytes:228, SocketChannelContext (org.elasticsearch.nio)
read:40, BytesChannelContext (org.elasticsearch.nio)
handleRead:139, EventHandler (org.elasticsearch.nio)
handleRead:151, TestEventHandler (org.elasticsearch.transport.nio)
handleRead:420, NioSelector (org.elasticsearch.nio)
processKey:246, NioSelector (org.elasticsearch.nio)
singleLoop:174, NioSelector (org.elasticsearch.nio)
runLoop:131, NioSelector (org.elasticsearch.nio)
run:-1, 461835914 (org.elasticsearch.nio.NioSelectorGroup$$Lambda$1709)
run:835, Thread (java.lang)
@Tim-Brooks Tim-Brooks added >bug :Distributed/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. labels Feb 5, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/CRUD)

@Tim-Brooks Tim-Brooks self-assigned this Feb 5, 2020
@ywelsch
Copy link
Contributor

ywelsch commented Feb 5, 2020

Relates #39793 (comment)

@original-brownbear
Copy link
Member

It seems to me this issue would be automatically resolved by #51035 if we went with simply bounding the in-flight bulk requests and made rejections on the write pool impossible that way?

Tim-Brooks added a commit to Tim-Brooks/elasticsearch that referenced this issue Feb 5, 2020
Currently the shard bulk request can be rejected by the write threadpool
after a mapping update. This introduces a scenario where the mapping
listener thread will attempt to finish the request and fsync. This
thread can potentially be a transport thread. This commit fixes this
issue by forcing the finish action to happen on the write threadpool.

Fixes elastic#51904.
Tim-Brooks added a commit that referenced this issue Feb 15, 2020
Currently the shard bulk request can be rejected by the write threadpool
after a mapping update. This introduces a scenario where the mapping
listener thread will attempt to finish the request and fsync. This
thread can potentially be a transport thread. This commit fixes this
issue by forcing the finish action to happen on the write threadpool.

Fixes #51904.
Tim-Brooks added a commit to Tim-Brooks/elasticsearch that referenced this issue Feb 18, 2020
Currently the shard bulk request can be rejected by the write threadpool
after a mapping update. This introduces a scenario where the mapping
listener thread will attempt to finish the request and fsync. This
thread can potentially be a transport thread. This commit fixes this
issue by forcing the finish action to happen on the write threadpool.

Fixes elastic#51904.
Tim-Brooks added a commit that referenced this issue Feb 25, 2020
Currently the shard bulk request can be rejected by the write threadpool
after a mapping update. This introduces a scenario where the mapping
listener thread will attempt to finish the request and fsync. This
thread can potentially be a transport thread. This commit fixes this
issue by forcing the finish action to happen on the write threadpool.

Fixes #51904.
Tim-Brooks added a commit that referenced this issue Feb 25, 2020
Currently the shard bulk request can be rejected by the write threadpool
after a mapping update. This introduces a scenario where the mapping
listener thread will attempt to finish the request and fsync. This
thread can potentially be a transport thread. This commit fixes this
issue by forcing the finish action to happen on the write threadpool.

Fixes #51904.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants