Elasticsearch is fsyncing on transport threads #51904

Tim-Brooks · 2020-02-05T01:16:08Z

It is currently possible for a cluster state listener to execute a transport fsync.

Currently in TransportShardBulkAction it is possible that a shard operations will trigger a mapping update.
When this happens Elasticsearch registers a ClusterStateObserver.Listener to continue when the mapping is complete.
This listener will eventually attempt to reschedule the write operation.
If the write thread pool cannot accept this operation, the onRejection callback will fail outstanding operations and complete the request (probably trying to notify of the operations that were able to be completed).
Completing a TransportShardBulkAction will attempt to fsync or refresh as necessary after initiating replication.

Here is a transport_worker stack trace. I also think these listeners might be executed on cluster state threads?

ensureSynced:808, Translog (org.elasticsearch.index.translog)
ensureSynced:824, Translog (org.elasticsearch.index.translog)
ensureTranslogSynced:513, InternalEngine (org.elasticsearch.index.engine)
write:2980, IndexShard$5 (org.elasticsearch.index.shard)
processList:108, AsyncIOProcessor (org.elasticsearch.common.util.concurrent)
drainAndProcessAndRelease:96, AsyncIOProcessor (org.elasticsearch.common.util.concurrent)
put:84, AsyncIOProcessor (org.elasticsearch.common.util.concurrent)
sync:3003, IndexShard (org.elasticsearch.index.shard)
run:320, TransportWriteAction$AsyncAfterWriteAction (org.elasticsearch.action.support.replication)
runPostReplicationActions:163, TransportWriteAction$WritePrimaryResult (org.elasticsearch.action.support.replication)
handlePrimaryResult:136, ReplicationOperation (org.elasticsearch.action.support.replication)
accept:-1, 359201671 (org.elasticsearch.action.support.replication.ReplicationOperation$$Lambda$3596)
onResponse:63, ActionListener$1 (org.elasticsearch.action)
onResponse:163, ActionListener$4 (org.elasticsearch.action)
completeWith:336, ActionListener (org.elasticsearch.action)
finishRequest:186, TransportShardBulkAction$2 (org.elasticsearch.action.bulk)
onRejection:182, TransportShardBulkAction$2 (org.elasticsearch.action.bulk)
onRejection:681, ThreadContext$ContextPreservingAbstractRunnable (org.elasticsearch.common.util.concurrent)
execute:90, EsThreadPoolExecutor (org.elasticsearch.common.util.concurrent)
lambda$doRun$0:160, TransportShardBulkAction$2 (org.elasticsearch.action.bulk)
accept:-1, 95719050 (org.elasticsearch.action.bulk.TransportShardBulkAction$2$$Lambda$3833)
onResponse:63, ActionListener$1 (org.elasticsearch.action)
lambda$onResponse$0:289, TransportShardBulkAction$3 (org.elasticsearch.action.bulk)
run:-1, 1539599321 (org.elasticsearch.action.bulk.TransportShardBulkAction$3$$Lambda$3857)
onResponse:251, ActionListener$5 (org.elasticsearch.action)
onNewClusterState:125, TransportShardBulkAction$1 (org.elasticsearch.action.bulk)
onNewClusterState:311, ClusterStateObserver$ContextPreservingListener (org.elasticsearch.cluster)
waitForNextChange:169, ClusterStateObserver (org.elasticsearch.cluster)
waitForNextChange:120, ClusterStateObserver (org.elasticsearch.cluster)
waitForNextChange:112, ClusterStateObserver (org.elasticsearch.cluster)
lambda$shardOperationOnPrimary$1:122, TransportShardBulkAction (org.elasticsearch.action.bulk)
accept:-1, 1672258490 (org.elasticsearch.action.bulk.TransportShardBulkAction$$Lambda$3831)
onResponse:277, TransportShardBulkAction$3 (org.elasticsearch.action.bulk)
onResponse:273, TransportShardBulkAction$3 (org.elasticsearch.action.bulk)
onResponse:282, ActionListener$6 (org.elasticsearch.action)
onResponse:116, MappingUpdatedAction$1 (org.elasticsearch.cluster.action.index)
onResponse:113, MappingUpdatedAction$1 (org.elasticsearch.cluster.action.index)
lambda$executeLocally$0:97, NodeClient (org.elasticsearch.client.node)
accept:-1, 2099146048 (org.elasticsearch.client.node.NodeClient$$Lambda$2772)
onResponse:144, TaskManager$1 (org.elasticsearch.tasks)
onResponse:138, TaskManager$1 (org.elasticsearch.tasks)
handleResponse:54, ActionListenerResponseHandler (org.elasticsearch.action)
handleResponse:1053, TransportService$ContextRestoreResponseHandler (org.elasticsearch.transport)
doRun:220, InboundHandler$1 (org.elasticsearch.transport)
run:37, AbstractRunnable (org.elasticsearch.common.util.concurrent)
execute:196, EsExecutors$DirectExecutorService (org.elasticsearch.common.util.concurrent)
handleResponse:212, InboundHandler (org.elasticsearch.transport)
messageReceived:138, InboundHandler (org.elasticsearch.transport)
inboundMessage:102, InboundHandler (org.elasticsearch.transport)
inboundMessage:664, TcpTransport (org.elasticsearch.transport)
consumeNetworkReads:688, TcpTransport (org.elasticsearch.transport)
consumeReads:276, MockNioTransport$MockTcpReadWriteHandler (org.elasticsearch.transport.nio)
handleReadBytes:228, SocketChannelContext (org.elasticsearch.nio)
read:40, BytesChannelContext (org.elasticsearch.nio)
handleRead:139, EventHandler (org.elasticsearch.nio)
handleRead:151, TestEventHandler (org.elasticsearch.transport.nio)
handleRead:420, NioSelector (org.elasticsearch.nio)
processKey:246, NioSelector (org.elasticsearch.nio)
singleLoop:174, NioSelector (org.elasticsearch.nio)
runLoop:131, NioSelector (org.elasticsearch.nio)
run:-1, 461835914 (org.elasticsearch.nio.NioSelectorGroup$$Lambda$1709)
run:835, Thread (java.lang)

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-02-05T01:16:12Z

Pinging @elastic/es-distributed (:Distributed/CRUD)

ywelsch · 2020-02-05T10:26:16Z

Relates #39793 (comment)

original-brownbear · 2020-02-05T10:34:50Z

It seems to me this issue would be automatically resolved by #51035 if we went with simply bounding the in-flight bulk requests and made rejections on the write pool impossible that way?

Currently the shard bulk request can be rejected by the write threadpool after a mapping update. This introduces a scenario where the mapping listener thread will attempt to finish the request and fsync. This thread can potentially be a transport thread. This commit fixes this issue by forcing the finish action to happen on the write threadpool. Fixes elastic#51904.

Currently the shard bulk request can be rejected by the write threadpool after a mapping update. This introduces a scenario where the mapping listener thread will attempt to finish the request and fsync. This thread can potentially be a transport thread. This commit fixes this issue by forcing the finish action to happen on the write threadpool. Fixes #51904.

Currently the shard bulk request can be rejected by the write threadpool after a mapping update. This introduces a scenario where the mapping listener thread will attempt to finish the request and fsync. This thread can potentially be a transport thread. This commit fixes this issue by forcing the finish action to happen on the write threadpool. Fixes elastic#51904.

Currently the shard bulk request can be rejected by the write threadpool after a mapping update. This introduces a scenario where the mapping listener thread will attempt to finish the request and fsync. This thread can potentially be a transport thread. This commit fixes this issue by forcing the finish action to happen on the write threadpool. Fixes #51904.

Tim-Brooks added >bug :Distributed/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. labels Feb 5, 2020

Tim-Brooks self-assigned this Feb 5, 2020

Tim-Brooks mentioned this issue Feb 5, 2020

Force execution of finish shard bulk request #51957

Merged

Tim-Brooks closed this as completed in #51957 Feb 15, 2020

Tim-Brooks mentioned this issue Feb 18, 2020

Force execution of finish shard bulk request (#51957) #52484

Merged

codebrain mentioned this issue Apr 1, 2020

7.7.0 meta ticket (Part 3) elastic/elasticsearch-net#4534

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Elasticsearch is fsyncing on transport threads #51904

Elasticsearch is fsyncing on transport threads #51904

Tim-Brooks commented Feb 5, 2020

elasticmachine commented Feb 5, 2020

ywelsch commented Feb 5, 2020

original-brownbear commented Feb 5, 2020

Elasticsearch is fsyncing on transport threads #51904

Elasticsearch is fsyncing on transport threads #51904

Comments

Tim-Brooks commented Feb 5, 2020

elasticmachine commented Feb 5, 2020

ywelsch commented Feb 5, 2020

original-brownbear commented Feb 5, 2020