Add wait_if_ongoing for refresh API increasing refresh reliability #91579
Labels
:Distributed/CRUD
A catch all label for issues around indexing, updating and getting a doc by id. Not search.
>enhancement
Team:Distributed
Meta label for distributed team
Description
Problem Statement
When we have many frequently update and refresh request for many indices (maybe every write 4 docs, refresh 1, and Write 1000 docs per seconds every shards, i know refresh is a resource-intensive api as docs says: ).
it shows that refresh queue would blocked
hundreds of thousands of queued requests
even in 3 nodes with 16G heap, 10 primary shards, 20 replications shards, and 400G storage, ES Version 8.4.3, 96 Core CPUI think this because
REFRESH
thread pool type isThreadPoolType.SCALING
TransportShardRefreshAction
is extends ofTransportReplicationAction
whichforceExecution
is default true in replication(like TransportReplicationAction.java#L200 )REFRESH
API would call InternalEngine refresh withblock = true
(like InternalEngine.java#L1795 ) and in hot threads shows block in acquire lockSo a refresh request would expands to
indices * shards * replications
(in our test case is 30), with blocked executionsProposal
May be we can add a
wait_if_ongoing
parameter in refresh api likeflush
api. which can make refresh requests withnonblocking
. just callingInternalEngine#maybeRefresh
. when it can not acquire a lock, it must be a in-flight refresh task is runningPR #91578
The text was updated successfully, but these errors were encountered: