Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add wait_if_ongoing for refresh API increasing refresh reliability #91578

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

luyuncheng
Copy link
Contributor

@luyuncheng luyuncheng commented Nov 15, 2022

Problem Statement

ISSUE: #91579

When we have many frequently update and refresh request for many indices (i know refresh is a resource-intensive api as docs says: ).

it shows that refresh queue would blocked hundreds of thousands of queued requests even in 3 nodes with 16G heap, 10 primary shards, 20 replications shards, and 400G storage, ES Version 8.4.3, 96 Core CPU
image

I think this because

  1. REFRESH thread pool type is ThreadPoolType.SCALING
  2. TransportShardRefreshAction is extends of TransportReplicationAction which forceExecution is default true in replication(like TransportReplicationAction.java#L200 )
  3. REFRESH API would call InternalEngine refresh with block = true (like InternalEngine.java#L1795 ) and in hot threads shows block in acquire lock
    9c94f741-3406-4949-a6af-c51c6ac204b3

So a refresh request would expands to indices * shards * replications (in our test case is 30), with blocked executions

Proposal

May be we can add a wait_if_ongoing parameter in refresh api like flush api. which can make refresh requests with nonblocking. just calling InternalEngine#maybeRefresh. when it can not acquire a lock, it must be a in-flight refresh task is running

@elasticsearchmachine elasticsearchmachine added needs:triage Requires assignment of a team area label external-contributor Pull request authored by a developer outside the Elasticsearch team labels Nov 15, 2022
@nik9000 nik9000 added the :Distributed/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. label Nov 15, 2022
@elasticsearchmachine elasticsearchmachine added Team:Distributed Meta label for distributed team and removed needs:triage Requires assignment of a team area label labels Nov 15, 2022
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. external-contributor Pull request authored by a developer outside the Elasticsearch team Team:Distributed Meta label for distributed team v8.15.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants