Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add wait_if_ongoing for refresh API increasing refresh reliability #91579

Closed
luyuncheng opened this issue Nov 15, 2022 · 4 comments
Closed

Add wait_if_ongoing for refresh API increasing refresh reliability #91579

luyuncheng opened this issue Nov 15, 2022 · 4 comments
Labels
:Distributed/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. >enhancement Team:Distributed Meta label for distributed team

Comments

@luyuncheng
Copy link
Contributor

luyuncheng commented Nov 15, 2022

Description

Problem Statement

When we have many frequently update and refresh request for many indices (maybe every write 4 docs, refresh 1, and Write 1000 docs per seconds every shards, i know refresh is a resource-intensive api as docs says: ).

it shows that refresh queue would blocked hundreds of thousands of queued requests even in 3 nodes with 16G heap, 10 primary shards, 20 replications shards, and 400G storage, ES Version 8.4.3, 96 Core CPU
image

I think this because

  1. REFRESH thread pool type is ThreadPoolType.SCALING
  2. TransportShardRefreshAction is extends of TransportReplicationAction which forceExecution is default true in replication(like TransportReplicationAction.java#L200 )
  3. REFRESH API would call InternalEngine refresh with block = true (like InternalEngine.java#L1795 ) and in hot threads shows block in acquire lock
    9c94f741-3406-4949-a6af-c51c6ac204b3

So a refresh request would expands to indices * shards * replications (in our test case is 30), with blocked executions

Proposal

May be we can add a wait_if_ongoing parameter in refresh api like flush api. which can make refresh requests with nonblocking. just calling InternalEngine#maybeRefresh. when it can not acquire a lock, it must be a in-flight refresh task is running

PR #91578

@nik9000
Copy link
Member

nik9000 commented Nov 15, 2022

Indexing has a already has a wait_for_refresh option which is pretty similar. Is that useful for you?

@nik9000 nik9000 added the :Distributed/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. label Nov 15, 2022
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Meta label for distributed team label Nov 15, 2022
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@elasticsearchmachine elasticsearchmachine removed the needs:triage Requires assignment of a team area label label Nov 15, 2022
@luyuncheng
Copy link
Contributor Author

Indexing has a already has a wait_for_refresh option which is pretty similar. Is that useful for you?

@nik9000 thanks for the replying.
i think this is a good way to wait for doc until they refreshed in few write per seconds. But when refresh_inteval = '10s+' AND many writes per seconds, these requests which are waiting for refresh occupied many memory.
i try to use this parameter, as http client is a pipeline model in http1.1, client must wait for the response, so the client and coordianate node GC frequency goes up, and write performance drops down.

meanwhile, i think this parameter can make the refresh api more robust that would not pending 200,000 requests at only 30 shards.

@ywangd
Copy link
Member

ywangd commented May 31, 2024

I am closing this as a duplicate of #87936.

@ywangd ywangd closed this as completed May 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. >enhancement Team:Distributed Meta label for distributed team
Projects
None yet
Development

No branches or pull requests

4 participants