Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not force refresh when write indexing buffer #50769

Merged
merged 6 commits into from
Jan 9, 2020

Conversation

dnhatn
Copy link
Member

@dnhatn dnhatn commented Jan 8, 2020

Today we periodically check the indexing buffer memory every 5 seconds or after we have used 1/30 of the configured memory. If the total used memory is over the threshold, then we refresh the "largest" shards. If refreshing takes longer these intervals (i.e., 5s or 1/30 buffer), then we continue to enqueue refreshes to these shards. This leads to two issues:

  • The refresh thread pool can be exhausted and other shards can't refresh
  • Execute too many refreshes for the "largest" shards

With this change, we only refresh the largest shards if they are not refreshing. Here we rely on the periodic check to trigger another refresh if needed. We can harden this by making the ongoing refresh triggers the memory check when it's completed. I opted out this option in this PR for simplicity.

See: https://discuss.elastic.co/t/write-queue-continue-to-rise/213652/

@dnhatn dnhatn added >enhancement :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. v8.0.0 v7.6.0 labels Jan 8, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Engine)


// TODO: would be cleaner if I could pass this 1kb setting to the single node this test created....
IndexingMemoryController imc = new IndexingMemoryController(settings, null, null) {
public void testSkipRefreshIfShardIsRefreshingAlready() throws Exception {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only added this new test. Other tests are unchanged.

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

I left a few smaller comments. Also, it would be good to have a few extra test runs of the server module to ensure it does not introduce spurious test failures.

@dnhatn
Copy link
Member Author

dnhatn commented Jan 9, 2020

Also, it would be good to have a few extra test runs of the server module to ensure it does not introduce spurious test failures.

Yep, I am running it on my CI now.

@dnhatn
Copy link
Member Author

dnhatn commented Jan 9, 2020

@ywelsch @henningandersen Thanks for reviewing.

@dnhatn dnhatn merged commit 0510af8 into elastic:master Jan 9, 2020
@dnhatn dnhatn deleted the indexing-memory branch January 9, 2020 23:18
henningandersen added a commit to henningandersen/elasticsearch that referenced this pull request Jan 10, 2020
The test checked queue size and active count, however,
ThreadPoolExecutor pulls out the request from the queue before marking
the worker active, risking that we think all tasks are done when they
are not. Now check on completed-tasks metric instead, which is
guaranteed to be monotonic.

Relates elastic#50769
henningandersen added a commit that referenced this pull request Jan 10, 2020
The test checked queue size and active count, however,
ThreadPoolExecutor pulls out the request from the queue before marking
the worker active, risking that we think all tasks are done when they
are not. Now check on completed-tasks metric instead, which is
guaranteed to be monotonic.

Relates #50769
dnhatn added a commit that referenced this pull request Jan 11, 2020
Today we periodically check the indexing buffer memory every 5 seconds
or after we have used 1/30 of the configured memory. If the total used
memory is over the threshold, then we refresh the "largest" shards. If
refreshing takes longer these intervals (i.e., 5s or 1/30 buffer), then
we continue to enqueue refreshes to these shards. This leads to two
issues:

- The refresh thread pool can be exhausted and other shards can't refresh
- Execute too many refreshes for the "largest" shards

With this change, we only refresh the largest shards if they are not refreshing.
Here we rely on the periodic check to trigger another refresh if needed. We can
harden this by making the ongoing refresh triggers the memory check when
it's completed. I opted out this option in this PR for simplicity.

See: https://discuss.elastic.co/t/write-queue-continue-to-rise/213652/
dnhatn pushed a commit that referenced this pull request Jan 11, 2020
The test checked queue size and active count, however,
ThreadPoolExecutor pulls out the request from the queue before marking
the worker active, risking that we think all tasks are done when they
are not. Now check on completed-tasks metric instead, which is
guaranteed to be monotonic.

Relates #50769
SivagurunathanV pushed a commit to SivagurunathanV/elasticsearch that referenced this pull request Jan 23, 2020
Today we periodically check the indexing buffer memory every 5 seconds 
or after we have used 1/30 of the configured memory. If the total used
memory is over the threshold, then we refresh the "largest" shards. If 
refreshing takes longer these intervals (i.e., 5s or 1/30 buffer), then
we continue to enqueue refreshes to these shards. This leads to two
issues:

- The refresh thread pool can be exhausted and other shards can't refresh
- Execute too many refreshes for the "largest" shards

With this change, we only refresh the largest shards if they are not refreshing.
Here we rely on the periodic check to trigger another refresh if needed. We can
harden this by making the ongoing refresh triggers the memory check when
it's completed. I opted out this option in this PR for simplicity.

See: https://discuss.elastic.co/t/write-queue-continue-to-rise/213652/
SivagurunathanV pushed a commit to SivagurunathanV/elasticsearch that referenced this pull request Jan 23, 2020
The test checked queue size and active count, however,
ThreadPoolExecutor pulls out the request from the queue before marking
the worker active, risking that we think all tasks are done when they
are not. Now check on completed-tasks metric instead, which is
guaranteed to be monotonic.

Relates elastic#50769
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Engine Anything around managing Lucene and the Translog in an open shard. >enhancement v7.6.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants