-
Notifications
You must be signed in to change notification settings - Fork 24.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disk decider prevents allocation/fast recovery #56578
Labels
>bug
:Distributed/Allocation
All issues relating to the decision making around placing a shard (both master logic & on the nodes)
Team:Distributed
Meta label for distributed team
Comments
henningandersen
added
>bug
:Distributed/Allocation
All issues relating to the decision making around placing a shard (both master logic & on the nodes)
team-discuss
labels
May 12, 2020
Pinging @elastic/es-distributed (:Distributed/Allocation) |
We discussed this at our weekly sync and while we did not reach a conclusion, we did cover following:
|
lockewritesdocs
pushed a commit
that referenced
this issue
Aug 29, 2022
…89018) * Create restart-cluster.asciidoc As per #49972 and #56578, if a node is above low disk threshold when being restarted (rolling restart, network disruption or crash), the disk threshold decider prevents reusing the shard content on the restarted node. The consequence of the event is the node may take a long time to start. * Update docs/reference/setup/restart-cluster.asciidoc LGTM! Thanks! Co-authored-by: Adam Locke <adam.locke@elastic.co> Co-authored-by: Adam Locke <adam.locke@elastic.co>
Leaf-Lin
added a commit
to Leaf-Lin/elasticsearch
that referenced
this issue
Aug 29, 2022
…lastic#89018) * Create restart-cluster.asciidoc As per elastic#49972 and elastic#56578, if a node is above low disk threshold when being restarted (rolling restart, network disruption or crash), the disk threshold decider prevents reusing the shard content on the restarted node. The consequence of the event is the node may take a long time to start. * Update docs/reference/setup/restart-cluster.asciidoc LGTM! Thanks! Co-authored-by: Adam Locke <adam.locke@elastic.co> Co-authored-by: Adam Locke <adam.locke@elastic.co>
elasticsearchmachine
pushed a commit
that referenced
this issue
Aug 29, 2022
…89018) (#89702) * Create restart-cluster.asciidoc As per #49972 and #56578, if a node is above low disk threshold when being restarted (rolling restart, network disruption or crash), the disk threshold decider prevents reusing the shard content on the restarted node. The consequence of the event is the node may take a long time to start. * Update docs/reference/setup/restart-cluster.asciidoc LGTM! Thanks! Co-authored-by: Adam Locke <adam.locke@elastic.co> Co-authored-by: Adam Locke <adam.locke@elastic.co> Co-authored-by: Adam Locke <adam.locke@elastic.co>
albertzaharovits
pushed a commit
to albertzaharovits/elasticsearch
that referenced
this issue
Aug 31, 2022
…lastic#89018) * Create restart-cluster.asciidoc As per elastic#49972 and elastic#56578, if a node is above low disk threshold when being restarted (rolling restart, network disruption or crash), the disk threshold decider prevents reusing the shard content on the restarted node. The consequence of the event is the node may take a long time to start. * Update docs/reference/setup/restart-cluster.asciidoc LGTM! Thanks! Co-authored-by: Adam Locke <adam.locke@elastic.co> Co-authored-by: Adam Locke <adam.locke@elastic.co>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
>bug
:Distributed/Allocation
All issues relating to the decision making around placing a shard (both master logic & on the nodes)
Team:Distributed
Meta label for distributed team
If a node is above low disk threshold when being restarted (rolling restart, network disruption or crash), the disk threshold decider prevents reusing the shard content on the restarted node.
This seems unfortunate, in particular in the good case where we can do a noop recovery or an operations based recovery with only few operations (since that disk usage is already accounted for).
Notice that being above the low disk threshold on a node is not a bad state in itself. The cluster may have plenty space available, but an imbalance w.r.t. disk usage can make some nodes go above low threshold anyway. We only start moving shards off the node when high threshold is reached.
The test case here demonstrates this. It also demonstrates that even having nodes with space available for the shard is not enough due to delayed allocation.
This might partially repair itself (not demonstrated) in that when enough shards have been recovered elsewhere, the node could drop below the low threshold, making the rest of the shard contents available for faster recoveries.
The text was updated successfully, but these errors were encountered: