Node leaving the cluster does not trigger 'recover', unassigned shards are not re-assigned #15003

truthtrap · 2015-11-25T09:48:38Z

Dear people,

After upgrading to ES 2.0.0 I noticed that sometimes shards stay unassigned after a node leaves the cluster. We have been looking at all the usual suspects like

curl -XPUT localhost:9200/_cluster/settings -d '
{
  "persistent": {
    "cluster.routing.allocation.enable": "all"
  }
}'

but, the only thing that we found working in this situation was a manual reroute or enabling the allocation (which is already enabled). Later it became clear that it only happens on stable clusters. If there are relocations going on the unassigned shards are assigned fine.

Yesterday we applied the 2.1.0 update. But we still have the same issue.

As far as we can see this is an issue (for us certainly). If it is a feature my apologies for abusing this area, and I will move the discussion to the forums.

Groet,
Jurg.

The text was updated successfully, but these errors were encountered:

ywelsch · 2015-11-25T10:22:35Z

A manual reroute should not be needed. This sounds like a bug. Can you share information about the size of the cluster and the settings used, especially those affecting shard allocation (https://www.elastic.co/guide/en/elasticsearch/reference/2.1/index-modules-allocation.html)?

truthtrap · 2015-11-25T12:01:17Z

Cluster has been fluctuating between 6 and 3 nodes, to test the scaling. It holds 12 indexes, with 3 shards and 1 replica. It holds 75m documents, in total 100G of data. (Kind of augmented logstash setup ingesting a bit more than 3 documents per second, sustained.)

It runs on EC2. Normally we use 3 t2.medium instances (it is staging) but we have been rotating up to m4.10xlarge because we had to restore a month.

elasticsearch.json:

{
    "_comment": "this configuration is auto-generated (from userdata)",
    "bootstrap": {
        "mlockall": true
    },
    "cloud": {
        "aws": {
            "access_key": "...",
            "region": "eu-west-1",
            "secret_key": "..."
        }
    },
    "cluster": {
        "name": "staging.elasticsearch"
    },
    "discovery": {
        "ec2": {
            "groups": "staging-elasticsearch-30mhz-com",
            "host_type": "public_dns"
        },
        "type": "ec2",
        "zen": {
            "minimum_master_nodes": 2
        }
    },
    "gateway": {
        "expected_nodes": 3,
        "recover_after_nodes": 2
    },
    "indices": {
        "fielddata": {
            "cache": {
                "size": "50%"
            }
        },
        "recovery": {
            "concurrent_streams": 5,
            "max_bytes_per_sec": "50mb"
        }
    },
    "network.host": "_ec2:publicDns_",
    "node": {
        "data": true,
        "master": true
    }
}

bleskes · 2015-11-25T13:01:47Z

Just double checking - this is not by any chance a case of delayed assignment? - see https://www.elastic.co/guide/en/elasticsearch/reference/current/delayed-allocation.html#delayed-allocation

truthtrap · 2015-11-25T13:27:32Z

(hi boaz :) )

no. i tried to put that in the config, to force an automatic explicit trigger :) (couldn't find how to do that.)

bqk- · 2015-11-25T13:32:07Z

I have a similar problem in a 8 nodes cluster, 7 data nodes and 1 non-data.
After closing a first node, shards get reassigned nicely on the other nodes, but when closing a second one later, shards stay unassigned.
Both nodes have the exact same configuration and I cannot see why it would have a difference behavior on the two nodes.

This is using ES 2.0 also.

bqk- · 2015-11-25T14:14:48Z

Somehow it allocated the shards after a few hours, although delayed_unassigned_shards showed 0 the whole time.
Is there any action that triggers it ?

bleskes · 2015-12-09T16:56:14Z

I tried to to reproduce this and couldn't - I started a 4 nodes cluster, added 12 indices and your settings. Shutting down a node (kill -9) resulted in the expected 60s wait for it to come back, followed by allocation of the missing shards and back to green. @truthtrap can you supply more info/work on a small and reliable reproduction?

truthtrap · 2015-12-09T19:33:38Z

@bleskes thanks for trying. can you try an 'orderly' shutdown with the init.d script that is part of the package (rpm we use)?

i'll try to set up a completely standard cluster, populate with some indices, and try to reproduce.

clintongormley · 2016-02-14T17:53:17Z

@truthtrap is this something you're still seeing on 2.2?

truthtrap · 2016-02-15T10:13:38Z

@clinton didn't try yet. will upgrade to 2.2.0, and let you know...

On Sun, Feb 14, 2016 at 6:54 PM, Clinton Gormley notifications@github.com
wrote:

@truthtrap https://github.com/truthtrap is this something you're still
seeing on 2.2?

—
Reply to this email directly or view it on GitHub
#15003 (comment)
.

truthtrap · 2016-02-17T06:58:45Z

@clinton looks good so far. upgraded (and rotated) our staging cluster. it
honors the 60s timeout, and then it start re-assigning shards again. if we
are ready to upgrade production i'll let you know how that goes.

(our staging cluster has 125G, 14 indices, each 3 shards with 1 replica.)

On Mon, Feb 15, 2016 at 11:13 AM, Jurg van Vliet jurg@truthtrap.com wrote:

@clinton didn't try yet. will upgrade to 2.2.0, and let you know...

On Sun, Feb 14, 2016 at 6:54 PM, Clinton Gormley <notifications@github.com

wrote:

@truthtrap https://github.com/truthtrap is this something you're still
seeing on 2.2?

—
Reply to this email directly or view it on GitHub
#15003 (comment)
.

truthtrap · 2016-02-21T10:55:04Z

@clinton rotating production elasticsearch cluster was ok as well. no
manual intervention necessary to start assigning shards.

thanks!!

On Wed, Feb 17, 2016 at 7:58 AM, Jurg van Vliet jurg@truthtrap.com wrote:

@clinton looks good so far. upgraded (and rotated) our staging cluster. it
honors the 60s timeout, and then it start re-assigning shards again. if we
are ready to upgrade production i'll let you know how that goes.

(our staging cluster has 125G, 14 indices, each 3 shards with 1 replica.)

On Mon, Feb 15, 2016 at 11:13 AM, Jurg van Vliet jurg@truthtrap.com
wrote:

@clinton didn't try yet. will upgrade to 2.2.0, and let you know...

On Sun, Feb 14, 2016 at 6:54 PM, Clinton Gormley <
notifications@github.com> wrote:

@truthtrap https://github.com/truthtrap is this something you're
still seeing on 2.2?

—
Reply to this email directly or view it on GitHub
#15003 (comment)
.

clintongormley · 2016-02-28T19:17:13Z

thanks @truthtrap - closing

ywelsch added the :Allocation label Nov 25, 2015

clintongormley added the discuss label Nov 28, 2015

clintongormley added feedback_needed and removed discuss labels Dec 11, 2015

clintongormley closed this as completed Feb 28, 2016

lcawl added :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. and removed :Allocation labels Feb 13, 2018

clintongormley added :Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) and removed :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. labels Feb 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node leaving the cluster does not trigger 'recover', unassigned shards are not re-assigned #15003

Node leaving the cluster does not trigger 'recover', unassigned shards are not re-assigned #15003

truthtrap commented Nov 25, 2015

ywelsch commented Nov 25, 2015

truthtrap commented Nov 25, 2015

bleskes commented Nov 25, 2015

truthtrap commented Nov 25, 2015

bqk- commented Nov 25, 2015

bqk- commented Nov 25, 2015

bleskes commented Dec 9, 2015

truthtrap commented Dec 9, 2015

clintongormley commented Feb 14, 2016

truthtrap commented Feb 15, 2016

truthtrap commented Feb 17, 2016

truthtrap commented Feb 21, 2016

clintongormley commented Feb 28, 2016

Node leaving the cluster does not trigger 'recover', unassigned shards are not re-assigned #15003

Node leaving the cluster does not trigger 'recover', unassigned shards are not re-assigned #15003

Comments

truthtrap commented Nov 25, 2015

ywelsch commented Nov 25, 2015

truthtrap commented Nov 25, 2015

bleskes commented Nov 25, 2015

truthtrap commented Nov 25, 2015

bqk- commented Nov 25, 2015

bqk- commented Nov 25, 2015

bleskes commented Dec 9, 2015

truthtrap commented Dec 9, 2015

clintongormley commented Feb 14, 2016

truthtrap commented Feb 15, 2016

truthtrap commented Feb 17, 2016

truthtrap commented Feb 21, 2016

clintongormley commented Feb 28, 2016