ElasticSearch 1.3.4 recovery slow on larger clusters (50+ total nodes) #8487

TrentStewart · 2014-11-14T17:42:53Z

We are seeing a situation on clusters running 1.3.4 with greater than 50 total nodes where shard recovery/allocation is either failing or is VERY slow.

Full details:
I currently have 6 clusters, 4 with 24 total nodes and 2 larger clusters with 53 & 63 nodes respectively. Everything is run on VMs running Windows Server 2012 R2 within Azure.

All clusters (except the 63 node) were upgraded from 1.3.2 to 1.3.4 using an offline method of shutting down the cluster clean, swapping out the x64 Windows Service, then restarting all of the nodes. The 63 node cluster was built last week clean with 1.3.4.

For the 24 node clusters, the clusters returned to a green status after the upgrade in less than 30 minutes and are performing very well. I also did a rolling reboot of all of the machines to apply OS updates where I have automation that sets allocation to new_primaries, shuts down the service, reboots the machine, waits for the node to rejoin the cluster, sets allocation back to all, then waits for the cluster to return to green before proceeding to the next node. Again for each of the 24 node clusters, it completed the entire process in just shy of 2 hours.

The 53 node cluster is our oldest cluster (with 144 indices & 3330 shards) and has under gone several upgrades using the offline method and in each scenario, the cluster returned to yellow within 15 minutes and green within 40 minutes. Monday night we upgraded this cluster and it took 3 hours to get to yellow and 6 hours to get to green.

The 63 node cluster is our newest cluster (currently has 12 indices & 272 shards). Last night & today while performing a rolling reboot, each node has a max of 3 shards on it, after the machine rebooted, it was taking > 10 minutes for the node to rejoin the cluster and randomly some shards would never finish initialization.

When I query _cat/recovery?pretty=true&v&active_only=true, no shards are listed however _cat/shards would show 1 or 2 shards as INITIALIZING. If I issued the command for reroute cancel on the initializing shards, they would almost immediately allocation and the cluster would turn green.

bleskes · 2014-11-17T12:47:07Z

The 53 node cluster is our oldest cluster (with 144 indices & 3330 shards) and has under gone several upgrades using the offline method and in each scenario, the cluster returned to yellow within 15 minutes and green within 40 minutes. Monday night we upgraded this cluster and it took 3 hours to get to yellow and 6 hours to get to green.

When you bring a node back up and it has replicas on disk, ES will sync the replicas with the current primaries. That can't take varying amount of time, depending on how different the local segments file are. We are working on making it faster.

It was taking > 10 minutes for the node to rejoin the cluster

With 1.4.0 we considerably improved the joining process by using batching- it should be much faster. See #7493

When I query _cat/recovery?pretty=true&v&active_only=true, no shards are listed however _cat/shards would show 1 or 2 shards as INITIALIZING.

This is worrying. Can you reproduce the issue? Are there any errors in the logs? I'm looking for something like ConcurrentModificationException (but it may be something else!).

portante · 2014-12-04T22:10:22Z

I have a small one node ES instance (1.4.1, RHEL 7, openjdk 1.7.0), with
256GB of memory, 4 sockets, 10 cores per socket (no hyper-threads), and
with 623 indexes (no replication, one shard per index), the cluster becomes
unresponsive with Java running out of memory (spends all its time in
garbage collection) on ES startup. We just upgraded from 1.3.2-1.

I modified my startup sequence to immediately close all indexes after
restarting elasticsearch and then open them in sets of 10, waiting for the
cluster health to go green. That process ends up taking >10 min per set
to open those indexes, but eventually dies mid-way through the indexes.

We then opened them one at a time, from the most recent to oldest, for
only about 60 indexes, to see if we can limp along until we find a fix for
this problem. The indexes are all around 11-13 GB in size, about 150M
documents.

Does this sound familiar to anyone?

On Mon, Nov 17, 2014 at 7:47 AM, Boaz Leskes notifications@github.com
wrote:

The 53 node cluster is our oldest cluster (with 144 indices & 3330 shards)
and has under gone several upgrades using the offline method and in each
scenario, the cluster returned to yellow within 15 minutes and green within
40 minutes. Monday night we upgraded this cluster and it took 3 hours to
get to yellow and 6 hours to get to green.

When you bring a node back up and it has replicas on disk, ES will sync
the replicas with the current primaries. That can't take varying amount of
time, depending on how different the local segments file are. We are
working on making it faster.

It was taking > 10 minutes for the node to rejoin the cluster

With 1.4.0 we considerably improved the joining process by using batching-
it should be much faster. See #7493
#7493

When I query _cat/recovery?pretty=true&v&active_only=true, no shards are
listed however _cat/shards would show 1 or 2 shards as INITIALIZING.

This is worrying. Can you reproduce the issue? Are there any errors in the
logs? I'm looking for something like ConcurrentModificationException (but
it may be something else!).

—
Reply to this email directly or view it on GitHub
#8487 (comment)
.

TrentStewart · 2014-12-05T21:42:29Z

The 53 node cluster is our oldest cluster (with 144 indices & 3330 shards) and has under gone several upgrades using the offline method and in each scenario, the cluster returned to yellow within 15 minutes and green within 40 minutes. Monday night we upgraded this cluster and it took 3 hours to get to yellow and 6 hours to get to green.

When you bring a node back up and it has replicas on disk, ES will sync the replicas with the current primaries. That can't take varying amount of time, depending on how different the local segments file are. We are working on making it faster.

True, there appears to be a greater percentage of the recovery time is spent in the translog stage.

It was taking > 10 minutes for the node to rejoin the cluster

With 1.4.0 we considerably improved the joining process by using batching- it should be much faster. See #7493

When I query _cat/recovery?pretty=true&v&active_only=true, no shards are listed however _cat/shards would show 1 or 2 shards as INITIALIZING.

This is worrying. Can you reproduce the issue? Are there any errors in the logs? I'm looking for something like ConcurrentModificationException (but it may be something else!).

Yes, it's not 100% reproducible, but it does happen frequently. There are no exceptions in the logs that I have discovered.

Since I posted this, I done several rolling reboots and node failure recoveries of our clusters. With 1.3.4 vs 1.3.2, a full recovery has gone from ~40 minutes to over 4 hours. Prior to 1.3.4 I never paid any attention to pending_tasks, it's part of my problem triage process now, so I'm not able to do an apples to apples comparison.

When _cat/recovery does not show anything, pending_tasks will grow to over 2000 tasks while its stuck on a single task, usually its shard-started that wasn't able to allocate to a node. Some times I can cancel the allocation, most of the time the cancel would timeout as well and the cluster is left in a limbo waiting and waiting.

portante · 2014-12-06T03:54:06Z

#8394 (comment) addressed the problem I saw above.

bleskes · 2014-12-17T10:32:06Z

@TrentStewart sorry for not returning to you earlier

When _cat/recovery does not show anything, pending_tasks will grow to over 2000 tasks while its stuck on a single task, usually its shard-started that wasn't able to allocate to a node.

When this happen again (master stuck on a single task), can you run the hot threads api on it? that will tell us what it's busy doing.

Some times I can cancel the allocation, most of the time the cancel would timeout as well and the cluster is left in a limbo waiting and waiting.

Cancel is in itself a task, albeit with highest priority. It needs to wait until the current task is done before having effect.

julien51 · 2015-02-23T16:13:09Z

We are seeing a similar issue where recovery seems to be stuck in the "translog" state.
When checking /<idx>/_recovery?pretty=true&active_only=true we do find that translog recovery is in progress:

{
      "id" : 2,
      "type" : "GATEWAY",
      "stage" : "TRANSLOG",
      "primary" : true,
      "start_time_in_millis" : 1424704351606,
      "stop_time_in_millis" : 0,
      "total_time_in_millis" : 3657431,
      "source" : {
        "id" : "lv6TBuiTSoS0nz1qQ9a1Bg",
        "host" : "elastic5.<host>.com.",
        "transport_address" : "inet[/192.168.162.33:9300]",
        "ip" : "104.237.137.221",
        "name" : "elastic5.<host>.com"
      },
      "target" : {
        "id" : "lv6TBuiTSoS0nz1qQ9a1Bg",
        "host" : "elastic5.<host>.com.",
        "transport_address" : "inet[/192.168.162.33:9300]",
        "ip" : "104.237.137.221",
        "name" : "elastic5.<host>.com"
      },
      "index" : {
        "files" : {
          "total" : 647,
          "reused" : 647,
          "recovered" : 647,
          "percent" : "100.0%"
        },
        "bytes" : {
          "total" : 8834720289,
          "reused" : 8834720289,
          "recovered" : 8834720289,
          "percent" : "100.0%"
        },
        "total_time_in_millis" : 48
      },
      "translog" : {
        "recovered" : 36274,
        "total_time_in_millis" : 0
      },
      "start" : {
        "check_index_time_in_millis" : 0,
        "total_time_in_millis" : 1021
      }
    }

But it is extremely slow (the 0 value in translog.total_time_in_millis is a lie!)... and we have no ETA idea. What does recovered mean in the context of translog? Are these bytes?

bleskes · 2015-02-23T16:15:42Z

@julien51 those are translog operations . To find out how many there are in total and what the size of the translog is, you can run GET /<idx>/_stats?level=shards&human and check the primary shard. I'm working on a change that will supply those as part of the recovery output. Which version are you on?

julien51 · 2015-02-23T16:19:20Z

Version: 1.4.2
The problem: this is a primary shard initializing (after a recovery), so it looks like the number of operations is not available :(
Also, in about 5 minutes, we progressed by ~1500 operations... so slow!

bleskes · 2015-02-23T16:25:03Z

@julien51 I see, this is your primary shard? In this case it's hard to tell indeed. You can check the file system for the size of a file called "translog-????.recovering" (again, working on improving this).

When you shut down the cluster (did you?) did you have any relocations/recoveries going on? (if you know).

julien51 · 2015-02-23T16:29:00Z

Unfortunaty, there was a bit of a disaster and the shutdown of the server was unexpected (still trying to understand what happend). We had 2 servers failing (out of 5) and of course they had the primary and secondary for one of our indices. But yes, based on the logs, I believe there was some relocation going on.

The file size of 1752285717 bytes. Any idea of how that roughly converts to number of operations?
Is there one operation per line? The file has 511,014 lines... so if that's 1 line per operation, we're not even 10% of the way :( any idea how to make this much faster? (because at this point it will take 16hours to recover a file of less than 2GB)

I should add that it's similarly slow for the 2ndary shards which are also initializing and in the "translog" state.

julien51 · 2015-02-23T17:07:06Z

Extra stupid question: this would in theory be an absolutely bad time to upgrade to 1.4.4, but I see 1.4.3 does bring improvements on the recovery front.
What if I prevent the cluster from doing any re-alloc and then upgrade the node on which this slow initilizating is going on?

julien51 · 2015-02-23T23:25:05Z

After 6 hours, it was still processing the translog, but at an excruciatingly slow pace... so I restarted the node, and well, it finished in a matter of seconds on another host. We seem to have the same number of documents.
Yet, we'll start re-indexing from our main datastore, just in case we miss anything.

bleskes · 2015-02-24T13:23:35Z

The file size of 1752285717 bytes. Any idea of how that roughly converts to number of operations?
Is there one operation per line? The file has 511,014 lines... so if that's 1 line per operation,

The translog is binary. New lines are just there by accident (if viewed as text)

this would in theory be an absolutely bad time to upgrade to 1.4.4, but I see 1.4.3 does bring improvements on the recovery front.

1.4.3 helps by being more aggressive in trimming the translog post recovery but we're still chasing this issue. It is very rare but do occur (as you sadly noticed).

I restarted the node, and well, it finished in a matter of seconds on another host.
We seem to have the same number of documents.

Yes. The translog on replica is being flushed during recovery. It's only the primary than can grow, because we need it a safety measure to catch all the documents indexed between starting to copy lucene files and starting the translog phase. Depending on what exactly happened, there might be none but it is not guaranteed.

clintongormley · 2015-11-21T21:18:52Z

The never-ending-translog bug was fixed several versions ago. I'm going to close this issue.

bleskes self-assigned this Nov 17, 2014

portante mentioned this issue Dec 5, 2014

Large index no longer initialises under 1.4.0 and 1.4.0 Beta 1 due to OutOfMemoryException #8394

Closed

clintongormley closed this as completed Nov 21, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ElasticSearch 1.3.4 recovery slow on larger clusters (50+ total nodes) #8487

ElasticSearch 1.3.4 recovery slow on larger clusters (50+ total nodes) #8487

TrentStewart commented Nov 14, 2014

bleskes commented Nov 17, 2014

portante commented Dec 4, 2014

TrentStewart commented Dec 5, 2014

portante commented Dec 6, 2014

bleskes commented Dec 17, 2014

julien51 commented Feb 23, 2015

bleskes commented Feb 23, 2015

julien51 commented Feb 23, 2015

bleskes commented Feb 23, 2015

julien51 commented Feb 23, 2015

julien51 commented Feb 23, 2015

julien51 commented Feb 23, 2015

bleskes commented Feb 24, 2015

clintongormley commented Nov 21, 2015

ElasticSearch 1.3.4 recovery slow on larger clusters (50+ total nodes) #8487

ElasticSearch 1.3.4 recovery slow on larger clusters (50+ total nodes) #8487

Comments

TrentStewart commented Nov 14, 2014

bleskes commented Nov 17, 2014

portante commented Dec 4, 2014

TrentStewart commented Dec 5, 2014

portante commented Dec 6, 2014

bleskes commented Dec 17, 2014

julien51 commented Feb 23, 2015

bleskes commented Feb 23, 2015

julien51 commented Feb 23, 2015

bleskes commented Feb 23, 2015

julien51 commented Feb 23, 2015

julien51 commented Feb 23, 2015

julien51 commented Feb 23, 2015

bleskes commented Feb 24, 2015

clintongormley commented Nov 21, 2015