Slow recovery during rolling cluster restarts #21884

gehel · 2016-11-30T14:02:57Z

Elasticsearch version: 2.3.5

Plugins installed:

analysis-icu
experimental-highlighter-elasticsearch-plugin
extra
swift-repository-plugin

JVM version:
java version "1.7.0_121"
OpenJDK Runtime Environment (IcedTea 2.6.8) (7u121-2.6.8-1ubuntu0.14.04.1)
OpenJDK 64-Bit Server VM (build 24.121-b00, mixed mode)

OS version: Ubuntu Trusty (14.04)

Description of the problem including expected versus actual behavior:

When doing a rolling restart of our elasticsearch cluster, it takes ~ 1h to go back to green cluster state after restarting a node. We tried a few strategies, including:

stop writes to the cluster before restart
synced flush
setting shard allocation to "primaries"
increasing delayed allocation

We can see during recovery that all shards that were on the restarted node are restored from an active replica, causing a large amount of network traffic which is throttled and taking time. With our 2 main elasticsearch clusters being 24 and 31 nodes respectively, a full cluster restart takes ~24h.

With writes disabled and synced flush, I would expect the shards to be recovered from disk on the restarted node.

Steps to reproduce:

disable writes to the cluster
synced flush
set shard allocation to primaries
restart a node
set shard allocation to all
wait

I'm at a loss as to which info might make sense to include here, but I'm more than happy to send anything that might be useful.

gehel · 2016-11-30T14:07:13Z

For the record, this is tracked on our side at https://phabricator.wikimedia.org/T145065

boogie-byte · 2016-12-01T15:59:02Z

We have exactly the same issue.

# /usr/share/elasticsearch/bin/elasticsearch --version
Version: 2.3.2, Build: b9e4a6a/2016-04-21T16:03:47Z, JVM: 1.7.0_79

Installed plugins:

marvel-agent
repository-hdfs

bleskes · 2016-12-04T20:15:02Z

@gehel do you have option to repeat the experiment and debug it? most specifically I'm interested to see if the synced flush in step 2 actually worked. It's a best effort api and it reports it's success in the response - do you check for that? Also, we can validate it worked by looking at shard level stats but any indexing might violate it.

gehel · 2016-12-04T20:19:57Z

@bleskes Yes, I can repeat the experiment. I did check the results of synced flush and did not see any issue there. I can check again.

I have a full cluster restart comping up for this week, if you have a list of things to check during that restart, I'll be more than happy to collect any helpful data!

bleskes · 2016-12-04T20:26:58Z

Here a few things to try:

Check the output of the synced flush to see all shards were synced flushed
Check GET {index}/_stats?level=shards output and look at the commit section and you should see the synced flush id . It should be the same between primaries and replicas of the same shard. This can be verbose so maybe use filter_path to trim it down.
Enable trace logging on indices.recovery. This will give us full insights into the decision process made by the recovery logic.

gehel · 2016-12-06T15:27:32Z

I'm digging into this and I'm finding a few things. First, I do see failures of flushing some shards, because of "pending operations". I see 31 failures on a total of ~3K indices. Unsurprisingly, some of the failures are on our high traffic / large size indices, but some are also on smaller indices.

If I understand correctly that probably means that contrary to my belief, we still have some writes going on. I'll continue digging in this direction.

gehel · 2016-12-06T16:10:30Z

Digging further, I also see the shard mgwiki_content_first[0]` being recovered from another node

[2016-12-06 15:47:31,391][TRACE][indices.recovery         ] [elastic2002] [mgwiki_content_first][0] starting recovery from {elastic2012}{TLb2tIskQlaLTa3ibtQ_og}{10.192.16.148}{10.192.16.148:9300}{rack=B8, row=B, master=false}

That shard was flushed without failure:

"mgwiki_content_first" : {
  "total" : 3,
  "successful" : 3,
  "failed" : 0
},

and the sync_ids were identical on all copies after the flush:

"mgwiki_content_first" : {
  "shards" : {
    "0" : [ {
      "commit" : {
        "id" : "TbkUWvbthd6WbrLXrFG1sA==",
        "generation" : 42383,
        "user_data" : {
          "sync_id" : "AVjUsu2snwYQ9LxOcJzQ",
          "translog_uuid" : "iesOxXkjSCiS4O5bW8pE-g",
          "translog_generation" : "807"
        },
        "num_docs" : 117992
      }
    }, {
      "commit" : {
        "id" : "hOeITyQpv6biwUx4OQV06g==",
        "generation" : 42382,
        "user_data" : {
          "sync_id" : "AVjUsu2snwYQ9LxOcJzQ",
          "translog_uuid" : "lFrUP_ZDT9eH3b-JFsrbLA",
          "translog_generation" : "810"
        },
        "num_docs" : 117992
      }
    }, {
      "commit" : {
        "id" : "tIi13S81XgWbMXu5CPv4wQ==",
        "generation" : 42383,
        "user_data" : {
          "sync_id" : "AVjUsu2snwYQ9LxOcJzQ",
          "translog_uuid" : "HWxI6X-cTpyJZrFj2xsrxg",
          "translog_generation" : "846"
        },
        "num_docs" : 117992
      }
    } ]
  }
},

By the time that shard was being recovered, the syn_id had changed, but the number of documents is still the same.

"mgwiki_content_first" : {
  "shards" : {
    "0" : [ {
      "commit" : {
        "id" : "hOeITyQpv6biwUx4OQV75Q==",
        "generation" : 42386,
        "user_data" : {
          "translog_uuid" : "lFrUP_ZDT9eH3b-JFsrbLA",
          "sync_id" : "AVjUy6z-8YIXoWOFGpQ4",
          "translog_generation" : "812"
        },
        "num_docs" : 117992
      }
    }, {
      "commit" : {
        "id" : "TbkUWvbthd6WbrLXrFG8xQ==",
        "generation" : 42387,
        "user_data" : {
          "translog_uuid" : "iesOxXkjSCiS4O5bW8pE-g",
          "sync_id" : "AVjUy6z-8YIXoWOFGpQ4",
          "translog_generation" : "809"
        },
        "num_docs" : 117992
      }
    }, {
      "commit" : {
        "id" : "hOeITyQpv6biwUx4OQV75Q==",
        "generation" : 42386,
        "user_data" : {
          "translog_uuid" : "lFrUP_ZDT9eH3b-JFsrbLA",
          "sync_id" : "AVjUy6z-8YIXoWOFGpQ4",
          "translog_generation" : "812"
        },
        "num_docs" : 117992
      }
    } ]
  }
},

I'm not really sure what all this means...

bleskes · 2016-12-06T19:59:27Z

being recovered from another node

Shards are still recovered when synced flush (we need to compare the id) but it's very quick.

I'm not really sure what all this means...

This means that there was another sync flush and the shards got a new sync_id. Since the total doc count is the same, I presume some docs were updated and after 5m of inactivity the shards were automatically synced flushed.

I'm going to close this for now. Please do reopen if things look wrong again.

gehel · 2016-12-06T20:33:23Z

We actually still had writes going on. We found the source. I'm going to do another try tomorrow and see if I have the same result.

gehel · 2016-12-07T14:23:44Z

With writes properly disabled, it works much better :) We are still having issues with one of our index, but this needs some more investigation on our side.

@bleskes thanks a lot for the help!

bleskes · 2016-12-07T17:26:38Z

YW! thanks for coming back to report.

larschri · 2017-12-15T09:34:20Z

Steps to reproduce:

disable writes to the cluster

synced flush

set shard allocation to primaries

restart a node

set shard allocation to all

wait

Manual synced flush is worthless in some cases. The sync_ids will be overwritten when shards are marked as inactive, even if a synced flush was executed. This happens after 5 minutes (indices.memory.shard_inactive_time) by default, so it might be a lot quicker to wait for all indices to become inactive, instead of relying on synced flush.

https://discuss.elastic.co/t/shard-sync-id-keeps-changing-in-read-only-cluster-es-5-5/110835

bleskes · 2017-12-15T11:31:02Z

@larschri This is indeed a shortcoming of the current logic. The synced flush triggered by the inactivity forces a new sync id while the node is offline and thus when it comes back online it's marker is different. This is all not an issue with version 6.0 since the new ops based recovery will kick in and and recovery will be just as quick. It is, however, something that we should fix for 5.6. Can you please open a new issue?

larschri · 2017-12-15T11:39:57Z

@bleskes #27838

bleskes · 2017-12-15T13:02:29Z

Thanks @larschri!

…

On Fri, Dec 15, 2017 at 12:40 PM, Lars Christian Jensen < ***@***.***> wrote: @bleskes <https://github.com/bleskes> #27838 <#27838> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#21884 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA9bJ5cjbhQD4Wgf_m0kaNsdfQ1LoFVwks5tAlqagaJpZM4LAOWo> .

bleskes closed this as completed Dec 6, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow recovery during rolling cluster restarts #21884

Slow recovery during rolling cluster restarts #21884

gehel commented Nov 30, 2016

gehel commented Nov 30, 2016

boogie-byte commented Dec 1, 2016

bleskes commented Dec 4, 2016

gehel commented Dec 4, 2016

bleskes commented Dec 4, 2016

gehel commented Dec 6, 2016

gehel commented Dec 6, 2016

bleskes commented Dec 6, 2016

gehel commented Dec 6, 2016

gehel commented Dec 7, 2016

bleskes commented Dec 7, 2016

larschri commented Dec 15, 2017 •

edited

Loading

bleskes commented Dec 15, 2017

larschri commented Dec 15, 2017

bleskes commented Dec 15, 2017 via email

Slow recovery during rolling cluster restarts #21884

Slow recovery during rolling cluster restarts #21884

Comments

gehel commented Nov 30, 2016

gehel commented Nov 30, 2016

boogie-byte commented Dec 1, 2016

bleskes commented Dec 4, 2016

gehel commented Dec 4, 2016

bleskes commented Dec 4, 2016

gehel commented Dec 6, 2016

gehel commented Dec 6, 2016

bleskes commented Dec 6, 2016

gehel commented Dec 6, 2016

gehel commented Dec 7, 2016

bleskes commented Dec 7, 2016

larschri commented Dec 15, 2017 • edited Loading

bleskes commented Dec 15, 2017

larschri commented Dec 15, 2017

bleskes commented Dec 15, 2017 via email

larschri commented Dec 15, 2017 •

edited

Loading