Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow recovery during rolling cluster restarts #21884

Closed
gehel opened this issue Nov 30, 2016 · 15 comments
Closed

Slow recovery during rolling cluster restarts #21884

gehel opened this issue Nov 30, 2016 · 15 comments

Comments

@gehel
Copy link

gehel commented Nov 30, 2016

Elasticsearch version: 2.3.5

Plugins installed:

JVM version:
java version "1.7.0_121"
OpenJDK Runtime Environment (IcedTea 2.6.8) (7u121-2.6.8-1ubuntu0.14.04.1)
OpenJDK 64-Bit Server VM (build 24.121-b00, mixed mode)

OS version: Ubuntu Trusty (14.04)

Description of the problem including expected versus actual behavior:

When doing a rolling restart of our elasticsearch cluster, it takes ~ 1h to go back to green cluster state after restarting a node. We tried a few strategies, including:

We can see during recovery that all shards that were on the restarted node are restored from an active replica, causing a large amount of network traffic which is throttled and taking time. With our 2 main elasticsearch clusters being 24 and 31 nodes respectively, a full cluster restart takes ~24h.

With writes disabled and synced flush, I would expect the shards to be recovered from disk on the restarted node.

Steps to reproduce:

  1. disable writes to the cluster
  2. synced flush
  3. set shard allocation to primaries
  4. restart a node
  5. set shard allocation to all
  6. wait

I'm at a loss as to which info might make sense to include here, but I'm more than happy to send anything that might be useful.

@gehel
Copy link
Author

gehel commented Nov 30, 2016

For the record, this is tracked on our side at https://phabricator.wikimedia.org/T145065

@boogie-byte
Copy link

We have exactly the same issue.

# /usr/share/elasticsearch/bin/elasticsearch --version
Version: 2.3.2, Build: b9e4a6a/2016-04-21T16:03:47Z, JVM: 1.7.0_79

Installed plugins:

  • marvel-agent
  • repository-hdfs

@bleskes
Copy link
Contributor

bleskes commented Dec 4, 2016

@gehel do you have option to repeat the experiment and debug it? most specifically I'm interested to see if the synced flush in step 2 actually worked. It's a best effort api and it reports it's success in the response - do you check for that? Also, we can validate it worked by looking at shard level stats but any indexing might violate it.

@gehel
Copy link
Author

gehel commented Dec 4, 2016

@bleskes Yes, I can repeat the experiment. I did check the results of synced flush and did not see any issue there. I can check again.

I have a full cluster restart comping up for this week, if you have a list of things to check during that restart, I'll be more than happy to collect any helpful data!

@bleskes
Copy link
Contributor

bleskes commented Dec 4, 2016

Here a few things to try:

  1. Check the output of the synced flush to see all shards were synced flushed
  2. Check GET {index}/_stats?level=shards output and look at the commit section and you should see the synced flush id . It should be the same between primaries and replicas of the same shard. This can be verbose so maybe use filter_path to trim it down.
  3. Enable trace logging on indices.recovery. This will give us full insights into the decision process made by the recovery logic.

@gehel
Copy link
Author

gehel commented Dec 6, 2016

I'm digging into this and I'm finding a few things. First, I do see failures of flushing some shards, because of "pending operations". I see 31 failures on a total of ~3K indices. Unsurprisingly, some of the failures are on our high traffic / large size indices, but some are also on smaller indices.

If I understand correctly that probably means that contrary to my belief, we still have some writes going on. I'll continue digging in this direction.

@gehel
Copy link
Author

gehel commented Dec 6, 2016

Digging further, I also see the shard mgwiki_content_first[0]` being recovered from another node

[2016-12-06 15:47:31,391][TRACE][indices.recovery         ] [elastic2002] [mgwiki_content_first][0] starting recovery from {elastic2012}{TLb2tIskQlaLTa3ibtQ_og}{10.192.16.148}{10.192.16.148:9300}{rack=B8, row=B, master=false}

That shard was flushed without failure:

"mgwiki_content_first" : {
  "total" : 3,
  "successful" : 3,
  "failed" : 0
},

and the sync_ids were identical on all copies after the flush:

"mgwiki_content_first" : {
  "shards" : {
    "0" : [ {
      "commit" : {
        "id" : "TbkUWvbthd6WbrLXrFG1sA==",
        "generation" : 42383,
        "user_data" : {
          "sync_id" : "AVjUsu2snwYQ9LxOcJzQ",
          "translog_uuid" : "iesOxXkjSCiS4O5bW8pE-g",
          "translog_generation" : "807"
        },
        "num_docs" : 117992
      }
    }, {
      "commit" : {
        "id" : "hOeITyQpv6biwUx4OQV06g==",
        "generation" : 42382,
        "user_data" : {
          "sync_id" : "AVjUsu2snwYQ9LxOcJzQ",
          "translog_uuid" : "lFrUP_ZDT9eH3b-JFsrbLA",
          "translog_generation" : "810"
        },
        "num_docs" : 117992
      }
    }, {
      "commit" : {
        "id" : "tIi13S81XgWbMXu5CPv4wQ==",
        "generation" : 42383,
        "user_data" : {
          "sync_id" : "AVjUsu2snwYQ9LxOcJzQ",
          "translog_uuid" : "HWxI6X-cTpyJZrFj2xsrxg",
          "translog_generation" : "846"
        },
        "num_docs" : 117992
      }
    } ]
  }
},

By the time that shard was being recovered, the syn_id had changed, but the number of documents is still the same.

"mgwiki_content_first" : {
  "shards" : {
    "0" : [ {
      "commit" : {
        "id" : "hOeITyQpv6biwUx4OQV75Q==",
        "generation" : 42386,
        "user_data" : {
          "translog_uuid" : "lFrUP_ZDT9eH3b-JFsrbLA",
          "sync_id" : "AVjUy6z-8YIXoWOFGpQ4",
          "translog_generation" : "812"
        },
        "num_docs" : 117992
      }
    }, {
      "commit" : {
        "id" : "TbkUWvbthd6WbrLXrFG8xQ==",
        "generation" : 42387,
        "user_data" : {
          "translog_uuid" : "iesOxXkjSCiS4O5bW8pE-g",
          "sync_id" : "AVjUy6z-8YIXoWOFGpQ4",
          "translog_generation" : "809"
        },
        "num_docs" : 117992
      }
    }, {
      "commit" : {
        "id" : "hOeITyQpv6biwUx4OQV75Q==",
        "generation" : 42386,
        "user_data" : {
          "translog_uuid" : "lFrUP_ZDT9eH3b-JFsrbLA",
          "sync_id" : "AVjUy6z-8YIXoWOFGpQ4",
          "translog_generation" : "812"
        },
        "num_docs" : 117992
      }
    } ]
  }
},

I'm not really sure what all this means...

@bleskes
Copy link
Contributor

bleskes commented Dec 6, 2016

being recovered from another node

Shards are still recovered when synced flush (we need to compare the id) but it's very quick.

I'm not really sure what all this means...

This means that there was another sync flush and the shards got a new sync_id. Since the total doc count is the same, I presume some docs were updated and after 5m of inactivity the shards were automatically synced flushed.

I'm going to close this for now. Please do reopen if things look wrong again.

@bleskes bleskes closed this as completed Dec 6, 2016
@gehel
Copy link
Author

gehel commented Dec 6, 2016

We actually still had writes going on. We found the source. I'm going to do another try tomorrow and see if I have the same result.

@gehel
Copy link
Author

gehel commented Dec 7, 2016

With writes properly disabled, it works much better :) We are still having issues with one of our index, but this needs some more investigation on our side.

@bleskes thanks a lot for the help!

@bleskes
Copy link
Contributor

bleskes commented Dec 7, 2016

YW! thanks for coming back to report.

@larschri
Copy link

larschri commented Dec 15, 2017

Steps to reproduce:

  1. disable writes to the cluster
  2. synced flush
  3. set shard allocation to primaries
  4. restart a node
  5. set shard allocation to all
  6. wait

Manual synced flush is worthless in some cases. The sync_ids will be overwritten when shards are marked as inactive, even if a synced flush was executed. This happens after 5 minutes (indices.memory.shard_inactive_time) by default, so it might be a lot quicker to wait for all indices to become inactive, instead of relying on synced flush.

https://discuss.elastic.co/t/shard-sync-id-keeps-changing-in-read-only-cluster-es-5-5/110835

@bleskes
Copy link
Contributor

bleskes commented Dec 15, 2017

@larschri This is indeed a shortcoming of the current logic. The synced flush triggered by the inactivity forces a new sync id while the node is offline and thus when it comes back online it's marker is different. This is all not an issue with version 6.0 since the new ops based recovery will kick in and and recovery will be just as quick. It is, however, something that we should fix for 5.6. Can you please open a new issue?

@larschri
Copy link

@bleskes #27838

@bleskes
Copy link
Contributor

bleskes commented Dec 15, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants