-
Notifications
You must be signed in to change notification settings - Fork 24.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ElasticSearch 1.3.4 recovery slow on larger clusters (50+ total nodes) #8487
Comments
When you bring a node back up and it has replicas on disk, ES will sync the replicas with the current primaries. That can't take varying amount of time, depending on how different the local segments file are. We are working on making it faster.
With 1.4.0 we considerably improved the joining process by using batching- it should be much faster. See #7493
This is worrying. Can you reproduce the issue? Are there any errors in the logs? I'm looking for something like ConcurrentModificationException (but it may be something else!). |
I have a small one node ES instance (1.4.1, RHEL 7, openjdk 1.7.0), with I modified my startup sequence to immediately close all indexes after We then opened them one at a time, from the most recent to oldest, for Does this sound familiar to anyone? On Mon, Nov 17, 2014 at 7:47 AM, Boaz Leskes notifications@github.com
|
True, there appears to be a greater percentage of the recovery time is spent in the translog stage.
Yes, it's not 100% reproducible, but it does happen frequently. There are no exceptions in the logs that I have discovered. Since I posted this, I done several rolling reboots and node failure recoveries of our clusters. With 1.3.4 vs 1.3.2, a full recovery has gone from ~40 minutes to over 4 hours. Prior to 1.3.4 I never paid any attention to pending_tasks, it's part of my problem triage process now, so I'm not able to do an apples to apples comparison. When _cat/recovery does not show anything, pending_tasks will grow to over 2000 tasks while its stuck on a single task, usually its shard-started that wasn't able to allocate to a node. Some times I can cancel the allocation, most of the time the cancel would timeout as well and the cluster is left in a limbo waiting and waiting. |
#8394 (comment) addressed the problem I saw above. |
@TrentStewart sorry for not returning to you earlier
When this happen again (master stuck on a single task), can you run the hot threads api on it? that will tell us what it's busy doing.
Cancel is in itself a task, albeit with highest priority. It needs to wait until the current task is done before having effect. |
We are seeing a similar issue where recovery seems to be stuck in the "translog" state.
But it is extremely slow (the 0 value in translog.total_time_in_millis is a lie!)... and we have no ETA idea. What does |
@julien51 those are translog operations . To find out how many there are in total and what the size of the translog is, you can run |
Version: 1.4.2 |
@julien51 I see, this is your primary shard? In this case it's hard to tell indeed. You can check the file system for the size of a file called "translog-????.recovering" (again, working on improving this). When you shut down the cluster (did you?) did you have any relocations/recoveries going on? (if you know). |
Unfortunaty, there was a bit of a disaster and the shutdown of the server was unexpected (still trying to understand what happend). We had 2 servers failing (out of 5) and of course they had the primary and secondary for one of our indices. But yes, based on the logs, I believe there was some relocation going on. The file size of 1752285717 bytes. Any idea of how that roughly converts to number of operations? I should add that it's similarly slow for the 2ndary shards which are also initializing and in the "translog" state. |
Extra stupid question: this would in theory be an absolutely bad time to upgrade to 1.4.4, but I see 1.4.3 does bring improvements on the recovery front. |
After 6 hours, it was still processing the translog, but at an excruciatingly slow pace... so I restarted the node, and well, it finished in a matter of seconds on another host. We seem to have the same number of documents. |
The translog is binary. New lines are just there by accident (if viewed as text)
1.4.3 helps by being more aggressive in trimming the translog post recovery but we're still chasing this issue. It is very rare but do occur (as you sadly noticed).
Yes. The translog on replica is being flushed during recovery. It's only the primary than can grow, because we need it a safety measure to catch all the documents indexed between starting to copy lucene files and starting the translog phase. Depending on what exactly happened, there might be none but it is not guaranteed. |
The never-ending-translog bug was fixed several versions ago. I'm going to close this issue. |
We are seeing a situation on clusters running 1.3.4 with greater than 50 total nodes where shard recovery/allocation is either failing or is VERY slow.
Full details:
I currently have 6 clusters, 4 with 24 total nodes and 2 larger clusters with 53 & 63 nodes respectively. Everything is run on VMs running Windows Server 2012 R2 within Azure.
All clusters (except the 63 node) were upgraded from 1.3.2 to 1.3.4 using an offline method of shutting down the cluster clean, swapping out the x64 Windows Service, then restarting all of the nodes. The 63 node cluster was built last week clean with 1.3.4.
For the 24 node clusters, the clusters returned to a green status after the upgrade in less than 30 minutes and are performing very well. I also did a rolling reboot of all of the machines to apply OS updates where I have automation that sets allocation to new_primaries, shuts down the service, reboots the machine, waits for the node to rejoin the cluster, sets allocation back to all, then waits for the cluster to return to green before proceeding to the next node. Again for each of the 24 node clusters, it completed the entire process in just shy of 2 hours.
The 53 node cluster is our oldest cluster (with 144 indices & 3330 shards) and has under gone several upgrades using the offline method and in each scenario, the cluster returned to yellow within 15 minutes and green within 40 minutes. Monday night we upgraded this cluster and it took 3 hours to get to yellow and 6 hours to get to green.
The 63 node cluster is our newest cluster (currently has 12 indices & 272 shards). Last night & today while performing a rolling reboot, each node has a max of 3 shards on it, after the machine rebooted, it was taking > 10 minutes for the node to rejoin the cluster and randomly some shards would never finish initialization.
When I query _cat/recovery?pretty=true&v&active_only=true, no shards are listed however _cat/shards would show 1 or 2 shards as INITIALIZING. If I issued the command for reroute cancel on the initializing shards, they would almost immediately allocation and the cluster would turn green.
The text was updated successfully, but these errors were encountered: