Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Internal: Upgrade caused shard data to stay on nodes #7386

Closed
nik9000 opened this Issue Aug 21, 2014 · 37 comments

Comments

Projects
None yet
6 participants
@nik9000
Copy link
Contributor

nik9000 commented Aug 21, 2014

Upgrade caused shard data to stay on nodes even after it isn't useful any more.

This comes from https://groups.google.com/forum/#!topic/elasticsearch/Mn1N0xmjsL8

What I did:
Started upgrading from Elasticsearch 1.2.1 to Elasticsearch 1.3.2. For each of the 6 nodes I updated:

  • Set allocation to primaries only
  • Sync new plugins into place
  • Update deb package
  • Restart Elasticsearch
  • Wait for Elasticsearch to respond on the local host
  • Set allocation to all
  • Wait for Elasticsearch to report GREEN
  • Sleep for half an hour so the cluster can rebalance itself a bit

What happened:
The new version of Elasticsearch came up but didn't remove all the shard data it can't use. This picture from Whatson shows the problem pretty well:
https://wikitech.wikimedia.org/wiki/File:Whatson_out_of_disk.png
The nodes on the left were upgraded and blue means disk usage by Elasticsearch and brown is "other" disk usage.

When I dig around on the filesystem all the space usage is in the shard storage directory (/var/lib/elasticsearch/production-search-eqiad/nodes/0/indices) but when I compare the list of open files to the list of files on the file system with this I see that whole directories are just sitting around, unused. Hitting the /_cat/shards/<directory_name> corroborates that the shard in the directory isn't on the node. Oddly, if we keep poking around we find open files in directories representing shards that we don't expect to be on the node either....

What we're doing now:
We're going to try restarting the upgrade and blasting the data directory on the node as we upgrade it.

Reproduction steps:
No idea. And I'm a bit afraid to keep pushing things on our cluster with it in the state that it is in.

@s1monw

This comment has been minimized.

Copy link
Contributor

s1monw commented Aug 21, 2014

could this be related to #6692 did you upgrade all nodes to 1.3 or do you still have nodes < 1.3.0 in the cluster?

@nik9000

This comment has been minimized.

Copy link
Contributor Author

nik9000 commented Aug 21, 2014

Only about 1/3 of the nodes before we got warnings about disk space.

@s1monw

This comment has been minimized.

Copy link
Contributor

s1monw commented Aug 21, 2014

I guess it's not freeing the space unless an upgraded node holds a copy of the shard. That is new in 1.3 and I still try to remember what the background was. Can you check if that assumption is true, are the shards that are not delete allocated on old nodes?

@nik9000

This comment has been minimized.

Copy link
Contributor Author

nik9000 commented Aug 21, 2014

Well, this is almost certainly the cause:

            // If all nodes have been upgraded to >= 1.3.0 at some point we get back here and have the chance to
            // run this api. (when cluster state is then updated)
            if (node.getVersion().before(Version.V_1_3_0)) {
                logger.debug("Skip deleting deleting shard instance [{}], a node holding a shard instance is < 1.3.0", shardRouting);
                return false;
            }

1.3 won't delete stuff from the disks until the whole cluster is 1.3. That's ugly. I run with disks 50% full and the upgrade process almost filled them just with shuffling.

Side note: if the shards are still in the routing table it'd be nice to see them. Right now they seem to be invisble to he _cat api.

@s1monw

This comment has been minimized.

Copy link
Contributor

s1monw commented Aug 21, 2014

@nik9000 this was a temporary thing to add extra safety. It will get lower the more nodes you upgrade. I agree we could expose some more infos here if stuff is still on disk.

@nik9000

This comment has been minimized.

Copy link
Contributor Author

nik9000 commented Aug 21, 2014

This gave me quite a scare! I was running this upgrade over night with a script with extra sleeping to keep the cluster balanced. It woke me up with 99% disk utilization on one of the nodes. I'll keep pushing the upgrade through carefully.

@nik9000 nik9000 closed this Aug 21, 2014

@nik9000

This comment has been minimized.

Copy link
Contributor Author

nik9000 commented Aug 21, 2014

For posterity: if you nuke the contents of your node's disk after stopping Elasticsearch 1.2 but before starting Elasticsearch 1.3 then you won't end up with too much data that can't be cleared. The more nodes you upgrade the more shards you'll be able to delete any way - like @s1monw said.

@s1monw

This comment has been minimized.

Copy link
Contributor

s1monw commented Aug 21, 2014

just to clarify a bit more we added some safety in 1.3 that required a new API and we can only call this API if we know that we are allocated on another 1.3 or newer node that is why we keep the data around longer. thanks for opening this nik!

@nik9000

This comment has been minimized.

Copy link
Contributor Author

nik9000 commented Aug 22, 2014

So far we haven't seen any cleanup of old shards and we've just restarted the last node to pick up 1.3.2.
whatson_not_yet_cleaning
Deleting the contents of the node slowed down the upgrade but allowed us to continue the process without space being taken up by indexes we couldn't remove.

@martijnvg

This comment has been minimized.

Copy link
Member

martijnvg commented Aug 22, 2014

The unused shard copies only get deleted if all its active copies can be verified. Maybe shard to be cleaned up had copies on this not yet upgraded node?

Unused shard copies should get cleaned up now, if that isn't the case then that is bad.

If you enable trace logging for the indices.store category then we can get a peek in ES' decision making.

@nik9000

This comment has been minimized.

Copy link
Contributor Author

nik9000 commented Aug 22, 2014

@martijnvg - I'll see what happens once all the cluster goes green after the last upgrade - that'll be in under an hour.

Did we do anything to allow changing log levels on the fly? I remember seeing something about it but #6416 is still open.

@nik9000

This comment has been minimized.

Copy link
Contributor Author

nik9000 commented Aug 22, 2014

And by we I mean you, I guess :)

@martijnvg

This comment has been minimized.

Copy link
Member

martijnvg commented Aug 22, 2014

:) Well this has been in for a while: #2517

Which allows to change the log settings via the cluster update api.

@nik9000

This comment has been minimized.

Copy link
Contributor Author

nik9000 commented Aug 22, 2014

@nik9000

This comment has been minimized.

Copy link
Contributor Author

nik9000 commented Aug 22, 2014

That is getting spit out constantly.

@nik9000

This comment has been minimized.

Copy link
Contributor Author

nik9000 commented Aug 22, 2014

Looks like it is on every node as well.

@nik9000 nik9000 reopened this Aug 22, 2014

@nik9000

This comment has been minimized.

Copy link
Contributor Author

nik9000 commented Aug 22, 2014

Cluster is now green and lots of old data still sitting around.

@bleskes

This comment has been minimized.

Copy link
Member

bleskes commented Aug 22, 2014

@nik9000 this is very odd. The line points at a null clusterName . All the nodes are continuously logging this? Can I ask you to enable debug logging for the root logger and share the log? I hope to get more context into when this can happen.

@nik9000

This comment has been minimized.

Copy link
Contributor Author

nik9000 commented Aug 22, 2014

I see that cluster name is something that as introduced in 1.1.1. Maybe a coincidence - but I haven't performed a full cluster restart since upgrading to 1.1.0.

@nik9000

This comment has been minimized.

Copy link
Contributor Author

nik9000 commented Aug 22, 2014

Let me see about that debug logging - seems like that'll be a ton of data. Also - looks like this is the only thing that doesn't check if the cluster name is non null. Probably just a coincidence because it supposed to be non-null since 1.1.1 I guess.....

@bleskes

This comment has been minimized.

Copy link
Member

bleskes commented Aug 22, 2014

@nik9000 I'm not sure I follow what you mean by

looks like this is the only thing that doesn't check if the cluster name is non null.

I was referring to this line: https://github.com/elasticsearch/elasticsearch/blob/v1.3.2/src/main/java/org/elasticsearch/indices/store/IndicesStore.java#L418

@nik9000

This comment has been minimized.

Copy link
Contributor Author

nik9000 commented Aug 22, 2014

@bleskes - sorry, yeah. I was looking at other code that looked at the cluster name and its pretty careful around the cluster name potentially being null. Like
https://github.com/elasticsearch/elasticsearch/blob/v1.3.2/src/main/java/org/elasticsearch/cluster/ClusterState.java#L577 and https://github.com/elasticsearch/elasticsearch/blob/v1.3.2/src/main/java/org/elasticsearch/discovery/zen/ZenDiscovery.java#L551 .

I guess what I'm saying is that if the cluster state never picked up the name somehow this looks like the only thing that would break.

@nik9000

This comment has been minimized.

Copy link
Contributor Author

nik9000 commented Aug 22, 2014

Tried setting logger to debug and didn't get anything super interesting. Here is some of it: https://gist.github.com/nik9000/b9c40805abb4bcbb5b61

@bleskes

This comment has been minimized.

Copy link
Member

bleskes commented Aug 22, 2014

Thx Nik. I have a theory. Indeed the cluster name as part of the cluster state was introduced in 1.1.1 . When a node of version >=1.1.1 reads the cluster state from an older node, that field will be populated with null. During the upgrade from 1.1.0 this happened and the cluster state in memory has it's name set to null. Since you never restarted the complete cluster since then, all nodes have kept communicating it keep it alive. This trips this new code. A full cluster restart should fix it but that's obviously totally not desirable. I'm still trying to come up with a potential work around...

@bleskes

This comment has been minimized.

Copy link
Member

bleskes commented Aug 22, 2014

@nik9000 do you use dedicated master nodes? it doesn't look so from the logs but I want to double check

@nik9000

This comment has been minimized.

Copy link
Contributor Author

nik9000 commented Aug 22, 2014

@bleskes no dedicated master nodes.

@nik9000

This comment has been minimized.

Copy link
Contributor Author

nik9000 commented Aug 22, 2014

@bleskes that's what I was thinking - I was digging through places where the cluster state is built from name and they are pretty rare. Still, it'd take me some time to validate that they never get saved.

bleskes added a commit to bleskes/elasticsearch that referenced this issue Aug 22, 2014

[Internal] user node's cluster name as a default for an incoming clus…
…ter state who misses it

ClusterState has a reference to the cluster name since version 1.1.0 (df7474b) . However, if the state was  sent from a master of an older version, this name can be set to null. This is an unexpected and can cause bugs. The bad part is that it will never correct it self until a full cluster restart where the cluster state is rebuilt using the code of the latest version.

 This commit changes the default to the node's cluster name.

 Relates to elastic#7386
@nik9000

This comment has been minimized.

Copy link
Contributor Author

nik9000 commented Aug 22, 2014

More posterity: this broke for me because when I started the cluster I was using 1.1.0 and I haven't done a full restart since - only rolling restarts. If you are in that boat - do not upgrade to 1.3 until 1.3.3 is released.

bleskes added a commit that referenced this issue Aug 27, 2014

[Internal] user node's cluster name as a default for an incoming clus…
…ter state who misses it

ClusterState has a reference to the cluster name since version 1.1.0 (df7474b) . However, if the state was  sent from a master of an older version, this name can be set to null. This is an unexpected and can cause bugs. The bad part is that it will never correct it self until a full cluster restart where the cluster state is rebuilt using the code of the latest version.

This commit changes the default to the node's cluster name.

Relates to #7386

Closes #7414

bleskes added a commit that referenced this issue Aug 27, 2014

[Internal] user node's cluster name as a default for an incoming clus…
…ter state who misses it

ClusterState has a reference to the cluster name since version 1.1.0 (df7474b) . However, if the state was  sent from a master of an older version, this name can be set to null. This is an unexpected and can cause bugs. The bad part is that it will never correct it self until a full cluster restart where the cluster state is rebuilt using the code of the latest version.

This commit changes the default to the node's cluster name.

Relates to #7386

Closes #7414

bleskes added a commit that referenced this issue Aug 27, 2014

[Internal] user node's cluster name as a default for an incoming clus…
…ter state who misses it

ClusterState has a reference to the cluster name since version 1.1.0 (df7474b) . However, if the state was sent from a master of an older version, this name can be set to null. This is an unexpected and can cause bugs. The bad part is that it will never correct it self until a full cluster restart where the cluster state is rebuilt using the code of the latest version.

This commit changes the default to the node's cluster name.

Relates to #7386

Closes #7414

@bleskes bleskes added v1.4.0 labels Aug 27, 2014

@bleskes

This comment has been minimized.

Copy link
Member

bleskes commented Aug 27, 2014

I'm going to close this as it is fixed by the change my in #7414

@bleskes bleskes closed this Aug 27, 2014

@nik9000

This comment has been minimized.

Copy link
Contributor Author

nik9000 commented Aug 27, 2014

Thanks!

@clintongormley clintongormley changed the title Upgrade caused shard data to stay on nodes Internal: Upgrade caused shard data to stay on nodes Sep 8, 2014

bleskes added a commit that referenced this issue Sep 8, 2014

[Internal] user node's cluster name as a default for an incoming clus…
…ter state who misses it

ClusterState has a reference to the cluster name since version 1.1.0 (df7474b) . However, if the state was  sent from a master of an older version, this name can be set to null. This is an unexpected and can cause bugs. The bad part is that it will never correct it self until a full cluster restart where the cluster state is rebuilt using the code of the latest version.

This commit changes the default to the node's cluster name.

Relates to #7386

Closes #7414
@ajhalani

This comment has been minimized.

Copy link

ajhalani commented Sep 21, 2014

Ran into same issue when upgrading v1.2.2 to v1.3.2. Could you please help by answering -

  • Besides error traces/wasted disk space, does this actually cause search/indexing failures?
  • Until v1.3.3 is released, what is the fix ? Will full cluster turnaround fix this?
@nik9000

This comment has been minimized.

Copy link
Contributor Author

nik9000 commented Sep 21, 2014

The error i has was caused be more doing a full restart some 1.0.1 or so.
Full cluster restart will fix it. Like turn the whole thing off then back
on again.

You can also did it by applying the patch to fix this issue directly to
1.3.2 and building a release and making that the master node, even if it
only stays the master for a minute. That is a bit involved if you aren't
used to building elasticsearch though.
On Sep 21, 2014 4:21 AM, "ajhalani" notifications@github.com wrote:

Ran into same issue when upgrading v1.2.2 to v1.3.2. Could you please help
by answering -

  • Besides error traces/wasted disk space, does this actually cause
    search/indexing failures?
  • Until v1.3.3 is released, what is the fix ? Will full cluster
    turnaround fix this?


Reply to this email directly or view it on GitHub
#7386 (comment)
.

@ajhalani

This comment has been minimized.

Copy link

ajhalani commented Sep 21, 2014

Thanks Nik,, Yes we have been doing rolling upgrade since v1,0,x, and the issue explosed with last upgrade from v1.2.2/

Really curious what is the impact of leaving v1.3.2. So far I only see error traces, but no search/index/alert failures.

Also I am not sure how can we make an upgraded node master, is their an option for that?

---- Edit 8:49 PM GMT Time ----
DId a full cluster upgrade, things are back online and green. Don't see the error traces at the moment.

@nik9000

This comment has been minimized.

Copy link
Contributor Author

nik9000 commented Sep 21, 2014

On Sep 21, 2014 11:37 AM, "ajhalani" notifications@github.com wrote:

Thanks Nik,, Yes we have been doing rolling upgrade since v1,0,x, and the
issue explosed with last upgrade from v1.2.2/

Yup that sounds like this issue then.

Really curious what is the impact of leaving v1.3.2. So far I only see
error traces, but no search/index/alert failures.

The errors at trace log you can ignore. The trouble will be that the disks
will get full. You can delete the files that elasticsearch isn't using on
your own at it is safe so long as you were right that it wasn't using them.
You have to be careful though.

Also I am not sure how can we make an upgraded node master, is their an
option for that?

That isn't super easy. I can't explain on mobile so it'll have to wait
until Monday. I did it because I'm familiar with the source code. If a
full restart isn't too much of a problem for you is suggest it. If not ping
here and I can explain on Monday when I'm more awake.


Reply to this email directly or view it on GitHub.

@ajhalani

This comment has been minimized.

Copy link

ajhalani commented Sep 22, 2014

Yea don't worry explaining how to make a node master it if it's not a straightforward option.. As I said in a later edit, did a full cluster restart and issue went away. thanks again!

@nik9000

This comment has been minimized.

Copy link
Contributor Author

nik9000 commented Sep 22, 2014

Cool! I'm glad it worked for you.

@bleskes I've seen a few people with this issue over the past month - maybe 4. I wonder if it is worth thinking of cutting a 1.3.3 soonish to pick this up?

@kimchy

This comment has been minimized.

Copy link
Member

kimchy commented Sep 22, 2014

@nik9000 yea, we should release 1.3.3 as soon as possible, we were waiting on Lucene 4.9.1, which was released and we pushed it in yesterday. I am still waiting for review on #7811 and a discussion if it makes sense to get it into 1.3.3 as well.

mute pushed a commit to mute/elasticsearch that referenced this issue Jul 29, 2015

[Internal] user node's cluster name as a default for an incoming clus…
…ter state who misses it

ClusterState has a reference to the cluster name since version 1.1.0 (df7474b) . However, if the state was sent from a master of an older version, this name can be set to null. This is an unexpected and can cause bugs. The bad part is that it will never correct it self until a full cluster restart where the cluster state is rebuilt using the code of the latest version.

This commit changes the default to the node's cluster name.

Relates to elastic#7386

Closes elastic#7414
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.