Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
Internal: Indexes unuseable after upgrade from 0.2 to 1.3 and cluster restart #7430
We recently tried to upgrade a ES cluster from 0.2 to 1.3. The actual upgrade worked out fine, but once we restarted the whole cluster, we saw those warnings for all shards (constantly repeating):
When we shutdown the cluster a couple minutes after bringing it up, with the new version, we saw this behavior just for the newest index. After about an hour the behavior would be the same for other indexes after a cluster restart.
We found out that the indexes are updated and on shutdown nearly all segment info (*.si) files are deleted (those which have a corresponding marker _upgraded.si). Those si files surviving seemed to be not upgraded (at least they don't have those marker files). And there content is like this or this:
While those updated contain afterwards this kind of information:
We could force the same behavior triggering an optimize for a given index. By restarting one node at a time and waiting till it fully integrated into the cluster we were able to restore the deleted si files through other nodes (including the _upgraded.si marker files). Afterwards the si files where safe and didn't got deleted.
To me it looks like either ES or Lucene is memorizing to delete the _upgraded.si files on VM shutdown but by accident deletes the actual si files as well.
the script to reproduce the issue (just steps, script is slowly synchronous, but doesn't wait for ES starts...) https://gist.github.com/philnate/cfee1d171022b9eb3b23
Hope this helps
Seems to be all fine. A colleague did some extended testing and everything was good. @s1monw do you plan to consume lucene 4.10 with some upcoming release of ES 1.3? That would allow us to directly upgrade to the latest and greatest ES, not migrating first to 1.0 and then to 1.3.
changed the title
Indexes unuseable after upgrade from 0.2 to 1.3 and cluster restart
Sep 11, 2014
it seems we face this issue when upgrade 1.1.2 to 1.3.6 (6-node cluster, rolling upgrade). Here is a count from 1 of the logs:
so most of the error coming from that .si files
we also seems to be hitting issue like
We have > 3000 shards total so its hard to say "which phrase" or "primary/replica"... I am very curious to know exactly what steps are
Currently I am attempting to recovery this by setting
So what do we do when this has already happened? Is this error recoverable? Is there a way for us to recreate the missing .si file? Currently, it is happening to two of my indices, but just for one shard (out of 5). Even if we can somehow retain data on the other 4 shards would be preferred.