Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Internal: Indexes unuseable after upgrade from 0.2 to 1.3 and cluster restart #7430

Closed
philnate opened this Issue Aug 25, 2014 · 13 comments

Comments

Projects
None yet
6 participants
@philnate
Copy link

philnate commented Aug 25, 2014

We recently tried to upgrade a ES cluster from 0.2 to 1.3. The actual upgrade worked out fine, but once we restarted the whole cluster, we saw those warnings for all shards (constantly repeating):

IndexShardGatewayRecoveryException[[maki-log-2014-08-21][3] shard allocated for local recovery (post api), should exist, but doesn't, current files: [write.lock, _checksums-1408609875350, _10f.nrm, segments.gen, _17x.nrm, _10f.tis, _17z.si, _17y.fdt, _17y.tis, _17x.fdx, _10f.frq, _17x.fdt, _17y.tii, _17x.prx, _17y.nrm, _10f.fdx, _10f.fnm, _17x.tis, _10f.tii, _10f.fdt, _17z.cfe, _17x.frq, _17x.tii, segments_j, _10f.prx, _17y.fnm, _17y.fdx, _17y.prx, _17y.frq, _17z.cfs, _17x.fnm]]; nested: FileNotFoundException[No such file [_10f.si]]; ]]
[2014-08-21 13:44:15,826][WARN ][cluster.action.shard     ] [Ghost Dancer] [maki-log-2014-08-21][3] received shard failed for [maki-log-2014-08-21][3], node[QsfMdS40Qve8PukS4er9oA], [P], s[INITIALIZING], indexUUID [_na_], reason [master [Ghost Dancer][QsfMdS40Qve8PukS4er9oA][gboanea-ThinkPad-W520][inet[/10.200.54.63:9300]] marked shard as initializing, but shard is marked as failed, resend shard failure]
[2014-08-21 13:44:15,841][WARN ][cluster.action.shard     ] [Ghost Dancer] [maki-log-2014-08-21][4] received shard failed for [maki-log-2014-08-21][4], node[QsfMdS40Qve8PukS4er9oA], [P], s[INITIALIZING], indexUUID [_na_], reason [master [Ghost Dancer][QsfMdS40Qve8PukS4er9oA][gboanea-ThinkPad-W520][inet[/10.200.54.63:9300]] marked shard as initializing, but shard is marked as failed, resend shard failure]
[2014-08-21 13:44:15,844][WARN ][indices.cluster          ] [Ghost Dancer] [maki-log-2014-08-21][1] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [maki-log-2014-08-21][1] failed to fetch index version after copying it over
    at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:152)
    at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:132)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [maki-log-2014-08-21][1] shard allocated for local recovery (post api), should exist, but doesn't, current files: [_17o.nrm, _zv.fdx, write.lock, _zv.fdt, segments.gen, _17o.fdx, _17o.tis, _17o.fdt, _zv.fnm, _17p.cfe, _zv.prx, _zv.frq, _17n.prx, _17n.frq, _17o.tii, _17n.nrm, _17n.tii, _17o.frq, _checksums-1408609865048, _17p.cfs, segments_j, _17o.fnm, _17o.prx, _17n.tis, _17n.fdx, _17n.fdt, _zv.tis, _zv.nrm, _17n.fnm, _zv.tii, _17p.si]
    at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:131)
    ... 4 more
Caused by: java.io.FileNotFoundException: No such file [_zv.si]
    at org.elasticsearch.index.store.DistributorDirectory.getDirectory(DistributorDirectory.java:173)
    at org.elasticsearch.index.store.DistributorDirectory.getDirectory(DistributorDirectory.java:144)
    at org.elasticsearch.index.store.DistributorDirectory.openInput(DistributorDirectory.java:130)
    at org.apache.lucene.store.FilterDirectory.openInput(FilterDirectory.java:80)
    at org.elasticsearch.index.store.Store$StoreDirectory.openInput(Store.java:375)
    at org.apache.lucene.codecs.lucene3x.Lucene3xSegmentInfoReader.read(Lucene3xSegmentInfoReader.java:103)
    at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:361)
    at org.apache.lucene.index.SegmentInfos$1.doBody(SegmentInfos.java:457)
    at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:907)
    at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:753)
    at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:453)
    at org.elasticsearch.common.lucene.Lucene.readSegmentInfos(Lucene.java:98)
    at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:122)
    ... 4 more..

When we shutdown the cluster a couple minutes after bringing it up, with the new version, we saw this behavior just for the newest index. After about an hour the behavior would be the same for other indexes after a cluster restart.

We found out that the indexes are updated and on shutdown nearly all segment info (*.si) files are deleted (those which have a corresponding marker _upgraded.si). Those si files surviving seemed to be not upgraded (at least they don't have those marker files). And there content is like this or this:

?�l��Lucene3xSegmentInfo3.6.21�
os.version�2.6.39-300.17.2.el6uek.x86_64�osLinuxlucene.version+3.6.2 1423725 - rmuir - 2012-12-18 19:45:40�sourceflushos.archamd64
                                                                                                                                  java.versio1.7.0_51
         java.vendor�Oracle Corporation
_175.fdt�_175_upgraded.s_175.fdx_175.s_175.fn_175.ti_175.ti_175.nr_175.fr_175.prx%
1408712122907SegmentInfo��4.9�� timestamp
os.version�3.2.0-67-generic�osLinuxlucene.version+4.9.0 1604085 - rmuir - 2014-06-20 06:22:23�sourceflushos.archamd64
                                                                                                                     java.versio1.7.0_65
                                                                                                                                        java.vendor�Oracle Corporation_18n.cf_18n.cfs_18n.si�(����\%

While those updated contain afterwards this kind of information:

?�l��Lucene3xSegmentInfo3.6.2�2�  
                                        mergeFactor�10
os.version�2.6.39-300.17.2.el6uek.x86_64�osLinuxlucene.version+3.6.2 1423725 - rmuir - 2012-12-18 19:45:40�sourcemergeos.archamd64�mergeMaxNumSegments�-1
             java.versio1.7.0_51
                                java.vendor�Oracle Corporation
_1mx.ti_1mx.fr_1mx.pr_1mx.fd_1mx.nr_1mx.fdt_1mx.si�_1mx_upgraded.s_1mx.fn_1mx.tis%  

We could force the same behavior triggering an optimize for a given index. By restarting one node at a time and waiting till it fully integrated into the cluster we were able to restore the deleted si files through other nodes (including the _upgraded.si marker files). Afterwards the si files where safe and didn't got deleted.

To me it looks like either ES or Lucene is memorizing to delete the _upgraded.si files on VM shutdown but by accident deletes the actual si files as well.

@s1monw

This comment has been minimized.

Copy link
Contributor

s1monw commented Aug 25, 2014

@philnate lemme ask a view questions:

  • did you modify any files on you filesystem?
  • do you have multiple datapaths configured?
  • did you see any other errors in the logs?
  • can you share your node-config?
  • did you change anything in the upgrade process?

@s1monw s1monw self-assigned this Aug 25, 2014

@philnate

This comment has been minimized.

Copy link
Author

philnate commented Aug 25, 2014

  • no
  • no
  • no, right during startup this warning is logged on and on
  • I've reproduced the issue with vanilla ES
  • as we were coming from 0.2 we did a shutdown of the cluster, updated the binaries and restarted the cluster, then stopped it and after restart we saw the failure

the script to reproduce the issue (just steps, script is slowly synchronous, but doesn't wait for ES starts...) https://gist.github.com/philnate/cfee1d171022b9eb3b23

Hope this helps

@philnate

This comment has been minimized.

Copy link
Author

philnate commented Aug 26, 2014

Updating from 0.20 to 1.0.3 is working. But trying to migrate to any newer version than 1.0.x is giving me the described error. So this issue seems to be introduced with 1.1.0 being it the newer Lucene version used or ES itself.

@mikemccand

This comment has been minimized.

Copy link
Contributor

mikemccand commented Aug 26, 2014

This is a Lucene issue; I managed to make a standalone test exposing it: https://issues.apache.org/jira/browse/LUCENE-5907

@mikemccand

This comment has been minimized.

Copy link
Contributor

mikemccand commented Aug 26, 2014

The Lucene issue is fixed, and will be included in Lucene 4.10.

@philnate

This comment has been minimized.

Copy link
Author

philnate commented Aug 27, 2014

@mikemccand thank you for your fast solution. I've patched ES 1.3.2 with the fix and a quick test seemed to be ok. Will do some addtional testing.

@mikemccand

This comment has been minimized.

Copy link
Contributor

mikemccand commented Aug 27, 2014

Thanks for testing @philnate

@s1monw

This comment has been minimized.

Copy link
Contributor

s1monw commented Aug 27, 2014

@philnate thanks so much for verifying!! that is super helpful

@philnate

This comment has been minimized.

Copy link
Author

philnate commented Aug 28, 2014

Seems to be all fine. A colleague did some extended testing and everything was good. @s1monw do you plan to consume lucene 4.10 with some upcoming release of ES 1.3? That would allow us to directly upgrade to the latest and greatest ES, not migrating first to 1.0 and then to 1.3.

@clintongormley

This comment has been minimized.

Copy link
Member

clintongormley commented Sep 6, 2014

@philnate Lucene 4.10 will be in v1.4, but not in v1.3.x

Fixed by #7584

@clintongormley clintongormley changed the title Indexes unuseable after upgrade from 0.2 to 1.3 and cluster restart Internal: Indexes unuseable after upgrade from 0.2 to 1.3 and cluster restart Sep 11, 2014

@mikemccand mikemccand added the v1.3.3 label Sep 22, 2014

@mikemccand

This comment has been minimized.

Copy link
Contributor

mikemccand commented Sep 22, 2014

Lucene 4.9.1 is released, and I upgraded 1.3.x to it: c998d8d

@cywjackson

This comment has been minimized.

Copy link

cywjackson commented Jan 7, 2015

it seems we face this issue when upgrade 1.1.2 to 1.3.6 (6-node cluster, rolling upgrade). Here is a count from 1 of the logs:

      1 [_gyqx_2k2.del]
      1 [_hcap_2c4.del]
   7951 [_6mef.si]
   9252 [_24gwg.si]
  11295 [_7yti.si]

so most of the error coming from that .si files

we also seems to be hitting issue like CorruptIndexException[checksum failed" , "failed to fetch index version after copying it over" and [2015-01-07 11:03:59,051 +0000][DEBUG][index.shard.service ] [usw2a-search5-prod] [abid_v3][106] state: [RECOVERING]->[CLOSED], reason [recovery failure [IndexShardGatewayRecoveryException[[abid_v3][106] failed to fetch index version after copying it over]; nested: CorruptIndexException[[abid_v3][106] Corrupted index [corrupted_ySuuD05fT8ae37Z0ABqoMA] caused by: CorruptIndexException[checksum failed (hardware problem?) : expected=9j4trc actual=1mqj6pc resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@2d4333ac)]]; ]]

There are corrupted_* files in many place, feel very similar to #8805 , but i am not sure the exact files to delete in the workaround there We deleted the checksum and the marker file and ES loaded the shards again automatically. it is very possible as @s1monw said (btw. they might have been corrupted for a long time already but old ES version didn't check this. We ran this ES < 0.90.3 and has been upgrading to 1.0.3 and 1.1.2 (and ran into various issues along the way) .

We have > 3000 shards total so its hard to say "which phrase" or "primary/replica"... I am very curious to know exactly what steps are We solved the issue on our side by deleting the checksums as well as the corrupted file and updating the indices.

Currently I am attempting to recovery this by setting index.shard.check_on_startup: fix , but not sure if it is the right fix especially with the big warning sign about data loss in the doc :(

@krispyjala

This comment has been minimized.

Copy link

krispyjala commented Feb 20, 2015

So what do we do when this has already happened? Is this error recoverable? Is there a way for us to recreate the missing .si file? Currently, it is happening to two of my indices, but just for one shard (out of 5). Even if we can somehow retain data on the other 4 shards would be preferred.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.