Corrupted shard uncovered during node decommissioning #8827

bobrik · 2014-12-08T19:43:17Z

I removed node from allocation by ip and one shard appeared as corrupted at the end of relocation. Logs from the last restart:

2014-12-08 17:29:15,820][INFO ][node                     ] [statistics04] version[1.4.1], pid[96760], build[89d3241/2014-11-26T15:49:29Z]
[2014-12-08 17:29:15,821][INFO ][node                     ] [statistics04] initializing ...
[2014-12-08 17:29:15,832][INFO ][plugins                  ] [statistics04] loaded [cloud-aws], sites []
[2014-12-08 17:29:18,419][INFO ][node                     ] [statistics04] initialized
[2014-12-08 17:29:18,419][INFO ][node                     ] [statistics04] starting ...
[2014-12-08 17:29:18,520][INFO ][transport                ] [statistics04] bound_address {inet[/192.168.1.212:9300]}, publish_address {inet[/192.168.1.212:9300]}
[2014-12-08 17:29:18,527][INFO ][discovery                ] [statistics04] statistics/VJCAl72ETbmulc5OJu5DQA
[2014-12-08 17:29:22,815][INFO ][cluster.service          ] [statistics04] detected_master [statistics07][EZ1aCHx5RNO1xNEUtNY5YQ][web606][inet[/192.168.2.95:9300]], added {[statistics07][EZ1aCHx5RNO1xNEUtNY5YQ][web606][inet[/192.168.2.95:9300]],[statistics06][qnY9nSvXQsmvsTWqu_ayNg][web605][inet[/192.168.2.94:9300]],[statistics08][QnMcrdd0SxWA4zTMEZtdrQ][web607][inet[/192.168.2.96:9300]],}, reason: zen-disco-receive(from master [[statistics07][EZ1aCHx5RNO1xNEUtNY5YQ][web606][inet[/192.168.2.95:9300]]])
[2014-12-08 17:29:24,046][INFO ][http                     ] [statistics04] bound_address {inet[/192.168.1.212:9200]}, publish_address {inet[/192.168.1.212:9200]}
[2014-12-08 17:29:24,046][INFO ][node                     ] [statistics04] started
[2014-12-08 17:33:30,360][WARN ][transport                ] [statistics04] Received response for a request that has timed out, sent [246546ms] ago, timed out [216545ms] ago, action [internal:discovery/zen/fd/master_ping], node [[statistics07][EZ1aCHx5RNO1xNEUtNY5YQ][web606][inet[/192.168.2.95:9300]]], id [13]
[2014-12-08 18:44:45,855][INFO ][cluster.service          ] [statistics04] removed {[statistics06][qnY9nSvXQsmvsTWqu_ayNg][web605][inet[/192.168.2.94:9300]],}, reason: zen-disco-receive(from master [[statistics07][EZ1aCHx5RNO1xNEUtNY5YQ][web606][inet[/192.168.2.95:9300]]])
[2014-12-08 18:44:49,445][INFO ][cluster.service          ] [statistics04] added {[statistics06][YOK_20U7Qee-XSasg0J8VA][web605][inet[/192.168.2.94:9300]],}, reason: zen-disco-receive(from master [[statistics07][EZ1aCHx5RNO1xNEUtNY5YQ][web606][inet[/192.168.2.95:9300]]])
[2014-12-08 18:45:30,882][INFO ][discovery.zen            ] [statistics04] master_left [[statistics07][EZ1aCHx5RNO1xNEUtNY5YQ][web606][inet[/192.168.2.95:9300]]], reason [shut_down]
[2014-12-08 18:45:30,964][WARN ][discovery.zen            ] [statistics04] master left (reason = shut_down), current nodes: {[statistics04][VJCAl72ETbmulc5OJu5DQA][web467][inet[/192.168.1.212:9300]],[statistics06][YOK_20U7Qee-XSasg0J8VA][web605][inet[/192.168.2.94:9300]],[statistics08][QnMcrdd0SxWA4zTMEZtdrQ][web607][inet[/192.168.2.96:9300]],}
[2014-12-08 18:45:31,190][INFO ][discovery.zen            ] [statistics04] master_left [[statistics07][EZ1aCHx5RNO1xNEUtNY5YQ][web606][inet[/192.168.2.95:9300]]], reason [transport disconnected]
[2014-12-08 18:45:31,190][INFO ][cluster.service          ] [statistics04] removed {[statistics07][EZ1aCHx5RNO1xNEUtNY5YQ][web606][inet[/192.168.2.95:9300]],}, reason: zen-disco-master_failed ([statistics07][EZ1aCHx5RNO1xNEUtNY5YQ][web606][inet[/192.168.2.95:9300]])
[2014-12-08 18:45:40,396][INFO ][cluster.service          ] [statistics04] detected_master [statistics08][QnMcrdd0SxWA4zTMEZtdrQ][web607][inet[/192.168.2.96:9300]], reason: zen-disco-receive(from master [[statistics08][QnMcrdd0SxWA4zTMEZtdrQ][web607][inet[/192.168.2.96:9300]]])
[2014-12-08 18:45:47,229][INFO ][cluster.service          ] [statistics04] added {[statistics07][q9ghAwFYTPCKyJ26BFaSqw][web606][inet[/192.168.2.95:9300]],}, reason: zen-disco-receive(from master [[statistics08][QnMcrdd0SxWA4zTMEZtdrQ][web607][inet[/192.168.2.96:9300]]])
[2014-12-08 18:46:02,161][INFO ][discovery.zen            ] [statistics04] master_left [[statistics08][QnMcrdd0SxWA4zTMEZtdrQ][web607][inet[/192.168.2.96:9300]]], reason [shut_down]
[2014-12-08 18:46:02,168][INFO ][discovery.zen            ] [statistics04] master_left [[statistics08][QnMcrdd0SxWA4zTMEZtdrQ][web607][inet[/192.168.2.96:9300]]], reason [transport disconnected]
[2014-12-08 18:46:03,815][WARN ][discovery.zen            ] [statistics04] master left (reason = shut_down), current nodes: {[statistics07][q9ghAwFYTPCKyJ26BFaSqw][web606][inet[/192.168.2.95:9300]],[statistics04][VJCAl72ETbmulc5OJu5DQA][web467][inet[/192.168.1.212:9300]],[statistics06][YOK_20U7Qee-XSasg0J8VA][web605][inet[/192.168.2.94:9300]],}
[2014-12-08 18:46:03,815][INFO ][cluster.service          ] [statistics04] removed {[statistics08][QnMcrdd0SxWA4zTMEZtdrQ][web607][inet[/192.168.2.96:9300]],}, reason: zen-disco-master_failed ([statistics08][QnMcrdd0SxWA4zTMEZtdrQ][web607][inet[/192.168.2.96:9300]])
[2014-12-08 18:46:08,348][INFO ][cluster.service          ] [statistics04] new_master [statistics04][VJCAl72ETbmulc5OJu5DQA][web467][inet[/192.168.1.212:9300]], reason: zen-disco-join (elected_as_master)
[2014-12-08 18:46:33,513][INFO ][cluster.service          ] [statistics04] added {[statistics08][Ai0OIXsCTgO_YE1MhJLRiQ][web607][inet[/192.168.2.96:9300]],}, reason: zen-disco-receive(join from node[[statistics08][Ai0OIXsCTgO_YE1MhJLRiQ][web607][inet[/192.168.2.96:9300]]])
[2014-12-08 18:49:54,615][INFO ][indices.store            ] [statistics04] Failed to open / find files while reading metadata snapshot
[2014-12-08 18:56:51,471][INFO ][cluster.metadata         ] [statistics04] [statistics-not-so-fast-201312] update_mapping [events]
[2014-12-08 19:55:32,217][INFO ][index.engine.internal    ] [statistics04] [statistics-not-so-fast-201312][2] now throttling indexing: numMergesInFlight=6, maxNumMerges=5
[2014-12-08 19:55:32,230][INFO ][index.engine.internal    ] [statistics04] [statistics-not-so-fast-201312][2] stop throttling indexing: numMergesInFlight=4, maxNumMerges=5
[2014-12-08 19:55:33,219][INFO ][index.engine.internal    ] [statistics04] [statistics-not-so-fast-201312][2] now throttling indexing: numMergesInFlight=6, maxNumMerges=5
[2014-12-08 19:55:33,230][INFO ][index.engine.internal    ] [statistics04] [statistics-not-so-fast-201312][2] stop throttling indexing: numMergesInFlight=4, maxNumMerges=5
[2014-12-08 19:55:33,231][INFO ][index.engine.internal    ] [statistics04] [statistics-not-so-fast-201312][2] now throttling indexing: numMergesInFlight=6, maxNumMerges=5
[2014-12-08 19:55:33,243][INFO ][index.engine.internal    ] [statistics04] [statistics-not-so-fast-201312][2] stop throttling indexing: numMergesInFlight=4, maxNumMerges=5
[2014-12-08 22:59:31,798][WARN ][indices.recovery         ] [statistics04] [statistics-20140918][4] Corrupted file detected name [_u28.si], length [472], checksum [1kcwerm], writtenBy [null] checksum mismatch
[2014-12-08 22:59:31,858][WARN ][indices.recovery         ] [statistics04] [statistics-20140918][4] Corrupted file detected name [_u28.fdx], length [774151], checksum [sumxsg], writtenBy [null] checksum mismatch
[2014-12-08 22:59:31,885][WARN ][indices.recovery         ] [statistics04] [statistics-20140918][4] Corrupted file detected name [_u28.fnm], length [3554], checksum [dubiao], writtenBy [null] checksum mismatch
[2014-12-08 22:59:32,294][WARN ][indices.recovery         ] [statistics04] [statistics-20140918][4] Corrupted file detected name [_u28_es090_0.tip], length [3272251], checksum [hv0fef], writtenBy [null] checksum mismatch
[2014-12-08 23:00:55,791][WARN ][indices.recovery         ] [statistics04] [statistics-20140918][4] Corrupted file detected name [_u28_es090_0.blm], length [9517214], checksum [1ro50g1], writtenBy [null] checksum mismatch
[2014-12-08 23:01:08,564][WARN ][indices.recovery         ] [statistics04] [statistics-20140918][4] Corrupted file detected name [_u28_es090_0.doc], length [311172533], checksum [uy3nf4], writtenBy [null] checksum mismatch
[2014-12-08 23:01:13,610][WARN ][indices.recovery         ] [statistics04] [statistics-20140918][4] Corrupted file detected name [_u28_es090_0.tim], length [275377261], checksum [187kzhk], writtenBy [null] checksum mismatch
[2014-12-08 23:01:44,209][WARN ][indices.recovery         ] [statistics04] [statistics-20140918][4] Corrupted file detected name [_u28.fdt], length [726233746], checksum [i2a7yg], writtenBy [null] checksum mismatch
[2014-12-08 23:01:44,413][WARN ][index.engine.internal    ] [statistics04] [statistics-20140918][4] failed engine [corrupt file detected source: [recovery phase 1]]
org.elasticsearch.indices.recovery.RecoverFilesRecoveryException: [statistics-20140918][4] Failed to transfer [27] files with total size of [1.8gb]
    at org.elasticsearch.indices.recovery.RecoverySource$1.phase1(RecoverySource.java:276)
    at org.elasticsearch.index.engine.internal.InternalEngine.recover(InternalEngine.java:1116)
    at org.elasticsearch.index.shard.service.InternalIndexShard.recover(InternalIndexShard.java:654)
    at org.elasticsearch.indices.recovery.RecoverySource.recover(RecoverySource.java:137)
    at org.elasticsearch.indices.recovery.RecoverySource.access$2600(RecoverySource.java:74)
    at org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:464)
    at org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:450)
    at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:722)
Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=1kcwerm actual=n2dftw resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@b0037f7)
    at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
    at org.elasticsearch.index.store.Store.verify(Store.java:365)
    at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
    at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
    ... 4 more
    Suppressed: org.elasticsearch.transport.RemoteTransportException: [statistics08][inet[/192.168.2.96:9300]][internal:index/shard/recovery/file_chunk]
    Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=sumxsg actual=z4yixy resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@5c352fb4)
        at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
        at org.elasticsearch.index.store.Store.verify(Store.java:365)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:722)
    Suppressed: org.elasticsearch.transport.RemoteTransportException: [statistics08][inet[/192.168.2.96:9300]][internal:index/shard/recovery/file_chunk]
    Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=dubiao actual=18t8jnd resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@4594ef9b)
        at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
        at org.elasticsearch.index.store.Store.verify(Store.java:365)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:722)
    Suppressed: org.elasticsearch.transport.RemoteTransportException: [statistics08][inet[/192.168.2.96:9300]][internal:index/shard/recovery/file_chunk]
    Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=hv0fef actual=15rxc01 resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@36405f3e)
        at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
        at org.elasticsearch.index.store.Store.verify(Store.java:365)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:722)
    Suppressed: org.elasticsearch.transport.RemoteTransportException: [statistics08][inet[/192.168.2.96:9300]][internal:index/shard/recovery/file_chunk]
    Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=1ro50g1 actual=1qivhre resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@5085f5c6)
        at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
        at org.elasticsearch.index.store.Store.verify(Store.java:365)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:722)
    Suppressed: org.elasticsearch.transport.RemoteTransportException: [statistics08][inet[/192.168.2.96:9300]][internal:index/shard/recovery/file_chunk]
    Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=uy3nf4 actual=uk9ays resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@50854005)
        at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
        at org.elasticsearch.index.store.Store.verify(Store.java:365)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:722)
    Suppressed: org.elasticsearch.transport.RemoteTransportException: [statistics08][inet[/192.168.2.96:9300]][internal:index/shard/recovery/file_chunk]
    Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=187kzhk actual=gstvlk resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@52ce09d)
        at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
        at org.elasticsearch.index.store.Store.verify(Store.java:365)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:722)
    Suppressed: org.elasticsearch.transport.RemoteTransportException: [statistics08][inet[/192.168.2.96:9300]][internal:index/shard/recovery/file_chunk]
    Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=i2a7yg actual=16u6g09 resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@74aaf669)
        at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
        at org.elasticsearch.index.store.Store.verify(Store.java:365)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:722)
[2014-12-08 23:01:44,634][WARN ][cluster.action.shard     ] [statistics04] [statistics-20140918][4] sending failed shard for [statistics-20140918][4], node[VJCAl72ETbmulc5OJu5DQA], relocating [Ai0OIXsCTgO_YE1MhJLRiQ], [P], s[RELOCATING], indexUUID [MgvyngJJQaCtGYfnqKOaZA], reason [engine failure, message [corrupt file detected source: [recovery phase 1]][RecoverFilesRecoveryException[[statistics-20140918][4] Failed to transfer [27] files with total size of [1.8gb]]; nested: CorruptIndexException[checksum failed (hardware problem?) : expected=1kcwerm actual=n2dftw resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@b0037f7)]; ]]
[2014-12-08 23:01:44,634][WARN ][cluster.action.shard     ] [statistics04] [statistics-20140918][4] received shard failed for [statistics-20140918][4], node[VJCAl72ETbmulc5OJu5DQA], relocating [Ai0OIXsCTgO_YE1MhJLRiQ], [P], s[RELOCATING], indexUUID [MgvyngJJQaCtGYfnqKOaZA], reason [engine failure, message [corrupt file detected source: [recovery phase 1]][RecoverFilesRecoveryException[[statistics-20140918][4] Failed to transfer [27] files with total size of [1.8gb]]; nested: CorruptIndexException[checksum failed (hardware problem?) : expected=1kcwerm actual=n2dftw resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@b0037f7)]; ]]
[2014-12-08 23:01:46,441][WARN ][indices.cluster          ] [statistics04] [statistics-20140918][4] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [statistics-20140918][4] failed to fetch index version after copying it over
    at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:158)
    at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:132)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:722)
Caused by: org.apache.lucene.index.CorruptIndexException: [statistics-20140918][4] Preexisting corrupted index [corrupted_9CAF-B8ySZSdvpwbnwVAww] caused by: CorruptIndexException[checksum failed (hardware problem?) : expected=1kcwerm actual=n2dftw resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@b0037f7)]
org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=1kcwerm actual=n2dftw resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@b0037f7)
    at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
    at org.elasticsearch.index.store.Store.verify(Store.java:365)
    at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
    at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
    at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:722)
    Suppressed: org.elasticsearch.transport.RemoteTransportException: [statistics08][inet[/192.168.2.96:9300]][internal:index/shard/recovery/file_chunk]
    Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=sumxsg actual=z4yixy resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@5c352fb4)
        at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
        at org.elasticsearch.index.store.Store.verify(Store.java:365)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:722)
    Suppressed: org.elasticsearch.transport.RemoteTransportException: [statistics08][inet[/192.168.2.96:9300]][internal:index/shard/recovery/file_chunk]
    Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=dubiao actual=18t8jnd resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@4594ef9b)
        at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
        at org.elasticsearch.index.store.Store.verify(Store.java:365)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:722)
    Suppressed: org.elasticsearch.transport.RemoteTransportException: [statistics08][inet[/192.168.2.96:9300]][internal:index/shard/recovery/file_chunk]
    Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=hv0fef actual=15rxc01 resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@36405f3e)
        at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
        at org.elasticsearch.index.store.Store.verify(Store.java:365)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:722)
    Suppressed: org.elasticsearch.transport.RemoteTransportException: [statistics08][inet[/192.168.2.96:9300]][internal:index/shard/recovery/file_chunk]
    Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=1ro50g1 actual=1qivhre resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@5085f5c6)
        at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
        at org.elasticsearch.index.store.Store.verify(Store.java:365)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:722)
    Suppressed: org.elasticsearch.transport.RemoteTransportException: [statistics08][inet[/192.168.2.96:9300]][internal:index/shard/recovery/file_chunk]
    Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=uy3nf4 actual=uk9ays resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@50854005)
        at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
        at org.elasticsearch.index.store.Store.verify(Store.java:365)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:722)
    Suppressed: org.elasticsearch.transport.RemoteTransportException: [statistics08][inet[/192.168.2.96:9300]][internal:index/shard/recovery/file_chunk]
    Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=187kzhk actual=gstvlk resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@52ce09d)
        at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
        at org.elasticsearch.index.store.Store.verify(Store.java:365)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:722)
    Suppressed: org.elasticsearch.transport.RemoteTransportException: [statistics08][inet[/192.168.2.96:9300]][internal:index/shard/recovery/file_chunk]

.. and so on.

"Maybe that's just a glitch that could disappear with restart" – was my first thought.

[2014-12-08 23:12:36,360][INFO ][node                     ] [statistics04] stopping ...
[2014-12-08 23:12:36,423][INFO ][node                     ] [statistics04] stopped
[2014-12-08 23:12:36,423][INFO ][node                     ] [statistics04] closing ...
[2014-12-08 23:12:36,455][INFO ][node                     ] [statistics04] closed
[2014-12-08 23:12:45,207][INFO ][node                     ] [statistics04] version[1.4.1], pid[84384], build[89d3241/2014-11-26T15:49:29Z]
[2014-12-08 23:12:45,207][INFO ][node                     ] [statistics04] initializing ...
[2014-12-08 23:12:45,338][INFO ][plugins                  ] [statistics04] loaded [cloud-aws], sites []
[2014-12-08 23:12:49,383][INFO ][node                     ] [statistics04] initialized
[2014-12-08 23:12:49,384][INFO ][node                     ] [statistics04] starting ...
[2014-12-08 23:12:49,479][INFO ][transport                ] [statistics04] bound_address {inet[/192.168.1.212:9300]}, publish_address {inet[/192.168.1.212:9300]}
[2014-12-08 23:12:49,501][INFO ][discovery                ] [statistics04] statistics/xU57iGQbRuuZUB3xyvB-LA
[2014-12-08 23:12:52,669][INFO ][cluster.service          ] [statistics04] detected_master [statistics08][Ai0OIXsCTgO_YE1MhJLRiQ][web607][inet[/192.168.2.96:9300]], added {[statistics07][q9ghAwFYTPCKyJ26BFaSqw][web606][inet[/192.168.2.95:9300]],[statistics08][Ai0OIXsCTgO_YE1MhJLRiQ][web607][inet[/192.168.2.96:9300]],[statistics06][YOK_20U7Qee-XSasg0J8VA][web605][inet[/192.168.2.94:9300]],}, reason: zen-disco-receive(from master [[statistics08][Ai0OIXsCTgO_YE1MhJLRiQ][web607][inet[/192.168.2.96:9300]]])
[2014-12-08 23:12:53,969][INFO ][http                     ] [statistics04] bound_address {inet[/192.168.1.212:9200]}, publish_address {inet[/192.168.1.212:9200]}
[2014-12-08 23:12:53,969][INFO ][node                     ] [statistics04] started
[2014-12-08 23:13:00,485][WARN ][indices.cluster          ] [statistics04] [statistics-20140918][4] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [statistics-20140918][4] failed to fetch index version after copying it over
    at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:158)
    at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:132)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:722)
Caused by: org.apache.lucene.index.CorruptIndexException: [statistics-20140918][4] Preexisting corrupted index [corrupted_9CAF-B8ySZSdvpwbnwVAww] caused by: CorruptIndexException[checksum failed (hardware problem?) : expected=1kcwerm actual=n2dftw resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@b0037f7)]
org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=1kcwerm actual=n2dftw resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@b0037f7)
    at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
    at org.elasticsearch.index.store.Store.verify(Store.java:365)
    at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
    at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
    at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:722)
    Suppressed: org.elasticsearch.transport.RemoteTransportException: [statistics08][inet[/192.168.2.96:9300]][internal:index/shard/recovery/file_chunk]
    Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=sumxsg actual=z4yixy resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@5c352fb4)
        at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
        at org.elasticsearch.index.store.Store.verify(Store.java:365)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:722)
    Suppressed: org.elasticsearch.transport.RemoteTransportException: [statistics08][inet[/192.168.2.96:9300]][internal:index/shard/recovery/file_chunk]
    Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=dubiao actual=18t8jnd resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@4594ef9b)
        at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
        at org.elasticsearch.index.store.Store.verify(Store.java:365)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:722)
    Suppressed: org.elasticsearch.transport.RemoteTransportException: [statistics08][inet[/192.168.2.96:9300]][internal:index/shard/recovery/file_chunk]
    Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=hv0fef actual=15rxc01 resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@36405f3e)
        at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
        at org.elasticsearch.index.store.Store.verify(Store.java:365)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:722)
    Suppressed: org.elasticsearch.transport.RemoteTransportException: [statistics08][inet[/192.168.2.96:9300]][internal:index/shard/recovery/file_chunk]
    Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=1ro50g1 actual=1qivhre resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@5085f5c6)
        at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
        at org.elasticsearch.index.store.Store.verify(Store.java:365)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:722)
    Suppressed: org.elasticsearch.transport.RemoteTransportException: [statistics08][inet[/192.168.2.96:9300]][internal:index/shard/recovery/file_chunk]
    Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=uy3nf4 actual=uk9ays resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@50854005)
        at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
        at org.elasticsearch.index.store.Store.verify(Store.java:365)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:722)
    Suppressed: org.elasticsearch.transport.RemoteTransportException: [statistics08][inet[/192.168.2.96:9300]][internal:index/shard/recovery/file_chunk]
    Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=187kzhk actual=gstvlk resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@52ce09d)
        at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
        at org.elasticsearch.index.store.Store.verify(Store.java:365)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:722)
    Suppressed: org.elasticsearch.transport.RemoteTransportException: [statistics08][inet[/192.168.2.96:9300]][internal:index/shard/recovery/file_chunk]
    Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=i2a7yg actual=16u6g09 resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@74aaf669)
        at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
        at org.elasticsearch.index.store.Store.verify(Store.java:365)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:722)

    at org.elasticsearch.index.store.Store.failIfCorrupted(Store.java:452)
    at org.elasticsearch.index.store.Store.failIfCorrupted(Store.java:433)
    at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:120)
    ... 4 more
[2014-12-08 23:13:00,492][WARN ][cluster.action.shard     ] [statistics04] [statistics-20140918][4] sending failed shard for [statistics-20140918][4], node[xU57iGQbRuuZUB3xyvB-LA], [P], s[INITIALIZING], indexUUID [MgvyngJJQaCtGYfnqKOaZA], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[statistics-20140918][4] failed to fetch index version after copying it over]; nested: CorruptIndexException[[statistics-20140918][4] Preexisting corrupted index [corrupted_9CAF-B8ySZSdvpwbnwVAww] caused by: CorruptIndexException[checksum failed (hardware problem?) : expected=1kcwerm actual=n2dftw resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@b0037f7)]
org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=1kcwerm actual=n2dftw resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@b0037f7)
    at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
    at org.elasticsearch.index.store.Store.verify(Store.java:365)
    at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
    at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
    at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:722)
    Suppressed: org.elasticsearch.transport.RemoteTransportException: [statistics08][inet[/192.168.2.96:9300]][internal:index/shard/recovery/file_chunk]
    Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=sumxsg actual=z4yixy resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@5c352fb4)
        at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
        at org.elasticsearch.index.store.Store.verify(Store.java:365)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:722)
    Suppressed: org.elasticsearch.transport.RemoteTransportException: [statistics08][inet[/192.168.2.96:9300]][internal:index/shard/recovery/file_chunk]
    Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=dubiao actual=18t8jnd resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@4594ef9b)
        at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
        at org.elasticsearch.index.store.Store.verify(Store.java:365)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:722)
    Suppressed: org.elasticsearch.transport.RemoteTransportException: [statistics08][inet[/192.168.2.96:9300]][internal:index/shard/recovery/file_chunk]
    Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=hv0fef actual=15rxc01 resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@36405f3e)
        at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
        at org.elasticsearch.index.store.Store.verify(Store.java:365)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:722)
    Suppressed: org.elasticsearch.transport.RemoteTransportException: [statistics08][inet[/192.168.2.96:9300]][internal:index/shard/recovery/file_chunk]
    Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=1ro50g1 actual=1qivhre resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@5085f5c6)
        at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
        at org.elasticsearch.index.store.Store.verify(Store.java:365)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:722)
    Suppressed: org.elasticsearch.transport.RemoteTransportException: [statistics08][inet[/192.168.2.96:9300]][internal:index/shard/recovery/file_chunk]

... and so on.

"Gee, good job on making backups, myself!" — was my second thought.

# curl -X POST 'http://web605:9200/_snapshot/ceph_s3/statistics-2014-12-07/_restore?wait_for_completion=true&pretty' -d '{ "indices": "statistics-20140918", "include_global_state": false }'
{
  "snapshot" : {
    "snapshot" : "statistics-2014-12-07",
    "indices" : [ "statistics-20140918" ],
    "shards" : {
      "total" : 5,
      "failed" : 1,
      "successful" : 4
    }
  }
}

older snapshot:

# curl -X POST 'http://web605:9200/_snapshot/ceph_s3/statistics-2014-11-27/_restore?wait_for_completion=true&pretty' -d '{ "indices": "statistics-20140918", "include_global_state": false }'
{
  "snapshot" : {
    "snapshot" : "statistics-2014-11-27",
    "indices" : [ "statistics-20140918" ],
    "shards" : {
      "total" : 5,
      "failed" : 1,
      "successful" : 4
    }
  }
}

and in logs, as usual:

[2014-12-08 23:37:41,567][WARN ][cluster.action.shard     ] [statistics08] [statistics-20140918][4] sending failed shard for [statistics-20140918][4], node[Ai0OIXsCTgO_YE1MhJLRiQ], [P], restoring[ceph_s3:statistics-2014-11-27], s[INITIALIZING], indexUUID [MgvyngJJQaCtGYfnqKOaZA], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[statistics-20140918][4] failed recovery]; nested: IndexShardRestoreFailedException[[statistics-20140918][4] restore failed]; nested: IndexShardRestoreFailedException[[statistics-20140918][4] failed to restore snapshot [statistics-2014-11-27]]; nested: IndexShardRestoreFailedException[[statistics-20140918][4] Can't restore corrupted shard]; nested: CorruptIndexException[[statistics-20140918][4] Preexisting corrupted index [corrupted_iEZPcPv2QT21ve_TEv2S5A] caused by: CorruptIndexException[checksum failed (hardware problem?) : expected=i2a7yg actual=16u6g09 resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@30eef865)]
org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=i2a7yg actual=16u6g09 resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@30eef865)
    at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
    at org.elasticsearch.index.store.Store.verify(Store.java:365)
    at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository$RestoreContext.restoreFile(BlobStoreIndexShardRepository.java:843)
    at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository$RestoreContext.restore(BlobStoreIndexShardRepository.java:784)
    at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository.restore(BlobStoreIndexShardRepository.java:162)
    at org.elasticsearch.index.snapshots.IndexShardSnapshotAndRestoreService.restore(IndexShardSnapshotAndRestoreService.java:124)
    at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:127)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:722)
]; ]]
[2014-12-08 23:37:41,567][WARN ][cluster.action.shard     ] [statistics08] [statistics-20140918][4] received shard failed for [statistics-20140918][4], node[Ai0OIXsCTgO_YE1MhJLRiQ], [P], restoring[ceph_s3:statistics-2014-11-27], s[INITIALIZING], indexUUID [MgvyngJJQaCtGYfnqKOaZA], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[statistics-20140918][4] failed recovery]; nested: IndexShardRestoreFailedException[[statistics-20140918][4] restore failed]; nested: IndexShardRestoreFailedException[[statistics-20140918][4] failed to restore snapshot [statistics-2014-11-27]]; nested: IndexShardRestoreFailedException[[statistics-20140918][4] Can't restore corrupted shard]; nested: CorruptIndexException[[statistics-20140918][4] Preexisting corrupted index [corrupted_iEZPcPv2QT21ve_TEv2S5A] caused by: CorruptIndexException[checksum failed (hardware problem?) : expected=i2a7yg actual=16u6g09 resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@30eef865)]
org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=i2a7yg actual=16u6g09 resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@30eef865)
    at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
    at org.elasticsearch.index.store.Store.verify(Store.java:365)
    at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository$RestoreContext.restoreFile(BlobStoreIndexShardRepository.java:843)
    at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository$RestoreContext.restore(BlobStoreIndexShardRepository.java:784)
    at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository.restore(BlobStoreIndexShardRepository.java:162)
    at org.elasticsearch.index.snapshots.IndexShardSnapshotAndRestoreService.restore(IndexShardSnapshotAndRestoreService.java:124)
    at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:127)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:722)
]; ]]
[2014-12-08 23:37:42,078][WARN ][cluster.action.shard     ] [statistics08] [statistics-20140918][4] received shard failed for [statistics-20140918][4], node[YOK_20U7Qee-XSasg0J8VA], [P], restoring[ceph_s3:statistics-2014-11-27], s[INITIALIZING], indexUUID [MgvyngJJQaCtGYfnqKOaZA], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[statistics-20140918][4] failed recovery]; nested: IndexShardRestoreFailedException[[statistics-20140918][4] restore failed]; nested: IndexShardRestoreFailedException[[statistics-20140918][4] failed to restore snapshot [statistics-2014-11-27]]; nested: IndexShardRestoreFailedException[[statistics-20140918][4] Can't restore corrupted shard]; nested: CorruptIndexException[[statistics-20140918][4] Preexisting corrupted index [corrupted_DiMHFSlxQCakLjudKt_TaQ] caused by: CorruptIndexException[checksum failed (hardware problem?) : expected=i2a7yg actual=16u6g09 resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@1ee1a7fe)]
org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=i2a7yg actual=16u6g09 resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@1ee1a7fe)
    at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
    at org.elasticsearch.index.store.Store.verify(Store.java:365)
    at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository$RestoreContext.restoreFile(BlobStoreIndexShardRepository.java:843)
    at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository$RestoreContext.restore(BlobStoreIndexShardRepository.java:784)
    at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository.restore(BlobStoreIndexShardRepository.java:162)
    at org.elasticsearch.index.snapshots.IndexShardSnapshotAndRestoreService.restore(IndexShardSnapshotAndRestoreService.java:124)
    at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:127)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:722)

Well, maybe it wasn't that great decision to remove replicas for old indices.

I thought that checksums has to be checked during backups. If you have alive replica at the time of a backup, you can at least start recovering early. If you removed that healthy (?) replica after making a snapshot, you're doomed.

I'm using 1.4.1 with aws plugin for s3 snaphots on ceph (it has checksums too).

Is there a way to "fix" failed shard with data removal? If I cannot recover shard, I want my cluster to be green at least.

cc @imotov

The text was updated successfully, but these errors were encountered:

bobrik · 2014-12-08T19:48:23Z

Found some info about cluster status during shard movement:

epoch      timestamp cluster    status node.total node.data shards  pri relo init unassign
1418059931 21:32:11  statistics green           4         4   2347 2231    6    0        0

epoch      timestamp cluster    status node.total node.data shards  pri relo init unassign
1418065863 23:11:03  statistics red             4         4   2346 2230    0    0        1

Shard relocation started ~19:00:

epoch      timestamp cluster    status node.total node.data shards  pri relo init unassign
1418050789 18:59:49  statistics green           4         4   2347 2231    0    0        0

epoch      timestamp cluster    status node.total node.data shards  pri relo init unassign
1418050845 19:00:45  statistics green           4         4   2347 2231    6    0        0

bobrik · 2014-12-08T19:52:31Z

Shard is currently trying to restore, loading at least one core 100%:

curl http://web607:9200/_nodes/_local/hot_threads
::: [statistics08][Ai0OIXsCTgO_YE1MhJLRiQ][web607][inet[/192.168.2.96:9300]]

   92.2% (461.1ms out of 500ms) cpu usage by thread 'elasticsearch[statistics08][clusterService#updateTask][T#1]'
     5/10 snapshots sharing following 15 elements
       org.elasticsearch.cluster.routing.IndexShardRoutingTable.shardsWithState(IndexShardRoutingTable.java:515)
       org.elasticsearch.cluster.routing.IndexRoutingTable.shardsWithState(IndexRoutingTable.java:268)
       org.elasticsearch.cluster.routing.RoutingTable.shardsWithState(RoutingTable.java:114)
       org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDecider.sizeOfRelocatingShards(DiskThresholdDecider.java:225)
       org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDecider.canRemain(DiskThresholdDecider.java:434)
       org.elasticsearch.cluster.routing.allocation.decider.AllocationDeciders.canRemain(AllocationDeciders.java:105)
       org.elasticsearch.cluster.routing.allocation.AllocationService.moveShards(AllocationService.java:257)
       org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:223)
       org.elasticsearch.cluster.routing.allocation.AllocationService.applyFailedShards(AllocationService.java:113)
       org.elasticsearch.cluster.action.shard.ShardStateAction$3.execute(ShardStateAction.java:183)
       org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:329)
       org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:153)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
       java.lang.Thread.run(Thread.java:722)
     2/10 snapshots sharing following 20 elements
       sun.misc.Unsafe.park(Native Method)
       java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
       java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1033)
       java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1326)
       java.util.concurrent.CountDownLatch.await(CountDownLatch.java:282)
       org.elasticsearch.discovery.BlockingClusterStatePublishResponseHandler.awaitAllNodes(BlockingClusterStatePublishResponseHandler.java:58)
       org.elasticsearch.discovery.zen.publish.PublishClusterStateAction.publish(PublishClusterStateAction.java:153)
       org.elasticsearch.discovery.zen.publish.PublishClusterStateAction.publish(PublishClusterStateAction.java:86)
       org.elasticsearch.discovery.zen.ZenDiscovery.publish(ZenDiscovery.java:318)
       sun.reflect.GeneratedMethodAccessor33.invoke(Unknown Source)
       sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
       java.lang.reflect.Method.invoke(Method.java:601)
       org.elasticsearch.common.inject.internal.ConstructionContext$DelegatingInvocationHandler.invoke(ConstructionContext.java:110)
       com.sun.proxy.$Proxy12.publish(Unknown Source)
       org.elasticsearch.discovery.DiscoveryService.publish(DiscoveryService.java:137)
       org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:423)
       org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:153)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
       java.lang.Thread.run(Thread.java:722)
     3/10 snapshots sharing following 4 elements
       org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:153)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
       java.lang.Thread.run(Thread.java:722)

    9.3% (46.7ms out of 500ms) cpu usage by thread 'elasticsearch[statistics08][bulk][T#8]'
     10/10 snapshots sharing following 10 elements
       sun.misc.Unsafe.park(Native Method)
       java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
       java.util.concurrent.LinkedTransferQueue.awaitMatch(LinkedTransferQueue.java:735)
       java.util.concurrent.LinkedTransferQueue.xfer(LinkedTransferQueue.java:644)
       java.util.concurrent.LinkedTransferQueue.take(LinkedTransferQueue.java:1137)
       org.elasticsearch.common.util.concurrent.SizeBlockingQueue.take(SizeBlockingQueue.java:162)
       java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
       java.lang.Thread.run(Thread.java:722)

    3.2% (16ms out of 500ms) cpu usage by thread 'elasticsearch[statistics08][bulk][T#9]'
     10/10 snapshots sharing following 10 elements
       sun.misc.Unsafe.park(Native Method)
       java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
       java.util.concurrent.LinkedTransferQueue.awaitMatch(LinkedTransferQueue.java:735)
       java.util.concurrent.LinkedTransferQueue.xfer(LinkedTransferQueue.java:644)
       java.util.concurrent.LinkedTransferQueue.take(LinkedTransferQueue.java:1137)
       org.elasticsearch.common.util.concurrent.SizeBlockingQueue.take(SizeBlockingQueue.java:162)
       java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
       java.lang.Thread.run(Thread.java:722)

bobrik · 2014-12-08T20:20:41Z

This index actually was fully scrolled with spark a couple of days ago and there were no issues. I have logs from this node for a year (since 0.90.5), hope that could help with investigation.

clintongormley · 2015-11-21T22:17:59Z

Sorry, this issue got lost. Given that it is from a year ago, I assume you've resolved the issue already. So much has changed since then, that there's no point in investigating further.

bobrik · 2015-11-23T21:18:13Z

Yeah, fixed that by changing company and country, thanks!

clintongormley · 2015-11-28T16:29:22Z

:D

clintongormley closed this as completed Nov 21, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Corrupted shard uncovered during node decommissioning #8827

Corrupted shard uncovered during node decommissioning #8827

bobrik commented Dec 8, 2014

bobrik commented Dec 8, 2014

bobrik commented Dec 8, 2014

bobrik commented Dec 8, 2014

clintongormley commented Nov 21, 2015

bobrik commented Nov 23, 2015

clintongormley commented Nov 28, 2015

Corrupted shard uncovered during node decommissioning #8827

Corrupted shard uncovered during node decommissioning #8827

Comments

bobrik commented Dec 8, 2014

bobrik commented Dec 8, 2014

bobrik commented Dec 8, 2014

bobrik commented Dec 8, 2014

clintongormley commented Nov 21, 2015

bobrik commented Nov 23, 2015

clintongormley commented Nov 28, 2015