Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Corrupted shard uncovered during node decommissioning #8827

Closed
bobrik opened this issue Dec 8, 2014 · 6 comments
Closed

Corrupted shard uncovered during node decommissioning #8827

bobrik opened this issue Dec 8, 2014 · 6 comments

Comments

@bobrik
Copy link
Contributor

bobrik commented Dec 8, 2014

I removed node from allocation by ip and one shard appeared as corrupted at the end of relocation. Logs from the last restart:

2014-12-08 17:29:15,820][INFO ][node                     ] [statistics04] version[1.4.1], pid[96760], build[89d3241/2014-11-26T15:49:29Z]
[2014-12-08 17:29:15,821][INFO ][node                     ] [statistics04] initializing ...
[2014-12-08 17:29:15,832][INFO ][plugins                  ] [statistics04] loaded [cloud-aws], sites []
[2014-12-08 17:29:18,419][INFO ][node                     ] [statistics04] initialized
[2014-12-08 17:29:18,419][INFO ][node                     ] [statistics04] starting ...
[2014-12-08 17:29:18,520][INFO ][transport                ] [statistics04] bound_address {inet[/192.168.1.212:9300]}, publish_address {inet[/192.168.1.212:9300]}
[2014-12-08 17:29:18,527][INFO ][discovery                ] [statistics04] statistics/VJCAl72ETbmulc5OJu5DQA
[2014-12-08 17:29:22,815][INFO ][cluster.service          ] [statistics04] detected_master [statistics07][EZ1aCHx5RNO1xNEUtNY5YQ][web606][inet[/192.168.2.95:9300]], added {[statistics07][EZ1aCHx5RNO1xNEUtNY5YQ][web606][inet[/192.168.2.95:9300]],[statistics06][qnY9nSvXQsmvsTWqu_ayNg][web605][inet[/192.168.2.94:9300]],[statistics08][QnMcrdd0SxWA4zTMEZtdrQ][web607][inet[/192.168.2.96:9300]],}, reason: zen-disco-receive(from master [[statistics07][EZ1aCHx5RNO1xNEUtNY5YQ][web606][inet[/192.168.2.95:9300]]])
[2014-12-08 17:29:24,046][INFO ][http                     ] [statistics04] bound_address {inet[/192.168.1.212:9200]}, publish_address {inet[/192.168.1.212:9200]}
[2014-12-08 17:29:24,046][INFO ][node                     ] [statistics04] started
[2014-12-08 17:33:30,360][WARN ][transport                ] [statistics04] Received response for a request that has timed out, sent [246546ms] ago, timed out [216545ms] ago, action [internal:discovery/zen/fd/master_ping], node [[statistics07][EZ1aCHx5RNO1xNEUtNY5YQ][web606][inet[/192.168.2.95:9300]]], id [13]
[2014-12-08 18:44:45,855][INFO ][cluster.service          ] [statistics04] removed {[statistics06][qnY9nSvXQsmvsTWqu_ayNg][web605][inet[/192.168.2.94:9300]],}, reason: zen-disco-receive(from master [[statistics07][EZ1aCHx5RNO1xNEUtNY5YQ][web606][inet[/192.168.2.95:9300]]])
[2014-12-08 18:44:49,445][INFO ][cluster.service          ] [statistics04] added {[statistics06][YOK_20U7Qee-XSasg0J8VA][web605][inet[/192.168.2.94:9300]],}, reason: zen-disco-receive(from master [[statistics07][EZ1aCHx5RNO1xNEUtNY5YQ][web606][inet[/192.168.2.95:9300]]])
[2014-12-08 18:45:30,882][INFO ][discovery.zen            ] [statistics04] master_left [[statistics07][EZ1aCHx5RNO1xNEUtNY5YQ][web606][inet[/192.168.2.95:9300]]], reason [shut_down]
[2014-12-08 18:45:30,964][WARN ][discovery.zen            ] [statistics04] master left (reason = shut_down), current nodes: {[statistics04][VJCAl72ETbmulc5OJu5DQA][web467][inet[/192.168.1.212:9300]],[statistics06][YOK_20U7Qee-XSasg0J8VA][web605][inet[/192.168.2.94:9300]],[statistics08][QnMcrdd0SxWA4zTMEZtdrQ][web607][inet[/192.168.2.96:9300]],}
[2014-12-08 18:45:31,190][INFO ][discovery.zen            ] [statistics04] master_left [[statistics07][EZ1aCHx5RNO1xNEUtNY5YQ][web606][inet[/192.168.2.95:9300]]], reason [transport disconnected]
[2014-12-08 18:45:31,190][INFO ][cluster.service          ] [statistics04] removed {[statistics07][EZ1aCHx5RNO1xNEUtNY5YQ][web606][inet[/192.168.2.95:9300]],}, reason: zen-disco-master_failed ([statistics07][EZ1aCHx5RNO1xNEUtNY5YQ][web606][inet[/192.168.2.95:9300]])
[2014-12-08 18:45:40,396][INFO ][cluster.service          ] [statistics04] detected_master [statistics08][QnMcrdd0SxWA4zTMEZtdrQ][web607][inet[/192.168.2.96:9300]], reason: zen-disco-receive(from master [[statistics08][QnMcrdd0SxWA4zTMEZtdrQ][web607][inet[/192.168.2.96:9300]]])
[2014-12-08 18:45:47,229][INFO ][cluster.service          ] [statistics04] added {[statistics07][q9ghAwFYTPCKyJ26BFaSqw][web606][inet[/192.168.2.95:9300]],}, reason: zen-disco-receive(from master [[statistics08][QnMcrdd0SxWA4zTMEZtdrQ][web607][inet[/192.168.2.96:9300]]])
[2014-12-08 18:46:02,161][INFO ][discovery.zen            ] [statistics04] master_left [[statistics08][QnMcrdd0SxWA4zTMEZtdrQ][web607][inet[/192.168.2.96:9300]]], reason [shut_down]
[2014-12-08 18:46:02,168][INFO ][discovery.zen            ] [statistics04] master_left [[statistics08][QnMcrdd0SxWA4zTMEZtdrQ][web607][inet[/192.168.2.96:9300]]], reason [transport disconnected]
[2014-12-08 18:46:03,815][WARN ][discovery.zen            ] [statistics04] master left (reason = shut_down), current nodes: {[statistics07][q9ghAwFYTPCKyJ26BFaSqw][web606][inet[/192.168.2.95:9300]],[statistics04][VJCAl72ETbmulc5OJu5DQA][web467][inet[/192.168.1.212:9300]],[statistics06][YOK_20U7Qee-XSasg0J8VA][web605][inet[/192.168.2.94:9300]],}
[2014-12-08 18:46:03,815][INFO ][cluster.service          ] [statistics04] removed {[statistics08][QnMcrdd0SxWA4zTMEZtdrQ][web607][inet[/192.168.2.96:9300]],}, reason: zen-disco-master_failed ([statistics08][QnMcrdd0SxWA4zTMEZtdrQ][web607][inet[/192.168.2.96:9300]])
[2014-12-08 18:46:08,348][INFO ][cluster.service          ] [statistics04] new_master [statistics04][VJCAl72ETbmulc5OJu5DQA][web467][inet[/192.168.1.212:9300]], reason: zen-disco-join (elected_as_master)
[2014-12-08 18:46:33,513][INFO ][cluster.service          ] [statistics04] added {[statistics08][Ai0OIXsCTgO_YE1MhJLRiQ][web607][inet[/192.168.2.96:9300]],}, reason: zen-disco-receive(join from node[[statistics08][Ai0OIXsCTgO_YE1MhJLRiQ][web607][inet[/192.168.2.96:9300]]])
[2014-12-08 18:49:54,615][INFO ][indices.store            ] [statistics04] Failed to open / find files while reading metadata snapshot
[2014-12-08 18:56:51,471][INFO ][cluster.metadata         ] [statistics04] [statistics-not-so-fast-201312] update_mapping [events]
[2014-12-08 19:55:32,217][INFO ][index.engine.internal    ] [statistics04] [statistics-not-so-fast-201312][2] now throttling indexing: numMergesInFlight=6, maxNumMerges=5
[2014-12-08 19:55:32,230][INFO ][index.engine.internal    ] [statistics04] [statistics-not-so-fast-201312][2] stop throttling indexing: numMergesInFlight=4, maxNumMerges=5
[2014-12-08 19:55:33,219][INFO ][index.engine.internal    ] [statistics04] [statistics-not-so-fast-201312][2] now throttling indexing: numMergesInFlight=6, maxNumMerges=5
[2014-12-08 19:55:33,230][INFO ][index.engine.internal    ] [statistics04] [statistics-not-so-fast-201312][2] stop throttling indexing: numMergesInFlight=4, maxNumMerges=5
[2014-12-08 19:55:33,231][INFO ][index.engine.internal    ] [statistics04] [statistics-not-so-fast-201312][2] now throttling indexing: numMergesInFlight=6, maxNumMerges=5
[2014-12-08 19:55:33,243][INFO ][index.engine.internal    ] [statistics04] [statistics-not-so-fast-201312][2] stop throttling indexing: numMergesInFlight=4, maxNumMerges=5
[2014-12-08 22:59:31,798][WARN ][indices.recovery         ] [statistics04] [statistics-20140918][4] Corrupted file detected name [_u28.si], length [472], checksum [1kcwerm], writtenBy [null] checksum mismatch
[2014-12-08 22:59:31,858][WARN ][indices.recovery         ] [statistics04] [statistics-20140918][4] Corrupted file detected name [_u28.fdx], length [774151], checksum [sumxsg], writtenBy [null] checksum mismatch
[2014-12-08 22:59:31,885][WARN ][indices.recovery         ] [statistics04] [statistics-20140918][4] Corrupted file detected name [_u28.fnm], length [3554], checksum [dubiao], writtenBy [null] checksum mismatch
[2014-12-08 22:59:32,294][WARN ][indices.recovery         ] [statistics04] [statistics-20140918][4] Corrupted file detected name [_u28_es090_0.tip], length [3272251], checksum [hv0fef], writtenBy [null] checksum mismatch
[2014-12-08 23:00:55,791][WARN ][indices.recovery         ] [statistics04] [statistics-20140918][4] Corrupted file detected name [_u28_es090_0.blm], length [9517214], checksum [1ro50g1], writtenBy [null] checksum mismatch
[2014-12-08 23:01:08,564][WARN ][indices.recovery         ] [statistics04] [statistics-20140918][4] Corrupted file detected name [_u28_es090_0.doc], length [311172533], checksum [uy3nf4], writtenBy [null] checksum mismatch
[2014-12-08 23:01:13,610][WARN ][indices.recovery         ] [statistics04] [statistics-20140918][4] Corrupted file detected name [_u28_es090_0.tim], length [275377261], checksum [187kzhk], writtenBy [null] checksum mismatch
[2014-12-08 23:01:44,209][WARN ][indices.recovery         ] [statistics04] [statistics-20140918][4] Corrupted file detected name [_u28.fdt], length [726233746], checksum [i2a7yg], writtenBy [null] checksum mismatch
[2014-12-08 23:01:44,413][WARN ][index.engine.internal    ] [statistics04] [statistics-20140918][4] failed engine [corrupt file detected source: [recovery phase 1]]
org.elasticsearch.indices.recovery.RecoverFilesRecoveryException: [statistics-20140918][4] Failed to transfer [27] files with total size of [1.8gb]
    at org.elasticsearch.indices.recovery.RecoverySource$1.phase1(RecoverySource.java:276)
    at org.elasticsearch.index.engine.internal.InternalEngine.recover(InternalEngine.java:1116)
    at org.elasticsearch.index.shard.service.InternalIndexShard.recover(InternalIndexShard.java:654)
    at org.elasticsearch.indices.recovery.RecoverySource.recover(RecoverySource.java:137)
    at org.elasticsearch.indices.recovery.RecoverySource.access$2600(RecoverySource.java:74)
    at org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:464)
    at org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:450)
    at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:722)
Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=1kcwerm actual=n2dftw resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@b0037f7)
    at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
    at org.elasticsearch.index.store.Store.verify(Store.java:365)
    at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
    at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
    ... 4 more
    Suppressed: org.elasticsearch.transport.RemoteTransportException: [statistics08][inet[/192.168.2.96:9300]][internal:index/shard/recovery/file_chunk]
    Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=sumxsg actual=z4yixy resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@5c352fb4)
        at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
        at org.elasticsearch.index.store.Store.verify(Store.java:365)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:722)
    Suppressed: org.elasticsearch.transport.RemoteTransportException: [statistics08][inet[/192.168.2.96:9300]][internal:index/shard/recovery/file_chunk]
    Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=dubiao actual=18t8jnd resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@4594ef9b)
        at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
        at org.elasticsearch.index.store.Store.verify(Store.java:365)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:722)
    Suppressed: org.elasticsearch.transport.RemoteTransportException: [statistics08][inet[/192.168.2.96:9300]][internal:index/shard/recovery/file_chunk]
    Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=hv0fef actual=15rxc01 resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@36405f3e)
        at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
        at org.elasticsearch.index.store.Store.verify(Store.java:365)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:722)
    Suppressed: org.elasticsearch.transport.RemoteTransportException: [statistics08][inet[/192.168.2.96:9300]][internal:index/shard/recovery/file_chunk]
    Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=1ro50g1 actual=1qivhre resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@5085f5c6)
        at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
        at org.elasticsearch.index.store.Store.verify(Store.java:365)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:722)
    Suppressed: org.elasticsearch.transport.RemoteTransportException: [statistics08][inet[/192.168.2.96:9300]][internal:index/shard/recovery/file_chunk]
    Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=uy3nf4 actual=uk9ays resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@50854005)
        at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
        at org.elasticsearch.index.store.Store.verify(Store.java:365)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:722)
    Suppressed: org.elasticsearch.transport.RemoteTransportException: [statistics08][inet[/192.168.2.96:9300]][internal:index/shard/recovery/file_chunk]
    Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=187kzhk actual=gstvlk resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@52ce09d)
        at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
        at org.elasticsearch.index.store.Store.verify(Store.java:365)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:722)
    Suppressed: org.elasticsearch.transport.RemoteTransportException: [statistics08][inet[/192.168.2.96:9300]][internal:index/shard/recovery/file_chunk]
    Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=i2a7yg actual=16u6g09 resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@74aaf669)
        at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
        at org.elasticsearch.index.store.Store.verify(Store.java:365)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:722)
[2014-12-08 23:01:44,634][WARN ][cluster.action.shard     ] [statistics04] [statistics-20140918][4] sending failed shard for [statistics-20140918][4], node[VJCAl72ETbmulc5OJu5DQA], relocating [Ai0OIXsCTgO_YE1MhJLRiQ], [P], s[RELOCATING], indexUUID [MgvyngJJQaCtGYfnqKOaZA], reason [engine failure, message [corrupt file detected source: [recovery phase 1]][RecoverFilesRecoveryException[[statistics-20140918][4] Failed to transfer [27] files with total size of [1.8gb]]; nested: CorruptIndexException[checksum failed (hardware problem?) : expected=1kcwerm actual=n2dftw resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@b0037f7)]; ]]
[2014-12-08 23:01:44,634][WARN ][cluster.action.shard     ] [statistics04] [statistics-20140918][4] received shard failed for [statistics-20140918][4], node[VJCAl72ETbmulc5OJu5DQA], relocating [Ai0OIXsCTgO_YE1MhJLRiQ], [P], s[RELOCATING], indexUUID [MgvyngJJQaCtGYfnqKOaZA], reason [engine failure, message [corrupt file detected source: [recovery phase 1]][RecoverFilesRecoveryException[[statistics-20140918][4] Failed to transfer [27] files with total size of [1.8gb]]; nested: CorruptIndexException[checksum failed (hardware problem?) : expected=1kcwerm actual=n2dftw resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@b0037f7)]; ]]
[2014-12-08 23:01:46,441][WARN ][indices.cluster          ] [statistics04] [statistics-20140918][4] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [statistics-20140918][4] failed to fetch index version after copying it over
    at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:158)
    at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:132)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:722)
Caused by: org.apache.lucene.index.CorruptIndexException: [statistics-20140918][4] Preexisting corrupted index [corrupted_9CAF-B8ySZSdvpwbnwVAww] caused by: CorruptIndexException[checksum failed (hardware problem?) : expected=1kcwerm actual=n2dftw resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@b0037f7)]
org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=1kcwerm actual=n2dftw resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@b0037f7)
    at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
    at org.elasticsearch.index.store.Store.verify(Store.java:365)
    at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
    at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
    at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:722)
    Suppressed: org.elasticsearch.transport.RemoteTransportException: [statistics08][inet[/192.168.2.96:9300]][internal:index/shard/recovery/file_chunk]
    Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=sumxsg actual=z4yixy resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@5c352fb4)
        at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
        at org.elasticsearch.index.store.Store.verify(Store.java:365)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:722)
    Suppressed: org.elasticsearch.transport.RemoteTransportException: [statistics08][inet[/192.168.2.96:9300]][internal:index/shard/recovery/file_chunk]
    Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=dubiao actual=18t8jnd resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@4594ef9b)
        at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
        at org.elasticsearch.index.store.Store.verify(Store.java:365)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:722)
    Suppressed: org.elasticsearch.transport.RemoteTransportException: [statistics08][inet[/192.168.2.96:9300]][internal:index/shard/recovery/file_chunk]
    Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=hv0fef actual=15rxc01 resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@36405f3e)
        at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
        at org.elasticsearch.index.store.Store.verify(Store.java:365)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:722)
    Suppressed: org.elasticsearch.transport.RemoteTransportException: [statistics08][inet[/192.168.2.96:9300]][internal:index/shard/recovery/file_chunk]
    Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=1ro50g1 actual=1qivhre resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@5085f5c6)
        at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
        at org.elasticsearch.index.store.Store.verify(Store.java:365)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:722)
    Suppressed: org.elasticsearch.transport.RemoteTransportException: [statistics08][inet[/192.168.2.96:9300]][internal:index/shard/recovery/file_chunk]
    Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=uy3nf4 actual=uk9ays resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@50854005)
        at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
        at org.elasticsearch.index.store.Store.verify(Store.java:365)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:722)
    Suppressed: org.elasticsearch.transport.RemoteTransportException: [statistics08][inet[/192.168.2.96:9300]][internal:index/shard/recovery/file_chunk]
    Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=187kzhk actual=gstvlk resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@52ce09d)
        at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
        at org.elasticsearch.index.store.Store.verify(Store.java:365)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:722)
    Suppressed: org.elasticsearch.transport.RemoteTransportException: [statistics08][inet[/192.168.2.96:9300]][internal:index/shard/recovery/file_chunk]

.. and so on.

"Maybe that's just a glitch that could disappear with restart" – was my first thought.

[2014-12-08 23:12:36,360][INFO ][node                     ] [statistics04] stopping ...
[2014-12-08 23:12:36,423][INFO ][node                     ] [statistics04] stopped
[2014-12-08 23:12:36,423][INFO ][node                     ] [statistics04] closing ...
[2014-12-08 23:12:36,455][INFO ][node                     ] [statistics04] closed
[2014-12-08 23:12:45,207][INFO ][node                     ] [statistics04] version[1.4.1], pid[84384], build[89d3241/2014-11-26T15:49:29Z]
[2014-12-08 23:12:45,207][INFO ][node                     ] [statistics04] initializing ...
[2014-12-08 23:12:45,338][INFO ][plugins                  ] [statistics04] loaded [cloud-aws], sites []
[2014-12-08 23:12:49,383][INFO ][node                     ] [statistics04] initialized
[2014-12-08 23:12:49,384][INFO ][node                     ] [statistics04] starting ...
[2014-12-08 23:12:49,479][INFO ][transport                ] [statistics04] bound_address {inet[/192.168.1.212:9300]}, publish_address {inet[/192.168.1.212:9300]}
[2014-12-08 23:12:49,501][INFO ][discovery                ] [statistics04] statistics/xU57iGQbRuuZUB3xyvB-LA
[2014-12-08 23:12:52,669][INFO ][cluster.service          ] [statistics04] detected_master [statistics08][Ai0OIXsCTgO_YE1MhJLRiQ][web607][inet[/192.168.2.96:9300]], added {[statistics07][q9ghAwFYTPCKyJ26BFaSqw][web606][inet[/192.168.2.95:9300]],[statistics08][Ai0OIXsCTgO_YE1MhJLRiQ][web607][inet[/192.168.2.96:9300]],[statistics06][YOK_20U7Qee-XSasg0J8VA][web605][inet[/192.168.2.94:9300]],}, reason: zen-disco-receive(from master [[statistics08][Ai0OIXsCTgO_YE1MhJLRiQ][web607][inet[/192.168.2.96:9300]]])
[2014-12-08 23:12:53,969][INFO ][http                     ] [statistics04] bound_address {inet[/192.168.1.212:9200]}, publish_address {inet[/192.168.1.212:9200]}
[2014-12-08 23:12:53,969][INFO ][node                     ] [statistics04] started
[2014-12-08 23:13:00,485][WARN ][indices.cluster          ] [statistics04] [statistics-20140918][4] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [statistics-20140918][4] failed to fetch index version after copying it over
    at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:158)
    at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:132)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:722)
Caused by: org.apache.lucene.index.CorruptIndexException: [statistics-20140918][4] Preexisting corrupted index [corrupted_9CAF-B8ySZSdvpwbnwVAww] caused by: CorruptIndexException[checksum failed (hardware problem?) : expected=1kcwerm actual=n2dftw resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@b0037f7)]
org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=1kcwerm actual=n2dftw resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@b0037f7)
    at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
    at org.elasticsearch.index.store.Store.verify(Store.java:365)
    at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
    at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
    at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:722)
    Suppressed: org.elasticsearch.transport.RemoteTransportException: [statistics08][inet[/192.168.2.96:9300]][internal:index/shard/recovery/file_chunk]
    Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=sumxsg actual=z4yixy resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@5c352fb4)
        at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
        at org.elasticsearch.index.store.Store.verify(Store.java:365)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:722)
    Suppressed: org.elasticsearch.transport.RemoteTransportException: [statistics08][inet[/192.168.2.96:9300]][internal:index/shard/recovery/file_chunk]
    Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=dubiao actual=18t8jnd resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@4594ef9b)
        at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
        at org.elasticsearch.index.store.Store.verify(Store.java:365)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:722)
    Suppressed: org.elasticsearch.transport.RemoteTransportException: [statistics08][inet[/192.168.2.96:9300]][internal:index/shard/recovery/file_chunk]
    Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=hv0fef actual=15rxc01 resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@36405f3e)
        at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
        at org.elasticsearch.index.store.Store.verify(Store.java:365)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:722)
    Suppressed: org.elasticsearch.transport.RemoteTransportException: [statistics08][inet[/192.168.2.96:9300]][internal:index/shard/recovery/file_chunk]
    Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=1ro50g1 actual=1qivhre resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@5085f5c6)
        at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
        at org.elasticsearch.index.store.Store.verify(Store.java:365)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:722)
    Suppressed: org.elasticsearch.transport.RemoteTransportException: [statistics08][inet[/192.168.2.96:9300]][internal:index/shard/recovery/file_chunk]
    Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=uy3nf4 actual=uk9ays resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@50854005)
        at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
        at org.elasticsearch.index.store.Store.verify(Store.java:365)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:722)
    Suppressed: org.elasticsearch.transport.RemoteTransportException: [statistics08][inet[/192.168.2.96:9300]][internal:index/shard/recovery/file_chunk]
    Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=187kzhk actual=gstvlk resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@52ce09d)
        at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
        at org.elasticsearch.index.store.Store.verify(Store.java:365)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:722)
    Suppressed: org.elasticsearch.transport.RemoteTransportException: [statistics08][inet[/192.168.2.96:9300]][internal:index/shard/recovery/file_chunk]
    Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=i2a7yg actual=16u6g09 resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@74aaf669)
        at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
        at org.elasticsearch.index.store.Store.verify(Store.java:365)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:722)

    at org.elasticsearch.index.store.Store.failIfCorrupted(Store.java:452)
    at org.elasticsearch.index.store.Store.failIfCorrupted(Store.java:433)
    at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:120)
    ... 4 more
[2014-12-08 23:13:00,492][WARN ][cluster.action.shard     ] [statistics04] [statistics-20140918][4] sending failed shard for [statistics-20140918][4], node[xU57iGQbRuuZUB3xyvB-LA], [P], s[INITIALIZING], indexUUID [MgvyngJJQaCtGYfnqKOaZA], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[statistics-20140918][4] failed to fetch index version after copying it over]; nested: CorruptIndexException[[statistics-20140918][4] Preexisting corrupted index [corrupted_9CAF-B8ySZSdvpwbnwVAww] caused by: CorruptIndexException[checksum failed (hardware problem?) : expected=1kcwerm actual=n2dftw resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@b0037f7)]
org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=1kcwerm actual=n2dftw resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@b0037f7)
    at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
    at org.elasticsearch.index.store.Store.verify(Store.java:365)
    at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
    at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
    at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:722)
    Suppressed: org.elasticsearch.transport.RemoteTransportException: [statistics08][inet[/192.168.2.96:9300]][internal:index/shard/recovery/file_chunk]
    Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=sumxsg actual=z4yixy resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@5c352fb4)
        at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
        at org.elasticsearch.index.store.Store.verify(Store.java:365)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:722)
    Suppressed: org.elasticsearch.transport.RemoteTransportException: [statistics08][inet[/192.168.2.96:9300]][internal:index/shard/recovery/file_chunk]
    Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=dubiao actual=18t8jnd resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@4594ef9b)
        at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
        at org.elasticsearch.index.store.Store.verify(Store.java:365)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:722)
    Suppressed: org.elasticsearch.transport.RemoteTransportException: [statistics08][inet[/192.168.2.96:9300]][internal:index/shard/recovery/file_chunk]
    Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=hv0fef actual=15rxc01 resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@36405f3e)
        at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
        at org.elasticsearch.index.store.Store.verify(Store.java:365)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:722)
    Suppressed: org.elasticsearch.transport.RemoteTransportException: [statistics08][inet[/192.168.2.96:9300]][internal:index/shard/recovery/file_chunk]
    Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=1ro50g1 actual=1qivhre resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@5085f5c6)
        at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
        at org.elasticsearch.index.store.Store.verify(Store.java:365)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:722)
    Suppressed: org.elasticsearch.transport.RemoteTransportException: [statistics08][inet[/192.168.2.96:9300]][internal:index/shard/recovery/file_chunk]

... and so on.

"Gee, good job on making backups, myself!" — was my second thought.

# curl -X POST 'http://web605:9200/_snapshot/ceph_s3/statistics-2014-12-07/_restore?wait_for_completion=true&pretty' -d '{ "indices": "statistics-20140918", "include_global_state": false }'
{
  "snapshot" : {
    "snapshot" : "statistics-2014-12-07",
    "indices" : [ "statistics-20140918" ],
    "shards" : {
      "total" : 5,
      "failed" : 1,
      "successful" : 4
    }
  }
}

older snapshot:

# curl -X POST 'http://web605:9200/_snapshot/ceph_s3/statistics-2014-11-27/_restore?wait_for_completion=true&pretty' -d '{ "indices": "statistics-20140918", "include_global_state": false }'
{
  "snapshot" : {
    "snapshot" : "statistics-2014-11-27",
    "indices" : [ "statistics-20140918" ],
    "shards" : {
      "total" : 5,
      "failed" : 1,
      "successful" : 4
    }
  }
}

and in logs, as usual:

[2014-12-08 23:37:41,567][WARN ][cluster.action.shard     ] [statistics08] [statistics-20140918][4] sending failed shard for [statistics-20140918][4], node[Ai0OIXsCTgO_YE1MhJLRiQ], [P], restoring[ceph_s3:statistics-2014-11-27], s[INITIALIZING], indexUUID [MgvyngJJQaCtGYfnqKOaZA], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[statistics-20140918][4] failed recovery]; nested: IndexShardRestoreFailedException[[statistics-20140918][4] restore failed]; nested: IndexShardRestoreFailedException[[statistics-20140918][4] failed to restore snapshot [statistics-2014-11-27]]; nested: IndexShardRestoreFailedException[[statistics-20140918][4] Can't restore corrupted shard]; nested: CorruptIndexException[[statistics-20140918][4] Preexisting corrupted index [corrupted_iEZPcPv2QT21ve_TEv2S5A] caused by: CorruptIndexException[checksum failed (hardware problem?) : expected=i2a7yg actual=16u6g09 resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@30eef865)]
org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=i2a7yg actual=16u6g09 resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@30eef865)
    at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
    at org.elasticsearch.index.store.Store.verify(Store.java:365)
    at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository$RestoreContext.restoreFile(BlobStoreIndexShardRepository.java:843)
    at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository$RestoreContext.restore(BlobStoreIndexShardRepository.java:784)
    at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository.restore(BlobStoreIndexShardRepository.java:162)
    at org.elasticsearch.index.snapshots.IndexShardSnapshotAndRestoreService.restore(IndexShardSnapshotAndRestoreService.java:124)
    at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:127)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:722)
]; ]]
[2014-12-08 23:37:41,567][WARN ][cluster.action.shard     ] [statistics08] [statistics-20140918][4] received shard failed for [statistics-20140918][4], node[Ai0OIXsCTgO_YE1MhJLRiQ], [P], restoring[ceph_s3:statistics-2014-11-27], s[INITIALIZING], indexUUID [MgvyngJJQaCtGYfnqKOaZA], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[statistics-20140918][4] failed recovery]; nested: IndexShardRestoreFailedException[[statistics-20140918][4] restore failed]; nested: IndexShardRestoreFailedException[[statistics-20140918][4] failed to restore snapshot [statistics-2014-11-27]]; nested: IndexShardRestoreFailedException[[statistics-20140918][4] Can't restore corrupted shard]; nested: CorruptIndexException[[statistics-20140918][4] Preexisting corrupted index [corrupted_iEZPcPv2QT21ve_TEv2S5A] caused by: CorruptIndexException[checksum failed (hardware problem?) : expected=i2a7yg actual=16u6g09 resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@30eef865)]
org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=i2a7yg actual=16u6g09 resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@30eef865)
    at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
    at org.elasticsearch.index.store.Store.verify(Store.java:365)
    at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository$RestoreContext.restoreFile(BlobStoreIndexShardRepository.java:843)
    at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository$RestoreContext.restore(BlobStoreIndexShardRepository.java:784)
    at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository.restore(BlobStoreIndexShardRepository.java:162)
    at org.elasticsearch.index.snapshots.IndexShardSnapshotAndRestoreService.restore(IndexShardSnapshotAndRestoreService.java:124)
    at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:127)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:722)
]; ]]
[2014-12-08 23:37:42,078][WARN ][cluster.action.shard     ] [statistics08] [statistics-20140918][4] received shard failed for [statistics-20140918][4], node[YOK_20U7Qee-XSasg0J8VA], [P], restoring[ceph_s3:statistics-2014-11-27], s[INITIALIZING], indexUUID [MgvyngJJQaCtGYfnqKOaZA], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[statistics-20140918][4] failed recovery]; nested: IndexShardRestoreFailedException[[statistics-20140918][4] restore failed]; nested: IndexShardRestoreFailedException[[statistics-20140918][4] failed to restore snapshot [statistics-2014-11-27]]; nested: IndexShardRestoreFailedException[[statistics-20140918][4] Can't restore corrupted shard]; nested: CorruptIndexException[[statistics-20140918][4] Preexisting corrupted index [corrupted_DiMHFSlxQCakLjudKt_TaQ] caused by: CorruptIndexException[checksum failed (hardware problem?) : expected=i2a7yg actual=16u6g09 resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@1ee1a7fe)]
org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=i2a7yg actual=16u6g09 resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@1ee1a7fe)
    at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
    at org.elasticsearch.index.store.Store.verify(Store.java:365)
    at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository$RestoreContext.restoreFile(BlobStoreIndexShardRepository.java:843)
    at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository$RestoreContext.restore(BlobStoreIndexShardRepository.java:784)
    at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository.restore(BlobStoreIndexShardRepository.java:162)
    at org.elasticsearch.index.snapshots.IndexShardSnapshotAndRestoreService.restore(IndexShardSnapshotAndRestoreService.java:124)
    at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:127)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:722)

Well, maybe it wasn't that great decision to remove replicas for old indices.

I thought that checksums has to be checked during backups. If you have alive replica at the time of a backup, you can at least start recovering early. If you removed that healthy (?) replica after making a snapshot, you're doomed.

I'm using 1.4.1 with aws plugin for s3 snaphots on ceph (it has checksums too).

Is there a way to "fix" failed shard with data removal? If I cannot recover shard, I want my cluster to be green at least.

cc @imotov

@bobrik
Copy link
Contributor Author

bobrik commented Dec 8, 2014

Found some info about cluster status during shard movement:

epoch      timestamp cluster    status node.total node.data shards  pri relo init unassign
1418059931 21:32:11  statistics green           4         4   2347 2231    6    0        0
epoch      timestamp cluster    status node.total node.data shards  pri relo init unassign
1418065863 23:11:03  statistics red             4         4   2346 2230    0    0        1

Shard relocation started ~19:00:

epoch      timestamp cluster    status node.total node.data shards  pri relo init unassign
1418050789 18:59:49  statistics green           4         4   2347 2231    0    0        0
epoch      timestamp cluster    status node.total node.data shards  pri relo init unassign
1418050845 19:00:45  statistics green           4         4   2347 2231    6    0        0

@bobrik
Copy link
Contributor Author

bobrik commented Dec 8, 2014

Shard is currently trying to restore, loading at least one core 100%:

curl http://web607:9200/_nodes/_local/hot_threads
::: [statistics08][Ai0OIXsCTgO_YE1MhJLRiQ][web607][inet[/192.168.2.96:9300]]

   92.2% (461.1ms out of 500ms) cpu usage by thread 'elasticsearch[statistics08][clusterService#updateTask][T#1]'
     5/10 snapshots sharing following 15 elements
       org.elasticsearch.cluster.routing.IndexShardRoutingTable.shardsWithState(IndexShardRoutingTable.java:515)
       org.elasticsearch.cluster.routing.IndexRoutingTable.shardsWithState(IndexRoutingTable.java:268)
       org.elasticsearch.cluster.routing.RoutingTable.shardsWithState(RoutingTable.java:114)
       org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDecider.sizeOfRelocatingShards(DiskThresholdDecider.java:225)
       org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDecider.canRemain(DiskThresholdDecider.java:434)
       org.elasticsearch.cluster.routing.allocation.decider.AllocationDeciders.canRemain(AllocationDeciders.java:105)
       org.elasticsearch.cluster.routing.allocation.AllocationService.moveShards(AllocationService.java:257)
       org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:223)
       org.elasticsearch.cluster.routing.allocation.AllocationService.applyFailedShards(AllocationService.java:113)
       org.elasticsearch.cluster.action.shard.ShardStateAction$3.execute(ShardStateAction.java:183)
       org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:329)
       org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:153)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
       java.lang.Thread.run(Thread.java:722)
     2/10 snapshots sharing following 20 elements
       sun.misc.Unsafe.park(Native Method)
       java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
       java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1033)
       java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1326)
       java.util.concurrent.CountDownLatch.await(CountDownLatch.java:282)
       org.elasticsearch.discovery.BlockingClusterStatePublishResponseHandler.awaitAllNodes(BlockingClusterStatePublishResponseHandler.java:58)
       org.elasticsearch.discovery.zen.publish.PublishClusterStateAction.publish(PublishClusterStateAction.java:153)
       org.elasticsearch.discovery.zen.publish.PublishClusterStateAction.publish(PublishClusterStateAction.java:86)
       org.elasticsearch.discovery.zen.ZenDiscovery.publish(ZenDiscovery.java:318)
       sun.reflect.GeneratedMethodAccessor33.invoke(Unknown Source)
       sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
       java.lang.reflect.Method.invoke(Method.java:601)
       org.elasticsearch.common.inject.internal.ConstructionContext$DelegatingInvocationHandler.invoke(ConstructionContext.java:110)
       com.sun.proxy.$Proxy12.publish(Unknown Source)
       org.elasticsearch.discovery.DiscoveryService.publish(DiscoveryService.java:137)
       org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:423)
       org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:153)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
       java.lang.Thread.run(Thread.java:722)
     3/10 snapshots sharing following 4 elements
       org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:153)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
       java.lang.Thread.run(Thread.java:722)

    9.3% (46.7ms out of 500ms) cpu usage by thread 'elasticsearch[statistics08][bulk][T#8]'
     10/10 snapshots sharing following 10 elements
       sun.misc.Unsafe.park(Native Method)
       java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
       java.util.concurrent.LinkedTransferQueue.awaitMatch(LinkedTransferQueue.java:735)
       java.util.concurrent.LinkedTransferQueue.xfer(LinkedTransferQueue.java:644)
       java.util.concurrent.LinkedTransferQueue.take(LinkedTransferQueue.java:1137)
       org.elasticsearch.common.util.concurrent.SizeBlockingQueue.take(SizeBlockingQueue.java:162)
       java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
       java.lang.Thread.run(Thread.java:722)

    3.2% (16ms out of 500ms) cpu usage by thread 'elasticsearch[statistics08][bulk][T#9]'
     10/10 snapshots sharing following 10 elements
       sun.misc.Unsafe.park(Native Method)
       java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
       java.util.concurrent.LinkedTransferQueue.awaitMatch(LinkedTransferQueue.java:735)
       java.util.concurrent.LinkedTransferQueue.xfer(LinkedTransferQueue.java:644)
       java.util.concurrent.LinkedTransferQueue.take(LinkedTransferQueue.java:1137)
       org.elasticsearch.common.util.concurrent.SizeBlockingQueue.take(SizeBlockingQueue.java:162)
       java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
       java.lang.Thread.run(Thread.java:722)

@bobrik
Copy link
Contributor Author

bobrik commented Dec 8, 2014

This index actually was fully scrolled with spark a couple of days ago and there were no issues. I have logs from this node for a year (since 0.90.5), hope that could help with investigation.

@clintongormley
Copy link

Sorry, this issue got lost. Given that it is from a year ago, I assume you've resolved the issue already. So much has changed since then, that there's no point in investigating further.

@bobrik
Copy link
Contributor Author

bobrik commented Nov 23, 2015

Yeah, fixed that by changing company and country, thanks!

@clintongormley
Copy link

:D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants