Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index Corruption #7238

Closed
ghost opened this Issue Aug 12, 2014 · 22 comments

Comments

Projects
None yet
6 participants
@ghost
Copy link

commented Aug 12, 2014

We have two nodes running in our elasticsearch set-up with the cluster regularly going into yellow state. As far as we have been able to determine, this is usually caused by a shard of one particular index regularly becoming unassigned. Restarting the elasticsearch instance helps for a short while (green state), but it usually turns yellow again in half an hour or so.

Log File Excerpt:

[2014-08-12 13:16:15,183][INFO ][node ] [Natalie Portman] started
[2014-08-12 13:16:18,742][INFO ][gateway ] [Natalie Portman] recovered [19] indices into cluster_state
[2014-08-12 13:16:20,982][INFO ][cluster.service ] [Natalie Portman] added {[Jennifer Lawrence][cTovvVXzSgmKBSgi9AoYaA][ip-XXXXX][inet[/XXXX]],}, reason: zen-disco-receive(join from node[[Jennifer Lawrence][cTovvVXzSgmKBSgi9AoYaA][XXXXX][inet[/XXXXX]]])
[2014-08-12 14:13:50,511][WARN ][index.merge.scheduler ] [Natalie Portman] [fuelup-accounts-revisions][2] failed to merge
java.lang.ArrayIndexOutOfBoundsException: 219
at org.apache.lucene.util.FixedBitSet.set(FixedBitSet.java:256)
at org.apache.lucene.codecs.PostingsConsumer.merge(PostingsConsumer.java:87)
at org.apache.lucene.codecs.TermsConsumer.merge(TermsConsumer.java:110)
at org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:72)
at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:399)
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:112)
at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4163)
at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3759)
at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
at org.apache.lucene.index.TrackingConcurrentMergeScheduler.doMerge(TrackingConcurrentMergeScheduler.java:106)
at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)
[2014-08-12 14:13:50,515][WARN ][index.engine.internal ] [Natalie Portman] [fuelup-accounts-revisions][2] failed engine [merge exception]
[2014-08-12 14:13:50,784][WARN ][cluster.action.shard ] [Natalie Portman] [fuelup-accounts-revisions][2] sending failed shard for [fuelup-accounts-revisions][2], node[wqRW6lbSQ6yk1B3bxbCF6w], [R], s[STARTED], indexUUID [PMsFEnTWQ1e_uqGFx8yUXw], reason [engine failure, message [merge exception][MergeException[java.lang.ArrayIndexOutOfBoundsException: 219]; nested: ArrayIndexOutOfBoundsException[219]; ]]
[2014-08-12 14:13:50,785][WARN ][cluster.action.shard ] [Natalie Portman] [fuelup-accounts-revisions][2] received shard failed for [fuelup-accounts-revisions][2], node[wqRW6lbSQ6yk1B3bxbCF6w], [R], s[STARTED], indexUUID [PMsFEnTWQ1e_uqGFx8yUXw], reason [engine failure, message [merge exception][MergeException[java.lang.ArrayIndexOutOfBoundsException: 219]; nested: ArrayIndexOutOfBoundsException[219]; ]]
[2014-08-12 14:13:50,888][WARN ][cluster.action.shard ] [Natalie Portman] [fuelup-accounts-revisions][2] received shard failed for [fuelup-accounts-revisions][2], node[cTovvVXzSgmKBSgi9AoYaA], [P], s[STARTED], indexUUID [PMsFEnTWQ1e_uqGFx8yUXw], reason [engine failure, message [merge exception][MergeException[java.lang.ArrayIndexOutOfBoundsException: 219]; nested: ArrayIndexOutOfBoundsException[219]; ]]
[2014-08-12 14:13:50,890][WARN ][cluster.action.shard ] [Natalie Portman] [fuelup-accounts-revisions][2] received shard failed for [fuelup-accounts-revisions][2], node[cTovvVXzSgmKBSgi9AoYaA], [P], s[STARTED], indexUUID [PMsFEnTWQ1e_uqGFx8yUXw], reason [master [Natalie Portman][wqRW6lbSQ6yk1B3bxbCF6w][ip-XXXXX][inet[/XXXXX]] marked shard as started, but shard has not been created, mark shard as failed]
[2014-08-12 14:13:52,371][WARN ][index.merge.scheduler ] [Natalie Portman] [fuelup-accounts-revisions][2] failed to merge

We have verified that the machines have both enough storage and memory, so we are wondering what is going wrong and how we could further debug this. Suggestions are appreciated.

@mikemccand

This comment has been minimized.

Copy link
Contributor

commented Aug 12, 2014

This looks like either index corruption or a Lucene issue. What ES version are you using? Are you able to reproduce this issue from a clean slate / minimal configuration?

@ghost

This comment has been minimized.

Copy link
Author

commented Aug 14, 2014

We are using ES version 1.3.1 (upgraded from 1.2.1 running before). I have investigated it further and indeed it seems two of our indices were corrupted somehow. I have the corrupted data still so I could analyze it further to see what caused this. The disks containing the indices were not too large (about 8x larger than the data itself and filled for about 60 - 80%) I am hypothesizing that too little disk space when synchronizing shards across the machine could have been the cause, but I am not sure if that's a realistic assessment. If you have other suggestions that I could look into, those would be welcome. I want to rule out that this is an ES (or Lucene) issue.

@mikemccand

This comment has been minimized.

Copy link
Contributor

commented Aug 14, 2014

A disk full should not cause index corruption...

Are you able to reproduce the creation of the corrupted index? Did you run CheckIndex on it (can you post the output if so)?

Also, I'd recommend upgrading to 1.3.2: there was a rare bug in the compression library that ES uses when moving shards around, that can introduce corruption.

@ghost

This comment has been minimized.

Copy link
Author

commented Sep 1, 2014

Mike, we:

  1. Upgraded our machines to 1.3.2
  2. Put our indices on separate large volumes

Still we are getting regular index corruption, particularly on indices which have a lot of write operations, see the examples below.

Do you have a suggestions as to what the best approach is to debugging this? We'd be happy to do this, but are having some trouble figuring out the best place to start.

Some examples:

[2014-09-01 15:09:44,799][WARN ][cluster.action.shard ] [Jennifer Lawrence] [fuelup-accounts-revisions][2] received shard failed for [fuelup-accounts-revisions][2], node[eFW-40fxQiye18qoDPiJ_Q], [P], s[INITIALIZING], indexUUID [pEYC0ucUSIGun5KgVOoC0A], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[fuelup-accounts-revisions][2] failed to fetch index version after copying it over]; nested: CorruptIndexException[[fuelup-accounts-revisions][2] Corrupted index [corrupted_U1FKXejPRfGBoSRivrDxHQ] caused by: CorruptIndexException[codec footer mismatch: actual footer=1952805376 vs expected footer=-1071082520 (resource: NIOFSIndexInput(path="/esdata/fuelup-testing/nodes/0/indices/fuelup-accounts-revisions/2/index/_a_XBloomFilter_0.smy"))]]; ]]

[2014-09-01 15:04:19,366][WARN ][indices.cluster ] [Jennifer Lawrence] [fuelup-manifests][4] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [fuelup-manifests][4] failed to fetch index version after copying it over
at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:152)
at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:132)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.lucene.index.CorruptIndexException: [fuelup-manifests][4] Corrupted index [corrupted_jh_x1y9MS1SWKZefSFlVfQ] caused by: CorruptIndexException[codec footer mismatch: actual footer=1852273495 vs expected footer=-1071082520 (resource: NIOFSIndexInput(path="/esdata/fuelup-testing/nodes/0/indices/fuelup-manifests/4/index/_mi_XBloomFilter_0.smy"))]
at org.elasticsearch.index.store.Store.failIfCorrupted(Store.java:343)
at org.elasticsearch.index.store.Store.failIfCorrupted(Store.java:328)
at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:119)

... 4 more

@mikemccand

This comment has been minimized.

Copy link
Contributor

commented Sep 1, 2014

Do you have a suggestions as to what the best approach is to debugging this?

It's important to get to the bottom of this, in case you're hitting a new bug here ...

Does this corruption reproduce on a brand new index created with 1.3.2? I can't tell from your responses so far whether you are carrying forward an index from before...

Can you attach the server logs from all nodes that led up to the above exception? It looks like a shard was migrated to another node, but then when that node tried to start the shard, it detected corruption. We need to isolate where that corruption happened, either during indexing or during replication.

Can you zip up the data directory from the node that this shard was replicated from and post somewhere, if you allowed to share? I can have a look and see if the corruption is present there.

Are you running any plugins, any external processes, etc.?

@mikemccand

This comment has been minimized.

Copy link
Contributor

commented Sep 1, 2014

Also, which CPU, OS, JVM version are you using?

@almer-t

This comment has been minimized.

Copy link

commented Sep 1, 2014

Mike, thanks for your response, as for your questions:
A) It's on a brand new index (actually multiple indices) on 1.3.2, the index has custom mappings.
B) We're running on AWS (Ubuntu 14.04LTS). A shared Intel Xeon CPU E5-2650.
C) Java reports (Oracle Java 7):
java version "1.7.0_65"
Java(TM) SE Runtime Environment (build 1.7.0_65-b17)
Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode)
D) Plugins are "cloud-aws" (2.3.0) and "head"

As for your other questions: I will look into what logs and index data I can share with you tomorrow. We will likely set-up a separate environment for replicating the problem. Naturally I'll share with you any "recipe" I can make for reproducing the problem.

@mikemccand

This comment has been minimized.

Copy link
Contributor

commented Sep 1, 2014

Thanks @almer-t that's great you can reproduce from a clean start; hopefully this means we can iterate until we find the root cause. If you can narrow it down to a dedicated environment / particular set of documents / recipe that would be very helpful.

As far as I know that JVM version is safe.

Have you changed any settings from defaults (e.g. via settings API or in config/elasticsearch.yml)?

@ghost

This comment has been minimized.

Copy link
Author

commented Sep 2, 2014

mike: I've haven't changed the defaults (apart from the number of master nodes, from 1->2, but the problem existed before this, so it's probably not the root cause).

Here's some log information that shows the first signs of trouble:

[2014-09-02 09:07:51,137][WARN ][index.engine.internal ] [Natalie Portman] [fuelup-manifests][3] failed engine [corrupt
file detected source: [recovery phase 1]]
[2014-09-02 09:07:51,154][WARN ][cluster.action.shard ] [Natalie Portman] [fuelup-manifests][3] sending failed shard fo
r [fuelup-manifests][3], node[3R-3HR1jQzOWsct8tIlO8w], [P], s[STARTED], indexUUID [xlRTKIXsSq6HO_BHWIA4Fg], reason [engine
failure, message [corrupt file detected source: [recovery phase 1]][RecoverFilesRecoveryException[[fuelup-manifests][3] Fai
led to transfer [0] files with total size of [0b]]; nested: CorruptIndexException[[fuelup-manifests][3] Corrupted index [corrupted_dXHcg1AkQtGuplAHa87jqw] caused by: CorruptIndexException[codec footer mismatch: actual footer=1852273495 vs expected footer=-1071082520 (resource: NIOFSIndexInput(path="/esdata/fuelup-testing/nodes/0/indices/fuelup-manifests/3/index/_b_XBloomFilter_0.smy"))]]; ]]
...
[2014-09-02 09:07:51,201][WARN ][indices.cluster ] [Natalie Portman] [fuelup-manifests][3] master [[Natalie Portman][3R-3HR1jQzOWsct8tIlO8w][ip-10-0-1-20][inet[/10.0.1.20:9300]]] marked shard as started, but shard has not been created, mark shard as failed
...
[2014-09-02 09:07:51,352][WARN ][indices.cluster ] [Natalie Portman] [fuelup-manifests][3] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [fuelup-manifests][3] failed to fetch index version after copying it over
at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:152)
at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:132)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.lucene.index.CorruptIndexException: [fuelup-manifests][3] Corrupted index [corrupted_dXHcg1AkQtGuplAHa87jqw] caused by: CorruptIndexException[codec footer mismatch: actual footer=1852273495 vs expected footer=-1071082520 (resource: NIOFSIndexInput(path="/esdata/fuelup-testing/nodes/0/indices/fuelup-manifests/3/index/_b_XBloomFilter_0.smy"))]
at org.elasticsearch.index.store.Store.failIfCorrupted(Store.java:343)
at org.elasticsearch.index.store.Store.failIfCorrupted(Store.java:328)
at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:119)

... 4 more

All except the first message then repeats (particularly the 2nd and the last)

@ghost ghost changed the title Failed to Merge Index Corruption Sep 2, 2014

@ghost

This comment has been minimized.

Copy link
Author

commented Sep 2, 2014

We are currently using three machines. I split this into a 1/2 situation (by renaming the cluster and moving around some files). So I ended up with a cluster A with 1 machine and a cluster B with 2 machines. I have tried our read/write operations on both A and B and can not trivially reproduce the errors above. I have to say though that usually they occur "after a while", so we'll have to wait a bit.

I have found that when corruption does occur its detection can be triggered by:

  1. creating a backup snapshot.
  2. restarting a machine in the cluster.
  3. changing the replication settings of a corrupted index.

Examing log files when corruption did occur (from yesterday when it was still a 3-machine cluster), it seems that this message repeats often, with the exact same values:
"CorruptIndexException[codec footer mismatch: actual footer=1852273495 vs expected footer=-1071082520"

Also this is a bit awkward:
"[RecoverFilesRecoveryException[[fuelup-manifests][3] Failed to transfer [0] files with total size of [0b]]"

A wild guess: can it be that the volume where the elasticsearch data is stored is (for a brief instant) disconnected and then reconnected, which may cause this to occur?

Mike: I am also sending you the complete log files (privately).

@mikemccand

This comment has been minimized.

Copy link
Contributor

commented Sep 2, 2014

Thanks @almer-fuel

Can you enable CheckIndex on startup, by adding "index.shard.check_on_startup: true" to your config/elasticsearch.yml? Then restart a node and it should run CheckIndex on all its shards; it runs for a while and then suddenly prints out lots of output (from Lucene's CheckIndex tool)... I'm hoping we can isolate when the corruption happens.

Are you able to get the corruption to happen if you use 1 shard 0 replicas? Just trying to simplify the recipe as much as possible.

I would definitely not expect this kind of corruption if a volume disconnects/reconnects; only if the volume is buggy (flips/loses/adds bits) should you see this kind of corruption.

Does the corruption always happen in a *.smy file?

@rmuir

This comment has been minimized.

Copy link
Contributor

commented Sep 2, 2014

This is the .smy file from pulsing codec. it doesnt have a checksum... its basically incompatible with elasticsearch (and was just an experiment). Pulsing codec is actually removed from lucene since this optimization has been in the default codec for years. I recommend turning it off.

@ghost

This comment has been minimized.

Copy link
Author

commented Sep 2, 2014

Mike: checkindex set, restarting any of the machines does not yield a corrupted index (yet) as of now.

Mike & muir:
Hmmm, yes the corruption always happens in an smy file. All the corrupted indexes have at least a mapping where "postings_format" : "bloom_pulsing" is used. I hypothesized this could have been the problem in a mail to Mike.

rmuir:
Would you recommend in general leaving postings_format alone? (I can't seem to find it anymore in the current elasticsearch documentation). Also: if this can cause index corruption it should be disabled.

@mikemccand

This comment has been minimized.

Copy link
Contributor

commented Sep 2, 2014

Hi @almer-fuel hmm I can't find your private email to me ...

The pulsing postings format is being removed from Lucene; what it does (folding terms with just one doc into the terms dict) is already being done by the default postings format. I would recommend just using the default ...

But, I don't think this can explain the original AIOOBE you hit; are you able to reproduce that exception?

@ghost

This comment has been minimized.

Copy link
Author

commented Sep 2, 2014

Check your spam :) (if you can't find it, i'll resend).

No, it can't explain the AIOOBE, but we have not yet seen that problem recur (perhaps we should split this bug into two, one concerning the AIOOBE still unexplained, the other about the postings_format, I am changing our index definitions, so I should be able to confirm within several days that the problem does not re-occur).

@budnik

This comment has been minimized.

Copy link

commented Sep 4, 2014

We ecountered same issue multiple times. We were using 'postings_format" : "bloom_pulsing" for one of the fields. I hope switching back to default would help and thanks for helping me figuring it out.

@mikemccand

This comment has been minimized.

Copy link
Contributor

commented Sep 4, 2014

Thanks @almer-fuel I found the email ...

Let's leave this issue open to dig into the AIOOBE?

@ghost

This comment has been minimized.

Copy link
Author

commented Sep 4, 2014

@budnik @mikemccand Removing the bloom_pulsing helped restore our cluster to a stable state. So, it's confirmed to have been the cause of our (non-AIOOBE) problems.

@mikemccand Yes, we will keep an eye on whether the AIOOBE occurs again (and/or if we can reproduce it in a separate testing environment). If so, we will post our updates here. It could also be that the issues are somehow linked (the AIOOBE being a side effect of the bloom-pulsing issues), but I can't say that for sure. I'll keep an eye on it.

Thanks for the help! 🙏

@deverton

This comment has been minimized.

Copy link
Contributor

commented Oct 1, 2014

We just rolled out a new cluster on 1.3.2 and got this after a restart.

[2014-10-01 11:24:19,381][WARN ][cluster.action.shard     ] [node1.example.com] [index-20130125][1] sending failed shard for [index-20130125][1], node[2LBNSJCMQByFpj0LgKZjJw], relocating [JT1xYUsPQkyEgLzDmBAvhA], [P], s[INITIALIZING], indexUUID [uNfol8JNQ_GZCMEzT07uvQ], reason [engine failure, message [corrupted preexisting index][CorruptIndexException[[reviews-20130125][1] Corrupted index [corrupted_yHFUTG8eQMWqLbRwRHtPaA] caused by: CorruptIndexException[codec footer mismatch: actual footer=1230204780 vs expected footer=-1071082520 (resource: MMapIndexInput(path="/opt/elasticsearch/data/elasticsearch_1_3/nodes/0/indices/index-20130125/1/index/_am.fdt"))]]]]

This was on a two node cluster and the index was created from scratch. There are no other indices in this cluster either. The index mapping is fairly simple and doesn't involve any changes to codecs or anything so I don't think this is bloom related.

@s1monw

This comment has been minimized.

Copy link
Contributor

commented Oct 1, 2014

hmm the index name says reviews-20130125 so it seems like it's a pretty old index? You said it was newly created so I am a bit confused. Can you check if there is a file called corrupted_yHFUTG8eQMWqLbRwRHtPaA inside the index/shard directory and gist it's content?

@deverton

This comment has been minimized.

Copy link
Contributor

commented Oct 1, 2014

Hah, sorry, my attempt at masking data didn't quite work out. While the index name indicates it's old, it's actually just the version of the mapping. It was replicated from a CouchDB instance as part of the deployment of 1.3.2.

I'll see if we still have corrupted_yHFUTG8eQMWqLbRwRHtPaA in the data directory. It might have been destroyed as we recreated the ES 1.3.2 instances just to be sure.

@s1monw

This comment has been minimized.

Copy link
Contributor

commented Oct 1, 2014

I am closing this issue for now...

@s1monw s1monw closed this Oct 1, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.