Fix a very rare case of corruption in compression used for internal cluster communication. #7210

rjernst · 2014-08-08T18:51:11Z

See CorruptedCompressorTests for details on how this bug can be hit.

javanna · 2014-08-08T19:26:41Z

src/test/java/org/elasticsearch/common/compress/CorruptedCompressorTests.java

+ */
+public class CorruptedCompressorTests extends ElasticsearchTestCase {
+
+    public void testCorruption() throws IOException {


missing @Test annotation

@test doesnt do anything :)

I know if the method starts with test we are good anyway but why do we use the annotation all over the place then :) Either we remove it everywhere or we stick to it I'd say...

I don't use @Test unless someone makes me, for the exact reason Robert pointed out. It is just extra characters with no benefit.

rmuir · 2014-08-08T19:38:02Z

Please disable unsafe encode/decode complete.

This may crash machines that don't allow unaligned reads: Fatal error in Java Runtime Environment with LZF 0.9.5 Library in Solaris 11 on sparc ning/compress#18
if (SUNOS) does not imply its safe to do such unaligned reads.
This may corrupt data on bigendian systems: Incorrect de-serialization leading to stream corruption in Big Endian systems ning/compress#37
We do not test such situations.

imotov · 2014-08-08T19:47:18Z

src/main/java/org/elasticsearch/common/compress/ElasticsearchChunkEncoder.java

+ * This is a fork of {@link com.ning.compress.lzf.impl.VanillaChunkEncoder} to quickly fix
+ * an extremely rare bug. See CorruptedCompressorTests for details on reproducing the bug.
+ */
+public class ElasticsearchChunkEncoder extends VanillaChunkEncoder {


Historically, we were using "X" prefix to designate temporary implementations like this one. So, a more traditional name would be XVanillaChunkEncoder. For example 2edde35

Ok, changed to XVanillaChunkEncoder.

rjernst · 2014-08-08T20:12:44Z

Ok, I think I addressed all the comments. The only unchanged thing is the license file, because I don't know which license to put in there (the original file had no license header).

rjernst · 2014-08-09T19:03:45Z

The PR to the compress-lzf project was merged, and a 1.0.2 release was made. I removed the X encoder and made the upgrade to 1.0.2.

kimchy · 2014-08-11T01:55:29Z

pom.xml

@@ -1381,6 +1381,7 @@
                  <!-- t-digest -->
                  <exclude>src/main/java/org/elasticsearch/search/aggregations/metrics/percentiles/tdigest/TDigestState.java</exclude>
                  <exclude>src/test/java/org/elasticsearch/search/aggregations/metrics/GroupTree.java</exclude>
+                  <exclude>src/test/java/org/elasticsearch/common/compress/lzf/XVanillaChunkEncoder.java</exclude>


do we still need this with 1.0.2?

Good catch. Removed.

internal cluster communication. See CorruptedCompressorTests for details on how this bug can be hit. This change also removes the ability to use the unsafe variant of ChunkedEncoder, removing support for the compress.lzf.decoder setting.

rmuir · 2014-08-11T12:57:29Z

looks good, thanks Ryan.

jpountz · 2014-08-11T12:58:29Z

+1 as well

rjernst · 2014-08-11T14:29:18Z

Thanks. Pushed.

The compression bug fixed in elastic#7210 can still strike us since we are running BWC test against these version. This commit disables compression forcefully if the compatibility version is < 1.3.2 to prevent debugging already known issues.

This commit forces a full recovery if the source node is < 1.4.0 and prevents any recoveries from pre 1.3.2 nodes if compression is enabled to work around elastic#7210 Closes elastic#9922

This commit forces a full recovery if the source node is < 1.4.0 and prevents any recoveries from pre 1.3.2 nodes if compression is enabled to work around elastic#7210 Closes elastic#9922 Conflicts: src/main/java/org/elasticsearch/indices/recovery/RecoveryTarget.java

taf2 · 2015-06-18T20:00:30Z

Upgrading from 1.1.1 to 1.6.0 and noticing this output from our cluster

insertOrder timeInQueue priority source
      37659        27ms HIGH     shard-failed ([callers][2], node[Ko3b9KsESN68lTkPtVrHKw], relocating [4mcZCKvBRoKQJS_StGNPng], [P], s[INITIALIZING]), reason [shard failure [failed recovery][RecoveryFailedException[[callers][2]: Recovery failed from [aws_el1][4mcZCKvBRoKQJS_StGNPng][ip-10-55-11-210][inet[/10.55.11.210:9300]]{rack=useast1, master=true, zone=zonea} into [aws_el1a][Ko3b9KsESN68lTkPtVrHKw][ip-10-55-11-211][inet[/10.55.11.211:9300]]{rack=useast1, zone=zonea, master=true} (unexpected error)]; nested: ElasticsearchIllegalStateException[Can't recovery from node [aws_el1][4mcZCKvBRoKQJS_StGNPng][ip-10-55-11-210][inet[/10.55.11.210:9300]]{rack=useast1, master=true, zone=zonea} with [indices.recovery.compress : true] due to compression bugs -  see issue #7210 for details]; ]]```

what do we do?

rjernst · 2015-06-18T20:02:47Z

@taf2 Turn off compression before upgrading.

taf2 · 2015-06-18T20:04:57Z

@rjernst thanks! which kind of compression do we disable...

is it this option in

/etc/elasticsearch/elasticsearch.yml
#transport.tcp.compress: true

?

or another option?

taf2 · 2015-06-18T20:06:01Z

okay sorry it looks like we need to disable indices.recovery.compress - but is this something that needs to be disabled on all nodes in the cluster or just the new 1.6.0 node we're starting up now?

rjernst · 2015-06-18T20:07:08Z

All nodes in the cluster, before starting the upgrade. The problem is old nodes with this setting enabled would use the old buggy code, which can then cause data copied between and old and new node to become corrupted.

taf2 · 2015-06-18T20:10:20Z

excellent thank you - we have run the following on the existing cluster:

curl -XPUT localhost:9200/_cluster/settings -d '{"transient" : {"indices.recovery.compress" : false }}'

taf2 · 2015-06-18T20:12:39Z

Thank you that did the trick!

This commit forces a full recovery if the source node is < 1.4.0 and prevents any recoveries from pre 1.3.2 nodes if compression is enabled to work around elastic#7210 Closes elastic#9922 Conflicts: src/main/java/org/elasticsearch/indices/recovery/RecoveryTarget.java

rjernst changed the title ~~Internal: Fix a very rare case of corruption in replication compression.~~ Fix a very rare case of corruption in compression used for internal cluster communication. Aug 8, 2014

javanna reviewed Aug 8, 2014
View reviewed changes

imotov reviewed Aug 8, 2014
View reviewed changes

kimchy reviewed Aug 11, 2014
View reviewed changes

rjernst added bug labels Aug 11, 2014

rjernst closed this Aug 11, 2014

rjernst added the v1.2.4 label Aug 12, 2014

rmuir mentioned this pull request Aug 26, 2014

Add LZF safe encoder in LZFCompressor #7466

Closed

clintongormley changed the title ~~Fix a very rare case of corruption in compression used for internal cluster communication.~~ Internal: Fix a very rare case of corruption in compression used for internal cluster communication. Sep 8, 2014

s1monw mentioned this pull request Nov 9, 2014

[TEST] Disable compression in BWC test for version < 1.3.2 #8412

Merged

rjernst deleted the fix/compress-corruption branch January 21, 2015 23:22

s1monw mentioned this pull request Feb 27, 2015

Don't recover from buggy version #9925

Merged

jpountz mentioned this pull request May 22, 2015

Tighten up our compression framework. #11279

Merged

clintongormley added the :Internal label Jun 7, 2015

clintongormley changed the title ~~Internal: Fix a very rare case of corruption in compression used for internal cluster communication.~~ Fix a very rare case of corruption in compression used for internal cluster communication. Jun 7, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix a very rare case of corruption in compression used for internal cluster communication. #7210

Fix a very rare case of corruption in compression used for internal cluster communication. #7210

rjernst commented Aug 8, 2014

javanna Aug 8, 2014

rmuir Aug 8, 2014

javanna Aug 8, 2014

rjernst Aug 8, 2014

rmuir commented Aug 8, 2014

imotov Aug 8, 2014

rjernst Aug 8, 2014

rjernst commented Aug 8, 2014

rjernst commented Aug 9, 2014

kimchy Aug 11, 2014

rjernst Aug 11, 2014

rmuir commented Aug 11, 2014

jpountz commented Aug 11, 2014

rjernst commented Aug 11, 2014

taf2 commented Jun 18, 2015

rjernst commented Jun 18, 2015

taf2 commented Jun 18, 2015

taf2 commented Jun 18, 2015

rjernst commented Jun 18, 2015

taf2 commented Jun 18, 2015

taf2 commented Jun 18, 2015

Fix a very rare case of corruption in compression used for internal cluster communication. #7210

Fix a very rare case of corruption in compression used for internal cluster communication. #7210

Conversation

rjernst commented Aug 8, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rmuir commented Aug 8, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rjernst commented Aug 8, 2014

rjernst commented Aug 9, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rmuir commented Aug 11, 2014

jpountz commented Aug 11, 2014

rjernst commented Aug 11, 2014

taf2 commented Jun 18, 2015

rjernst commented Jun 18, 2015

taf2 commented Jun 18, 2015

taf2 commented Jun 18, 2015

rjernst commented Jun 18, 2015

taf2 commented Jun 18, 2015

taf2 commented Jun 18, 2015