Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a very rare case of corruption in compression used for internal cluster communication. #7210

Closed
wants to merge 1 commit into from

Conversation

Projects
None yet
8 participants
@rjernst
Copy link
Member

rjernst commented Aug 8, 2014

See CorruptedCompressorTests for details on how this bug can be hit.

@rjernst rjernst changed the title Internal: Fix a very rare case of corruption in replication compression. Fix a very rare case of corruption in compression used for internal cluster communication. Aug 8, 2014

@javanna

View changes

src/test/java/org/elasticsearch/common/compress/CorruptedCompressorTests.java Outdated
*/
public class CorruptedCompressorTests extends ElasticsearchTestCase {

public void testCorruption() throws IOException {

This comment has been minimized.

Copy link
@javanna

javanna Aug 8, 2014

Member

missing @Test annotation

This comment has been minimized.

Copy link
@rmuir

rmuir Aug 8, 2014

Contributor

@test doesnt do anything :)

This comment has been minimized.

Copy link
@javanna

javanna Aug 8, 2014

Member

I know if the method starts with test we are good anyway but why do we use the annotation all over the place then :) Either we remove it everywhere or we stick to it I'd say...

This comment has been minimized.

Copy link
@rjernst

rjernst Aug 8, 2014

Author Member

I don't use @Test unless someone makes me, for the exact reason Robert pointed out. It is just extra characters with no benefit.

@rmuir

This comment has been minimized.

Copy link
Contributor

rmuir commented Aug 8, 2014

Please disable unsafe encode/decode complete.

  • This may crash machines that don't allow unaligned reads: ning/compress#18
  • if (SUNOS) does not imply its safe to do such unaligned reads.
  • This may corrupt data on bigendian systems: ning/compress#37
  • We do not test such situations.
@imotov

View changes

src/main/java/org/elasticsearch/common/compress/ElasticsearchChunkEncoder.java Outdated
* This is a fork of {@link com.ning.compress.lzf.impl.VanillaChunkEncoder} to quickly fix
* an extremely rare bug. See CorruptedCompressorTests for details on reproducing the bug.
*/
public class ElasticsearchChunkEncoder extends VanillaChunkEncoder {

This comment has been minimized.

Copy link
@imotov

imotov Aug 8, 2014

Member

Historically, we were using "X" prefix to designate temporary implementations like this one. So, a more traditional name would be XVanillaChunkEncoder. For example 2edde35

This comment has been minimized.

Copy link
@rjernst

rjernst Aug 8, 2014

Author Member

Ok, changed to XVanillaChunkEncoder.

@rjernst

This comment has been minimized.

Copy link
Member Author

rjernst commented Aug 8, 2014

Ok, I think I addressed all the comments. The only unchanged thing is the license file, because I don't know which license to put in there (the original file had no license header).

@rjernst

This comment has been minimized.

Copy link
Member Author

rjernst commented Aug 9, 2014

The PR to the compress-lzf project was merged, and a 1.0.2 release was made. I removed the X encoder and made the upgrade to 1.0.2.

@kimchy

View changes

pom.xml Outdated
@@ -1381,6 +1381,7 @@
<!-- t-digest -->
<exclude>src/main/java/org/elasticsearch/search/aggregations/metrics/percentiles/tdigest/TDigestState.java</exclude>
<exclude>src/test/java/org/elasticsearch/search/aggregations/metrics/GroupTree.java</exclude>
<exclude>src/test/java/org/elasticsearch/common/compress/lzf/XVanillaChunkEncoder.java</exclude>

This comment has been minimized.

Copy link
@kimchy

kimchy Aug 11, 2014

Member

do we still need this with 1.0.2?

This comment has been minimized.

Copy link
@rjernst

rjernst Aug 11, 2014

Author Member

Good catch. Removed.

Internal: Fix a very rare case of corruption in compression used for
internal cluster communication.

See CorruptedCompressorTests for details on how this bug can be hit.
This change also removes the ability to use the unsafe variant of
ChunkedEncoder, removing support for the compress.lzf.decoder setting.
@rmuir

This comment has been minimized.

Copy link
Contributor

rmuir commented Aug 11, 2014

looks good, thanks Ryan.

@jpountz

This comment has been minimized.

Copy link
Contributor

jpountz commented Aug 11, 2014

+1 as well

@rjernst rjernst added bug labels Aug 11, 2014

@rjernst

This comment has been minimized.

Copy link
Member Author

rjernst commented Aug 11, 2014

Thanks. Pushed.

@rjernst rjernst closed this Aug 11, 2014

@rjernst rjernst added the v1.2.4 label Aug 12, 2014

@clintongormley clintongormley changed the title Fix a very rare case of corruption in compression used for internal cluster communication. Internal: Fix a very rare case of corruption in compression used for internal cluster communication. Sep 8, 2014

s1monw added a commit to s1monw/elasticsearch that referenced this pull request Nov 11, 2014

[TEST] Disable compression in BWC test for version < 1.3.2
The compression bug fixed in elastic#7210 can still strike us since we are
running BWC test against these version. This commit disables compression
forcefully if the compatibility version is < 1.3.2 to prevent debugging
already known issues.

s1monw added a commit to s1monw/elasticsearch that referenced this pull request Nov 11, 2014

[TEST] Disable compression in BWC test for version < 1.3.2
The compression bug fixed in elastic#7210 can still strike us since we are
running BWC test against these version. This commit disables compression
forcefully if the compatibility version is < 1.3.2 to prevent debugging
already known issues.

@rjernst rjernst deleted the rjernst:fix/compress-corruption branch Jan 21, 2015

s1monw added a commit to s1monw/elasticsearch that referenced this pull request Mar 2, 2015

[RECOVERY] Don't recover from buggy version
This commit forces a full recovery if the source node is < 1.4.0 and
prevents any recoveries from pre 1.3.2 nodes if compression is enabled to
work around elastic#7210

Closes elastic#9922

s1monw added a commit to s1monw/elasticsearch that referenced this pull request Mar 2, 2015

[RECOVERY] Don't recover from buggy version
This commit forces a full recovery if the source node is < 1.4.0 and
prevents any recoveries from pre 1.3.2 nodes if compression is enabled to
work around elastic#7210

Closes elastic#9922

Conflicts:
	src/main/java/org/elasticsearch/indices/recovery/RecoveryTarget.java

@clintongormley clintongormley changed the title Internal: Fix a very rare case of corruption in compression used for internal cluster communication. Fix a very rare case of corruption in compression used for internal cluster communication. Jun 7, 2015

@taf2

This comment has been minimized.

Copy link

taf2 commented Jun 18, 2015

Upgrading from 1.1.1 to 1.6.0 and noticing this output from our cluster

insertOrder timeInQueue priority source
      37659        27ms HIGH     shard-failed ([callers][2], node[Ko3b9KsESN68lTkPtVrHKw], relocating [4mcZCKvBRoKQJS_StGNPng], [P], s[INITIALIZING]), reason [shard failure [failed recovery][RecoveryFailedException[[callers][2]: Recovery failed from [aws_el1][4mcZCKvBRoKQJS_StGNPng][ip-10-55-11-210][inet[/10.55.11.210:9300]]{rack=useast1, master=true, zone=zonea} into [aws_el1a][Ko3b9KsESN68lTkPtVrHKw][ip-10-55-11-211][inet[/10.55.11.211:9300]]{rack=useast1, zone=zonea, master=true} (unexpected error)]; nested: ElasticsearchIllegalStateException[Can't recovery from node [aws_el1][4mcZCKvBRoKQJS_StGNPng][ip-10-55-11-210][inet[/10.55.11.210:9300]]{rack=useast1, master=true, zone=zonea} with [indices.recovery.compress : true] due to compression bugs -  see issue #7210 for details]; ]]```

what do we do?
@rjernst

This comment has been minimized.

Copy link
Member Author

rjernst commented Jun 18, 2015

@taf2 Turn off compression before upgrading.

@taf2

This comment has been minimized.

Copy link

taf2 commented Jun 18, 2015

@rjernst thanks! which kind of compression do we disable...

is it this option in

/etc/elasticsearch/elasticsearch.yml
#transport.tcp.compress: true

?

or another option?

@taf2

This comment has been minimized.

Copy link

taf2 commented Jun 18, 2015

okay sorry it looks like we need to disable indices.recovery.compress - but is this something that needs to be disabled on all nodes in the cluster or just the new 1.6.0 node we're starting up now?

@rjernst

This comment has been minimized.

Copy link
Member Author

rjernst commented Jun 18, 2015

All nodes in the cluster, before starting the upgrade. The problem is old nodes with this setting enabled would use the old buggy code, which can then cause data copied between and old and new node to become corrupted.

@taf2

This comment has been minimized.

Copy link

taf2 commented Jun 18, 2015

excellent thank you - we have run the following on the existing cluster:

curl -XPUT localhost:9200/_cluster/settings -d '{"transient" : {"indices.recovery.compress" : false }}'
@taf2

This comment has been minimized.

Copy link

taf2 commented Jun 18, 2015

Thank you that did the trick!

mute pushed a commit to mute/elasticsearch that referenced this pull request Jul 29, 2015

[RECOVERY] Don't recover from buggy version
This commit forces a full recovery if the source node is < 1.4.0 and
prevents any recoveries from pre 1.3.2 nodes if compression is enabled to
work around elastic#7210

Closes elastic#9922

Conflicts:
	src/main/java/org/elasticsearch/indices/recovery/RecoveryTarget.java
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.