New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a very rare case of corruption in compression used for internal cluster communication. #7210

Closed
wants to merge 1 commit into
base: master
from

Conversation

Projects
None yet
8 participants
@rjernst
Member

rjernst commented Aug 8, 2014

See CorruptedCompressorTests for details on how this bug can be hit.

@rjernst rjernst changed the title from Internal: Fix a very rare case of corruption in replication compression. to Fix a very rare case of corruption in compression used for internal cluster communication. Aug 8, 2014

@javanna

View changes

Show outdated Hide outdated ...est/java/org/elasticsearch/common/compress/CorruptedCompressorTests.java
@rmuir

This comment has been minimized.

Show comment
Hide comment
@rmuir

rmuir Aug 8, 2014

Contributor

Please disable unsafe encode/decode complete.

  • This may crash machines that don't allow unaligned reads: ning/compress#18
  • if (SUNOS) does not imply its safe to do such unaligned reads.
  • This may corrupt data on bigendian systems: ning/compress#37
  • We do not test such situations.
Contributor

rmuir commented Aug 8, 2014

Please disable unsafe encode/decode complete.

  • This may crash machines that don't allow unaligned reads: ning/compress#18
  • if (SUNOS) does not imply its safe to do such unaligned reads.
  • This may corrupt data on bigendian systems: ning/compress#37
  • We do not test such situations.
@imotov

View changes

Show outdated Hide outdated ...in/java/org/elasticsearch/common/compress/ElasticsearchChunkEncoder.java
@rjernst

This comment has been minimized.

Show comment
Hide comment
@rjernst

rjernst Aug 8, 2014

Member

Ok, I think I addressed all the comments. The only unchanged thing is the license file, because I don't know which license to put in there (the original file had no license header).

Member

rjernst commented Aug 8, 2014

Ok, I think I addressed all the comments. The only unchanged thing is the license file, because I don't know which license to put in there (the original file had no license header).

@rjernst

This comment has been minimized.

Show comment
Hide comment
@rjernst

rjernst Aug 9, 2014

Member

The PR to the compress-lzf project was merged, and a 1.0.2 release was made. I removed the X encoder and made the upgrade to 1.0.2.

Member

rjernst commented Aug 9, 2014

The PR to the compress-lzf project was merged, and a 1.0.2 release was made. I removed the X encoder and made the upgrade to 1.0.2.

@kimchy

View changes

Show outdated Hide outdated pom.xml
Internal: Fix a very rare case of corruption in compression used for
internal cluster communication.

See CorruptedCompressorTests for details on how this bug can be hit.
This change also removes the ability to use the unsafe variant of
ChunkedEncoder, removing support for the compress.lzf.decoder setting.
@rmuir

This comment has been minimized.

Show comment
Hide comment
@rmuir

rmuir Aug 11, 2014

Contributor

looks good, thanks Ryan.

Contributor

rmuir commented Aug 11, 2014

looks good, thanks Ryan.

@jpountz

This comment has been minimized.

Show comment
Hide comment
@jpountz

jpountz Aug 11, 2014

Contributor

+1 as well

Contributor

jpountz commented Aug 11, 2014

+1 as well

@rjernst rjernst added bug labels Aug 11, 2014

@rjernst

This comment has been minimized.

Show comment
Hide comment
@rjernst

rjernst Aug 11, 2014

Member

Thanks. Pushed.

Member

rjernst commented Aug 11, 2014

Thanks. Pushed.

@rjernst rjernst closed this Aug 11, 2014

@rjernst rjernst added the v1.2.4 label Aug 12, 2014

@clintongormley clintongormley changed the title from Fix a very rare case of corruption in compression used for internal cluster communication. to Internal: Fix a very rare case of corruption in compression used for internal cluster communication. Sep 8, 2014

s1monw added a commit to s1monw/elasticsearch that referenced this pull request Nov 11, 2014

[TEST] Disable compression in BWC test for version < 1.3.2
The compression bug fixed in #7210 can still strike us since we are
running BWC test against these version. This commit disables compression
forcefully if the compatibility version is < 1.3.2 to prevent debugging
already known issues.

s1monw added a commit to s1monw/elasticsearch that referenced this pull request Nov 11, 2014

[TEST] Disable compression in BWC test for version < 1.3.2
The compression bug fixed in #7210 can still strike us since we are
running BWC test against these version. This commit disables compression
forcefully if the compatibility version is < 1.3.2 to prevent debugging
already known issues.

@rjernst rjernst deleted the rjernst:fix/compress-corruption branch Jan 21, 2015

s1monw added a commit to s1monw/elasticsearch that referenced this pull request Mar 2, 2015

[RECOVERY] Don't recover from buggy version
This commit forces a full recovery if the source node is < 1.4.0 and
prevents any recoveries from pre 1.3.2 nodes if compression is enabled to
work around #7210

Closes #9922

s1monw added a commit to s1monw/elasticsearch that referenced this pull request Mar 2, 2015

[RECOVERY] Don't recover from buggy version
This commit forces a full recovery if the source node is < 1.4.0 and
prevents any recoveries from pre 1.3.2 nodes if compression is enabled to
work around #7210

Closes #9922

Conflicts:
	src/main/java/org/elasticsearch/indices/recovery/RecoveryTarget.java

@clintongormley clintongormley changed the title from Internal: Fix a very rare case of corruption in compression used for internal cluster communication. to Fix a very rare case of corruption in compression used for internal cluster communication. Jun 7, 2015

@taf2

This comment has been minimized.

Show comment
Hide comment
@taf2

taf2 Jun 18, 2015

Upgrading from 1.1.1 to 1.6.0 and noticing this output from our cluster

insertOrder timeInQueue priority source
      37659        27ms HIGH     shard-failed ([callers][2], node[Ko3b9KsESN68lTkPtVrHKw], relocating [4mcZCKvBRoKQJS_StGNPng], [P], s[INITIALIZING]), reason [shard failure [failed recovery][RecoveryFailedException[[callers][2]: Recovery failed from [aws_el1][4mcZCKvBRoKQJS_StGNPng][ip-10-55-11-210][inet[/10.55.11.210:9300]]{rack=useast1, master=true, zone=zonea} into [aws_el1a][Ko3b9KsESN68lTkPtVrHKw][ip-10-55-11-211][inet[/10.55.11.211:9300]]{rack=useast1, zone=zonea, master=true} (unexpected error)]; nested: ElasticsearchIllegalStateException[Can't recovery from node [aws_el1][4mcZCKvBRoKQJS_StGNPng][ip-10-55-11-210][inet[/10.55.11.210:9300]]{rack=useast1, master=true, zone=zonea} with [indices.recovery.compress : true] due to compression bugs -  see issue #7210 for details]; ]]```

what do we do?

taf2 commented Jun 18, 2015

Upgrading from 1.1.1 to 1.6.0 and noticing this output from our cluster

insertOrder timeInQueue priority source
      37659        27ms HIGH     shard-failed ([callers][2], node[Ko3b9KsESN68lTkPtVrHKw], relocating [4mcZCKvBRoKQJS_StGNPng], [P], s[INITIALIZING]), reason [shard failure [failed recovery][RecoveryFailedException[[callers][2]: Recovery failed from [aws_el1][4mcZCKvBRoKQJS_StGNPng][ip-10-55-11-210][inet[/10.55.11.210:9300]]{rack=useast1, master=true, zone=zonea} into [aws_el1a][Ko3b9KsESN68lTkPtVrHKw][ip-10-55-11-211][inet[/10.55.11.211:9300]]{rack=useast1, zone=zonea, master=true} (unexpected error)]; nested: ElasticsearchIllegalStateException[Can't recovery from node [aws_el1][4mcZCKvBRoKQJS_StGNPng][ip-10-55-11-210][inet[/10.55.11.210:9300]]{rack=useast1, master=true, zone=zonea} with [indices.recovery.compress : true] due to compression bugs -  see issue #7210 for details]; ]]```

what do we do?
@rjernst

This comment has been minimized.

Show comment
Hide comment
@rjernst

rjernst Jun 18, 2015

Member

@taf2 Turn off compression before upgrading.

Member

rjernst commented Jun 18, 2015

@taf2 Turn off compression before upgrading.

@taf2

This comment has been minimized.

Show comment
Hide comment
@taf2

taf2 Jun 18, 2015

@rjernst thanks! which kind of compression do we disable...

is it this option in

/etc/elasticsearch/elasticsearch.yml
#transport.tcp.compress: true

?

or another option?

taf2 commented Jun 18, 2015

@rjernst thanks! which kind of compression do we disable...

is it this option in

/etc/elasticsearch/elasticsearch.yml
#transport.tcp.compress: true

?

or another option?

@taf2

This comment has been minimized.

Show comment
Hide comment
@taf2

taf2 Jun 18, 2015

okay sorry it looks like we need to disable indices.recovery.compress - but is this something that needs to be disabled on all nodes in the cluster or just the new 1.6.0 node we're starting up now?

taf2 commented Jun 18, 2015

okay sorry it looks like we need to disable indices.recovery.compress - but is this something that needs to be disabled on all nodes in the cluster or just the new 1.6.0 node we're starting up now?

@rjernst

This comment has been minimized.

Show comment
Hide comment
@rjernst

rjernst Jun 18, 2015

Member

All nodes in the cluster, before starting the upgrade. The problem is old nodes with this setting enabled would use the old buggy code, which can then cause data copied between and old and new node to become corrupted.

Member

rjernst commented Jun 18, 2015

All nodes in the cluster, before starting the upgrade. The problem is old nodes with this setting enabled would use the old buggy code, which can then cause data copied between and old and new node to become corrupted.

@taf2

This comment has been minimized.

Show comment
Hide comment
@taf2

taf2 Jun 18, 2015

excellent thank you - we have run the following on the existing cluster:

curl -XPUT localhost:9200/_cluster/settings -d '{"transient" : {"indices.recovery.compress" : false }}'

taf2 commented Jun 18, 2015

excellent thank you - we have run the following on the existing cluster:

curl -XPUT localhost:9200/_cluster/settings -d '{"transient" : {"indices.recovery.compress" : false }}'
@taf2

This comment has been minimized.

Show comment
Hide comment
@taf2

taf2 Jun 18, 2015

Thank you that did the trick!

taf2 commented Jun 18, 2015

Thank you that did the trick!

mute pushed a commit to mute/elasticsearch that referenced this pull request Jul 29, 2015

[RECOVERY] Don't recover from buggy version
This commit forces a full recovery if the source node is < 1.4.0 and
prevents any recoveries from pre 1.3.2 nodes if compression is enabled to
work around #7210

Closes #9922

Conflicts:
	src/main/java/org/elasticsearch/indices/recovery/RecoveryTarget.java
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment