Fix a very rare case of corruption in compression used for internal cluster communication. #7210

Closed
wants to merge 1 commit into
from

Projects

None yet

8 participants

@rjernst
Member
rjernst commented Aug 8, 2014

See CorruptedCompressorTests for details on how this bug can be hit.

@rjernst rjernst changed the title from Internal: Fix a very rare case of corruption in replication compression. to Fix a very rare case of corruption in compression used for internal cluster communication. Aug 8, 2014
@javanna javanna and 2 others commented on an outdated diff Aug 8, 2014
...csearch/common/compress/CorruptedCompressorTests.java
+import org.elasticsearch.test.ElasticsearchTestCase;
+
+import java.io.ByteArrayInputStream;
+import java.io.ByteArrayOutputStream;
+import java.io.IOException;
+import java.io.InputStream;
+import java.util.Arrays;
+
+import static org.hamcrest.Matchers.equalTo;
+
+/**
+ * Test an extremely rare corruption produced by the pure java impl of ChunkEncoder.
+ */
+public class CorruptedCompressorTests extends ElasticsearchTestCase {
+
+ public void testCorruption() throws IOException {
@javanna
javanna Aug 8, 2014 Member

missing @Test annotation

@rmuir
rmuir Aug 8, 2014 Contributor

@Test doesnt do anything :)

@javanna
javanna Aug 8, 2014 Member

I know if the method starts with test we are good anyway but why do we use the annotation all over the place then :) Either we remove it everywhere or we stick to it I'd say...

@rjernst
rjernst Aug 8, 2014 Member

I don't use @Test unless someone makes me, for the exact reason Robert pointed out. It is just extra characters with no benefit.

@rmuir
Contributor
rmuir commented Aug 8, 2014

Please disable unsafe encode/decode complete.

  • This may crash machines that don't allow unaligned reads: ning/compress#18
  • if (SUNOS) does not imply its safe to do such unaligned reads.
  • This may corrupt data on bigendian systems: ning/compress#37
  • We do not test such situations.
@imotov imotov and 1 other commented on an outdated diff Aug 8, 2014
...search/common/compress/ElasticsearchChunkEncoder.java
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.elasticsearch.common.compress;
+
+import com.ning.compress.lzf.LZFChunk;
+import com.ning.compress.lzf.impl.VanillaChunkEncoder;
+
+/**
+ * This is a fork of {@link com.ning.compress.lzf.impl.VanillaChunkEncoder} to quickly fix
+ * an extremely rare bug. See CorruptedCompressorTests for details on reproducing the bug.
+ */
+public class ElasticsearchChunkEncoder extends VanillaChunkEncoder {
@imotov
imotov Aug 8, 2014 Member

Historically, we were using "X" prefix to designate temporary implementations like this one. So, a more traditional name would be XVanillaChunkEncoder. For example 2edde35

@rjernst
rjernst Aug 8, 2014 Member

Ok, changed to XVanillaChunkEncoder.

@rjernst
Member
rjernst commented Aug 8, 2014

Ok, I think I addressed all the comments. The only unchanged thing is the license file, because I don't know which license to put in there (the original file had no license header).

@rjernst
Member
rjernst commented Aug 9, 2014

The PR to the compress-lzf project was merged, and a 1.0.2 release was made. I removed the X encoder and made the upgrade to 1.0.2.

@kimchy kimchy and 1 other commented on an outdated diff Aug 11, 2014
pom.xml
@@ -1381,6 +1381,7 @@
<!-- t-digest -->
<exclude>src/main/java/org/elasticsearch/search/aggregations/metrics/percentiles/tdigest/TDigestState.java</exclude>
<exclude>src/test/java/org/elasticsearch/search/aggregations/metrics/GroupTree.java</exclude>
+ <exclude>src/test/java/org/elasticsearch/common/compress/lzf/XVanillaChunkEncoder.java</exclude>
@kimchy
kimchy Aug 11, 2014 Member

do we still need this with 1.0.2?

@rjernst
rjernst Aug 11, 2014 Member

Good catch. Removed.

@rjernst rjernst Internal: Fix a very rare case of corruption in compression used for
internal cluster communication.

See CorruptedCompressorTests for details on how this bug can be hit.
This change also removes the ability to use the unsafe variant of
ChunkedEncoder, removing support for the compress.lzf.decoder setting.
e3e5bff
@rmuir
Contributor
rmuir commented Aug 11, 2014

looks good, thanks Ryan.

@jpountz
Contributor
jpountz commented Aug 11, 2014

+1 as well

@rjernst
Member
rjernst commented Aug 11, 2014

Thanks. Pushed.

@rjernst rjernst closed this Aug 11, 2014
@rjernst rjernst added the v1.2.4 label Aug 12, 2014
@clintongormley clintongormley changed the title from Fix a very rare case of corruption in compression used for internal cluster communication. to Internal: Fix a very rare case of corruption in compression used for internal cluster communication. Sep 8, 2014
@s1monw s1monw added a commit to s1monw/elasticsearch that referenced this pull request Nov 11, 2014
@s1monw s1monw [TEST] Disable compression in BWC test for version < 1.3.2
The compression bug fixed in #7210 can still strike us since we are
running BWC test against these version. This commit disables compression
forcefully if the compatibility version is < 1.3.2 to prevent debugging
already known issues.
0cf86c0
@s1monw s1monw added a commit to s1monw/elasticsearch that referenced this pull request Nov 11, 2014
@s1monw s1monw [TEST] Disable compression in BWC test for version < 1.3.2
The compression bug fixed in #7210 can still strike us since we are
running BWC test against these version. This commit disables compression
forcefully if the compatibility version is < 1.3.2 to prevent debugging
already known issues.
16cb0dc
@rjernst rjernst deleted the rjernst:fix/compress-corruption branch Jan 21, 2015
@s1monw s1monw added a commit to s1monw/elasticsearch that referenced this pull request Mar 2, 2015
@s1monw s1monw [RECOVERY] Don't recover from buggy version
This commit forces a full recovery if the source node is < 1.4.0 and
prevents any recoveries from pre 1.3.2 nodes if compression is enabled to
work around #7210

Closes #9922
dd78370
@s1monw s1monw added a commit to s1monw/elasticsearch that referenced this pull request Mar 2, 2015
@s1monw s1monw [RECOVERY] Don't recover from buggy version
This commit forces a full recovery if the source node is < 1.4.0 and
prevents any recoveries from pre 1.3.2 nodes if compression is enabled to
work around #7210

Closes #9922

Conflicts:
	src/main/java/org/elasticsearch/indices/recovery/RecoveryTarget.java
8376043
@clintongormley clintongormley changed the title from Internal: Fix a very rare case of corruption in compression used for internal cluster communication. to Fix a very rare case of corruption in compression used for internal cluster communication. Jun 7, 2015
@taf2
taf2 commented Jun 18, 2015

Upgrading from 1.1.1 to 1.6.0 and noticing this output from our cluster

insertOrder timeInQueue priority source
      37659        27ms HIGH     shard-failed ([callers][2], node[Ko3b9KsESN68lTkPtVrHKw], relocating [4mcZCKvBRoKQJS_StGNPng], [P], s[INITIALIZING]), reason [shard failure [failed recovery][RecoveryFailedException[[callers][2]: Recovery failed from [aws_el1][4mcZCKvBRoKQJS_StGNPng][ip-10-55-11-210][inet[/10.55.11.210:9300]]{rack=useast1, master=true, zone=zonea} into [aws_el1a][Ko3b9KsESN68lTkPtVrHKw][ip-10-55-11-211][inet[/10.55.11.211:9300]]{rack=useast1, zone=zonea, master=true} (unexpected error)]; nested: ElasticsearchIllegalStateException[Can't recovery from node [aws_el1][4mcZCKvBRoKQJS_StGNPng][ip-10-55-11-210][inet[/10.55.11.210:9300]]{rack=useast1, master=true, zone=zonea} with [indices.recovery.compress : true] due to compression bugs -  see issue #7210 for details]; ]]```

what do we do?
@rjernst
Member
rjernst commented Jun 18, 2015

@taf2 Turn off compression before upgrading.

@taf2
taf2 commented Jun 18, 2015

@rjernst thanks! which kind of compression do we disable...

is it this option in

/etc/elasticsearch/elasticsearch.yml
#transport.tcp.compress: true

?

or another option?

@taf2
taf2 commented Jun 18, 2015

okay sorry it looks like we need to disable indices.recovery.compress - but is this something that needs to be disabled on all nodes in the cluster or just the new 1.6.0 node we're starting up now?

@rjernst
Member
rjernst commented Jun 18, 2015

All nodes in the cluster, before starting the upgrade. The problem is old nodes with this setting enabled would use the old buggy code, which can then cause data copied between and old and new node to become corrupted.

@taf2
taf2 commented Jun 18, 2015

excellent thank you - we have run the following on the existing cluster:

curl -XPUT localhost:9200/_cluster/settings -d '{"transient" : {"indices.recovery.compress" : false }}'
@taf2
taf2 commented Jun 18, 2015

Thank you that did the trick!

@mute mute pushed a commit to mute/elasticsearch that referenced this pull request Jul 29, 2015
@s1monw s1monw [RECOVERY] Don't recover from buggy version
This commit forces a full recovery if the source node is < 1.4.0 and
prevents any recoveries from pre 1.3.2 nodes if compression is enabled to
work around #7210

Closes #9922

Conflicts:
	src/main/java/org/elasticsearch/indices/recovery/RecoveryTarget.java
793c9e2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment