Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

intermittent MD5 checksum failures when snapshotting using repository-s3 #25269

Closed
joachimdraeger opened this issue Jun 16, 2017 · 1 comment

Comments

@joachimdraeger
Copy link
Contributor

commented Jun 16, 2017

Elasticsearch version: 5.4.0, master

Plugins installed: repository-s3

JVM version: 1.8.0_131-b11

OS version: Darwin Kernel Version 15.4.0: Fri Feb 26 22:08:05 PST 2016; root:xnu-3248.40.184~3/RELEASE_X86_64 x86_64

Description of the problem including expected versus actual behavior:

We are performing hourly snapshots of 10 daily indices. Every day there are 2 to 5 failures on either snapshot or delete snapshot caused by an Exception like this:
AmazonS3Exception: MD5 local [882AktFOIjwJNFwKqtfj+Q==], remote [LMtTgWbPjpF+Ia25VQtfwg==] are not equal...
We have never experienced such issues on 2.4.5 and S3 uploads using the AWS library should be quite reliable.

Analysis:
AmazonHttpClient uses mark/reset on the InputStream to retry failed uploads.
The checksum used by the repository-s3 code is calculated in DefaultS3OutputStream.doUpload() using a java.security.DigestInputStream which apparently does not support mark/reset.
When AmazonHttpClient retries after some bytes have already been streamed through the Digest, the checksum will be corrupted.

The checksum calculation in the repository-s3 code is redundant because AWS' AmazonS3Client. putObject() does already calculate and compare a checksum locally and on server side. AWS' implementation MD5DigestCalculatingInputStream does support mark/reset on retries.

The checksum calculation in the repository-s3 code does also contain another bug: It compares putObjectResult.getContentMd5() with its local checksum which is indeed the local checksum calculated by the S3 client. The remotely calculated MD5 is stored in the ETag property.

The solution is to remove the faulty MD5 code from repository-s3 because the S3 client already takes care for checksum comparison which is retry safe.

Steps to reproduce:

The error will occur in production eventually depending on how stable your network connection to S3 is.

In a test environment the error will occur eventually when creating snapshots in a loop.

There is a way to exhibit the behavior using a debugger :

  1. Enable remote debugging on a local ES installation
  2. Create an index
  3. Register S3 repository
  4. Put a break-point into the InputStream chain after the java.security.DigestInputStream on the return statement of a public int read(byte[] b, int off, int len) method, e.g. in MD5DigestCalculatingInputStream line 102 (for master).
  5. perform a snapshot
  6. When the breakpoint got hit, the DigestInputStream already updated its hash. Disable your network connection and continue the program.
  7. The next time the breakpoint gets hit, the S3 client is retrying. Enable network connection and continue the program.
  8. Observe the Exception in the logs.

Provide logs (if relevant):

[2017-06-09T16:50:02,172][INFO ][o.e.s.SnapshotShardsService] [xrZgqIf] snapshot [elasticsearch-local:20170609t1525-140/4UI1XpYuRhutCG82cfUThA] is done
[2017-06-09T16:50:13,913][WARN ][o.e.s.SnapshotShardsService] [xrZgqIf] [[index4][3]] [elasticsearch-local:20170609t1525-141/SYTtvRUGSiyEBtsMmBvWMw] failed to create snapshot
org.elasticsearch.index.snapshots.IndexShardSnapshotFailedException: Failed to write commit point
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext.snapshot(BlobStoreRepository.java:1231) ~[elasticsearch-6.0.0-alpha3-SNAPSHOT.jar:6.0.0-alpha3-SNAPSHOT]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:815) ~[elasticsearch-6.0.0-alpha3-SNAPSHOT.jar:6.0.0-alpha3-SNAPSHOT]
	at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:380) ~[elasticsearch-6.0.0-alpha3-SNAPSHOT.jar:6.0.0-alpha3-SNAPSHOT]
	at org.elasticsearch.snapshots.SnapshotShardsService.access$200(SnapshotShardsService.java:88) ~[elasticsearch-6.0.0-alpha3-SNAPSHOT.jar:6.0.0-alpha3-SNAPSHOT]
	at org.elasticsearch.snapshots.SnapshotShardsService$1.doRun(SnapshotShardsService.java:334) [elasticsearch-6.0.0-alpha3-SNAPSHOT.jar:6.0.0-alpha3-SNAPSHOT]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) [elasticsearch-6.0.0-alpha3-SNAPSHOT.jar:6.0.0-alpha3-SNAPSHOT]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.0.0-alpha3-SNAPSHOT.jar:6.0.0-alpha3-SNAPSHOT]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
Caused by: java.io.IOException: Unable to upload object elasticsearch/local600/indices/5v3fWPPXQdiTuLkWoXgs1Q/3/snap-SYTtvRUGSiyEBtsMmBvWMw.dat
	at org.elasticsearch.repositories.s3.DefaultS3OutputStream.upload(DefaultS3OutputStream.java:112) ~[?:?]
	at org.elasticsearch.repositories.s3.DefaultS3OutputStream.flush(DefaultS3OutputStream.java:99) ~[?:?]
	at org.elasticsearch.repositories.s3.S3OutputStream.flushBuffer(S3OutputStream.java:69) ~[?:?]
	at org.elasticsearch.repositories.s3.S3OutputStream.close(S3OutputStream.java:87) ~[?:?]
	at org.elasticsearch.repositories.s3.S3BlobContainer.lambda$writeBlob$2(S3BlobContainer.java:98) ~[?:?]
	at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_131]
	at org.elasticsearch.repositories.s3.SocketAccess.doPrivilegedIOException(SocketAccess.java:48) ~[?:?]
	at org.elasticsearch.repositories.s3.S3BlobContainer.writeBlob(S3BlobContainer.java:95) ~[?:?]
	at org.elasticsearch.repositories.blobstore.ChecksumBlobStoreFormat.writeBlob(ChecksumBlobStoreFormat.java:187) ~[elasticsearch-6.0.0-alpha3-SNAPSHOT.jar:6.0.0-alpha3-SNAPSHOT]
	at org.elasticsearch.repositories.blobstore.ChecksumBlobStoreFormat.write(ChecksumBlobStoreFormat.java:157) ~[elasticsearch-6.0.0-alpha3-SNAPSHOT.jar:6.0.0-alpha3-SNAPSHOT]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext.snapshot(BlobStoreRepository.java:1229) ~[elasticsearch-6.0.0-alpha3-SNAPSHOT.jar:6.0.0-alpha3-SNAPSHOT]
	... 9 more
Caused by: com.amazonaws.services.s3.model.AmazonS3Exception: MD5 local [882AktFOIjwJNFwKqtfj+Q==], remote [LMtTgWbPjpF+Ia25VQtfwg==] are not equal... (Service: null; Status Code: 0; Error Code: null; Request ID: null)
	at org.elasticsearch.repositories.s3.DefaultS3OutputStream.doUpload(DefaultS3OutputStream.java:145) ~[?:?]
	at org.elasticsearch.repositories.s3.DefaultS3OutputStream.upload(DefaultS3OutputStream.java:110) ~[?:?]
	at org.elasticsearch.repositories.s3.DefaultS3OutputStream.flush(DefaultS3OutputStream.java:99) ~[?:?]
	at org.elasticsearch.repositories.s3.S3OutputStream.flushBuffer(S3OutputStream.java:69) ~[?:?]
	at org.elasticsearch.repositories.s3.S3OutputStream.close(S3OutputStream.java:87) ~[?:?]
	at org.elasticsearch.repositories.s3.S3BlobContainer.lambda$writeBlob$2(S3BlobContainer.java:98) ~[?:?]
	at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_131]
	at org.elasticsearch.repositories.s3.SocketAccess.doPrivilegedIOException(SocketAccess.java:48) ~[?:?]
	at org.elasticsearch.repositories.s3.S3BlobContainer.writeBlob(S3BlobContainer.java:95) ~[?:?]
	at org.elasticsearch.repositories.blobstore.ChecksumBlobStoreFormat.writeBlob(ChecksumBlobStoreFormat.java:187) ~[elasticsearch-6.0.0-alpha3-SNAPSHOT.jar:6.0.0-alpha3-SNAPSHOT]
	at org.elasticsearch.repositories.blobstore.ChecksumBlobStoreFormat.write(ChecksumBlobStoreFormat.java:157) ~[elasticsearch-6.0.0-alpha3-SNAPSHOT.jar:6.0.0-alpha3-SNAPSHOT]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext.snapshot(BlobStoreRepository.java:1229) ~[elasticsearch-6.0.0-alpha3-SNAPSHOT.jar:6.0.0-alpha3-SNAPSHOT]
	... 9 more
[2017-06-09T16:50:16,866][INFO ][o.e.s.SnapshotShardsService] [xrZgqIf] snapshot [elasticsearch-local:20170609t1525-141/SYTtvRUGSiyEBtsMmBvWMw] is done
joachimdraeger added a commit to joachimdraeger/elasticsearch that referenced this issue Jun 16, 2017
Remove redundant and not resettable (fails on retries) check-summing.…
… Checksums are calculated and compared by the S3 client already. elastic#25269

@abeyad abeyad closed this in 98b0267 Jun 21, 2017

abeyad pushed a commit that referenced this issue Jun 21, 2017
Remove redundant and broken MD5 checksum from repository-s3 (#25270)
Remove redundant and not resettable (fails on retries) check-summing. Checksums are calculated and compared by the S3 client already. 

Closes #25269
abeyad pushed a commit that referenced this issue Jun 21, 2017
Remove redundant and broken MD5 checksum from repository-s3 (#25270)
Remove redundant and not resettable (fails on retries) check-summing. Checksums are calculated and compared by the S3 client already. 

Closes #25269
abeyad pushed a commit that referenced this issue Jun 21, 2017
Remove redundant and broken MD5 checksum from repository-s3 (#25270)
Remove redundant and not resettable (fails on retries) check-summing. Checksums are calculated and compared by the S3 client already. 

Closes #25269
utkarshsins added a commit to rahul1193/elasticsearch that referenced this issue Jun 29, 2017
Remove redundant and broken MD5 checksum from repository-s3 (elastic#…
…25270)

Remove redundant and not resettable (fails on retries) check-summing. Checksums are calculated and compared by the S3 client already. 

Closes elastic#25269
@joachimdraeger

This comment has been minimized.

Copy link
Contributor Author

commented Oct 20, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.