-
Notifications
You must be signed in to change notification settings - Fork 414
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RATIS-2147. Md5 mismatch when snapshot install #1142
RATIS-2147. Md5 mismatch when snapshot install #1142
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@133tosakarin , thanks a lot for working on this! Please remove the unused code instead of commenting them out. See also the comment inlined and below
diff --git a/ratis-server/src/main/java/org/apache/ratis/server/storage/SnapshotManager.java b/ratis-server/src/main/java/org/apache/ratis/server/storage/SnapshotManager.java
index 794604d66b..88a7555dcb 100644
--- a/ratis-server/src/main/java/org/apache/ratis/server/storage/SnapshotManager.java
+++ b/ratis-server/src/main/java/org/apache/ratis/server/storage/SnapshotManager.java
@@ -27,7 +27,6 @@ import org.apache.ratis.statemachine.SnapshotInfo;
import org.apache.ratis.statemachine.StateMachine;
import org.apache.ratis.statemachine.StateMachineStorage;
import org.apache.ratis.util.FileUtils;
-import org.apache.ratis.util.JavaUtils;
import org.apache.ratis.util.MD5FileUtil;
import org.apache.ratis.util.MemoizedSupplier;
import org.apache.ratis.util.Preconditions;
@@ -63,7 +62,6 @@ public class SnapshotManager {
private final Supplier<File> snapshotDir;
private final Supplier<File> snapshotTmpDir;
private final Function<FileChunkProto, String> getRelativePath;
- private final Supplier<MessageDigest> digester = JavaUtils.memoize(MD5Hash::getDigester);
SnapshotManager(RaftPeerId selfId, Supplier<RaftStorageDirectory> dir, StateMachineStorage smStorage) {
this.selfId = selfId;
@@ -88,7 +86,6 @@ public class SnapshotManager {
}
// create the temp snapshot file and put padding inside
out = FileUtils.newFileChannel(tmpSnapshotFile, StandardOpenOption.WRITE, StandardOpenOption.CREATE);
- digester.get().reset();
} else {
if (!exists) {
throw new FileNotFoundException("Chunk offset is non-zero but file is not found: " + tmpSnapshotFile
@@ -114,7 +111,7 @@ public class SnapshotManager {
// TODO: Make sure that subsequent requests for the same installSnapshot are coming in order,
// and are not lost when whole request cycle is done. Check requestId and requestIndex here
-
+ final MessageDigest digester = MD5Hash.getDigester();
for (FileChunkProto chunk : snapshotChunkRequest.getFileChunksList()) {
SnapshotInfo pi = stateMachine.getLatestSnapshot();
if (pi != null && pi.getTermIndex().getIndex() >= lastIncludedIndex) {
@@ -128,7 +125,7 @@ public class SnapshotManager {
try (FileChannel out = open(chunk, tmpSnapshotFile)) {
final ByteBuffer data = chunk.getData().asReadOnlyByteBuffer();
- digester.get().update(data.duplicate());
+ digester.update(data.duplicate());
int written = 0;
for(; data.remaining() > 0; ) {
@@ -144,7 +141,7 @@ public class SnapshotManager {
new MD5Hash(chunk.getFileDigest().toByteArray());
// calculate the checksum of the snapshot file and compare it with the
// file digest in the request
- final MD5Hash digest = new MD5Hash(digester.get().digest());
+ final MD5Hash digest = new MD5Hash(digester.digest());
if (!digest.equals(expectedDigest)) {
LOG.warn("The snapshot md5 digest {} does not match expected {}",
digest, expectedDigest);
MessageDigest digester = newMd5Digest(); | ||
digester.reset(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use MD5Hash.getDigester()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since I am not sure if it is a problem with JavaUtils.memoize, but I can confirm that it is a problem with digest, my idea is to use a new digest each time, which is more secure and reliable, and also avoids the situation where the digest calculation buffer is destroyed and the final result is incorrect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MD5Hash.getDigester()
is safe since it is a Threadlocal
and reset the digest
every time, i.e. the same thread will always use the same digest
object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, ThreadLocal variables should be safe. However, we have stably reproduced this issue in multiple different testing environments, and we are also identifying whether there is a problem with the JDK version we are using. During this process, we have made an important discovery that even if we receive an empty file, Corrupt will still appear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. For an empty file, we probably need to handle it differently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well! I am testing in our environment to see if there are still problems.
Before that, I will convert this PR to draft
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just taking a quick look. I think simply removing the MemoizedSupplier
(just calling MD5Hash#getDigester
every time) should technically work. From what I see, MemoizedSupplier
only call the MessageDigest#reset
at the first invocation. The subsequent MemoizedSupplier
invocation will just return the MessageDigest
instance without calling the MessageDigest#reset
which might cause the mismatch issue.
Ozone encountered some thread local MD5 issue (apache/ozone#6435), might be related.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just taking a quick look. I think simply removing the
MemoizedSupplier
(just callingMD5Hash#getDigester
every time) should technically work. From what I see,MemoizedSupplier
only call theMessageDigest#reset
at the first invocation. The subsequentMemoizedSupplier
invocation will just return theMessageDigest
instance without calling theMessageDigest#reset
which might cause the mismatch issue.Ozone encountered some thread local MD5 issue (apache/ozone#6435), might be related.
Thank you for your comment!
I also think that whether to remove MemoizedSupplier or not is irrelevant to the problem. My idea is to use a new digester for each file to solve this problem well.
I have pushed the code to this PR, you can take a look
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, instantiating a new digester should also be fine, but there might be more memory overhead and short lived objects compared to the ThreadLocal implementation (which might also be fine if the operation is infrequent).
If the issue can be resolved by simply removing the MemoizedSupplier
, I think we can consider this instead of instantiating a new digester every time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just taking a quick look. I think simply removing the
MemoizedSupplier
(just callingMD5Hash#getDigester
every time) should technically work. From what I see,MemoizedSupplier
only call theMessageDigest#reset
at the first invocation. The subsequentMemoizedSupplier
invocation will just return theMessageDigest
instance without calling theMessageDigest#reset
which might cause the mismatch issue.Ozone encountered some thread local MD5 issue (apache/ozone#6435), might be related.
In fact, digester.get ().reset () will be executed every time the first chunk is received.Just taking a quick look. I think simply removing the
MemoizedSupplier
(just callingMD5Hash#getDigester
every time) should technically work. From what I see,MemoizedSupplier
only call theMessageDigest#reset
at the first invocation. The subsequentMemoizedSupplier
invocation will just return theMessageDigest
instance without calling theMessageDigest#reset
which might cause the mismatch issue.
In addition, I have done JMH testing, and it takes 0.5s for new 1000000 digests, but in fact, we do not send so many files so frequently, and this overhead is not a problem
We spent a lot of time troubleshooting this issue, and what we were able to confirm in the end is that there seems to be some problem with However, we found that instead of using reset, creating a new digester each time a new file appears should solve the issue (our testing process can verify this). Therefore, I recommend making this change directly. What's your opinion? @szetszwo |
@szetszwo What do you think of this way? |
private String randomString(int length) {
|
ratis-server/src/main/java/org/apache/ratis/server/storage/SnapshotManager.java
Outdated
Show resolved
Hide resolved
digester = MessageDigest.getInstance("MD5"); | ||
digester.reset(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think calling MessageDigest#reset
here is unnecessary since a newly instantiated MessageDigest
should already be "reset".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your reminder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@133tosakarin , thanks for working on this!
Could we covert the digester
to a local variable?
diff --git a/ratis-server/src/main/java/org/apache/ratis/server/storage/SnapshotManager.java b/ratis-server/src/main/java/org/apache/ratis/server/storage/SnapshotManager.java
index 794604d66b..6f36757189 100644
--- a/ratis-server/src/main/java/org/apache/ratis/server/storage/SnapshotManager.java
+++ b/ratis-server/src/main/java/org/apache/ratis/server/storage/SnapshotManager.java
@@ -27,7 +27,6 @@ import org.apache.ratis.statemachine.SnapshotInfo;
import org.apache.ratis.statemachine.StateMachine;
import org.apache.ratis.statemachine.StateMachineStorage;
import org.apache.ratis.util.FileUtils;
-import org.apache.ratis.util.JavaUtils;
import org.apache.ratis.util.MD5FileUtil;
import org.apache.ratis.util.MemoizedSupplier;
import org.apache.ratis.util.Preconditions;
@@ -63,7 +62,6 @@ public class SnapshotManager {
private final Supplier<File> snapshotDir;
private final Supplier<File> snapshotTmpDir;
private final Function<FileChunkProto, String> getRelativePath;
- private final Supplier<MessageDigest> digester = JavaUtils.memoize(MD5Hash::getDigester);
SnapshotManager(RaftPeerId selfId, Supplier<RaftStorageDirectory> dir, StateMachineStorage smStorage) {
this.selfId = selfId;
@@ -88,7 +86,6 @@ public class SnapshotManager {
}
// create the temp snapshot file and put padding inside
out = FileUtils.newFileChannel(tmpSnapshotFile, StandardOpenOption.WRITE, StandardOpenOption.CREATE);
- digester.get().reset();
} else {
if (!exists) {
throw new FileNotFoundException("Chunk offset is non-zero but file is not found: " + tmpSnapshotFile
@@ -112,9 +109,9 @@ public class SnapshotManager {
LOG.info("Installing snapshot:{}, to tmp dir:{}",
ServerStringUtils.toInstallSnapshotRequestString(request), tmpDir);
+ final MessageDigest digester = MD5Hash.getDigester();
// TODO: Make sure that subsequent requests for the same installSnapshot are coming in order,
// and are not lost when whole request cycle is done. Check requestId and requestIndex here
-
for (FileChunkProto chunk : snapshotChunkRequest.getFileChunksList()) {
SnapshotInfo pi = stateMachine.getLatestSnapshot();
if (pi != null && pi.getTermIndex().getIndex() >= lastIncludedIndex) {
@@ -128,7 +125,7 @@ public class SnapshotManager {
try (FileChannel out = open(chunk, tmpSnapshotFile)) {
final ByteBuffer data = chunk.getData().asReadOnlyByteBuffer();
- digester.get().update(data.duplicate());
+ digester.update(data.duplicate());
int written = 0;
for(; data.remaining() > 0; ) {
@@ -144,7 +141,7 @@ public class SnapshotManager {
new MD5Hash(chunk.getFileDigest().toByteArray());
// calculate the checksum of the snapshot file and compare it with the
// file digest in the request
- final MD5Hash digest = new MD5Hash(digester.get().digest());
+ final MD5Hash digest = new MD5Hash(digester.digest());
if (!digest.equals(expectedDigest)) {
LOG.warn("The snapshot md5 digest {} does not match expected {}",
digest, expectedDigest);
@@ -63,7 +64,7 @@ public class SnapshotManager { | |||
private final Supplier<File> snapshotDir; | |||
private final Supplier<File> snapshotTmpDir; | |||
private final Function<FileChunkProto, String> getRelativePath; | |||
private final Supplier<MessageDigest> digester = JavaUtils.memoize(MD5Hash::getDigester); | |||
private MessageDigest digester; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move it to a local variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I explained it in the code snippet you provided, please take a look.
Thanks for running the benchmarks! Let's use a new digester. How about moving the new MD5 method to diff --git a/ratis-common/src/main/java/org/apache/ratis/io/MD5Hash.java b/ratis-common/src/main/java/org/apache/ratis/io/MD5Hash.java
index e60bef9652..8aacd9a65d 100644
--- a/ratis-common/src/main/java/org/apache/ratis/io/MD5Hash.java
+++ b/ratis-common/src/main/java/org/apache/ratis/io/MD5Hash.java
@@ -29,14 +29,15 @@ import java.util.Arrays;
public class MD5Hash {
public static final int MD5_LEN = 16;
- private static final ThreadLocal<MessageDigest> DIGESTER_FACTORY =
- ThreadLocal.withInitial(() -> {
+ private static final ThreadLocal<MessageDigest> DIGESTER_FACTORY = ThreadLocal.withInitial(MD5Hash::newDigester);
+
+ public static MessageDigest newDigester() {
try {
return MessageDigest.getInstance("MD5");
} catch (NoSuchAlgorithmException e) {
- throw new RuntimeException(e);
+ throw new IllegalStateException("Failed to create MessageDigest for MD5", e);
}
- });
+ }
private byte[] digest;
diff --git a/ratis-server/src/main/java/org/apache/ratis/server/storage/SnapshotManager.java b/ratis-server/src/main/java/org/apache/ratis/server/storage/SnapshotManager.java
index 794604d66b..5b5158aebd 100644
--- a/ratis-server/src/main/java/org/apache/ratis/server/storage/SnapshotManager.java
+++ b/ratis-server/src/main/java/org/apache/ratis/server/storage/SnapshotManager.java
@@ -112,9 +109,9 @@ public class SnapshotManager {
LOG.info("Installing snapshot:{}, to tmp dir:{}",
ServerStringUtils.toInstallSnapshotRequestString(request), tmpDir);
+ final MessageDigest digester = MD5Hash.newDigester();
// TODO: Make sure that subsequent requests for the same installSnapshot are coming in order,
// and are not lost when whole request cycle is done. Check requestId and requestIndex here
-
for (FileChunkProto chunk : snapshotChunkRequest.getFileChunksList()) {
SnapshotInfo pi = stateMachine.getLatestSnapshot();
if (pi != null && pi.getTermIndex().getIndex() >= lastIncludedIndex) { |
Good idea, but there are some problems in the provided code, we should construct digester when receiving the first chunk and calculate md5 after receiving a complete file. Therefore, we add a MessageDigest member variable in SnapshotManager and assign it to the first chunk every time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@133tosakarin , thanks for the update! The change looks good. Could you fix the checkstyle issues?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 the change looks good.
Maybe we need a atomicReference for this digester? @szetszwo |
@OneSizeFitsQuorum , That's a good point! We should have an |
see https://issues.apache.org/jira/browse/RATIS-2147