Skip to content

RATIS-1587. Fix snapshot multi-chunk bug & support snapshot hierarchy#655

Merged
szetszwo merged 4 commits intoapache:masterfrom
SzyWilliam:snapshot_bugfix
Jun 8, 2022
Merged

RATIS-1587. Fix snapshot multi-chunk bug & support snapshot hierarchy#655
szetszwo merged 4 commits intoapache:masterfrom
SzyWilliam:snapshot_bugfix

Conversation

@SzyWilliam
Copy link
Member

What changes were proposed in this pull request?

  • Fix snapshot multiple-chunk bug. Currently, when leader install a snapshot(multiple chunks) to a newly joined follower, leader will send multiple InstallSnapshot RPCs. However, each RPC will create a tmp dir with Random UUID, place the chunk in this tmp dir, and only renames the last tmp dir to sm-dir. In this PR, I propose to create tmp dir using request.uuid(), which remains unchanged among multiple RPCs.

  • Fix Grpc Stream errors. Currently In grpc.proto, InstallSnapshot is declaimed as client-end streaming rpc, but it is actual bi-directional streaming rpc. In this PR, I addded stream to InstallSnapshot proto so that it becomes bi-directional.

  • Support snapshot file hierarchy. Currently all files of a snapshot will be placed in statemachine dir and file hierarchy is flattened. In this PR, I name each file using its original filename (which contains hierarchy information).

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/RATIS-1587

How was this patch tested?

  • Unit Tests. It passed all unit tests in my machine (MacOS 12.13)
  • Manual Test. I packaged this version of ratis and tested its snapshot functionality in our project Apache IoTDB, and everything is working as expected

Copy link
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SzyWilliam , thanks a lot for working on this. Some comments/questions inlined. Please see if you could create a unit test.

Comment on lines +86 to +87
String fileNameToStateMachineDir = fileName.substring(
(dir.STATE_MACHINE_DIR_NAME.length()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need handle the path separator. How about using Path to relativize it?

@@ -74,6 +74,7 @@ public class SnapshotManager {
     // TODO: Make sure that subsequent requests for the same installSnapshot are coming in order,
     // and are not lost when whole request cycle is done. Check requestId and requestIndex here
 
+    final Path stateMachineDir = dir.getStateMachineDir().toPath();
     for (FileChunkProto chunk : snapshotChunkRequest.getFileChunksList()) {
       SnapshotInfo pi = stateMachine.getLatestSnapshot();
       if (pi != null && pi.getTermIndex().getIndex() >= lastIncludedIndex) {
@@ -83,9 +84,9 @@ public class SnapshotManager {
       }
 
       String fileName = chunk.getFilename(); // this is relative to the root dir
-      // TODO: assumes flat layout inside SM dir
-      File tmpSnapshotFile = new File(tmpDir,
-          new File(dir.getRoot(), fileName).getName());
+      final Path relative = stateMachineDir.relativize(new File(dir.getRoot(), fileName).toPath());
+      final File tmpSnapshotFile = new File(tmpDir, relative.toString());
+      FileUtils.createDirectories(tmpSnapshotFile);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great, I'll take this way


rpc installSnapshot(stream ratis.common.InstallSnapshotRequestProto)
returns(ratis.common.InstallSnapshotReplyProto) {}
returns(stream ratis.common.InstallSnapshotReplyProto) {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an incompatible change? If yes, we should document it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is a bug when using GRPC as communication protocol and involves multiple snapshot chunks. The handler of InstallSnapshot is implemented bidirectional-streaming, but the proto is declaimed client-streaming. The mismatch will cause the leader received a HTTP RST CANCEL for the last installSnapshot request.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this is a bug. If it is an incompatible change, we still have to document it even for a bug since it will break existing, working applications. Currently, applications with single InstallSnapshotRequestProto are working.

Indeed, it may not be incompatible. Could you test it if an old server (without stream) can talk to a new server (with stream) ?

@SzyWilliam
Copy link
Member Author

@szetszwo I added unit test which covers the scenario of Leader InstallSnapshot to Followers with snapshot containing multiple files in nested folder.

Copy link
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SzyWilliam , thanks for the update! Both new files need license header. Please test if adding stream is incompatible or not.

@@ -0,0 +1,154 @@
package org.apache.ratis;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need license header.

@SzyWilliam
Copy link
Member Author

SzyWilliam commented Jun 8, 2022

@szetszwo Add License header. Also, I did the test and I think the stream change is compatible. That's how I did the test:
I compiled the Arithmetic example in ratis-examples both with&w/o stream. Then I started Arithmetic Server w/o stream as leader, applied trxs, took a snaphot and purged logs. After that I started Arithmetic Server with stream and joined it to the group. I discovered that the leader successfully installed the snapshot to the new follower, which means old server w/o stream can install single-chunk snapshot to new server with stream.

Copy link
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 the change looks good.

@szetszwo szetszwo merged commit 3c4e6f1 into apache:master Jun 8, 2022
@szetszwo
Copy link
Contributor

szetszwo commented Jun 8, 2022

@SzyWilliam , thanks a lot for testing the compatibility!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants