New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Node does not catch up after restart in RAFT cluster #978
Comments
|
The problem is indepent of the used primitive (reproduced this also with DistributedMap). The key point is that the two nodes wrote lot of events and compacted there log, while the thrid node is unavailable. My assumption is, that the thrid node wants to append the received events from the leader, but can't find the corresponding file (for example |
|
I'm able to reproduce this behavior with the following unit test in the @Test
public void shouldCatchup() throws Throwable
{
// given
createServers(3);
servers.get(0).shutdown();
RaftClient client = createClient();
TestPrimitive primitive = createPrimitive(client);
final int entries = 10;
final int entrySize = 1024;
final String entry = RandomStringUtils.random(entrySize);
for (int i = 0; i < entries; i++)
{
primitive.write(entry)
.get(1_000, TimeUnit.MILLISECONDS);
}
// when
CompletableFuture
.allOf(servers.get(1).compact(),
servers.get(2).compact())
.get(15_000, TimeUnit.MILLISECONDS);
// then
final RaftServer server = createServer(members.get(0).memberId());
List<MemberId> members =
this.members
.stream()
.map(RaftMember::memberId)
.collect(Collectors.toList());
server
.join(members).get(10_000, TimeUnit.MILLISECONDS);
} |
* Update controller from branch 'stable/magnesium'
to 31dd70488c7396e91abc51f9a9d594eaaac7f849
- Bump atomix to 3.1.7
No changelogs, but the following issue is fixed:
atomix/atomix#978
Change-Id: Ib9248d7b9cc5e50d1789778b18650f5b7804e802
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
- Use JavaConverters to eliminate Scala compat workarounds
We are forced to use two workarounds to deal with Scala class
hierarchy. Use JavaConverters to convert to the appropriate
implementation, which does what we need it to do.
Change-Id: Ib66b73f8d67dd1084328c8b2bc6f2f29a199a3ba
Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
No changelogs, but the following issue is fixed: atomix/atomix#978 Change-Id: Ib9248d7b9cc5e50d1789778b18650f5b7804e802 Signed-off-by: Robert Varga <robert.varga@pantheon.tech>
Expected behavior
I expected that the node with id
member2catches up with the other nodes after restart.Actual behavior
Currently the following exceptions are thrown:
Steps to reproduce
On high load (where member2 is stopped):
After client is stopped and cluster is on idle again:
Compaction is done. Oldest logs are deleted and new same snapshots are written on nodes.
I assume leader sends follower snapshot. Because on member 3 (follower of partition 2) this is printed:
On the leader only the backup log statement is printed.
If member 2 is now restarted, it gots a lot of exceptions:
It seems that the snapshot was replicated correctly, but the appending on the log fails.
Please checkout the branch bug-catch-up on my repository to produce this.
Environment
Linux zell-arch 4.20.1-arch1-1-ARCH #1 SMP PREEMPT Wed Jan 9 20:25:43 UTC 2019 x86_64 GNU/LinuxThe text was updated successfully, but these errors were encountered: