-
Notifications
You must be signed in to change notification settings - Fork 565
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not fail consistency check if log is empty #10463
Conversation
The check for consistency is moved to a util class, so that it can be tested using unit test. Otherwise, the only way to test is by starting a cluster of 3 nodes. This is too cumbersome, as you can see in the RaftConsistencyCheckTest.
@Rule public RaftRule raftRule = RaftRule.withBootstrappedNodes(3); | ||
|
||
@Test // Regression test for https://github.com/camunda/zeebe/issues/10451 | ||
public void shouldNotFailRestartIfFollowerCrashedBeforeCommittingSnapshot() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this test for reproducing the bug. I'm not fully satisfied with this test. May be the unit test StateUtilTest
is enough, a we can remove this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would keep the test 👍 I like it. I guess would be also great if we build us some more tooling to start nodes in a specific state, like here I want to start a follower with no log at certain index and no snapshot or something like that.
Give me an ~1 hour I want to take a look |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @deepthidevaki especially for the PR description this helped me to better understand the issue 👍
I have some remarks, but since Ole already accepted I guess you can decided what you want to do with them :)
public class StateUtil { | ||
|
||
/** throws IllegalStateException if state is inconsistent */ | ||
public static void verifySnapshotLogConsistent( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔧 I feel this method does too much. It says it verifies, but it also resets the log, which is a bit unexpected to me. Could we move this out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added it there, for unit testing it easily. Otherwise, I don't know how to test it property.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok then maybe we split it into two methods like, one for verify and one like handleInconsistency or ensureConsistent log
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry. I already triggered merge. I didn't see you reply. I had thought about it before, but didn't find it good enough. I will think about it on how to split as well as test it.
|
||
private static Stream<Arguments> provideInconsistentState() { | ||
return Stream.of( | ||
Arguments.of(0, 5, false), Arguments.of(1, 5, false), Arguments.of(3, 6, false)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💭 Is it necessary to test these cases? What is the difference here? They are not triggering a different code path or? 🤔
🔧 Personally, I would like to have separate test methods with specific names instead of the parameterized tests, so I know what is actually tested here :) Feels a bit hard to determine, but maybe it is just me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's different cases - inconsistent state when there is not snapshot, when there is a snapshot. It is important to test these cases.
atomix/cluster/src/test/java/io/atomix/raft/RaftStartupConsistencyCheckTest.java
Show resolved
Hide resolved
atomix/cluster/src/test/java/io/atomix/raft/RaftStartupConsistencyCheckTest.java
Outdated
Show resolved
Hide resolved
atomix/cluster/src/test/java/io/atomix/raft/RaftStartupConsistencyCheckTest.java
Show resolved
Hide resolved
atomix/cluster/src/test/java/io/atomix/raft/RaftStartupConsistencyCheckTest.java
Show resolved
Hide resolved
@Rule public RaftRule raftRule = RaftRule.withBootstrappedNodes(3); | ||
|
||
@Test // Regression test for https://github.com/camunda/zeebe/issues/10451 | ||
public void shouldNotFailRestartIfFollowerCrashedBeforeCommittingSnapshot() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would keep the test 👍 I like it. I guess would be also great if we build us some more tooling to start nodes in a specific state, like here I want to start a follower with no log at certain index and no snapshot or something like that.
bors merge |
10443: Do not take a backup if it already exists r=deepthidevaki a=deepthidevaki ## Description After restore, the log is truncated to the checkpoint position. So the checkpoint record is processed again and will trigger a new backup with the same Id of the backup it restored from. With this PR, `BackupService` handles this case gracefully. In addition, we also do not take a new backup if existing backup is failed or in progress. Alternatively, we can delete this backup and take a new one. But chances of it happening (i.e triggering a new backup when one already is in progress/failed) is very low. So we can keep this simple. ## Related issues closes #10430 10463: Do not fail consistency check if log is empty r=deepthidevaki a=deepthidevaki ## Description When a follower receives a snapshot from the leader, it has to throw away the log and reset the log to `snapshotIndex + 1`. Previously we were doing it in the following order: 1. commit snapshot 2. reset In this case, if the system crashed after step 1, when the node restarts it is in an invalid state because the log is not reset after the snapshot. To prevent this case, we reset the log on startup based on the existing snapshot. This was buggy and caused issues, which was fixed by #10183. The fix was to reverse the order: 1. reset log 2. commit snapshot. So on restart, there is no need to reset the log. If the system crashes after step 1, we have any empty log and no snapshot (or a previous snapshot). This is a valid state because this follower is not in the quorum, so no data is lost. After the restart the follower will receive the snapshot and the following events. But this caused the consistency check to fail because it detected gaps between the snapshot and the first log entry. The state is not actually inconsistent, because no data is lost. So we fix this by updating the consistency check to treat this is a valid state. To make the state valid, if the log is empty, we reset it based on the available snapshot. ## Related issues closes #10451 Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com>
Build failed (retrying...): |
10463: Do not fail consistency check if log is empty r=deepthidevaki a=deepthidevaki ## Description When a follower receives a snapshot from the leader, it has to throw away the log and reset the log to `snapshotIndex + 1`. Previously we were doing it in the following order: 1. commit snapshot 2. reset In this case, if the system crashed after step 1, when the node restarts it is in an invalid state because the log is not reset after the snapshot. To prevent this case, we reset the log on startup based on the existing snapshot. This was buggy and caused issues, which was fixed by #10183. The fix was to reverse the order: 1. reset log 2. commit snapshot. So on restart, there is no need to reset the log. If the system crashes after step 1, we have any empty log and no snapshot (or a previous snapshot). This is a valid state because this follower is not in the quorum, so no data is lost. After the restart the follower will receive the snapshot and the following events. But this caused the consistency check to fail because it detected gaps between the snapshot and the first log entry. The state is not actually inconsistent, because no data is lost. So we fix this by updating the consistency check to treat this is a valid state. To make the state valid, if the log is empty, we reset it based on the available snapshot. ## Related issues closes #10451 10482: deps(maven): bump snakeyaml from 1.32 to 1.33 r=Zelldon a=dependabot[bot] Bumps [snakeyaml](https://bitbucket.org/snakeyaml/snakeyaml) from 1.32 to 1.33. <details> <summary>Commits</summary> <ul> <li><a href="https://bitbucket.org/snakeyaml/snakeyaml/commits/eafb23ec31a0babe591c00e1b50e557a5e3f9a1d"><code>eafb23e</code></a> [maven-release-plugin] prepare for next development iteration</li> <li><a href="https://bitbucket.org/snakeyaml/snakeyaml/commits/26624702fab8e0a1c301d7fad723c048528f75c3"><code>2662470</code></a> Improve JavaDoc</li> <li><a href="https://bitbucket.org/snakeyaml/snakeyaml/commits/80827798f06aeb3d4f2632b94075ca7633418829"><code>8082779</code></a> Always emit numberish strings with quotes</li> <li><a href="https://bitbucket.org/snakeyaml/snakeyaml/commits/42d6c79430431fe9033d3ba50f6a7dc6798ba7ad"><code>42d6c79</code></a> Reformat test</li> <li><a href="https://bitbucket.org/snakeyaml/snakeyaml/commits/1962a437263348c3b90857cda4bbfa2bd97908f8"><code>1962a43</code></a> Refactor: rename variables in Emitter</li> <li><a href="https://bitbucket.org/snakeyaml/snakeyaml/commits/bc594ad6e2b87c3fc26844e407276796fd866a40"><code>bc594ad</code></a> Issue 553: honor code point limit in loadAll</li> <li><a href="https://bitbucket.org/snakeyaml/snakeyaml/commits/c3e98fd755a949f65cf11f2ff39e55a1c2afd1c2"><code>c3e98fd</code></a> Update changes.xml</li> <li><a href="https://bitbucket.org/snakeyaml/snakeyaml/commits/a06f76859f2f07580b1d9fa6b66ea84aaad26cf8"><code>a06f768</code></a> Remove deprecated Tag manipulation</li> <li><a href="https://bitbucket.org/snakeyaml/snakeyaml/commits/5a0027a3781b92f59bf92cdeb1b7590589993efd"><code>5a0027a</code></a> Remove unused WhitespaceToken</li> <li><a href="https://bitbucket.org/snakeyaml/snakeyaml/commits/3f05838828b8df36ab961bf836f373b8c20cb8ff"><code>3f05838</code></a> Improve JavaDoc</li> <li>Additional commits viewable in <a href="https://bitbucket.org/snakeyaml/snakeyaml/branches/compare/snakeyaml-1.33..snakeyaml-1.32">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=org.yaml:snakeyaml&package-manager=maven&previous-version=1.32&new-version=1.33)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting ``@dependabot` rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - ``@dependabot` rebase` will rebase this PR - ``@dependabot` recreate` will recreate this PR, overwriting any edits that have been made to it - ``@dependabot` merge` will merge this PR after your CI passes on it - ``@dependabot` squash and merge` will squash and merge this PR after your CI passes on it - ``@dependabot` cancel merge` will cancel a previously requested merge and block automerging - ``@dependabot` reopen` will reopen this PR if it is closed - ``@dependabot` close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - ``@dependabot` ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - ``@dependabot` ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - ``@dependabot` ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> 10486: test: `PartitionRestoreServiceTest` does not block on taking a backup r=oleschoenburg a=oleschoenburg We saw some unit tests timing out in `PartitionRestoreServiceTest`: ``` "ForkJoinPool-1-worker-1" #19 daemon prio=5 os_prio=0 cpu=1567.91ms elapsed=914.45s tid=0x00007facfca78b60 nid=0x15ab5 waiting on condition [0x00007facb83df000] java.lang.Thread.State: WAITING (parking) at jdk.internal.misc.Unsafe.park(java.base@17.0.4.1/Native Method) - parking to wait for <0x0000000511f04c68> (a java.util.concurrent.CompletableFuture$Signaller) at java.util.concurrent.locks.LockSupport.park(java.base@17.0.4.1/LockSupport.java:211) at java.util.concurrent.CompletableFuture$Signaller.block(java.base@17.0.4.1/CompletableFuture.java:1864) at java.util.concurrent.ForkJoinPool.compensatedBlock(java.base@17.0.4.1/ForkJoinPool.java:3449) at java.util.concurrent.ForkJoinPool.managedBlock(java.base@17.0.4.1/ForkJoinPool.java:3432) at java.util.concurrent.CompletableFuture.waitingGet(java.base@17.0.4.1/CompletableFuture.java:1898) at java.util.concurrent.CompletableFuture.join(java.base@17.0.4.1/CompletableFuture.java:2117) at io.camunda.zeebe.restore.PartitionRestoreServiceTest.takeBackup(PartitionRestoreServiceTest.java:212) at io.camunda.zeebe.restore.PartitionRestoreServiceTest.shouldFailToRestoreWhenSnapshotIsCorrupted(PartitionRestoreServiceTest.java:182) ``` With these changes here we ensure that the test does not wait forever on a backup, instead setting a maximum of 30 seconds. Additionally, `TestRestorableBackupStore` now fails the future when a backup is marked as failed. 10489: Do not use DefaultActorClock r=Zelldon a=Zelldon ## Description The default ActorClock is not thread safe and shouldn't be shared with multiple threads. This means we need to set the clock in the ActorClockConfiguration to null. Creating the ActorScheduler with no clock will cause that each threads gets its own clock. Note: This is a quick fix, at some point, we want to make DefaultActorClock threadsafe so we can use always the same clock. See #10400 <!-- Please explain the changes you made here. --> ## Related issues <!-- Which issues are closed by this PR or are related --> related #10400 10490: ci(macos): set code cache size of 64m r=megglos a=megglos To counter occasional out of code cache errors observed on macos builds. Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com> Co-authored-by: Christopher Zell <zelldon91@googlemail.com> Co-authored-by: Meggle (Sebastian Bathke) <sebastian.bathke@camunda.com>
Build failed (retrying...): |
10463: Do not fail consistency check if log is empty r=deepthidevaki a=deepthidevaki ## Description When a follower receives a snapshot from the leader, it has to throw away the log and reset the log to `snapshotIndex + 1`. Previously we were doing it in the following order: 1. commit snapshot 2. reset In this case, if the system crashed after step 1, when the node restarts it is in an invalid state because the log is not reset after the snapshot. To prevent this case, we reset the log on startup based on the existing snapshot. This was buggy and caused issues, which was fixed by #10183. The fix was to reverse the order: 1. reset log 2. commit snapshot. So on restart, there is no need to reset the log. If the system crashes after step 1, we have any empty log and no snapshot (or a previous snapshot). This is a valid state because this follower is not in the quorum, so no data is lost. After the restart the follower will receive the snapshot and the following events. But this caused the consistency check to fail because it detected gaps between the snapshot and the first log entry. The state is not actually inconsistent, because no data is lost. So we fix this by updating the consistency check to treat this is a valid state. To make the state valid, if the log is empty, we reset it based on the available snapshot. ## Related issues closes #10451 10482: deps(maven): bump snakeyaml from 1.32 to 1.33 r=Zelldon a=dependabot[bot] Bumps [snakeyaml](https://bitbucket.org/snakeyaml/snakeyaml) from 1.32 to 1.33. <details> <summary>Commits</summary> <ul> <li><a href="https://bitbucket.org/snakeyaml/snakeyaml/commits/eafb23ec31a0babe591c00e1b50e557a5e3f9a1d"><code>eafb23e</code></a> [maven-release-plugin] prepare for next development iteration</li> <li><a href="https://bitbucket.org/snakeyaml/snakeyaml/commits/26624702fab8e0a1c301d7fad723c048528f75c3"><code>2662470</code></a> Improve JavaDoc</li> <li><a href="https://bitbucket.org/snakeyaml/snakeyaml/commits/80827798f06aeb3d4f2632b94075ca7633418829"><code>8082779</code></a> Always emit numberish strings with quotes</li> <li><a href="https://bitbucket.org/snakeyaml/snakeyaml/commits/42d6c79430431fe9033d3ba50f6a7dc6798ba7ad"><code>42d6c79</code></a> Reformat test</li> <li><a href="https://bitbucket.org/snakeyaml/snakeyaml/commits/1962a437263348c3b90857cda4bbfa2bd97908f8"><code>1962a43</code></a> Refactor: rename variables in Emitter</li> <li><a href="https://bitbucket.org/snakeyaml/snakeyaml/commits/bc594ad6e2b87c3fc26844e407276796fd866a40"><code>bc594ad</code></a> Issue 553: honor code point limit in loadAll</li> <li><a href="https://bitbucket.org/snakeyaml/snakeyaml/commits/c3e98fd755a949f65cf11f2ff39e55a1c2afd1c2"><code>c3e98fd</code></a> Update changes.xml</li> <li><a href="https://bitbucket.org/snakeyaml/snakeyaml/commits/a06f76859f2f07580b1d9fa6b66ea84aaad26cf8"><code>a06f768</code></a> Remove deprecated Tag manipulation</li> <li><a href="https://bitbucket.org/snakeyaml/snakeyaml/commits/5a0027a3781b92f59bf92cdeb1b7590589993efd"><code>5a0027a</code></a> Remove unused WhitespaceToken</li> <li><a href="https://bitbucket.org/snakeyaml/snakeyaml/commits/3f05838828b8df36ab961bf836f373b8c20cb8ff"><code>3f05838</code></a> Improve JavaDoc</li> <li>Additional commits viewable in <a href="https://bitbucket.org/snakeyaml/snakeyaml/branches/compare/snakeyaml-1.33..snakeyaml-1.32">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=org.yaml:snakeyaml&package-manager=maven&previous-version=1.32&new-version=1.33)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting ``@dependabot` rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - ``@dependabot` rebase` will rebase this PR - ``@dependabot` recreate` will recreate this PR, overwriting any edits that have been made to it - ``@dependabot` merge` will merge this PR after your CI passes on it - ``@dependabot` squash and merge` will squash and merge this PR after your CI passes on it - ``@dependabot` cancel merge` will cancel a previously requested merge and block automerging - ``@dependabot` reopen` will reopen this PR if it is closed - ``@dependabot` close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - ``@dependabot` ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - ``@dependabot` ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - ``@dependabot` ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Build failed (retrying...): |
Build succeeded: |
Successfully created backport PR #10494 for |
10494: [Backport stable/8.0] Do not fail consistency check if log is empty r=deepthidevaki a=backport-action # Description Backport of #10463 to `stable/8.0`. relates to #10451 10540: [Backport stable/8.0] Ensure retries are not interleaved even on multiple sequential calls r=deepthidevaki a=npepinpe ## Description This PR backports part of the changes found in #10289, notably the ones in the scheduler about the retry strategies (skipping the engine ones which are all around new 8.1.0 code). It seems I forgot to backport that part, as we did remove `runUntilDone` in 8.0 as well, so we should fix the retry strategies. ## Related issues backports #10289 Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com> Co-authored-by: Nicolas Pepin-Perreault <nicolas.pepin-perreault@camunda.com>
10494: [Backport stable/8.0] Do not fail consistency check if log is empty r=deepthidevaki a=backport-action # Description Backport of #10463 to `stable/8.0`. relates to #10451 10576: [Backport stable/8.0] Add verification timeout to SuccessfulDeploymentTest.shouldSendResponse r=saig0 a=backport-action # Description Backport of #10573 to `stable/8.0`. relates to #10492 Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com> Co-authored-by: Christopher Zell <zelldon91@googlemail.com>
Description
When a follower receives a snapshot from the leader, it has to throw away the log and reset the log to
snapshotIndex + 1
. Previously we were doing it in the following order:In this case, if the system crashed after step 1, when the node restarts it is in an invalid state because the log is not reset after the snapshot. To prevent this case, we reset the log on startup based on the existing snapshot. This was buggy and caused issues, which was fixed by #10183. The fix was to reverse the order:
So on restart, there is no need to reset the log. If the system crashes after step 1, we have any empty log and no snapshot (or a previous snapshot). This is a valid state because this follower is not in the quorum, so no data is lost. After the restart the follower will receive the snapshot and the following events.
But this caused the consistency check to fail because it detected gaps between the snapshot and the first log entry. The state is not actually inconsistent, because no data is lost. So we fix this by updating the consistency check to treat this is a valid state. To make the state valid, if the log is empty, we reset it based on the available snapshot.
Related issues
closes #10451
Definition of Done
Not all items need to be done depending on the issue and the pull request.
Code changes:
backport stable/1.3
) to the PR, in case that fails you need to create backports manually.Testing:
Documentation:
Please refer to our review guidelines.