test: `PartitionRestoreServiceTest` does not block on taking a backup #10486

lenaschoenburg · 2022-09-26T11:18:31Z

We saw some unit tests timing out in PartitionRestoreServiceTest:

"ForkJoinPool-1-worker-1" #19 daemon prio=5 os_prio=0 cpu=1567.91ms elapsed=914.45s tid=0x00007facfca78b60 nid=0x15ab5 waiting on condition  [0x00007facb83df000]
   java.lang.Thread.State: WAITING (parking)
	at jdk.internal.misc.Unsafe.park(java.base@17.0.4.1/Native Method)
	- parking to wait for  <0x0000000511f04c68> (a java.util.concurrent.CompletableFuture$Signaller)
	at java.util.concurrent.locks.LockSupport.park(java.base@17.0.4.1/LockSupport.java:211)
	at java.util.concurrent.CompletableFuture$Signaller.block(java.base@17.0.4.1/CompletableFuture.java:1864)
	at java.util.concurrent.ForkJoinPool.compensatedBlock(java.base@17.0.4.1/ForkJoinPool.java:3449)
	at java.util.concurrent.ForkJoinPool.managedBlock(java.base@17.0.4.1/ForkJoinPool.java:3432)
	at java.util.concurrent.CompletableFuture.waitingGet(java.base@17.0.4.1/CompletableFuture.java:1898)
	at java.util.concurrent.CompletableFuture.join(java.base@17.0.4.1/CompletableFuture.java:2117)
	at io.camunda.zeebe.restore.PartitionRestoreServiceTest.takeBackup(PartitionRestoreServiceTest.java:212)
	at io.camunda.zeebe.restore.PartitionRestoreServiceTest.shouldFailToRestoreWhenSnapshotIsCorrupted(PartitionRestoreServiceTest.java:182)

With these changes here we ensure that the test does not wait forever on a backup, instead setting a maximum of 30 seconds. Additionally, TestRestorableBackupStore now fails the future when a backup is marked as failed.

lenaschoenburg · 2022-09-26T11:23:08Z

With the changes here, PartitionRestoreServiceTest will hopefully fail with a helpful message instead of silently blocking. I can also wait and see if it already fails in this PR, then I could add a fix for the real test issue.

deepthidevaki

Thanks for looking into it. Just 2 points:

May be we can add @Timeout to the test. So even if we forget to add explicit timeout to new tests, it will timeout eventually.
TestSnapshotStore is not thread safe. Is it creating the issues? The backup future can be a volatile, and the map has to be concurrent map.

github-actions · 2022-09-26T11:36:40Z

Test Results

  763 files -   166   763 suites - 166 2h 11m 20s ⏱️ + 5m 38s
6 038 tests - 1 403 6 027 ✔️ - 1 404 10 💤 ±0 1 ❌ +1
6 216 runs - 1 413 6 205 ✔️ - 1 414 10 💤 ±0 1 ❌ +1

For more details on these failures, see this check.

Results for commit 7eb8001. ± Comparison against base commit 531997e.

♻️ This comment has been updated with latest results.

lenaschoenburg · 2022-09-26T11:54:29Z

TestSnapshotStore is not thread safe. Is it creating the issues?

This test uses FileBasedSnapshotStore, not TestSnapshotStore.

deepthidevaki · 2022-09-26T12:00:51Z

TestSnapshotStore is not thread safe. Is it creating the issues?

This test uses FileBasedSnapshotStore, not TestSnapshotStore.

Oops. Sorry, I meant TestRestorableBackupStore

deepthidevaki

🚀 Thanks.

lenaschoenburg · 2022-09-26T13:49:24Z

bors r+

10463: Do not fail consistency check if log is empty r=deepthidevaki a=deepthidevaki ## Description When a follower receives a snapshot from the leader, it has to throw away the log and reset the log to `snapshotIndex + 1`. Previously we were doing it in the following order: 1. commit snapshot 2. reset In this case, if the system crashed after step 1, when the node restarts it is in an invalid state because the log is not reset after the snapshot. To prevent this case, we reset the log on startup based on the existing snapshot. This was buggy and caused issues, which was fixed by #10183. The fix was to reverse the order: 1. reset log 2. commit snapshot. So on restart, there is no need to reset the log. If the system crashes after step 1, we have any empty log and no snapshot (or a previous snapshot). This is a valid state because this follower is not in the quorum, so no data is lost. After the restart the follower will receive the snapshot and the following events. But this caused the consistency check to fail because it detected gaps between the snapshot and the first log entry. The state is not actually inconsistent, because no data is lost. So we fix this by updating the consistency check to treat this is a valid state. To make the state valid, if the log is empty, we reset it based on the available snapshot. ## Related issues closes #10451 10482: deps(maven): bump snakeyaml from 1.32 to 1.33 r=Zelldon a=dependabot[bot] Bumps [snakeyaml](https://bitbucket.org/snakeyaml/snakeyaml) from 1.32 to 1.33. <details> <summary>Commits</summary> <ul> <li><a href="https://bitbucket.org/snakeyaml/snakeyaml/commits/eafb23ec31a0babe591c00e1b50e557a5e3f9a1d"><code>eafb23e</code></a> [maven-release-plugin] prepare for next development iteration</li> <li><a href="https://bitbucket.org/snakeyaml/snakeyaml/commits/26624702fab8e0a1c301d7fad723c048528f75c3"><code>2662470</code></a> Improve JavaDoc</li> <li><a href="https://bitbucket.org/snakeyaml/snakeyaml/commits/80827798f06aeb3d4f2632b94075ca7633418829"><code>8082779</code></a> Always emit numberish strings with quotes</li> <li><a href="https://bitbucket.org/snakeyaml/snakeyaml/commits/42d6c79430431fe9033d3ba50f6a7dc6798ba7ad"><code>42d6c79</code></a> Reformat test</li> <li><a href="https://bitbucket.org/snakeyaml/snakeyaml/commits/1962a437263348c3b90857cda4bbfa2bd97908f8"><code>1962a43</code></a> Refactor: rename variables in Emitter</li> <li><a href="https://bitbucket.org/snakeyaml/snakeyaml/commits/bc594ad6e2b87c3fc26844e407276796fd866a40"><code>bc594ad</code></a> Issue 553: honor code point limit in loadAll</li> <li><a href="https://bitbucket.org/snakeyaml/snakeyaml/commits/c3e98fd755a949f65cf11f2ff39e55a1c2afd1c2"><code>c3e98fd</code></a> Update changes.xml</li> <li><a href="https://bitbucket.org/snakeyaml/snakeyaml/commits/a06f76859f2f07580b1d9fa6b66ea84aaad26cf8"><code>a06f768</code></a> Remove deprecated Tag manipulation</li> <li><a href="https://bitbucket.org/snakeyaml/snakeyaml/commits/5a0027a3781b92f59bf92cdeb1b7590589993efd"><code>5a0027a</code></a> Remove unused WhitespaceToken</li> <li><a href="https://bitbucket.org/snakeyaml/snakeyaml/commits/3f05838828b8df36ab961bf836f373b8c20cb8ff"><code>3f05838</code></a> Improve JavaDoc</li> <li>Additional commits viewable in <a href="https://bitbucket.org/snakeyaml/snakeyaml/branches/compare/snakeyaml-1.33..snakeyaml-1.32">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=org.yaml:snakeyaml&package-manager=maven&previous-version=1.32&new-version=1.33)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting ``@dependabot` rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - ``@dependabot` rebase` will rebase this PR - ``@dependabot` recreate` will recreate this PR, overwriting any edits that have been made to it - ``@dependabot` merge` will merge this PR after your CI passes on it - ``@dependabot` squash and merge` will squash and merge this PR after your CI passes on it - ``@dependabot` cancel merge` will cancel a previously requested merge and block automerging - ``@dependabot` reopen` will reopen this PR if it is closed - ``@dependabot` close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - ``@dependabot` ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - ``@dependabot` ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - ``@dependabot` ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> 10486: test: `PartitionRestoreServiceTest` does not block on taking a backup r=oleschoenburg a=oleschoenburg We saw some unit tests timing out in `PartitionRestoreServiceTest`: ``` "ForkJoinPool-1-worker-1" #19 daemon prio=5 os_prio=0 cpu=1567.91ms elapsed=914.45s tid=0x00007facfca78b60 nid=0x15ab5 waiting on condition [0x00007facb83df000] java.lang.Thread.State: WAITING (parking) at jdk.internal.misc.Unsafe.park(java.base@17.0.4.1/Native Method) - parking to wait for <0x0000000511f04c68> (a java.util.concurrent.CompletableFuture$Signaller) at java.util.concurrent.locks.LockSupport.park(java.base@17.0.4.1/LockSupport.java:211) at java.util.concurrent.CompletableFuture$Signaller.block(java.base@17.0.4.1/CompletableFuture.java:1864) at java.util.concurrent.ForkJoinPool.compensatedBlock(java.base@17.0.4.1/ForkJoinPool.java:3449) at java.util.concurrent.ForkJoinPool.managedBlock(java.base@17.0.4.1/ForkJoinPool.java:3432) at java.util.concurrent.CompletableFuture.waitingGet(java.base@17.0.4.1/CompletableFuture.java:1898) at java.util.concurrent.CompletableFuture.join(java.base@17.0.4.1/CompletableFuture.java:2117) at io.camunda.zeebe.restore.PartitionRestoreServiceTest.takeBackup(PartitionRestoreServiceTest.java:212) at io.camunda.zeebe.restore.PartitionRestoreServiceTest.shouldFailToRestoreWhenSnapshotIsCorrupted(PartitionRestoreServiceTest.java:182) ``` With these changes here we ensure that the test does not wait forever on a backup, instead setting a maximum of 30 seconds. Additionally, `TestRestorableBackupStore` now fails the future when a backup is marked as failed. 10489: Do not use DefaultActorClock r=Zelldon a=Zelldon ## Description The default ActorClock is not thread safe and shouldn't be shared with multiple threads. This means we need to set the clock in the ActorClockConfiguration to null. Creating the ActorScheduler with no clock will cause that each threads gets its own clock. Note: This is a quick fix, at some point, we want to make DefaultActorClock threadsafe so we can use always the same clock. See #10400  ## Related issues  related #10400 10490: ci(macos): set code cache size of 64m r=megglos a=megglos To counter occasional out of code cache errors observed on macos builds. Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com> Co-authored-by: Christopher Zell <zelldon91@googlemail.com> Co-authored-by: Meggle (Sebastian Bathke) <sebastian.bathke@camunda.com>

zeebe-bors-camunda · 2022-09-26T14:28:13Z

Build failed (retrying...):

Test summary

zeebe-bors-camunda · 2022-09-26T15:03:10Z

Build succeeded:

lenaschoenburg requested a review from deepthidevaki September 26, 2022 11:18

deepthidevaki approved these changes Sep 26, 2022

View reviewed changes

lenaschoenburg added 3 commits September 26, 2022 14:38

test: PartitionRestoreServiceTest does not block on taking a backup

7d109cd

refactor: thread safe TestRestorableBackupStore

e8f2d80

test: set a test timeout for PartitionRestoreServiceTest

7eb8001

lenaschoenburg force-pushed the os-fix-partition-restore-service-test-block branch from ac6aac8 to 7eb8001 Compare September 26, 2022 13:33

lenaschoenburg requested a review from deepthidevaki September 26, 2022 13:38

deepthidevaki approved these changes Sep 26, 2022

View reviewed changes

zeebe-bors-camunda bot merged commit 5cee53f into main Sep 26, 2022

zeebe-bors-camunda bot deleted the os-fix-partition-restore-service-test-block branch September 26, 2022 15:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: `PartitionRestoreServiceTest` does not block on taking a backup #10486

test: `PartitionRestoreServiceTest` does not block on taking a backup #10486

lenaschoenburg commented Sep 26, 2022

lenaschoenburg commented Sep 26, 2022

deepthidevaki left a comment

github-actions bot commented Sep 26, 2022 •

edited

Loading

lenaschoenburg commented Sep 26, 2022

deepthidevaki commented Sep 26, 2022

deepthidevaki left a comment

lenaschoenburg commented Sep 26, 2022

zeebe-bors-camunda bot commented Sep 26, 2022

zeebe-bors-camunda bot commented Sep 26, 2022

test: PartitionRestoreServiceTest does not block on taking a backup #10486

test: PartitionRestoreServiceTest does not block on taking a backup #10486

Conversation

lenaschoenburg commented Sep 26, 2022

lenaschoenburg commented Sep 26, 2022

deepthidevaki left a comment

Choose a reason for hiding this comment

github-actions bot commented Sep 26, 2022 • edited Loading

Test Results

lenaschoenburg commented Sep 26, 2022

deepthidevaki commented Sep 26, 2022

deepthidevaki left a comment

Choose a reason for hiding this comment

lenaschoenburg commented Sep 26, 2022

zeebe-bors-camunda bot commented Sep 26, 2022

zeebe-bors-camunda bot commented Sep 26, 2022

test: `PartitionRestoreServiceTest` does not block on taking a backup #10486

test: `PartitionRestoreServiceTest` does not block on taking a backup #10486

github-actions bot commented Sep 26, 2022 •

edited

Loading