DefaultActorClock is not thread safe #10400

Zelldon · 2022-09-19T14:04:25Z

Describe the bug

It seems we now create the ActorScheduler with a predefined actor clock, either ControlledClock or the DefaultActorClock https://github.com/camunda/zeebe/blob/main/dist/src/main/java/io/camunda/zeebe/shared/ActorClockConfiguration.java#L34-L35

The problem is that the DefaultActorClock is not thread safe! Previously we created the ActorScheduler without clock, which caused that each thread has its own clock. Then there is no issue.

If we use the DefaultActorClock in all threads this might cause issues on updating the time https://github.com/camunda/zeebe/blob/main/scheduler/src/main/java/io/camunda/zeebe/scheduler/clock/DefaultActorClock.java#L24-L31 which is not thread safe and called by all threads https://github.com/camunda/zeebe/blob/main/scheduler/src/main/java/io/camunda/zeebe/scheduler/ActorThread.java#L78

and also when timers are scheduled https://github.com/camunda/zeebe/blob/main/scheduler/src/main/java/io/camunda/zeebe/scheduler/ActorTimerQueue.java#L53

I found that during investigating #10390 and trying to make ActorTimerQueue threadsafe...

To Reproduce

Expected behavior

Either we not set the clock or we use a thread safe structure.

Log/Stacktrace

Full Stacktrace

<STACKTRACE>

Environment:

OS:
Zeebe Version: 8.x
Configuration:

The text was updated successfully, but these errors were encountered:

Zelldon · 2022-09-19T14:06:38Z

Right now I can't really estimate the impact, but I guess it might cause issues in executing scheduled tasks 🤷

npepinpe · 2022-09-20T07:52:37Z

Oh, damn, good catch! I didn't think of that. We could do two approaches, one is making the clock thread-safe, the other is simply going back to the original behavior with an asterisk: you can inject a pre-made clock (which is then used by all threads), or nothing, which then means all threads use their own clock. I think there's something to be said about the safety of having all threads use the same source of truth for time, but I'm also not sure what the impact is when they don't - we never really noticed clock skew/drift before (though, would we have anyway until it becomes a lot? 🤷).

npepinpe · 2022-09-20T08:20:41Z

Looking at it, I think the impact is quite volatile. It could be limited, but it could also be terrible if you hit some thread scheduling snag (like a thread being suspended right before updating the timeMillis, then for some reason no scheduled for 10 seconds, resetting that clock backwards 10 seconds).

So we should 100% make it thread-safe or go with the "let every thread create it when not provided" approach.

lenaschoenburg · 2022-09-20T09:17:09Z

Maybe we could look into using InstantSource which is threadsafe and supports custom tick durations.

npepinpe · 2022-09-20T09:19:34Z

I thought about it (well, the impl, Clock#tick), but its slightly different from what I understand. It doesn't cache anything but rather just truncated the result so it looks like it's ticking every milli only. Unless InstantSource is different?

Zelldon · 2022-09-20T09:19:57Z

What do you folks think of first (as quick fix before the release) we go with not setting the actor clock, which means DefaultActorClock for each thread. After the release we can experiment with making it thread safe?

npepinpe · 2022-09-20T09:20:46Z

Sounds good to me 👍

10463: Do not fail consistency check if log is empty r=deepthidevaki a=deepthidevaki ## Description When a follower receives a snapshot from the leader, it has to throw away the log and reset the log to `snapshotIndex + 1`. Previously we were doing it in the following order: 1. commit snapshot 2. reset In this case, if the system crashed after step 1, when the node restarts it is in an invalid state because the log is not reset after the snapshot. To prevent this case, we reset the log on startup based on the existing snapshot. This was buggy and caused issues, which was fixed by #10183. The fix was to reverse the order: 1. reset log 2. commit snapshot. So on restart, there is no need to reset the log. If the system crashes after step 1, we have any empty log and no snapshot (or a previous snapshot). This is a valid state because this follower is not in the quorum, so no data is lost. After the restart the follower will receive the snapshot and the following events. But this caused the consistency check to fail because it detected gaps between the snapshot and the first log entry. The state is not actually inconsistent, because no data is lost. So we fix this by updating the consistency check to treat this is a valid state. To make the state valid, if the log is empty, we reset it based on the available snapshot. ## Related issues closes #10451 10482: deps(maven): bump snakeyaml from 1.32 to 1.33 r=Zelldon a=dependabot[bot] Bumps [snakeyaml](https://bitbucket.org/snakeyaml/snakeyaml) from 1.32 to 1.33. <details> <summary>Commits</summary> <ul> <li><a href="https://bitbucket.org/snakeyaml/snakeyaml/commits/eafb23ec31a0babe591c00e1b50e557a5e3f9a1d"><code>eafb23e</code></a> [maven-release-plugin] prepare for next development iteration</li> <li><a href="https://bitbucket.org/snakeyaml/snakeyaml/commits/26624702fab8e0a1c301d7fad723c048528f75c3"><code>2662470</code></a> Improve JavaDoc</li> <li><a href="https://bitbucket.org/snakeyaml/snakeyaml/commits/80827798f06aeb3d4f2632b94075ca7633418829"><code>8082779</code></a> Always emit numberish strings with quotes</li> <li><a href="https://bitbucket.org/snakeyaml/snakeyaml/commits/42d6c79430431fe9033d3ba50f6a7dc6798ba7ad"><code>42d6c79</code></a> Reformat test</li> <li><a href="https://bitbucket.org/snakeyaml/snakeyaml/commits/1962a437263348c3b90857cda4bbfa2bd97908f8"><code>1962a43</code></a> Refactor: rename variables in Emitter</li> <li><a href="https://bitbucket.org/snakeyaml/snakeyaml/commits/bc594ad6e2b87c3fc26844e407276796fd866a40"><code>bc594ad</code></a> Issue 553: honor code point limit in loadAll</li> <li><a href="https://bitbucket.org/snakeyaml/snakeyaml/commits/c3e98fd755a949f65cf11f2ff39e55a1c2afd1c2"><code>c3e98fd</code></a> Update changes.xml</li> <li><a href="https://bitbucket.org/snakeyaml/snakeyaml/commits/a06f76859f2f07580b1d9fa6b66ea84aaad26cf8"><code>a06f768</code></a> Remove deprecated Tag manipulation</li> <li><a href="https://bitbucket.org/snakeyaml/snakeyaml/commits/5a0027a3781b92f59bf92cdeb1b7590589993efd"><code>5a0027a</code></a> Remove unused WhitespaceToken</li> <li><a href="https://bitbucket.org/snakeyaml/snakeyaml/commits/3f05838828b8df36ab961bf836f373b8c20cb8ff"><code>3f05838</code></a> Improve JavaDoc</li> <li>Additional commits viewable in <a href="https://bitbucket.org/snakeyaml/snakeyaml/branches/compare/snakeyaml-1.33..snakeyaml-1.32">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=org.yaml:snakeyaml&package-manager=maven&previous-version=1.32&new-version=1.33)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting ``@dependabot` rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - ``@dependabot` rebase` will rebase this PR - ``@dependabot` recreate` will recreate this PR, overwriting any edits that have been made to it - ``@dependabot` merge` will merge this PR after your CI passes on it - ``@dependabot` squash and merge` will squash and merge this PR after your CI passes on it - ``@dependabot` cancel merge` will cancel a previously requested merge and block automerging - ``@dependabot` reopen` will reopen this PR if it is closed - ``@dependabot` close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - ``@dependabot` ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - ``@dependabot` ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - ``@dependabot` ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> 10486: test: `PartitionRestoreServiceTest` does not block on taking a backup r=oleschoenburg a=oleschoenburg We saw some unit tests timing out in `PartitionRestoreServiceTest`: ``` "ForkJoinPool-1-worker-1" #19 daemon prio=5 os_prio=0 cpu=1567.91ms elapsed=914.45s tid=0x00007facfca78b60 nid=0x15ab5 waiting on condition [0x00007facb83df000] java.lang.Thread.State: WAITING (parking) at jdk.internal.misc.Unsafe.park(java.base@17.0.4.1/Native Method) - parking to wait for <0x0000000511f04c68> (a java.util.concurrent.CompletableFuture$Signaller) at java.util.concurrent.locks.LockSupport.park(java.base@17.0.4.1/LockSupport.java:211) at java.util.concurrent.CompletableFuture$Signaller.block(java.base@17.0.4.1/CompletableFuture.java:1864) at java.util.concurrent.ForkJoinPool.compensatedBlock(java.base@17.0.4.1/ForkJoinPool.java:3449) at java.util.concurrent.ForkJoinPool.managedBlock(java.base@17.0.4.1/ForkJoinPool.java:3432) at java.util.concurrent.CompletableFuture.waitingGet(java.base@17.0.4.1/CompletableFuture.java:1898) at java.util.concurrent.CompletableFuture.join(java.base@17.0.4.1/CompletableFuture.java:2117) at io.camunda.zeebe.restore.PartitionRestoreServiceTest.takeBackup(PartitionRestoreServiceTest.java:212) at io.camunda.zeebe.restore.PartitionRestoreServiceTest.shouldFailToRestoreWhenSnapshotIsCorrupted(PartitionRestoreServiceTest.java:182) ``` With these changes here we ensure that the test does not wait forever on a backup, instead setting a maximum of 30 seconds. Additionally, `TestRestorableBackupStore` now fails the future when a backup is marked as failed. 10489: Do not use DefaultActorClock r=Zelldon a=Zelldon ## Description The default ActorClock is not thread safe and shouldn't be shared with multiple threads. This means we need to set the clock in the ActorClockConfiguration to null. Creating the ActorScheduler with no clock will cause that each threads gets its own clock. Note: This is a quick fix, at some point, we want to make DefaultActorClock threadsafe so we can use always the same clock. See #10400  ## Related issues  related #10400 10490: ci(macos): set code cache size of 64m r=megglos a=megglos To counter occasional out of code cache errors observed on macos builds. Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com> Co-authored-by: Christopher Zell <zelldon91@googlemail.com> Co-authored-by: Meggle (Sebastian Bathke) <sebastian.bathke@camunda.com>

10486: test: `PartitionRestoreServiceTest` does not block on taking a backup r=oleschoenburg a=oleschoenburg We saw some unit tests timing out in `PartitionRestoreServiceTest`: ``` "ForkJoinPool-1-worker-1" #19 daemon prio=5 os_prio=0 cpu=1567.91ms elapsed=914.45s tid=0x00007facfca78b60 nid=0x15ab5 waiting on condition [0x00007facb83df000] java.lang.Thread.State: WAITING (parking) at jdk.internal.misc.Unsafe.park(java.base@17.0.4.1/Native Method) - parking to wait for <0x0000000511f04c68> (a java.util.concurrent.CompletableFuture$Signaller) at java.util.concurrent.locks.LockSupport.park(java.base@17.0.4.1/LockSupport.java:211) at java.util.concurrent.CompletableFuture$Signaller.block(java.base@17.0.4.1/CompletableFuture.java:1864) at java.util.concurrent.ForkJoinPool.compensatedBlock(java.base@17.0.4.1/ForkJoinPool.java:3449) at java.util.concurrent.ForkJoinPool.managedBlock(java.base@17.0.4.1/ForkJoinPool.java:3432) at java.util.concurrent.CompletableFuture.waitingGet(java.base@17.0.4.1/CompletableFuture.java:1898) at java.util.concurrent.CompletableFuture.join(java.base@17.0.4.1/CompletableFuture.java:2117) at io.camunda.zeebe.restore.PartitionRestoreServiceTest.takeBackup(PartitionRestoreServiceTest.java:212) at io.camunda.zeebe.restore.PartitionRestoreServiceTest.shouldFailToRestoreWhenSnapshotIsCorrupted(PartitionRestoreServiceTest.java:182) ``` With these changes here we ensure that the test does not wait forever on a backup, instead setting a maximum of 30 seconds. Additionally, `TestRestorableBackupStore` now fails the future when a backup is marked as failed. 10489: Do not use DefaultActorClock r=Zelldon a=Zelldon ## Description The default ActorClock is not thread safe and shouldn't be shared with multiple threads. This means we need to set the clock in the ActorClockConfiguration to null. Creating the ActorScheduler with no clock will cause that each threads gets its own clock. Note: This is a quick fix, at some point, we want to make DefaultActorClock threadsafe so we can use always the same clock. See #10400  ## Related issues  related #10400 10490: ci(macos): set code cache size of 64m r=megglos a=megglos To counter occasional out of code cache errors observed on macos builds. Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com> Co-authored-by: Christopher Zell <zelldon91@googlemail.com> Co-authored-by: Meggle (Sebastian Bathke) <sebastian.bathke@camunda.com>

megglos · 2022-09-30T09:23:47Z

Issue got mitigated with #10489

Remaining impact is not severe.

We decided to close it for now as we don't thin it's a big deal going forward.

Zelldon added the kind/bug Categorizes an issue or PR as a bug label Sep 19, 2022

Zelldon self-assigned this Sep 26, 2022

Zelldon mentioned this issue Sep 26, 2022

Do not use DefaultActorClock #10489

Merged

10 tasks

megglos closed this as completed Sep 30, 2022

Zelldon added the version:8.1.0 Marks an issue as being completely or in parts released in 8.1.0 label Oct 4, 2022

Zelldon mentioned this issue Apr 3, 2024

Tests that increase time by >5s are flaky #17258

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DefaultActorClock is not thread safe #10400

DefaultActorClock is not thread safe #10400

Zelldon commented Sep 19, 2022

Zelldon commented Sep 19, 2022

npepinpe commented Sep 20, 2022 •

edited

Loading

npepinpe commented Sep 20, 2022

lenaschoenburg commented Sep 20, 2022

npepinpe commented Sep 20, 2022

Zelldon commented Sep 20, 2022

npepinpe commented Sep 20, 2022

megglos commented Sep 30, 2022

DefaultActorClock is not thread safe #10400

DefaultActorClock is not thread safe #10400

Comments

Zelldon commented Sep 19, 2022

Zelldon commented Sep 19, 2022

npepinpe commented Sep 20, 2022 • edited Loading

npepinpe commented Sep 20, 2022

lenaschoenburg commented Sep 20, 2022

npepinpe commented Sep 20, 2022

Zelldon commented Sep 20, 2022

npepinpe commented Sep 20, 2022

megglos commented Sep 30, 2022

npepinpe commented Sep 20, 2022 •

edited

Loading