Journal reset resulting in an intermediate state which is detected as corruption #12754

deepthidevaki · 2023-05-12T13:55:14Z

Describe the bug

Related to #12374

A customer observed another case where the startup failed with the error:

"Expected to find a snapshot at index >= log's first index X, but found snapshot Y. A previous snapshot is most likely corrupted."

I could not verify it, but our assumption currently is that this is a false positive. The following might have happened:

Follower received snapshot
Before committing the snapshot, it reset segments
While deleting the segments, the node was shutdown.
After restart, it has the old snapshot plus partially deleted segments.

I think this is plausible as the reset/deleting segments is not and atomic operation. We might not be handling it correctly.

Expected behavior

The reset or snapshot commit process should not result in an invalid intermediate state which is detected as corruption.

Environment:

Zeebe Version: 8.1.*

Related to support

The text was updated successfully, but these errors were encountered:

deepthidevaki · 2023-05-12T14:02:47Z

A potential solution might be to delete the segments in the reverse order, combined with the PR that fixed #12374.

In a related issue, I discussed with @npepinpe a different approach. I was planning to create a separate issue for it, but I'm just dumping my thoughts here, as I think it is also relevant for this bug.

Terminologies

In-Sync replicas - The followers that are in sync with leaders and participating in the commit. In-sync replicas forms the quorum.

Out-of-sync follower - The follower which is receiving the event potentially after it is already committed.

In a setup with replicationFactor 3, there will be 2 in-sync replicas, out of which one will be the leader. The third one can be an out-of-sync follower.

Raft guarantees that any committed event is available in all in-sync replicas. The events are replicated to the out-of-sync follower eventually. But there is no guarantee when they are replicated. Ideally, they are replicated immediately. But it is possible the follower is slow, or the network is slow, resulting the out-of-sync follower to lag behind the leader by a large amount.

At some point, the leader sends a snapshot instead of the events, because the follower is lagging behind. At this point, the current behavior is as follows:

Follower resets the log (delete all segments)
Receives and commits the new snapshot
Delete the old snapshot
Continue receiving the events after the snapshot index

The above process can lead to the following scenarios:

Step 1 and 2 are not atomic. If the replica crashes in between, then it restarts with an empty state.
If it restarts before step 4, then it starts with a snapshot that is not use-able. Snapshot is valid only if the event corresponding to lastFollowUpEventPosition is in the log.

In a normal scenario, the above situation is not problematic because the in-sync replicas are healthy and it is guaranteed that only one of the in-sync replica becomes the leader.

However, in a disaster scenario where all in-sync replicas are gone, it might be acceptable to continue functioning with whatever data is available in the out-of-sync replica. It might not have up-to-date state, but for some use-cases it is ok to lose last X amount of data. In such cases, we want to enable users to recover the cluster from the state of the out-of-sync replica. However, in the above scenario the out-of-sync replica might be in a state where it's state is empty or the snapshot is not useable.

So, it would be good if we could ensure that the state in the out-of-sync follower is always in a valid state. To acheive that we can

Do not delete old snapshot and reset log immediately when it receives a snapshot.
Instead keep the old snapshot and the logs until the new snapshot is committed and its follow up event is received.

Solution 1:

We keep the old snapshot + logs in a different folder, the new snapshot and new logs in a different folder. On restarts, it attempts to use the new snapshot + log and fall back to the old state if necessary.

Solution 2:

Do not reset the logs, when a new snapshot is received. Instead, add a marker record in the journal to indicate that there is a snapshot at the position. The readers must know how to handle these “gaps”. For example, if stream processor reader hits this marker record, it has to throw away its state and replace with the snapshot. If the raft leader reads this record, it should replicate the snapshot instead of the event etc.

The compaction logic should take this into account, and compact only if there is one valid snapshot with its followup event in the log. So the follower can rollback to an old valid state if necessary.

This would also help in some issue that we observed recently where restarting the node during resetting the log can incorrectly lead to detecting it as corrupted log.

npepinpe · 2023-05-26T09:01:47Z

With #12868, we implemented a quick fix to prevent the specific case we're aware of. By deleting the segments in reverse order during reset, we can ensure that there are no gaps in the log/snapshot, and thus no corruption. Data loss is acceptable here as it was the desired outcome of the reset operation.

However, we agreed not to close the issue - this quick fix only solves a specific thing, and not the root cause, which is that the install operation on the follower is not atomic, specifically persisting the snapshot/clearing the log/updating the meta store/getting the first record for the snapshot/etc. are all independent operations and the node may stop at any point in between, leaving us in a weird state.

12868: fix(journal): avoid gaps in logs due to reset r=npepinpe a=npepinpe ## Description Avoid gaps in the log when resetting the log by deleting the segments in reverse order. This may cause data loss, but that's acceptable in the context of a reset operation. Deleting segments in reverse order will ensure we have no gaps between the remaining snapshot and segments should the operation fail midway. Note that this does not make persisting a new snapshot/reseting the log atomic unfortunately. Unfortunately tests were also not ideal, and require heavy use of mocks. It's also not possible to write integration test without major refactoring. As this is a quick fix, the hope is that the proper solution to make the Raft install operation atomic will redeem this. Not marked as closing the original issue because this is, after all, a quick fix. It only fixes a specific case we're aware of. ## Related issues  related to #12754 Co-authored-by: Nicolas Pepin-Perreault <nicolas.pepin-perreault@camunda.com>

12698: Replace go-bindata with go's embed FS r=npepinpe a=npepinpe ## Description Took some slack time to complete this quick win: replaces usage of go-bindata, an external tool dependency, with the built-in capabilities of the go compiler using `go:embed` and the embedFS. The version file is embedded at build time into the binary, much like when we used go-bindata, and read back as a package field. ## Related issues closes #6034 12868: fix(journal): avoid gaps in logs due to reset r=npepinpe a=npepinpe ## Description Avoid gaps in the log when resetting the log by deleting the segments in reverse order. This may cause data loss, but that's acceptable in the context of a reset operation. Deleting segments in reverse order will ensure we have no gaps between the remaining snapshot and segments should the operation fail midway. Note that this does not make persisting a new snapshot/reseting the log atomic unfortunately. Unfortunately tests were also not ideal, and require heavy use of mocks. It's also not possible to write integration test without major refactoring. As this is a quick fix, the hope is that the proper solution to make the Raft install operation atomic will redeem this. Not marked as closing the original issue because this is, after all, a quick fix. It only fixes a specific case we're aware of. ## Related issues  related to #12754 12880: deps(go): bump github.com/docker/docker from 24.0.1+incompatible to 24.0.2+incompatible in /clients/go r=npepinpe a=dependabot[bot] Bumps [github.com/docker/docker](https://github.com/docker/docker) from 24.0.1+incompatible to 24.0.2+incompatible. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/docker/docker/releases">github.com/docker/docker's releases</a>.</em></p> <blockquote> <h2>v24.0.2</h2> <h2>24.0.2</h2> <p>For a full list of pull requests and changes in this release, refer to the relevant GitHub milestones:</p> <ul> <li><a href="https://github.com/docker/cli/issues?q=is%3Aclosed+milestone%3A24.0.2">docker/cli, 24.0.2 milestone</a></li> <li><a href="https://github.com/moby/moby/issues?q=is%3Aclosed+milestone%3A24.0.2">moby/moby, 24.0.2 milestone</a></li> </ul> <h3>Bug fixes and enhancements</h3> <ul> <li>Fix a panic during build when referencing locally tagged images. <a href="https://redirect.github.com/moby/buildkit/pull/3899">moby/buildkit#3899</a>, <a href="https://redirect.github.com/moby/moby/pull/45582">moby/moby#45582</a></li> <li>Fix builds potentially failing with <code>exit code: 4294967295</code> when performing many concurrent build stages. <a href="https://redirect.github.com/moby/moby/pull/45620">moby/moby#45620</a></li> <li>Fix DNS resolution on Windows ignoring <code>etc/hosts</code> (<code>%WINDIR%\System32\Drivers\etc\hosts</code>), including resolution of <code>localhost</code>. <a href="https://redirect.github.com/moby/moby/pull/45562">moby/moby#45562</a></li> <li>Apply a workaround for a containerd bug that causes concurrent <code>docker exec</code> commands to take significantly longer than expected. <a href="https://redirect.github.com/moby/moby/pull/45625">moby/moby#45625</a></li> <li>containerd image store: Fix an issue where the image <code>Created</code> field would contain an incorrect value. <a href="https://redirect.github.com/moby/moby/pull/45623">moby/moby#45623</a></li> <li>containerd image store: Adjust the output of image pull progress so that the output has the same format regardless of whether the containerd image store is enabled. <a href="https://redirect.github.com/moby/moby/pull/45602">moby/moby#45602</a></li> <li>containerd image store: Switching between the default and containerd image store now requires a daemon restart. <a href="https://redirect.github.com/moby/moby/pull/45616">moby/moby#45616</a></li> </ul> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/moby/moby/commit/659604f9ee60f147020bdd444b26e4b5c636dc28"><code>659604f</code></a> Merge pull request <a href="https://redirect.github.com/docker/docker/issues/45625">#45625</a> from thaJeztah/24.0_backport_serialize_exec_starts_...</li> <li><a href="https://github.com/moby/moby/commit/6660133ffb91a3b754fe4ad16b87523dbc8493c8"><code>6660133</code></a> Merge pull request <a href="https://redirect.github.com/docker/docker/issues/45582">#45582</a> from thaJeztah/24.0_backport_vendor_buildkit_0.11.7...</li> <li><a href="https://github.com/moby/moby/commit/67b3563d096d21e45b92cbd1ecc2ce4bc72eab36"><code>67b3563</code></a> Merge pull request <a href="https://redirect.github.com/docker/docker/issues/45623">#45623</a> from vvoland/c8d-inspect-created-time-24</li> <li><a href="https://github.com/moby/moby/commit/7a4ea198032957918ffb4359d1b621b8cfd82201"><code>7a4ea19</code></a> libcontainerd: work around exec start bug in c8d</li> <li><a href="https://github.com/moby/moby/commit/ae6e9333c00dd6bfa674fde77399650841803821"><code>ae6e933</code></a> vendor: github.com/moby/buildkit v0.11.7-0.20230525183624-798ad6b0ce9f</li> <li><a href="https://github.com/moby/moby/commit/0d9acd24fe3a4d45b602f896d091a3855057d31d"><code>0d9acd2</code></a> c8d/inspect: Fill <code>Created</code> time if available</li> <li><a href="https://github.com/moby/moby/commit/37bc6397044d5a937929cbe94bf80a9d0b42ab2f"><code>37bc639</code></a> Merge pull request <a href="https://redirect.github.com/docker/docker/issues/45620">#45620</a> from thaJeztah/24.0_backport_update_go_runc_v1.1.0</li> <li><a href="https://github.com/moby/moby/commit/04eccf81654771f187cd7fdf34b3b12553e4e028"><code>04eccf8</code></a> vendor: github.com/containerd/go-runc v1.1.0</li> <li><a href="https://github.com/moby/moby/commit/24722779ffc5e4de697bea0606a9b92fa8111d2e"><code>2472277</code></a> Merge pull request <a href="https://redirect.github.com/docker/docker/issues/45616">#45616</a> from thaJeztah/24.0_backport_lock_in_snapshotter_se...</li> <li><a href="https://github.com/moby/moby/commit/9d8acb7bd1dfc45ac5522e843c81f7653bbd9c0c"><code>9d8acb7</code></a> Merge pull request <a href="https://redirect.github.com/docker/docker/issues/45612">#45612</a> from vvoland/dangling-image-repotagsdigests-test-24</li> <li>Additional commits viewable in <a href="https://github.com/docker/docker/compare/v24.0.1...v24.0.2">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=github.com/docker/docker&package-manager=go_modules&previous-version=24.0.1+incompatible&new-version=24.0.2+incompatible)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting ``@dependabot` rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - ``@dependabot` rebase` will rebase this PR - ``@dependabot` recreate` will recreate this PR, overwriting any edits that have been made to it - ``@dependabot` merge` will merge this PR after your CI passes on it - ``@dependabot` squash and merge` will squash and merge this PR after your CI passes on it - ``@dependabot` cancel merge` will cancel a previously requested merge and block automerging - ``@dependabot` reopen` will reopen this PR if it is closed - ``@dependabot` close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - ``@dependabot` ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - ``@dependabot` ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - ``@dependabot` ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> 12882: deps(maven): bump software.amazon.awssdk:bom from 2.20.73 to 2.20.74 r=npepinpe a=dependabot[bot] Bumps software.amazon.awssdk:bom from 2.20.73 to 2.20.74. [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=software.amazon.awssdk:bom&package-manager=maven&previous-version=2.20.73&new-version=2.20.74)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting ``@dependabot` rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - ``@dependabot` rebase` will rebase this PR - ``@dependabot` recreate` will recreate this PR, overwriting any edits that have been made to it - ``@dependabot` merge` will merge this PR after your CI passes on it - ``@dependabot` squash and merge` will squash and merge this PR after your CI passes on it - ``@dependabot` cancel merge` will cancel a previously requested merge and block automerging - ``@dependabot` reopen` will reopen this PR if it is closed - ``@dependabot` close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - ``@dependabot` ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - ``@dependabot` ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - ``@dependabot` ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> Co-authored-by: Nicolas Pepin-Perreault <nicolas.pepin-perreault@camunda.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

12698: Replace go-bindata with go's embed FS r=npepinpe a=npepinpe ## Description Took some slack time to complete this quick win: replaces usage of go-bindata, an external tool dependency, with the built-in capabilities of the go compiler using `go:embed` and the embedFS. The version file is embedded at build time into the binary, much like when we used go-bindata, and read back as a package field. ## Related issues closes #6034 12868: fix(journal): avoid gaps in logs due to reset r=npepinpe a=npepinpe ## Description Avoid gaps in the log when resetting the log by deleting the segments in reverse order. This may cause data loss, but that's acceptable in the context of a reset operation. Deleting segments in reverse order will ensure we have no gaps between the remaining snapshot and segments should the operation fail midway. Note that this does not make persisting a new snapshot/reseting the log atomic unfortunately. Unfortunately tests were also not ideal, and require heavy use of mocks. It's also not possible to write integration test without major refactoring. As this is a quick fix, the hope is that the proper solution to make the Raft install operation atomic will redeem this. Not marked as closing the original issue because this is, after all, a quick fix. It only fixes a specific case we're aware of. ## Related issues  related to #12754 12877: Event subprocess attaches to inner instance of multi-instance only r=saig0 a=korthout ## Description  Event sub processes inside a multi-instance sub process would be attached to the multi-instance body in addition to the inner instance. However, there is no reason why an event sub process should be attached to the multi-instance body. Instead, event sub processes should only be attached to the inner instances. **Example scenario** The parallel multi-instance sub process creates an `item` variable as input element in each inner instance. <img width="865" alt="Screenshot 2023-05-26 at 18 25 08" src="https://github.com/camunda/zeebe/assets/3511026/cf37717b-50cf-4478-84d3-d32009f54972"> Each message event sub process subscribes to the same message with a correlation key expression that uses the input element variable `item`. <img width="922" alt="Screenshot 2023-05-26 at 18 25 39" src="https://github.com/camunda/zeebe/assets/3511026/9fc40f17-a2e7-4628-a9fd-a70cb6bffe90"> The bug will surface as an incident on the multi-instance body stating: >failed to evaluate expression 'item.id': no variable found for name 'item' This was because we attempted to subscribe to the event sub process already when activating the multi-instance body, and **before** activating any of the inner instances. The problem is resolved by no longer attaching the event sub process to the multi-instance body. ## Related issues  closes #11578 Co-authored-by: Nicolas Pepin-Perreault <nicolas.pepin-perreault@camunda.com> Co-authored-by: Nico Korthout <nico.korthout@camunda.com>

12868: fix(journal): avoid gaps in logs due to reset r=npepinpe a=npepinpe ## Description Avoid gaps in the log when resetting the log by deleting the segments in reverse order. This may cause data loss, but that's acceptable in the context of a reset operation. Deleting segments in reverse order will ensure we have no gaps between the remaining snapshot and segments should the operation fail midway. Note that this does not make persisting a new snapshot/reseting the log atomic unfortunately. Unfortunately tests were also not ideal, and require heavy use of mocks. It's also not possible to write integration test without major refactoring. As this is a quick fix, the hope is that the proper solution to make the Raft install operation atomic will redeem this. Not marked as closing the original issue because this is, after all, a quick fix. It only fixes a specific case we're aware of. ## Related issues  related to #12754 12877: Event subprocess attaches to inner instance of multi-instance only r=saig0 a=korthout ## Description  Event sub processes inside a multi-instance sub process would be attached to the multi-instance body in addition to the inner instance. However, there is no reason why an event sub process should be attached to the multi-instance body. Instead, event sub processes should only be attached to the inner instances. **Example scenario** The parallel multi-instance sub process creates an `item` variable as input element in each inner instance. <img width="865" alt="Screenshot 2023-05-26 at 18 25 08" src="https://github.com/camunda/zeebe/assets/3511026/cf37717b-50cf-4478-84d3-d32009f54972"> Each message event sub process subscribes to the same message with a correlation key expression that uses the input element variable `item`. <img width="922" alt="Screenshot 2023-05-26 at 18 25 39" src="https://github.com/camunda/zeebe/assets/3511026/9fc40f17-a2e7-4628-a9fd-a70cb6bffe90"> The bug will surface as an incident on the multi-instance body stating: >failed to evaluate expression 'item.id': no variable found for name 'item' This was because we attempted to subscribe to the event sub process already when activating the multi-instance body, and **before** activating any of the inner instances. The problem is resolved by no longer attaching the event sub process to the multi-instance body. ## Related issues  closes #11578 Co-authored-by: Nicolas Pepin-Perreault <nicolas.pepin-perreault@camunda.com> Co-authored-by: Nico Korthout <nico.korthout@camunda.com>

12891: [Backport stable/8.2] fix(journal): avoid gaps in logs due to reset r=npepinpe a=backport-action # Description Backport of #12868 to `stable/8.2`. relates to #12754 Co-authored-by: Nicolas Pepin-Perreault <nicolas.pepin-perreault@camunda.com>

megglos · 2023-06-09T09:19:53Z

ZDP-Planning:

while we are at it, it would be worth to assess potential solutions to get an understanding of how much effort it is to change
needs prototyping
back to backlog as upcoming for now to be picked up going forward
relates to upcoming resilience epic

deepthidevaki added kind/bug Categorizes an issue or PR as a bug support Marks an issue as related to a customer support request labels May 12, 2023

megglos assigned megglos and npepinpe and unassigned megglos May 22, 2023

megglos added backport stable/8.1 backport stable/8.2 Backport a pull request to 8.2.x labels May 25, 2023

npepinpe mentioned this issue May 26, 2023

fix(journal): avoid gaps in logs due to reset #12868

Merged

14 tasks

npepinpe removed their assignment May 26, 2023

megglos added the planning/discuss To be discussed at the next planning. label May 30, 2023

backport-action mentioned this issue May 30, 2023

[Backport stable/8.2] fix(journal): avoid gaps in logs due to reset #12891

Merged

lenaschoenburg added version:8.3.0-alpha2 Marks an issue as being completely or in parts released in 8.3.0-alpha2 version:8.2.6 Marks an issue as being completely or in parts released in 8.2.6 labels Jun 7, 2023

megglos added the area/resilience label Jun 9, 2023

megglos added severity/mid Marks a bug as having a noticeable impact but with a known workaround and removed planning/discuss To be discussed at the next planning. labels Jun 9, 2023

megglos added the version:8.3.0 Marks an issue as being completely or in parts released in 8.3.0 label Oct 5, 2023

romansmirnov added the component/zeebe Related to the Zeebe component/team label Mar 5, 2024

lenaschoenburg removed the backport stable/8.1 label Apr 4, 2024

npepinpe assigned EuroLew Jun 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Journal reset resulting in an intermediate state which is detected as corruption #12754

Journal reset resulting in an intermediate state which is detected as corruption #12754

deepthidevaki commented May 12, 2023

deepthidevaki commented May 12, 2023

npepinpe commented May 26, 2023

megglos commented Jun 9, 2023

Journal reset resulting in an intermediate state which is detected as corruption #12754

Journal reset resulting in an intermediate state which is detected as corruption #12754

Comments

deepthidevaki commented May 12, 2023

deepthidevaki commented May 12, 2023

npepinpe commented May 26, 2023

megglos commented Jun 9, 2023