Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop raft server when going inactive due to unrecoverable errors #10776

Merged
merged 3 commits into from Oct 24, 2022

Conversation

deepthidevaki
Copy link
Contributor

Description

Previously, it was transitioning to inactive. But, in the configuration the member is still marked as active. As a result, the member transition back to active when it gets a new message from the leader. We cannot change the configuration, and mark this member as inactive because that would mean we are changing the quorum. What we really requires is that this partition is "dead" (atleast temporarily) so that it doesn't become leader again. We also don't want it to become a follower because this can also lead to partial functionality which can cause problems. For example, in follower role, raft is replicating events, but the streamprocessor or snapshotting is not working because of this error. So it is not able to compact the logs. This will eventually leads to disk space full and thus affecting other possibly healthy partitions.

To fix this, in this PR we stop the raft server instead of only transitioning to inactive. The replication factor and quorum remains the same. But this node cannot become leader again until the member is restarted.

Related issues

closes #9924

Definition of Done

Not all items need to be done depending on the issue and the pull request.

Code changes:

  • The changes are backwards compatibility with previous versions
  • If it fixes a bug then PRs are created to backport the fix to the last two minor versions. You can trigger a backport by assigning labels (e.g. backport stable/1.3) to the PR, in case that fails you need to create backports manually.

Testing:

  • There are unit/integration tests that verify all acceptance criterias of the issue
  • New tests are written to ensure backwards compatibility with further versions
  • The behavior is tested manually
  • The change has been verified by a QA run
  • The impact of the changes is verified by a benchmark

Documentation:

  • The documentation is updated (e.g. BPMN reference, configuration, examples, get-started guides, etc.)
  • New content is added to the release announcement
  • If the PR changes how BPMN processes are validated (e.g. support new BPMN element) then the Camunda modeling team should be informed to adjust the BPMN linting.

Please refer to our review guidelines.

When raft transition to inactive, ZeebePartition will be notified and triggers a new transition
@github-actions
Copy link
Contributor

github-actions bot commented Oct 20, 2022

Test Results

   947 files  ±0     947 suites  ±0   1h 42m 36s ⏱️ +11s
7 528 tests +9  7 520 ✔️ +8  7 💤 ±0  1 +1 
7 720 runs  +9  7 710 ✔️ +8  9 💤 ±0  1 +1 

For more details on these failures, see this check.

Results for commit 346ce34. ± Comparison against base commit 9a1ad47.

♻️ This comment has been updated with latest results.

Comment on lines 249 to 251
public CompletableFuture<Void> goInactive() {
return server.goInactive();
return server.stop();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

goInactive is used in two places, on unrecoverable errors and on recoverable errors on followers. Is it intended that recoverable errors on followers now stop the raft server?
Either way, I think we should rename the method to stop or similar.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously it was going inactive in both places, which was intended to behave similar to stop. So we are not changing the expected behavior.

Copy link
Member

@oleschoenburg oleschoenburg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @deepthidevaki, LGTM 👍

@deepthidevaki
Copy link
Contributor Author

bors merge

zeebe-bors-camunda bot added a commit that referenced this pull request Oct 24, 2022
10759: deps(maven): bump version.zpt from 8.1.1 to 8.1.2 r=oleschoenburg a=dependabot[bot]

Bumps `version.zpt` from 8.1.1 to 8.1.2.
Updates `zeebe-process-test-assertions` from 8.1.1 to 8.1.2
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a href="https://github.com/camunda/zeebe-process-test/releases">zeebe-process-test-assertions's releases</a>.</em></p>
<blockquote>
<h2>8.1.2</h2>
<h2>What's Changed</h2>
<ul>
<li>Zeebe release 8.1.2 by <a href="https://github.com/korthout"><code>`@​korthout</code></a>` in <a href="https://github-redirect.dependabot.com/camunda/zeebe-process-test/pull/545">camunda/zeebe-process-test#545</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a href="https://github.com/camunda/zeebe-process-test/compare/8.1.1...8.1.2">https://github.com/camunda/zeebe-process-test/compare/8.1.1...8.1.2</a></p>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a href="https://github.com/camunda/zeebe-process-test/commit/30bf84155a40f169f99918beb8743d9464251cdc"><code>30bf841</code></a> release(v8.1.2)</li>
<li><a href="https://github.com/camunda/zeebe-process-test/commit/44406fba5c228aba61e3ff58e9d7e41b173693d1"><code>44406fb</code></a> merge: <a href="https://github-redirect.dependabot.com/camunda/zeebe-process-test/issues/545">#545</a></li>
<li><a href="https://github.com/camunda/zeebe-process-test/commit/9f1ebf7665aacdbf43354cd9495e1c9dcfb79a4e"><code>9f1ebf7</code></a> deps(pom): bump zeebe to 8.1.2</li>
<li><a href="https://github.com/camunda/zeebe-process-test/commit/231129543b868ccbe28a8206de117ba34f47e51a"><code>2311295</code></a> release(v8.1.1): prepare for next development iteration</li>
<li>See full diff in <a href="https://github.com/camunda/zeebe-process-test/compare/8.1.1...8.1.2">compare view</a></li>
</ul>
</details>
<br />

Updates `zeebe-process-test-filters` from 8.1.1 to 8.1.2
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a href="https://github.com/camunda/zeebe-process-test/releases">zeebe-process-test-filters's releases</a>.</em></p>
<blockquote>
<h2>8.1.2</h2>
<h2>What's Changed</h2>
<ul>
<li>Zeebe release 8.1.2 by <a href="https://github.com/korthout"><code>`@​korthout</code></a>` in <a href="https://github-redirect.dependabot.com/camunda/zeebe-process-test/pull/545">camunda/zeebe-process-test#545</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a href="https://github.com/camunda/zeebe-process-test/compare/8.1.1...8.1.2">https://github.com/camunda/zeebe-process-test/compare/8.1.1...8.1.2</a></p>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a href="https://github.com/camunda/zeebe-process-test/commit/30bf84155a40f169f99918beb8743d9464251cdc"><code>30bf841</code></a> release(v8.1.2)</li>
<li><a href="https://github.com/camunda/zeebe-process-test/commit/44406fba5c228aba61e3ff58e9d7e41b173693d1"><code>44406fb</code></a> merge: <a href="https://github-redirect.dependabot.com/camunda/zeebe-process-test/issues/545">#545</a></li>
<li><a href="https://github.com/camunda/zeebe-process-test/commit/9f1ebf7665aacdbf43354cd9495e1c9dcfb79a4e"><code>9f1ebf7</code></a> deps(pom): bump zeebe to 8.1.2</li>
<li><a href="https://github.com/camunda/zeebe-process-test/commit/231129543b868ccbe28a8206de117ba34f47e51a"><code>2311295</code></a> release(v8.1.1): prepare for next development iteration</li>
<li>See full diff in <a href="https://github.com/camunda/zeebe-process-test/compare/8.1.1...8.1.2">compare view</a></li>
</ul>
</details>
<br />


Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting ``@dependabot` rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- ``@dependabot` rebase` will rebase this PR
- ``@dependabot` recreate` will recreate this PR, overwriting any edits that have been made to it
- ``@dependabot` merge` will merge this PR after your CI passes on it
- ``@dependabot` squash and merge` will squash and merge this PR after your CI passes on it
- ``@dependabot` cancel merge` will cancel a previously requested merge and block automerging
- ``@dependabot` reopen` will reopen this PR if it is closed
- ``@dependabot` close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
- ``@dependabot` ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
- ``@dependabot` ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
- ``@dependabot` ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)


</details>

10776: Stop raft server when going inactive due to unrecoverable errors r=deepthidevaki a=deepthidevaki

## Description

Previously, it was transitioning to inactive. But, in the configuration the member is still marked as active. As a result, the member transition back to active when it gets a new message from the leader. We cannot change the configuration, and mark this member as inactive because that would mean we are changing the quorum. What we really requires is that this partition is "dead" (atleast temporarily) so that it doesn't become leader again. We also don't want it to become a follower because this can also lead to partial functionality which can cause problems. For example, in follower role, raft is replicating events, but the streamprocessor or snapshotting is not working because of this error. So it is not able to compact the logs. This will eventually leads to disk space full and thus affecting other possibly healthy partitions.

To fix this, in this PR we stop the raft server instead of only transitioning to inactive. The replication factor and quorum remains the same. But this node cannot become leader again until the member is restarted.

## Related issues

closes #9924



10794: deps(maven): bump aws-java-sdk-core from 1.12.325 to 1.12.326 r=npepinpe a=dependabot[bot]

Bumps [aws-java-sdk-core](https://github.com/aws/aws-sdk-java) from 1.12.325 to 1.12.326.
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a href="https://github.com/aws/aws-sdk-java/blob/master/CHANGELOG.md">aws-java-sdk-core's changelog</a>.</em></p>
<blockquote>
<h1><strong>1.12.326</strong> <strong>2022-10-21</strong></h1>
<h2><strong>Amazon Cognito Identity Provider</strong></h2>
<ul>
<li>
<h3>Features</h3>
<ul>
<li>This release adds a new &quot;DeletionProtection&quot; field to the UserPool in Cognito. Application admins can configure this value with either ACTIVE or INACTIVE value. Setting this field to ACTIVE will prevent a user pool from accidental deletion.</li>
</ul>
</li>
</ul>
<h2><strong>Amazon SageMaker Service</strong></h2>
<ul>
<li>
<h3>Features</h3>
<ul>
<li>CreateInferenceRecommenderjob API now supports passing endpoint details directly, that will help customers to identify the max invocation and max latency they can achieve for their model and the associated endpoint along with getting recommendations on other instances.</li>
</ul>
</li>
</ul>
<h2><strong>Amazon Simple Storage Service</strong></h2>
<ul>
<li>
<h3>Features</h3>
<ul>
<li>S3 on Outposts launches support for automatic bucket-style alias. You can use the automatic access point alias instead of an access point ARN for any object-level operation in an Outposts bucket.</li>
</ul>
</li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a href="https://github.com/aws/aws-sdk-java/commit/a09eee1e8f39077386c25362d4fe004294e7ca60"><code>a09eee1</code></a> AWS SDK for Java 1.12.326</li>
<li><a href="https://github.com/aws/aws-sdk-java/commit/809b864c50b846cb65f1354303a1c00f048e6dde"><code>809b864</code></a> Update GitHub version number to 1.12.326-SNAPSHOT</li>
<li>See full diff in <a href="https://github.com/aws/aws-sdk-java/compare/1.12.325...1.12.326">compare view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=com.amazonaws:aws-java-sdk-core&package-manager=maven&previous-version=1.12.325&new-version=1.12.326)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting ``@dependabot` rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- ``@dependabot` rebase` will rebase this PR
- ``@dependabot` recreate` will recreate this PR, overwriting any edits that have been made to it
- ``@dependabot` merge` will merge this PR after your CI passes on it
- ``@dependabot` squash and merge` will squash and merge this PR after your CI passes on it
- ``@dependabot` cancel merge` will cancel a previously requested merge and block automerging
- ``@dependabot` reopen` will reopen this PR if it is closed
- ``@dependabot` close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
- ``@dependabot` ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
- ``@dependabot` ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
- ``@dependabot` ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)


</details>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com>
@zeebe-bors-camunda
Copy link
Contributor

Build failed (retrying...):

zeebe-bors-camunda bot added a commit that referenced this pull request Oct 24, 2022
10776: Stop raft server when going inactive due to unrecoverable errors r=deepthidevaki a=deepthidevaki

## Description

Previously, it was transitioning to inactive. But, in the configuration the member is still marked as active. As a result, the member transition back to active when it gets a new message from the leader. We cannot change the configuration, and mark this member as inactive because that would mean we are changing the quorum. What we really requires is that this partition is "dead" (atleast temporarily) so that it doesn't become leader again. We also don't want it to become a follower because this can also lead to partial functionality which can cause problems. For example, in follower role, raft is replicating events, but the streamprocessor or snapshotting is not working because of this error. So it is not able to compact the logs. This will eventually leads to disk space full and thus affecting other possibly healthy partitions.

To fix this, in this PR we stop the raft server instead of only transitioning to inactive. The replication factor and quorum remains the same. But this node cannot become leader again until the member is restarted.

## Related issues

closes #9924



10794: deps(maven): bump aws-java-sdk-core from 1.12.325 to 1.12.326 r=npepinpe a=dependabot[bot]

Bumps [aws-java-sdk-core](https://github.com/aws/aws-sdk-java) from 1.12.325 to 1.12.326.
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a href="https://github.com/aws/aws-sdk-java/blob/master/CHANGELOG.md">aws-java-sdk-core's changelog</a>.</em></p>
<blockquote>
<h1><strong>1.12.326</strong> <strong>2022-10-21</strong></h1>
<h2><strong>Amazon Cognito Identity Provider</strong></h2>
<ul>
<li>
<h3>Features</h3>
<ul>
<li>This release adds a new &quot;DeletionProtection&quot; field to the UserPool in Cognito. Application admins can configure this value with either ACTIVE or INACTIVE value. Setting this field to ACTIVE will prevent a user pool from accidental deletion.</li>
</ul>
</li>
</ul>
<h2><strong>Amazon SageMaker Service</strong></h2>
<ul>
<li>
<h3>Features</h3>
<ul>
<li>CreateInferenceRecommenderjob API now supports passing endpoint details directly, that will help customers to identify the max invocation and max latency they can achieve for their model and the associated endpoint along with getting recommendations on other instances.</li>
</ul>
</li>
</ul>
<h2><strong>Amazon Simple Storage Service</strong></h2>
<ul>
<li>
<h3>Features</h3>
<ul>
<li>S3 on Outposts launches support for automatic bucket-style alias. You can use the automatic access point alias instead of an access point ARN for any object-level operation in an Outposts bucket.</li>
</ul>
</li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a href="https://github.com/aws/aws-sdk-java/commit/a09eee1e8f39077386c25362d4fe004294e7ca60"><code>a09eee1</code></a> AWS SDK for Java 1.12.326</li>
<li><a href="https://github.com/aws/aws-sdk-java/commit/809b864c50b846cb65f1354303a1c00f048e6dde"><code>809b864</code></a> Update GitHub version number to 1.12.326-SNAPSHOT</li>
<li>See full diff in <a href="https://github.com/aws/aws-sdk-java/compare/1.12.325...1.12.326">compare view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=com.amazonaws:aws-java-sdk-core&package-manager=maven&previous-version=1.12.325&new-version=1.12.326)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting ``@dependabot` rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- ``@dependabot` rebase` will rebase this PR
- ``@dependabot` recreate` will recreate this PR, overwriting any edits that have been made to it
- ``@dependabot` merge` will merge this PR after your CI passes on it
- ``@dependabot` squash and merge` will squash and merge this PR after your CI passes on it
- ``@dependabot` cancel merge` will cancel a previously requested merge and block automerging
- ``@dependabot` reopen` will reopen this PR if it is closed
- ``@dependabot` close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
- ``@dependabot` ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
- ``@dependabot` ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
- ``@dependabot` ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)


</details>

Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
@zeebe-bors-camunda
Copy link
Contributor

Build failed (retrying...):

@zeebe-bors-camunda
Copy link
Contributor

Build succeeded:

@backport-action
Copy link
Collaborator

Successfully created backport PR #10798 for stable/8.0.

@backport-action
Copy link
Collaborator

Successfully created backport PR #10799 for stable/8.1.

zeebe-bors-camunda bot added a commit that referenced this pull request Oct 24, 2022
10798: [Backport stable/8.0] Stop raft server when going inactive due to unrecoverable errors r=deepthidevaki a=backport-action

# Description
Backport of #10776 to `stable/8.0`.

closes #9924

Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com>
zeebe-bors-camunda bot added a commit that referenced this pull request Oct 24, 2022
10799: [Backport stable/8.1] Stop raft server when going inactive due to unrecoverable errors r=deepthidevaki a=backport-action

# Description
Backport of #10776 to `stable/8.1`.

closes #9924

Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com>
@korthout korthout added release/8.0.8 version:8.1.3 Marks an issue as being completely or in parts released in 8.1.3 labels Nov 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
version:8.1.3 Marks an issue as being completely or in parts released in 8.1.3
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ZeebePartition caught in a loop transition to inactive after a dead partition
4 participants