New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stop raft server when going inactive due to unrecoverable errors #10776
Conversation
When raft transition to inactive, ZeebePartition will be notified and triggers a new transition
Test Results 947 files ±0 947 suites ±0 1h 42m 36s ⏱️ +11s For more details on these failures, see this check. Results for commit 346ce34. ± Comparison against base commit 9a1ad47. ♻️ This comment has been updated with latest results. |
public CompletableFuture<Void> goInactive() { | ||
return server.goInactive(); | ||
return server.stop(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❓ goInactive
is used in two places, on unrecoverable errors and on recoverable errors on followers. Is it intended that recoverable errors on followers now stop the raft server?
Either way, I think we should rename the method to stop
or similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously it was going inactive in both places, which was intended to behave similar to stop. So we are not changing the expected behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @deepthidevaki, LGTM 👍
bors merge |
10759: deps(maven): bump version.zpt from 8.1.1 to 8.1.2 r=oleschoenburg a=dependabot[bot] Bumps `version.zpt` from 8.1.1 to 8.1.2. Updates `zeebe-process-test-assertions` from 8.1.1 to 8.1.2 <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/camunda/zeebe-process-test/releases">zeebe-process-test-assertions's releases</a>.</em></p> <blockquote> <h2>8.1.2</h2> <h2>What's Changed</h2> <ul> <li>Zeebe release 8.1.2 by <a href="https://github.com/korthout"><code>`@korthout</code></a>` in <a href="https://github-redirect.dependabot.com/camunda/zeebe-process-test/pull/545">camunda/zeebe-process-test#545</a></li> </ul> <p><strong>Full Changelog</strong>: <a href="https://github.com/camunda/zeebe-process-test/compare/8.1.1...8.1.2">https://github.com/camunda/zeebe-process-test/compare/8.1.1...8.1.2</a></p> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/camunda/zeebe-process-test/commit/30bf84155a40f169f99918beb8743d9464251cdc"><code>30bf841</code></a> release(v8.1.2)</li> <li><a href="https://github.com/camunda/zeebe-process-test/commit/44406fba5c228aba61e3ff58e9d7e41b173693d1"><code>44406fb</code></a> merge: <a href="https://github-redirect.dependabot.com/camunda/zeebe-process-test/issues/545">#545</a></li> <li><a href="https://github.com/camunda/zeebe-process-test/commit/9f1ebf7665aacdbf43354cd9495e1c9dcfb79a4e"><code>9f1ebf7</code></a> deps(pom): bump zeebe to 8.1.2</li> <li><a href="https://github.com/camunda/zeebe-process-test/commit/231129543b868ccbe28a8206de117ba34f47e51a"><code>2311295</code></a> release(v8.1.1): prepare for next development iteration</li> <li>See full diff in <a href="https://github.com/camunda/zeebe-process-test/compare/8.1.1...8.1.2">compare view</a></li> </ul> </details> <br /> Updates `zeebe-process-test-filters` from 8.1.1 to 8.1.2 <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/camunda/zeebe-process-test/releases">zeebe-process-test-filters's releases</a>.</em></p> <blockquote> <h2>8.1.2</h2> <h2>What's Changed</h2> <ul> <li>Zeebe release 8.1.2 by <a href="https://github.com/korthout"><code>`@korthout</code></a>` in <a href="https://github-redirect.dependabot.com/camunda/zeebe-process-test/pull/545">camunda/zeebe-process-test#545</a></li> </ul> <p><strong>Full Changelog</strong>: <a href="https://github.com/camunda/zeebe-process-test/compare/8.1.1...8.1.2">https://github.com/camunda/zeebe-process-test/compare/8.1.1...8.1.2</a></p> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/camunda/zeebe-process-test/commit/30bf84155a40f169f99918beb8743d9464251cdc"><code>30bf841</code></a> release(v8.1.2)</li> <li><a href="https://github.com/camunda/zeebe-process-test/commit/44406fba5c228aba61e3ff58e9d7e41b173693d1"><code>44406fb</code></a> merge: <a href="https://github-redirect.dependabot.com/camunda/zeebe-process-test/issues/545">#545</a></li> <li><a href="https://github.com/camunda/zeebe-process-test/commit/9f1ebf7665aacdbf43354cd9495e1c9dcfb79a4e"><code>9f1ebf7</code></a> deps(pom): bump zeebe to 8.1.2</li> <li><a href="https://github.com/camunda/zeebe-process-test/commit/231129543b868ccbe28a8206de117ba34f47e51a"><code>2311295</code></a> release(v8.1.1): prepare for next development iteration</li> <li>See full diff in <a href="https://github.com/camunda/zeebe-process-test/compare/8.1.1...8.1.2">compare view</a></li> </ul> </details> <br /> Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting ``@dependabot` rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - ``@dependabot` rebase` will rebase this PR - ``@dependabot` recreate` will recreate this PR, overwriting any edits that have been made to it - ``@dependabot` merge` will merge this PR after your CI passes on it - ``@dependabot` squash and merge` will squash and merge this PR after your CI passes on it - ``@dependabot` cancel merge` will cancel a previously requested merge and block automerging - ``@dependabot` reopen` will reopen this PR if it is closed - ``@dependabot` close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - ``@dependabot` ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - ``@dependabot` ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - ``@dependabot` ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> 10776: Stop raft server when going inactive due to unrecoverable errors r=deepthidevaki a=deepthidevaki ## Description Previously, it was transitioning to inactive. But, in the configuration the member is still marked as active. As a result, the member transition back to active when it gets a new message from the leader. We cannot change the configuration, and mark this member as inactive because that would mean we are changing the quorum. What we really requires is that this partition is "dead" (atleast temporarily) so that it doesn't become leader again. We also don't want it to become a follower because this can also lead to partial functionality which can cause problems. For example, in follower role, raft is replicating events, but the streamprocessor or snapshotting is not working because of this error. So it is not able to compact the logs. This will eventually leads to disk space full and thus affecting other possibly healthy partitions. To fix this, in this PR we stop the raft server instead of only transitioning to inactive. The replication factor and quorum remains the same. But this node cannot become leader again until the member is restarted. ## Related issues closes #9924 10794: deps(maven): bump aws-java-sdk-core from 1.12.325 to 1.12.326 r=npepinpe a=dependabot[bot] Bumps [aws-java-sdk-core](https://github.com/aws/aws-sdk-java) from 1.12.325 to 1.12.326. <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/aws/aws-sdk-java/blob/master/CHANGELOG.md">aws-java-sdk-core's changelog</a>.</em></p> <blockquote> <h1><strong>1.12.326</strong> <strong>2022-10-21</strong></h1> <h2><strong>Amazon Cognito Identity Provider</strong></h2> <ul> <li> <h3>Features</h3> <ul> <li>This release adds a new "DeletionProtection" field to the UserPool in Cognito. Application admins can configure this value with either ACTIVE or INACTIVE value. Setting this field to ACTIVE will prevent a user pool from accidental deletion.</li> </ul> </li> </ul> <h2><strong>Amazon SageMaker Service</strong></h2> <ul> <li> <h3>Features</h3> <ul> <li>CreateInferenceRecommenderjob API now supports passing endpoint details directly, that will help customers to identify the max invocation and max latency they can achieve for their model and the associated endpoint along with getting recommendations on other instances.</li> </ul> </li> </ul> <h2><strong>Amazon Simple Storage Service</strong></h2> <ul> <li> <h3>Features</h3> <ul> <li>S3 on Outposts launches support for automatic bucket-style alias. You can use the automatic access point alias instead of an access point ARN for any object-level operation in an Outposts bucket.</li> </ul> </li> </ul> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/aws/aws-sdk-java/commit/a09eee1e8f39077386c25362d4fe004294e7ca60"><code>a09eee1</code></a> AWS SDK for Java 1.12.326</li> <li><a href="https://github.com/aws/aws-sdk-java/commit/809b864c50b846cb65f1354303a1c00f048e6dde"><code>809b864</code></a> Update GitHub version number to 1.12.326-SNAPSHOT</li> <li>See full diff in <a href="https://github.com/aws/aws-sdk-java/compare/1.12.325...1.12.326">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=com.amazonaws:aws-java-sdk-core&package-manager=maven&previous-version=1.12.325&new-version=1.12.326)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting ``@dependabot` rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - ``@dependabot` rebase` will rebase this PR - ``@dependabot` recreate` will recreate this PR, overwriting any edits that have been made to it - ``@dependabot` merge` will merge this PR after your CI passes on it - ``@dependabot` squash and merge` will squash and merge this PR after your CI passes on it - ``@dependabot` cancel merge` will cancel a previously requested merge and block automerging - ``@dependabot` reopen` will reopen this PR if it is closed - ``@dependabot` close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - ``@dependabot` ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - ``@dependabot` ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - ``@dependabot` ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com>
Build failed (retrying...): |
10776: Stop raft server when going inactive due to unrecoverable errors r=deepthidevaki a=deepthidevaki ## Description Previously, it was transitioning to inactive. But, in the configuration the member is still marked as active. As a result, the member transition back to active when it gets a new message from the leader. We cannot change the configuration, and mark this member as inactive because that would mean we are changing the quorum. What we really requires is that this partition is "dead" (atleast temporarily) so that it doesn't become leader again. We also don't want it to become a follower because this can also lead to partial functionality which can cause problems. For example, in follower role, raft is replicating events, but the streamprocessor or snapshotting is not working because of this error. So it is not able to compact the logs. This will eventually leads to disk space full and thus affecting other possibly healthy partitions. To fix this, in this PR we stop the raft server instead of only transitioning to inactive. The replication factor and quorum remains the same. But this node cannot become leader again until the member is restarted. ## Related issues closes #9924 10794: deps(maven): bump aws-java-sdk-core from 1.12.325 to 1.12.326 r=npepinpe a=dependabot[bot] Bumps [aws-java-sdk-core](https://github.com/aws/aws-sdk-java) from 1.12.325 to 1.12.326. <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/aws/aws-sdk-java/blob/master/CHANGELOG.md">aws-java-sdk-core's changelog</a>.</em></p> <blockquote> <h1><strong>1.12.326</strong> <strong>2022-10-21</strong></h1> <h2><strong>Amazon Cognito Identity Provider</strong></h2> <ul> <li> <h3>Features</h3> <ul> <li>This release adds a new "DeletionProtection" field to the UserPool in Cognito. Application admins can configure this value with either ACTIVE or INACTIVE value. Setting this field to ACTIVE will prevent a user pool from accidental deletion.</li> </ul> </li> </ul> <h2><strong>Amazon SageMaker Service</strong></h2> <ul> <li> <h3>Features</h3> <ul> <li>CreateInferenceRecommenderjob API now supports passing endpoint details directly, that will help customers to identify the max invocation and max latency they can achieve for their model and the associated endpoint along with getting recommendations on other instances.</li> </ul> </li> </ul> <h2><strong>Amazon Simple Storage Service</strong></h2> <ul> <li> <h3>Features</h3> <ul> <li>S3 on Outposts launches support for automatic bucket-style alias. You can use the automatic access point alias instead of an access point ARN for any object-level operation in an Outposts bucket.</li> </ul> </li> </ul> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/aws/aws-sdk-java/commit/a09eee1e8f39077386c25362d4fe004294e7ca60"><code>a09eee1</code></a> AWS SDK for Java 1.12.326</li> <li><a href="https://github.com/aws/aws-sdk-java/commit/809b864c50b846cb65f1354303a1c00f048e6dde"><code>809b864</code></a> Update GitHub version number to 1.12.326-SNAPSHOT</li> <li>See full diff in <a href="https://github.com/aws/aws-sdk-java/compare/1.12.325...1.12.326">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=com.amazonaws:aws-java-sdk-core&package-manager=maven&previous-version=1.12.325&new-version=1.12.326)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting ``@dependabot` rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - ``@dependabot` rebase` will rebase this PR - ``@dependabot` recreate` will recreate this PR, overwriting any edits that have been made to it - ``@dependabot` merge` will merge this PR after your CI passes on it - ``@dependabot` squash and merge` will squash and merge this PR after your CI passes on it - ``@dependabot` cancel merge` will cancel a previously requested merge and block automerging - ``@dependabot` reopen` will reopen this PR if it is closed - ``@dependabot` close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - ``@dependabot` ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - ``@dependabot` ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - ``@dependabot` ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Build failed (retrying...): |
Build succeeded: |
Successfully created backport PR #10798 for |
Successfully created backport PR #10799 for |
Description
Previously, it was transitioning to inactive. But, in the configuration the member is still marked as active. As a result, the member transition back to active when it gets a new message from the leader. We cannot change the configuration, and mark this member as inactive because that would mean we are changing the quorum. What we really requires is that this partition is "dead" (atleast temporarily) so that it doesn't become leader again. We also don't want it to become a follower because this can also lead to partial functionality which can cause problems. For example, in follower role, raft is replicating events, but the streamprocessor or snapshotting is not working because of this error. So it is not able to compact the logs. This will eventually leads to disk space full and thus affecting other possibly healthy partitions.
To fix this, in this PR we stop the raft server instead of only transitioning to inactive. The replication factor and quorum remains the same. But this node cannot become leader again until the member is restarted.
Related issues
closes #9924
Definition of Done
Not all items need to be done depending on the issue and the pull request.
Code changes:
backport stable/1.3
) to the PR, in case that fails you need to create backports manually.Testing:
Documentation:
Please refer to our review guidelines.