-
Notifications
You must be signed in to change notification settings - Fork 565
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deployment distribution is only retried twice #8525
Comments
…e failed broker is up This test reproduces the bug #8525
Hm it is somehow hard for me to believe this 🙈 I think I have seen multiple retries if a leader is not avail in benchmarks but also in chaos experiments. If we take a look at the code we see that the request is send via a timeout https://github.com/camunda-cloud/zeebe/blob/develop/broker/src/main/java/io/camunda/zeebe/broker/engine/impl/DeploymentDistributorImpl.java#L108 And we handle the ERROR via retry https://github.com/camunda-cloud/zeebe/blob/develop/broker/src/main/java/io/camunda/zeebe/broker/engine/impl/DeploymentDistributorImpl.java#L118 I'm wondering your described scenario could happen 🤔 would this mean the send is never timedout? Which would mean we have a bug in the transport. |
The problematic code is this Here it is waiting or the response from the StreamProcessor of other partition. As far as I understood, the code that you linked is retrying when the sending itself failed (eg:- immediately rejected by the receiver because it is not the leader). |
To my understanding, @Zelldon refers to this code in the messaging layer/service: Meaning, when a request times out then the future is completed exceptionally with a That's why I am wondering why this code actually is in place WDYT? |
Ah.. understood what you are refering to. |
This is necessary because it is not actually using a request-response pattern. |
Thanks @deepthidevaki makes sense I think I can now remember again. For the sake of completeness be aware that these things are also retried after replay https://github.com/camunda-cloud/zeebe/blob/d89a46ebadbdeb1b3f947fe8d955dbce84eca420/engine/src/main/java/io/camunda/zeebe/engine/processing/deployment/distribute/DeploymentRedistributor.java |
This also goes through |
Yes this uses also this impl class |
@deepthidevaki, I got it now! Thanks for your clarification. |
8526: fix(broker): retry deployment distribution until success r=deepthidevaki a=deepthidevaki ## Description Previously the retry was only scheduled once. So if the retry did not get a response, it is never retried again. As a result a partition that did not receive the deployment in the first two attempts, will never receives it (unless a new leader is elected). Also added a test that can reproduce the bug. ## Related issues closes #8445 closes #8525 8532: deps(maven): bump protobuf-bom from 3.19.1 to 3.19.2 r=npepinpe a=dependabot[bot] Bumps [protobuf-bom](https://github.com/protocolbuffers/protobuf) from 3.19.1 to 3.19.2. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/protocolbuffers/protobuf/releases">protobuf-bom's releases</a>.</em></p> <blockquote> <h2>Protocol Buffers v3.19.2</h2> <h1>Python</h1> <ul> <li>Fix missing Windows wheel for Python 3.10 on PyPI</li> </ul> <h1>Java</h1> <ul> <li>Improve performance characteristics of UnknownFieldSet parsing (<a href="https://github-redirect.dependabot.com/protocolbuffers/protobuf/issues/9371">#9371</a>)</li> </ul> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/protocolbuffers/protobuf/commit/cb46755e6405e083b45481f5ea4754b180705529"><code>cb46755</code></a> Tweak wording of CHANGES.txt</li> <li><a href="https://github.com/protocolbuffers/protobuf/commit/eb94f17a8b8babc4851fb9bb27f9fb1919fe9cfc"><code>eb94f17</code></a> Update protobuf version</li> <li><a href="https://github.com/protocolbuffers/protobuf/commit/ea9a01a0f5c259e9597c199e447d99eaab19930b"><code>ea9a01a</code></a> Update CHANGES.txt for 3.19.2 release</li> <li><a href="https://github.com/protocolbuffers/protobuf/commit/9638a5e5315bf73f5e7148c16181676372321892"><code>9638a5e</code></a> Improve performance of parsing unknown fields in Java (<a href="https://github-redirect.dependabot.com/protocolbuffers/protobuf/issues/9371">#9371</a>)</li> <li><a href="https://github.com/protocolbuffers/protobuf/commit/0e02f95b876c40ecb04a76e729dd3f0d9185431e"><code>0e02f95</code></a> Fix Ruby release build by pinning rake-compiler-dock version (<a href="https://github-redirect.dependabot.com/protocolbuffers/protobuf/issues/9372">#9372</a>)</li> <li><a href="https://github.com/protocolbuffers/protobuf/commit/9057466a7aea0b927f3bf7ab125355be3317d4a0"><code>9057466</code></a> Merge pull request <a href="https://github-redirect.dependabot.com/protocolbuffers/protobuf/issues/9336">#9336</a> from acozzette/merge-3.18.x</li> <li><a href="https://github.com/protocolbuffers/protobuf/commit/9227f60f85701be60f5a0585a27d0e881ebc84b0"><code>9227f60</code></a> Merge branch '3.18.x' into merge-3.18.x</li> <li><a href="https://github.com/protocolbuffers/protobuf/commit/1fe07f9ce2e98ca66a403b989d78a060f3153d5b"><code>1fe07f9</code></a> Cherry-pick fix from <a href="https://github-redirect.dependabot.com/protocolbuffers/protobuf/issues/9311">#9311</a> into 3.18.x (<a href="https://github-redirect.dependabot.com/protocolbuffers/protobuf/issues/9320">#9320</a>)</li> <li><a href="https://github.com/protocolbuffers/protobuf/commit/cb1f49ae73cf26dd61368f8bd1297e11c62a5f03"><code>cb1f49a</code></a> Cherry-pick Python 3.10 fix to 3.19.x (<a href="https://github-redirect.dependabot.com/protocolbuffers/protobuf/issues/9323">#9323</a>)</li> <li><a href="https://github.com/protocolbuffers/protobuf/commit/bb5094881c6a496a458cb5d1c939ac4d0e822964"><code>bb50948</code></a> Update python 3.10 install MD5 Sum (<a href="https://github-redirect.dependabot.com/protocolbuffers/protobuf/issues/9322">#9322</a>)</li> <li>Additional commits viewable in <a href="https://github.com/protocolbuffers/protobuf/compare/v3.19.1...v3.19.2">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=com.google.protobuf:protobuf-bom&package-manager=maven&previous-version=3.19.1&new-version=3.19.2)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Describe the bug
When a push deployment response is not received with in a specified timeout,
DeploymentDistributor
is expected to retry sending the deployment. However it was observed that it retries once, but if this retry also did not receive a response it is not retried again. As a result, the partition that did not receive the deployment will never receive it (unless a new leader is elected).To Reproduce
Observed in the flaky test #8445
Expected behavior
Distribution is retried until success.
Environment:
The text was updated successfully, but these errors were encountered: