Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deployment distribution is only retried twice #8525

Closed
deepthidevaki opened this issue Jan 5, 2022 · 9 comments · Fixed by #8526
Closed

Deployment distribution is only retried twice #8525

deepthidevaki opened this issue Jan 5, 2022 · 9 comments · Fixed by #8526
Assignees
Labels
kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog version:1.3.1 Marks an issue as being completely or in parts released in 1.3.1

Comments

@deepthidevaki
Copy link
Contributor

Describe the bug

When a push deployment response is not received with in a specified timeout, DeploymentDistributor is expected to retry sending the deployment. However it was observed that it retries once, but if this retry also did not receive a response it is not retried again. As a result, the partition that did not receive the deployment will never receive it (unless a new leader is elected).

To Reproduce

Observed in the flaky test #8445

Expected behavior
Distribution is retried until success.

Environment:

  • Zeebe Version: 1.3.0
@deepthidevaki deepthidevaki added kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog labels Jan 5, 2022
@deepthidevaki deepthidevaki self-assigned this Jan 5, 2022
deepthidevaki added a commit that referenced this issue Jan 5, 2022
…e failed broker is up

This test reproduces the bug #8525
@deepthidevaki deepthidevaki added this to In progress in Zeebe Jan 5, 2022
@Zelldon
Copy link
Member

Zelldon commented Jan 5, 2022

Hm it is somehow hard for me to believe this 🙈 I think I have seen multiple retries if a leader is not avail in benchmarks but also in chaos experiments.

If we take a look at the code we see that the request is send via a timeout https://github.com/camunda-cloud/zeebe/blob/develop/broker/src/main/java/io/camunda/zeebe/broker/engine/impl/DeploymentDistributorImpl.java#L108

And we handle the ERROR via retry https://github.com/camunda-cloud/zeebe/blob/develop/broker/src/main/java/io/camunda/zeebe/broker/engine/impl/DeploymentDistributorImpl.java#L118

https://github.com/camunda-cloud/zeebe/blob/06f2f683ddc27e181ecdf7a89ea4f64f5e76f0a5/broker/src/main/java/io/camunda/zeebe/broker/engine/impl/DeploymentDistributorImpl.java#L193-L207

I'm wondering your described scenario could happen 🤔 would this mean the send is never timedout? Which would mean we have a bug in the transport.

@deepthidevaki
Copy link
Contributor Author

The problematic code is this
https://github.com/camunda-cloud/zeebe/blob/d89a46ebadbdeb1b3f947fe8d955dbce84eca420/broker/src/main/java/io/camunda/zeebe/broker/engine/impl/DeploymentDistributorImpl.java#L70-L75

Here it is waiting or the response from the StreamProcessor of other partition.

As far as I understood, the code that you linked is retrying when the sending itself failed (eg:- immediately rejected by the receiver because it is not the leader).

@romansmirnov
Copy link
Member

romansmirnov commented Jan 5, 2022

To my understanding, @Zelldon refers to this code in the messaging layer/service:

https://github.com/camunda-cloud/zeebe/blob/06f2f683ddc27e181ecdf7a89ea4f64f5e76f0a5/atomix/cluster/src/main/java/io/atomix/cluster/messaging/impl/NettyMessagingService.java#L225-L233

Meaning, when a request times out then the future is completed exceptionally with a TimeoutException. This TimeoutException should be handled by the DeploymentDistributor which retires the push request respectively.

That's why I am wondering why this code actually is in place

https://github.com/camunda-cloud/zeebe/blob/06f2f683ddc27e181ecdf7a89ea4f64f5e76f0a5/broker/src/main/java/io/camunda/zeebe/broker/engine/impl/DeploymentDistributorImpl.java#L70-L82

WDYT?

@deepthidevaki
Copy link
Contributor Author

Ah.. understood what you are refering to.
But in this case the request did not reach transport layer because
https://github.com/camunda-cloud/zeebe/blob/d89a46ebadbdeb1b3f947fe8d955dbce84eca420/broker/src/main/java/io/camunda/zeebe/broker/engine/impl/DeploymentDistributorImpl.java#L94-L96
There was no leader at that time. (In the reproducer code the only replica is shutdown).

@deepthidevaki
Copy link
Contributor Author

That's why I am wondering why this code actually is in place

This is necessary because it is not actually using a request-response pattern.
When Deploymentdistribute request is sent, the receiver receives and write to the logstream and immediately send a response. This response can fail due to timeout in transport or the receiver rejecting the message because it is not the leader. A success response here does not mean that the command was processed by that partition. So another message is send by the StreamProcessor in the receiver after the command is processed. This can happen after a long delay due to various reasons such as leader changes. It can also happen that this is never processed because of a leader change and this record was never committed. This message is received by the subscription created before sending the request. If this message was not received with in the timeout, we should retry sending the message. This retrying is handled by https://github.com/camunda-cloud/zeebe/blob/06f2f683ddc27e181ecdf7a89ea4f64f5e76f0a5/broker/src/main/java/io/camunda/zeebe/broker/engine/impl/DeploymentDistributorImpl.java#L70-L82

@npepinpe npepinpe moved this from In progress to Review in progress in Zeebe Jan 5, 2022
@Zelldon
Copy link
Member

Zelldon commented Jan 5, 2022

Thanks @deepthidevaki makes sense I think I can now remember again. For the sake of completeness be aware that these things are also retried after replay https://github.com/camunda-cloud/zeebe/blob/d89a46ebadbdeb1b3f947fe8d955dbce84eca420/engine/src/main/java/io/camunda/zeebe/engine/processing/deployment/distribute/DeploymentRedistributor.java

@deepthidevaki
Copy link
Contributor Author

Thanks @deepthidevaki makes sense I think I can now remember again. For the sake of completeness be aware that these things are also retried after replay https://github.com/camunda-cloud/zeebe/blob/d89a46ebadbdeb1b3f947fe8d955dbce84eca420/engine/src/main/java/io/camunda/zeebe/engine/processing/deployment/distribute/DeploymentRedistributor.java

This also goes through DeploymentDistributorImpl, right? So we don't handle this differently?

@Zelldon
Copy link
Member

Zelldon commented Jan 5, 2022

Yes this uses also this impl class

@romansmirnov
Copy link
Member

@deepthidevaki, I got it now! Thanks for your clarification.

ghost pushed a commit that referenced this issue Jan 6, 2022
8526: fix(broker): retry deployment distribution until success r=deepthidevaki a=deepthidevaki

## Description

Previously the retry was only scheduled once. So if the retry did not get a response, it is never retried again. As a result a partition that did not receive the deployment in the first two attempts, will never receives it (unless a new leader is elected).

Also added a test that can reproduce the bug.

## Related issues

closes #8445 
closes #8525 



8532: deps(maven): bump protobuf-bom from 3.19.1 to 3.19.2 r=npepinpe a=dependabot[bot]

Bumps [protobuf-bom](https://github.com/protocolbuffers/protobuf) from 3.19.1 to 3.19.2.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a href="https://github.com/protocolbuffers/protobuf/releases">protobuf-bom's releases</a>.</em></p>
<blockquote>
<h2>Protocol Buffers v3.19.2</h2>
<h1>Python</h1>
<ul>
<li>Fix missing Windows wheel for Python 3.10 on PyPI</li>
</ul>
<h1>Java</h1>
<ul>
<li>Improve performance characteristics of UnknownFieldSet parsing (<a href="https://github-redirect.dependabot.com/protocolbuffers/protobuf/issues/9371">#9371</a>)</li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a href="https://github.com/protocolbuffers/protobuf/commit/cb46755e6405e083b45481f5ea4754b180705529"><code>cb46755</code></a> Tweak wording of CHANGES.txt</li>
<li><a href="https://github.com/protocolbuffers/protobuf/commit/eb94f17a8b8babc4851fb9bb27f9fb1919fe9cfc"><code>eb94f17</code></a> Update protobuf version</li>
<li><a href="https://github.com/protocolbuffers/protobuf/commit/ea9a01a0f5c259e9597c199e447d99eaab19930b"><code>ea9a01a</code></a> Update CHANGES.txt for 3.19.2 release</li>
<li><a href="https://github.com/protocolbuffers/protobuf/commit/9638a5e5315bf73f5e7148c16181676372321892"><code>9638a5e</code></a> Improve performance of parsing unknown fields in Java (<a href="https://github-redirect.dependabot.com/protocolbuffers/protobuf/issues/9371">#9371</a>)</li>
<li><a href="https://github.com/protocolbuffers/protobuf/commit/0e02f95b876c40ecb04a76e729dd3f0d9185431e"><code>0e02f95</code></a> Fix Ruby release build by pinning rake-compiler-dock version (<a href="https://github-redirect.dependabot.com/protocolbuffers/protobuf/issues/9372">#9372</a>)</li>
<li><a href="https://github.com/protocolbuffers/protobuf/commit/9057466a7aea0b927f3bf7ab125355be3317d4a0"><code>9057466</code></a> Merge pull request <a href="https://github-redirect.dependabot.com/protocolbuffers/protobuf/issues/9336">#9336</a> from acozzette/merge-3.18.x</li>
<li><a href="https://github.com/protocolbuffers/protobuf/commit/9227f60f85701be60f5a0585a27d0e881ebc84b0"><code>9227f60</code></a> Merge branch '3.18.x' into merge-3.18.x</li>
<li><a href="https://github.com/protocolbuffers/protobuf/commit/1fe07f9ce2e98ca66a403b989d78a060f3153d5b"><code>1fe07f9</code></a> Cherry-pick fix from <a href="https://github-redirect.dependabot.com/protocolbuffers/protobuf/issues/9311">#9311</a> into 3.18.x (<a href="https://github-redirect.dependabot.com/protocolbuffers/protobuf/issues/9320">#9320</a>)</li>
<li><a href="https://github.com/protocolbuffers/protobuf/commit/cb1f49ae73cf26dd61368f8bd1297e11c62a5f03"><code>cb1f49a</code></a> Cherry-pick Python 3.10 fix to 3.19.x (<a href="https://github-redirect.dependabot.com/protocolbuffers/protobuf/issues/9323">#9323</a>)</li>
<li><a href="https://github.com/protocolbuffers/protobuf/commit/bb5094881c6a496a458cb5d1c939ac4d0e822964"><code>bb50948</code></a> Update python 3.10 install MD5 Sum (<a href="https://github-redirect.dependabot.com/protocolbuffers/protobuf/issues/9322">#9322</a>)</li>
<li>Additional commits viewable in <a href="https://github.com/protocolbuffers/protobuf/compare/v3.19.1...v3.19.2">compare view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=com.google.protobuf:protobuf-bom&package-manager=maven&previous-version=3.19.1&new-version=3.19.2)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
- `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)


</details>

Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
@ghost ghost closed this as completed in 28f5468 Jan 6, 2022
Zeebe automation moved this from Review in progress to Done Jan 6, 2022
github-actions bot pushed a commit that referenced this issue Jan 6, 2022
…e failed broker is up

This test reproduces the bug #8525

(cherry picked from commit d60fa48)
github-actions bot pushed a commit that referenced this issue Jan 6, 2022
…e failed broker is up

This test reproduces the bug #8525

(cherry picked from commit d60fa48)
ghost pushed a commit that referenced this issue Jan 7, 2022
8545: [Backport stable/1.2] fix(broker): retry deployment distribution until success r=deepthidevaki a=github-actions[bot]

# Description
Backport of #8526 to `stable/1.2`.

closes #8445 #8525

Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com>
ghost pushed a commit that referenced this issue Jan 7, 2022
8546: [Backport stable/1.3] fix(broker): retry deployment distribution until success r=deepthidevaki a=github-actions[bot]

# Description
Backport of #8526 to `stable/1.3`.

closes #8445 #8525

Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com>
@npepinpe npepinpe added the version:1.3.1 Marks an issue as being completely or in parts released in 1.3.1 label Jan 17, 2022
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog version:1.3.1 Marks an issue as being completely or in parts released in 1.3.1
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants