Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix flaky test DeploymentClusteredTest.shouldRedistributeDeploymentWhenDeploymentPartitionIsRestarted #19358

Conversation

berkaycanbc
Copy link
Contributor

@berkaycanbc berkaycanbc commented Jun 13, 2024

Description

Background
Not receiving an acknowledgement command back from the partition that processed the distributed command is an expected scenario. To resolve that in production, we have a CommandRedistributor implemented. Every 10 seconds (with an exponential backoff), re-distributor simply re-tries the distributions that the partition didn't receive an acknowledgement back.

Flakiness
The logic explained above is already flaky by its nature. Simply, the partition is sometimes expected not to receive an acknowledgement back. This is exactly what happens in the test case. Sometimes, acknowledgement from partition 3 is not retrieved by partition 1 (deployment partition). Since the assertion waits for 5 seconds before failing, the re-distributor cannot retry the distribution in a given timeframe.

Solution
As a solution to that, we removed the assertion for CommandDistributionIntent.ACKNOWLEDGED. It is because there is already a method called clientRule.waitUntilDeploymentIsDone() later in the test, and it verifies if the distribution is completed within (partitionCount * 10L) seconds. Within that time frame the re-distribution can be retried and the distribution can be finished. This solution doesn't eliminate the flakiness as a whole, but it reduces its likelihood. Also, there is no way to reduce flakiness unless we accept to wait forever until the distribution is finished.

In addition, processDefinitionKey is renamed to deploymentKey as the deployment event's key is a deployment key. That key also added to distribution assertion since we might have other distributions with different keys during the deployment along with deployment distribution in the future.

Related issues

closes #17303

**Background**
Not receiving an acknowledgement command back from the partition that processed the distributed
command is an expected scenario. To resolve that in production, we have a `CommandRedistributor`
implemented. Every 10 seconds (with and exponential backoff), re-distributor simply re-tries the
distributions that the partition didn't receive an acknowledgement back.

**Flakiness**
The logic explained above is already flaky by its nature. Simply, the partition is sometimes
expected **not** to receive an acknowledgement back. This is exactly what happens in the test case.
Sometimes, acknowledgement from partition 3 is not retrieved by partition 1 (deployment partition).
Since the assertion waits for 5 seconds before failing, the re-distributor cannot retry the
distribution in a given timeframe.

**Solution**
As a solution to that, we removed the assertion for `CommandDistributionIntent.ACKNOWLEDGED`. It is
because there is already a method called `clientRule.waitUntilDeploymentIsDone()` later in the
test, and it verifies if the distribution is completed within (partitionCount * 10L) seconds.
Within that time frame the re-distribution can be retried and the distribution can be finished.
This solution doesn't eliminate the flakiness as a whole, but it reduces its likelihood. Also,
there is no way to reduce flakiness unless we accept to wait forever until the distribution is
finished.

In addition, `processDefinitionKey` is renamed to `deploymentKey` as the deployment event's key is
a deployment key. That key also added to distribution assertion since we might have other
distributions with different keys during the deployment along with deployment distribution in the
future.
@github-actions github-actions bot added the component/zeebe Related to the Zeebe component/team label Jun 13, 2024
@berkaycanbc berkaycanbc added backport stable/8.3 Backport a pull request to 8.3.x backport stable/8.4 Backport a pull request to 8.4.x backport stable/8.5 Backport a pull request to stable/8.5 labels Jun 13, 2024
@berkaycanbc berkaycanbc requested a review from a team June 13, 2024 15:20
Copy link
Contributor

@mustafadagher mustafadagher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@berkaycanbc berkaycanbc added this pull request to the merge queue Jun 14, 2024
Merged via the queue into main with commit 7fc6cb6 Jun 14, 2024
44 checks passed
@berkaycanbc berkaycanbc deleted the bcan-17303-fix-flaky-DeploymentClusteredTest.shouldRedistributeDeploymentWhenDeploymentPartitionIsRestarted branch June 14, 2024 08:43
@backport-action
Copy link
Collaborator

Git push to origin failed for stable/8.3 with exitcode 1

@backport-action
Copy link
Collaborator

@backport-action
Copy link
Collaborator

Git push to origin failed for stable/8.4 with exitcode 1

@backport-action
Copy link
Collaborator

@backport-action
Copy link
Collaborator

Git push to origin failed for stable/8.5 with exitcode 1

@backport-action
Copy link
Collaborator

github-merge-queue bot pushed a commit that referenced this pull request Jun 14, 2024
…distributeDeploymentWhenDeploymentPartitionIsRestarted (#19369)

# Description
Backport of #19358 to `stable/8.3`.

relates to #17303
original author: @berkaycanbc
github-merge-queue bot pushed a commit that referenced this pull request Jun 14, 2024
…distributeDeploymentWhenDeploymentPartitionIsRestarted (#19370)

# Description
Backport of #19358 to `stable/8.4`.

relates to #17303
original author: @berkaycanbc
github-merge-queue bot pushed a commit that referenced this pull request Jun 14, 2024
…distributeDeploymentWhenDeploymentPartitionIsRestarted (#19371)

# Description
Backport of #19358 to `stable/8.5`.

relates to #17303
original author: @berkaycanbc
@github-actions github-actions bot added version:8.3.13 Marks an issue as being completely or in parts released in 8.3.13 version:8.4.9 Marks an issue as being completely or in parts released in 8.4.9 version:8.5.4 Marks an issue as being completely or in parts released in 8.5.4 labels Jun 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport stable/8.3 Backport a pull request to 8.3.x backport stable/8.4 Backport a pull request to 8.4.x backport stable/8.5 Backport a pull request to stable/8.5 component/zeebe Related to the Zeebe component/team version:8.3.13 Marks an issue as being completely or in parts released in 8.3.13 version:8.4.9 Marks an issue as being completely or in parts released in 8.4.9 version:8.5.4 Marks an issue as being completely or in parts released in 8.5.4
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flaky Test DeploymentClusteredTest.shouldRedistributeDeploymentWhenDeploymentPartitionIsRestarted
3 participants