fix flaky test DeploymentClusteredTest.shouldRedistributeDeploymentWhenDeploymentPartitionIsRestarted #19358

berkaycanbc · 2024-06-13T15:17:27Z

Description

Background
Not receiving an acknowledgement command back from the partition that processed the distributed command is an expected scenario. To resolve that in production, we have a CommandRedistributor implemented. Every 10 seconds (with an exponential backoff), re-distributor simply re-tries the distributions that the partition didn't receive an acknowledgement back.

Flakiness
The logic explained above is already flaky by its nature. Simply, the partition is sometimes expected not to receive an acknowledgement back. This is exactly what happens in the test case. Sometimes, acknowledgement from partition 3 is not retrieved by partition 1 (deployment partition). Since the assertion waits for 5 seconds before failing, the re-distributor cannot retry the distribution in a given timeframe.

Solution
As a solution to that, we removed the assertion for CommandDistributionIntent.ACKNOWLEDGED. It is because there is already a method called clientRule.waitUntilDeploymentIsDone() later in the test, and it verifies if the distribution is completed within (partitionCount * 10L) seconds. Within that time frame the re-distribution can be retried and the distribution can be finished. This solution doesn't eliminate the flakiness as a whole, but it reduces its likelihood. Also, there is no way to reduce flakiness unless we accept to wait forever until the distribution is finished.

In addition, processDefinitionKey is renamed to deploymentKey as the deployment event's key is a deployment key. That key also added to distribution assertion since we might have other distributions with different keys during the deployment along with deployment distribution in the future.

Related issues

closes #17303

**Background** Not receiving an acknowledgement command back from the partition that processed the distributed command is an expected scenario. To resolve that in production, we have a `CommandRedistributor` implemented. Every 10 seconds (with and exponential backoff), re-distributor simply re-tries the distributions that the partition didn't receive an acknowledgement back. **Flakiness** The logic explained above is already flaky by its nature. Simply, the partition is sometimes expected **not** to receive an acknowledgement back. This is exactly what happens in the test case. Sometimes, acknowledgement from partition 3 is not retrieved by partition 1 (deployment partition). Since the assertion waits for 5 seconds before failing, the re-distributor cannot retry the distribution in a given timeframe. **Solution** As a solution to that, we removed the assertion for `CommandDistributionIntent.ACKNOWLEDGED`. It is because there is already a method called `clientRule.waitUntilDeploymentIsDone()` later in the test, and it verifies if the distribution is completed within (partitionCount * 10L) seconds. Within that time frame the re-distribution can be retried and the distribution can be finished. This solution doesn't eliminate the flakiness as a whole, but it reduces its likelihood. Also, there is no way to reduce flakiness unless we accept to wait forever until the distribution is finished. In addition, `processDefinitionKey` is renamed to `deploymentKey` as the deployment event's key is a deployment key. That key also added to distribution assertion since we might have other distributions with different keys during the deployment along with deployment distribution in the future.

mustafadagher

🚀

backport-action · 2024-06-14T08:44:18Z

Git push to origin failed for stable/8.3 with exitcode 1

backport-action · 2024-06-14T08:44:18Z

Successfully created backport PR for stable/8.3:

[Backport stable/8.3] fix flaky test DeploymentClusteredTest.shouldRedistributeDeploymentWhenDeploymentPartitionIsRestarted #19369

backport-action · 2024-06-14T08:44:21Z

Git push to origin failed for stable/8.4 with exitcode 1

backport-action · 2024-06-14T08:44:22Z

Successfully created backport PR for stable/8.4:

[Backport stable/8.4] fix flaky test DeploymentClusteredTest.shouldRedistributeDeploymentWhenDeploymentPartitionIsRestarted #19370

backport-action · 2024-06-14T08:44:25Z

Git push to origin failed for stable/8.5 with exitcode 1

backport-action · 2024-06-14T08:44:26Z

Successfully created backport PR for stable/8.5:

[Backport stable/8.5] fix flaky test DeploymentClusteredTest.shouldRedistributeDeploymentWhenDeploymentPartitionIsRestarted #19371

@berkaycanbc

…distributeDeploymentWhenDeploymentPartitionIsRestarted (#19369) # Description Backport of #19358 to `stable/8.3`. relates to #17303 original author: @berkaycanbc

@berkaycanbc

…distributeDeploymentWhenDeploymentPartitionIsRestarted (#19370) # Description Backport of #19358 to `stable/8.4`. relates to #17303 original author: @berkaycanbc

@berkaycanbc

…distributeDeploymentWhenDeploymentPartitionIsRestarted (#19371) # Description Backport of #19358 to `stable/8.5`. relates to #17303 original author: @berkaycanbc

github-actions bot added the component/zeebe Related to the Zeebe component/team label Jun 13, 2024

berkaycanbc added backport stable/8.3 Backport a pull request to 8.3.x backport stable/8.4 Backport a pull request to 8.4.x backport stable/8.5 Backport a pull request to stable/8.5 labels Jun 13, 2024

berkaycanbc requested a review from a team June 13, 2024 15:20

mustafadagher approved these changes Jun 13, 2024

View reviewed changes

berkaycanbc added this pull request to the merge queue Jun 14, 2024

Merged via the queue into main with commit 7fc6cb6 Jun 14, 2024
44 checks passed

berkaycanbc deleted the bcan-17303-fix-flaky-DeploymentClusteredTest.shouldRedistributeDeploymentWhenDeploymentPartitionIsRestarted branch June 14, 2024 08:43

backport-action mentioned this pull request Jun 14, 2024

[Backport stable/8.3] fix flaky test DeploymentClusteredTest.shouldRedistributeDeploymentWhenDeploymentPartitionIsRestarted #19369

Merged

backport-action mentioned this pull request Jun 14, 2024

[Backport stable/8.4] fix flaky test DeploymentClusteredTest.shouldRedistributeDeploymentWhenDeploymentPartitionIsRestarted #19370

Merged

backport-action mentioned this pull request Jun 14, 2024

[Backport stable/8.5] fix flaky test DeploymentClusteredTest.shouldRedistributeDeploymentWhenDeploymentPartitionIsRestarted #19371

Merged

github-actions bot added version:8.3.13 Marks an issue as being completely or in parts released in 8.3.13 version:8.4.9 Marks an issue as being completely or in parts released in 8.4.9 version:8.5.4 Marks an issue as being completely or in parts released in 8.5.4 labels Jun 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix flaky test DeploymentClusteredTest.shouldRedistributeDeploymentWhenDeploymentPartitionIsRestarted #19358

fix flaky test DeploymentClusteredTest.shouldRedistributeDeploymentWhenDeploymentPartitionIsRestarted #19358

berkaycanbc commented Jun 13, 2024 •

edited

Loading

mustafadagher left a comment

backport-action commented Jun 14, 2024

backport-action commented Jun 14, 2024

backport-action commented Jun 14, 2024

backport-action commented Jun 14, 2024

backport-action commented Jun 14, 2024

backport-action commented Jun 14, 2024

fix flaky test DeploymentClusteredTest.shouldRedistributeDeploymentWhenDeploymentPartitionIsRestarted #19358

fix flaky test DeploymentClusteredTest.shouldRedistributeDeploymentWhenDeploymentPartitionIsRestarted #19358

Conversation

berkaycanbc commented Jun 13, 2024 • edited Loading

Description

Related issues

mustafadagher left a comment

Choose a reason for hiding this comment

backport-action commented Jun 14, 2024

backport-action commented Jun 14, 2024

backport-action commented Jun 14, 2024

backport-action commented Jun 14, 2024

backport-action commented Jun 14, 2024

backport-action commented Jun 14, 2024

berkaycanbc commented Jun 13, 2024 •

edited

Loading