-
Notifications
You must be signed in to change notification settings - Fork 571
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix flaky test DeploymentClusteredTest.shouldRedistributeDeploymentWhenDeploymentPartitionIsRestarted #19358
Conversation
**Background** Not receiving an acknowledgement command back from the partition that processed the distributed command is an expected scenario. To resolve that in production, we have a `CommandRedistributor` implemented. Every 10 seconds (with and exponential backoff), re-distributor simply re-tries the distributions that the partition didn't receive an acknowledgement back. **Flakiness** The logic explained above is already flaky by its nature. Simply, the partition is sometimes expected **not** to receive an acknowledgement back. This is exactly what happens in the test case. Sometimes, acknowledgement from partition 3 is not retrieved by partition 1 (deployment partition). Since the assertion waits for 5 seconds before failing, the re-distributor cannot retry the distribution in a given timeframe. **Solution** As a solution to that, we removed the assertion for `CommandDistributionIntent.ACKNOWLEDGED`. It is because there is already a method called `clientRule.waitUntilDeploymentIsDone()` later in the test, and it verifies if the distribution is completed within (partitionCount * 10L) seconds. Within that time frame the re-distribution can be retried and the distribution can be finished. This solution doesn't eliminate the flakiness as a whole, but it reduces its likelihood. Also, there is no way to reduce flakiness unless we accept to wait forever until the distribution is finished. In addition, `processDefinitionKey` is renamed to `deploymentKey` as the deployment event's key is a deployment key. That key also added to distribution assertion since we might have other distributions with different keys during the deployment along with deployment distribution in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
Git push to origin failed for stable/8.3 with exitcode 1 |
Successfully created backport PR for |
Git push to origin failed for stable/8.4 with exitcode 1 |
Successfully created backport PR for |
Git push to origin failed for stable/8.5 with exitcode 1 |
Successfully created backport PR for |
…distributeDeploymentWhenDeploymentPartitionIsRestarted (#19369) # Description Backport of #19358 to `stable/8.3`. relates to #17303 original author: @berkaycanbc
…distributeDeploymentWhenDeploymentPartitionIsRestarted (#19370) # Description Backport of #19358 to `stable/8.4`. relates to #17303 original author: @berkaycanbc
…distributeDeploymentWhenDeploymentPartitionIsRestarted (#19371) # Description Backport of #19358 to `stable/8.5`. relates to #17303 original author: @berkaycanbc
Description
Background
Not receiving an acknowledgement command back from the partition that processed the distributed command is an expected scenario. To resolve that in production, we have a
CommandRedistributor
implemented. Every 10 seconds (with an exponential backoff), re-distributor simply re-tries the distributions that the partition didn't receive an acknowledgement back.Flakiness
The logic explained above is already flaky by its nature. Simply, the partition is sometimes expected not to receive an acknowledgement back. This is exactly what happens in the test case. Sometimes, acknowledgement from partition 3 is not retrieved by partition 1 (deployment partition). Since the assertion waits for 5 seconds before failing, the re-distributor cannot retry the distribution in a given timeframe.
Solution
As a solution to that, we removed the assertion for
CommandDistributionIntent.ACKNOWLEDGED
. It is because there is already a method calledclientRule.waitUntilDeploymentIsDone()
later in the test, and it verifies if the distribution is completed within (partitionCount * 10L) seconds. Within that time frame the re-distribution can be retried and the distribution can be finished. This solution doesn't eliminate the flakiness as a whole, but it reduces its likelihood. Also, there is no way to reduce flakiness unless we accept to wait forever until the distribution is finished.In addition,
processDefinitionKey
is renamed todeploymentKey
as the deployment event's key is a deployment key. That key also added to distribution assertion since we might have other distributions with different keys during the deployment along with deployment distribution in the future.Related issues
closes #17303