-
Notifications
You must be signed in to change notification settings - Fork 562
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flaky MultiPartitionDeploymentLifecycleTest.shouldTestLifecycle
#9964
Comments
Increased severity as this week this has been the main reason why builds failed to flaky tests |
Occurred again here https://github.com/camunda/zeebe/actions/runs/3150907128/jobs/5124227135 |
Happened again on main:
https://github.com/camunda/zeebe/actions/runs/3225048569/jobs/5276879600 |
This is failing quite frequently now. |
I was giving some priority to the hacktoberfest contributions, but I will start digging into this again today. So far my observation is that it fails because in the exporter we get all the commands first for some reason. |
After running it a a lot of times I finally got a failure locally. This allowed me to see the positions of the exported records. What is odd about this is the 3rd line in this log. Here we can see that the
Looking at the code we can see that we always write a otherPartitions.forEach(
partitionId -> {
deploymentDistributionRecord.setPartition(partitionId);
stateWriter.appendFollowUpEvent(
key, DeploymentDistributionIntent.DISTRIBUTING, deploymentDistributionRecord);
distributeDeploymentToPartition(key, partitionId, copiedDeploymentBuffer);
}); Somehow we are writing the response of this deployment distribution to the log stream before we are writing the distributing event. Hypothesis
Tl;dr: I think the inter partition communication does not take event buffering into consideration. Resulting in a strange order of commands/events on the log stream. @Zelldon Could you verify if my hypothesis makes any sense? |
Sounds plausible, but I think this was already before the abstraction like that, since the LogStreamWriters also just "buffered" the records during processing. I guess there was just some timing change which might cause this issue more often. |
🤔 There was also the changes in how the deployment distribution works somewhat recently. It might be that this caused a change in the timing. I can't fully remember the details of this change. I think the order shouldn't matter either way so I will modify the test to be a bit more lenient in the command ordering and that should fix the flakiness. |
@remcowesterhoud maybe it makes sense to use the PostCommitTasks for the Deployment distribution? So the request would only be executed IF the processing is done and the transaction is committed. Might make more sense I guess? |
Summary
Failures
Example assertion failure
Hypotheses
The ordering of events from multiple deployment distribution attempts is not guaranteed, the tests seems to be too strict.
The text was updated successfully, but these errors were encountered: