Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky MultiPartitionDeploymentLifecycleTest.shouldTestLifecycle #9964

Closed
lenaschoenburg opened this issue Aug 3, 2022 · 10 comments · Fixed by #10065 or #10689
Closed

Flaky MultiPartitionDeploymentLifecycleTest.shouldTestLifecycle #9964

lenaschoenburg opened this issue Aug 3, 2022 · 10 comments · Fixed by #10065 or #10689
Assignees
Labels
kind/flake Categorizes issue or PR as related to a flaky test version:1.3.14 Marks an issue as being completely or in parts released in 1.3.14 version:8.1.0-alpha5 Marks an issue as being completely or in parts released in 8.1.0-alpha5 version:8.1.0 Marks an issue as being completely or in parts released in 8.1.0 version:8.1.2 Marks an issue as being completely or in parts released in 8.1.2 version:8.2.0-alpha1 Marks an issue as being completely or in parts released in 8.2.0-alpha1 version:8.2.0 Marks an issue as being completely or in parts released in 8.2.0

Comments

@lenaschoenburg
Copy link
Member

lenaschoenburg commented Aug 3, 2022

Summary

Try to answer the following as best as possible

  • How often does the test fail? More than once
  • Does it block your work? No
  • Do we suspect that it is a real failure? No

Failures

Outline known failure cases, e.g. a failed assertion and its stacktrace obtained from Jenkins

Example assertion failure
Error:  io.camunda.zeebe.engine.processing.deployment.MultiPartitionDeploymentLifecycleTest.shouldTestLifecycle  Time elapsed: 0.601 s  <<< FAILURE!
java.lang.AssertionError: 

Actual and expected have the same elements but not in the same order, at index 6 actual element was:
(COMPLETED, EVENT)
whereas expected element was:
(COMPLETE, COMMAND)

at io.camunda.zeebe.engine.processing.deployment.MultiPartitionDeploymentLifecycleTest.shouldTestLifecycle(MultiPartitionDeploymentLifecycleTest.java:59)

Hypotheses

The ordering of events from multiple deployment distribution attempts is not guaranteed, the tests seems to be too strict.

@lenaschoenburg lenaschoenburg added the kind/flake Categorizes issue or PR as related to a flaky test label Aug 3, 2022
@remcowesterhoud
Copy link
Contributor

@pihme pihme added the severity/high Marks a bug as having a noticeable impact on the user with no known workaround label Aug 10, 2022
@pihme
Copy link
Contributor

pihme commented Aug 10, 2022

Increased severity as this week this has been the main reason why builds failed to flaky tests

@remcowesterhoud remcowesterhoud self-assigned this Aug 15, 2022
zeebe-bors-camunda bot added a commit that referenced this issue Aug 17, 2022
10091: [Backport stable/8.0] Fix flaky MultiPartitionDeploymentLifecycleTest r=remcowesterhoud a=backport-action

# Description
Backport of #10065 to `stable/8.0`.

relates to #9964

Co-authored-by: Remco Westerhoud <remco@westerhoud.nl>
zeebe-bors-camunda bot added a commit that referenced this issue Aug 17, 2022
10098: Backport 10065 to stable/1.3 r=remcowesterhoud a=remcowesterhoud

# Description
Backport of #10065 to `stable/1.3`.

relates to #9964

Co-authored-by: Remco Westerhoud <remco@westerhoud.nl>
zeebe-bors-camunda bot added a commit that referenced this issue Aug 23, 2022
10091: [Backport stable/8.0] Fix flaky MultiPartitionDeploymentLifecycleTest r=remcowesterhoud a=backport-action

# Description
Backport of #10065 to `stable/8.0`.

relates to #9964

Co-authored-by: Remco Westerhoud <remco@westerhoud.nl>
@saig0 saig0 added release/8.0.8 version:1.3.14 Marks an issue as being completely or in parts released in 1.3.14 labels Sep 1, 2022
@Zelldon
Copy link
Member

Zelldon commented Sep 29, 2022

@Zelldon Zelldon reopened this Sep 29, 2022
@remcowesterhoud remcowesterhoud removed their assignment Sep 29, 2022
@Zelldon Zelldon added the version:8.1.0 Marks an issue as being completely or in parts released in 8.1.0 label Oct 4, 2022
@remcowesterhoud remcowesterhoud self-assigned this Oct 10, 2022
@Zelldon
Copy link
Member

Zelldon commented Oct 11, 2022

Happened again on main:

Warning:  Flakes: 
Warning:  io.camunda.zeebe.engine.processing.deployment.MultiPartitionDeploymentLifecycleTest.shouldTestLifecycle
Error:    Run 1: MultiPartitionDeploymentLifecycleTest.shouldTestLifecycle:73 
Expecting actual:
  [(CREATE, COMMAND, 1),
    (COMPLETE, COMMAND, 3),
    (COMPLETE, COMMAND, 2),
    (CREATED, EVENT, 1),
    (CREATED, EVENT, 1),
    (DISTRIBUTING, EVENT, 2),
    (DISTRIBUTING, EVENT, 3),
    (COMPLETED, EVENT, 3),
    (COMPLETED, EVENT, 2),
    (FULLY_DISTRIBUTED, EVENT, 1)]
to start with:
  [(CREATE, COMMAND, 1), (CREATED, EVENT, 1), (CREATED, EVENT, 1)]

https://github.com/camunda/zeebe/actions/runs/3225048569/jobs/5276879600

@deepthidevaki
Copy link
Contributor

This is failing quite frequently now.

@remcowesterhoud
Copy link
Contributor

remcowesterhoud commented Oct 12, 2022

I was giving some priority to the hacktoberfest contributions, but I will start digging into this again today.

So far my observation is that it fails because in the exporter we get all the commands first for some reason.

@remcowesterhoud
Copy link
Contributor

After running it a a lot of times I finally got a failure locally. This allowed me to see the positions of the exported records. What is odd about this is the 3rd line in this log. Here we can see that the DeploymentDistribution.COMPLETE command has position 2.

Position: 1 TypedRecordImpl{metadata=RecordMetadata{recordType=COMMAND, intentValue=255, intent=CREATE, requestStreamId=-2147483648, requestId=-1, protocolVersion=3, valueType=DEPLOYMENT, rejectionType=NULL_VAL, rejectionReason=, brokerVersion=8.2.0}, value={"resources":[{"resourceName":"process.bpmn","resource":"PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iVVRGLTgiIHN0YW5kYWxvbmU9Im5vIj8+CjxkZWZpbml0aW9ucyB4bWxuczpicG1uZGk9Imh0dHA6Ly93d3cub21nLm9yZy9zcGVjL0JQTU4vMjAxMDA1MjQvREkiIHhtbG5zOmRjPSJodHRwOi8vd3d3Lm9tZy5vcmcvc3BlYy9ERC8yMDEwMDUyNC9EQyIgeG1sbnM6ZGk9Imh0dHA6Ly93d3cub21nLm9yZy9zcGVjL0RELzIwMTAwNTI0L0RJIiBpZD0iZGVmaW5pdGlvbnNfZmNkYTJlY2MtYzBhYS00NDNmLWI5NDItNWZkNjFkMmQzMjVkIiB0YXJnZXROYW1lc3BhY2U9Imh0dHA6Ly93d3cub21nLm9yZy9zcGVjL0JQTU4vMjAxMDA1MjQvTU9ERUwiIHhtbG5zPSJodHRwOi8vd3d3Lm9tZy5vcmcvc3BlYy9CUE1OLzIwMTAwNTI0L01PREVMIj4KICA8cHJvY2VzcyBpZD0ic2hvdWxkUmVEaXN0cmlidXRlQWZ0ZXJSZWNvdmVyeSIgaXNFeGVjdXRhYmxlPSJ0cnVlIj4KICAgIDxzdGFydEV2ZW50IGlkPSJzdGFydEV2ZW50XzgxY2FhNmQ2LTA5YmUtNDgzZC05YjZkLTg4NGZlODYyYjU2YiI+CiAgICAgIDxvdXRnb2luZz5zZXF1ZW5jZUZsb3dfODJmMjYzOTAtOWM4MS00NzhiLTgzM2YtYmQwZmViZmVlNmFiPC9vdXRnb2luZz4KICAgIDwvc3RhcnRFdmVudD4KICAgIDxlbmRFdmVudCBpZD0iZW5kRXZlbnRfM2NiODUyZDgtMThiYS00YjBlLTg2MGMtOTg1ZDc2ODZhYWMwIj4KICAgICAgPGluY29taW5nPnNlcXVlbmNlRmxvd184MmY...}
Position: 1 TypedRecordImpl{metadata=RecordMetadata{recordType=COMMAND, intentValue=255, intent=DISTRIBUTE, requestStreamId=-2147483648, requestId=-1, protocolVersion=3, valueType=DEPLOYMENT, rejectionType=NULL_VAL, rejectionReason=, brokerVersion=8.2.0}, value={"resources":[{"resourceName":"process.bpmn","resource":"PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iVVRGLTgiIHN0YW5kYWxvbmU9Im5vIj8+CjxkZWZpbml0aW9ucyB4bWxuczpicG1uZGk9Imh0dHA6Ly93d3cub21nLm9yZy9zcGVjL0JQTU4vMjAxMDA1MjQvREkiIHhtbG5zOmRjPSJodHRwOi8vd3d3Lm9tZy5vcmcvc3BlYy9ERC8yMDEwMDUyNC9EQyIgeG1sbnM6ZGk9Imh0dHA6Ly93d3cub21nLm9yZy9zcGVjL0RELzIwMTAwNTI0L0RJIiBpZD0iZGVmaW5pdGlvbnNfZmNkYTJlY2MtYzBhYS00NDNmLWI5NDItNWZkNjFkMmQzMjVkIiB0YXJnZXROYW1lc3BhY2U9Imh0dHA6Ly93d3cub21nLm9yZy9zcGVjL0JQTU4vMjAxMDA1MjQvTU9ERUwiIHhtbG5zPSJodHRwOi8vd3d3Lm9tZy5vcmcvc3BlYy9CUE1OLzIwMTAwNTI0L01PREVMIj4KICA8cHJvY2VzcyBpZD0ic2hvdWxkUmVEaXN0cmlidXRlQWZ0ZXJSZWNvdmVyeSIgaXNFeGVjdXRhYmxlPSJ0cnVlIj4KICAgIDxzdGFydEV2ZW50IGlkPSJzdGFydEV2ZW50XzgxY2FhNmQ2LTA5YmUtNDgzZC05YjZkLTg4NGZlODYyYjU2YiI+CiAgICAgIDxvdXRnb2luZz5zZXF1ZW5jZUZsb3dfODJmMjYzOTAtOWM4MS00NzhiLTgzM2YtYmQwZmViZmVlNmFiPC9vdXRnb2luZz4KICAgIDwvc3RhcnRFdmVudD4KICAgIDxlbmRFdmVudCBpZD0iZW5kRXZlbnRfM2NiODUyZDgtMThiYS00YjBlLTg2MGMtOTg1ZDc2ODZhYWMwIj4KICAgICAgPGluY29taW5nPnNlcXVlbmNlRmxvd184MmY...}
Position: 2 TypedRecordImpl{metadata=RecordMetadata{recordType=COMMAND, intentValue=255, intent=COMPLETE, requestStreamId=-2147483648, requestId=-1, protocolVersion=3, valueType=DEPLOYMENT_DISTRIBUTION, rejectionType=NULL_VAL, rejectionReason=, brokerVersion=8.2.0}, value={"partitionId":2}}
Position: 1 TypedRecordImpl{metadata=RecordMetadata{recordType=COMMAND, intentValue=255, intent=DISTRIBUTE, requestStreamId=-2147483648, requestId=-1, protocolVersion=3, valueType=DEPLOYMENT, rejectionType=NULL_VAL, rejectionReason=, brokerVersion=8.2.0}, value={"resources":[{"resourceName":"process.bpmn","resource":"PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iVVRGLTgiIHN0YW5kYWxvbmU9Im5vIj8+CjxkZWZpbml0aW9ucyB4bWxuczpicG1uZGk9Imh0dHA6Ly93d3cub21nLm9yZy9zcGVjL0JQTU4vMjAxMDA1MjQvREkiIHhtbG5zOmRjPSJodHRwOi8vd3d3Lm9tZy5vcmcvc3BlYy9ERC8yMDEwMDUyNC9EQyIgeG1sbnM6ZGk9Imh0dHA6Ly93d3cub21nLm9yZy9zcGVjL0RELzIwMTAwNTI0L0RJIiBpZD0iZGVmaW5pdGlvbnNfZmNkYTJlY2MtYzBhYS00NDNmLWI5NDItNWZkNjFkMmQzMjVkIiB0YXJnZXROYW1lc3BhY2U9Imh0dHA6Ly93d3cub21nLm9yZy9zcGVjL0JQTU4vMjAxMDA1MjQvTU9ERUwiIHhtbG5zPSJodHRwOi8vd3d3Lm9tZy5vcmcvc3BlYy9CUE1OLzIwMTAwNTI0L01PREVMIj4KICA8cHJvY2VzcyBpZD0ic2hvdWxkUmVEaXN0cmlidXRlQWZ0ZXJSZWNvdmVyeSIgaXNFeGVjdXRhYmxlPSJ0cnVlIj4KICAgIDxzdGFydEV2ZW50IGlkPSJzdGFydEV2ZW50XzgxY2FhNmQ2LTA5YmUtNDgzZC05YjZkLTg4NGZlODYyYjU2YiI+CiAgICAgIDxvdXRnb2luZz5zZXF1ZW5jZUZsb3dfODJmMjYzOTAtOWM4MS00NzhiLTgzM2YtYmQwZmViZmVlNmFiPC9vdXRnb2luZz4KICAgIDwvc3RhcnRFdmVudD4KICAgIDxlbmRFdmVudCBpZD0iZW5kRXZlbnRfM2NiODUyZDgtMThiYS00YjBlLTg2MGMtOTg1ZDc2ODZhYWMwIj4KICAgICAgPGluY29taW5nPnNlcXVlbmNlRmxvd184MmY...}
Position: 3 TypedRecordImpl{metadata=RecordMetadata{recordType=EVENT, intentValue=255, intent=CREATED, requestStreamId=-2147483648, requestId=-1, protocolVersion=3, valueType=PROCESS, rejectionType=NULL_VAL, rejectionReason=, brokerVersion=8.2.0}, value={"bpmnProcessId":"shouldReDistributeAfterRecovery","version":1,"processDefinitionKey":2251799813685249,"resourceName":"process.bpmn","checksum":"XkCnzocF31uHoqDJOltpSQ==","resource":"PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iVVRGLTgiIHN0YW5kYWxvbmU9Im5vIj8+CjxkZWZpbml0aW9ucyB4bWxuczpicG1uZGk9Imh0dHA6Ly93d3cub21nLm9yZy9zcGVjL0JQTU4vMjAxMDA1MjQvREkiIHhtbG5zOmRjPSJodHRwOi8vd3d3Lm9tZy5vcmcvc3BlYy9ERC8yMDEwMDUyNC9EQyIgeG1sbnM6ZGk9Imh0dHA6Ly93d3cub21nLm9yZy9zcGVjL0RELzIwMTAwNTI0L0RJIiBpZD0iZGVmaW5pdGlvbnNfZmNkYTJlY2MtYzBhYS00NDNmLWI5NDItNWZkNjFkMmQzMjVkIiB0YXJnZXROYW1lc3BhY2U9Imh0dHA6Ly93d3cub21nLm9yZy9zcGVjL0JQTU4vMjAxMDA1MjQvTU9ERUwiIHhtbG5zPSJodHRwOi8vd3d3Lm9tZy5vcmcvc3BlYy9CUE1OLzIwMTAwNTI0L01PREVMIj4KICA8cHJvY2VzcyBpZD0ic2hvdWxkUmVEaXN0cmlidXRlQWZ0ZXJSZWNvdmVyeSIgaXNFeGVjdXRhYmxlPSJ0cnVlIj4KICAgIDxzdGFydEV2ZW50IGlkPSJzdGFydEV2ZW50XzgxY2FhNmQ2LTA5YmUtNDgzZC05YjZkLTg4NGZlODYyYjU2YiI+CiAgICAgIDxvdXRnb2luZz5zZXF1ZW5jZUZsb3dfODJmMjYzOTAtOWM4MS00NzhiLTgzM2YtYmQwZmViZmVlNmFiPC9vdXRnb2luZz4KICAgIDwvc3RhcnRFdmVudD4KICAgI...}
Position: 4 TypedRecordImpl{metadata=RecordMetadata{recordType=EVENT, intentValue=255, intent=CREATED, requestStreamId=-2147483648, requestId=-1, protocolVersion=3, valueType=DEPLOYMENT, rejectionType=NULL_VAL, rejectionReason=, brokerVersion=8.2.0}, value={"resources":[{"resourceName":"process.bpmn","resource":"PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iVVRGLTgiIHN0YW5kYWxvbmU9Im5vIj8+CjxkZWZpbml0aW9ucyB4bWxuczpicG1uZGk9Imh0dHA6Ly93d3cub21nLm9yZy9zcGVjL0JQTU4vMjAxMDA1MjQvREkiIHhtbG5zOmRjPSJodHRwOi8vd3d3Lm9tZy5vcmcvc3BlYy9ERC8yMDEwMDUyNC9EQyIgeG1sbnM6ZGk9Imh0dHA6Ly93d3cub21nLm9yZy9zcGVjL0RELzIwMTAwNTI0L0RJIiBpZD0iZGVmaW5pdGlvbnNfZmNkYTJlY2MtYzBhYS00NDNmLWI5NDItNWZkNjFkMmQzMjVkIiB0YXJnZXROYW1lc3BhY2U9Imh0dHA6Ly93d3cub21nLm9yZy9zcGVjL0JQTU4vMjAxMDA1MjQvTU9ERUwiIHhtbG5zPSJodHRwOi8vd3d3Lm9tZy5vcmcvc3BlYy9CUE1OLzIwMTAwNTI0L01PREVMIj4KICA8cHJvY2VzcyBpZD0ic2hvdWxkUmVEaXN0cmlidXRlQWZ0ZXJSZWNvdmVyeSIgaXNFeGVjdXRhYmxlPSJ0cnVlIj4KICAgIDxzdGFydEV2ZW50IGlkPSJzdGFydEV2ZW50XzgxY2FhNmQ2LTA5YmUtNDgzZC05YjZkLTg4NGZlODYyYjU2YiI+CiAgICAgIDxvdXRnb2luZz5zZXF1ZW5jZUZsb3dfODJmMjYzOTAtOWM4MS00NzhiLTgzM2YtYmQwZmViZmVlNmFiPC9vdXRnb2luZz4KICAgIDwvc3RhcnRFdmVudD4KICAgIDxlbmRFdmVudCBpZD0iZW5kRXZlbnRfM2NiODUyZDgtMThiYS00YjBlLTg2MGMtOTg1ZDc2ODZhYWMwIj4KICAgICAgPGluY29taW5nPnNlcXVlbmNlRmxvd184MmY...}
Position: 5 TypedRecordImpl{metadata=RecordMetadata{recordType=EVENT, intentValue=255, intent=DISTRIBUTING, requestStreamId=-2147483648, requestId=-1, protocolVersion=3, valueType=DEPLOYMENT_DISTRIBUTION, rejectionType=NULL_VAL, rejectionReason=, brokerVersion=8.2.0}, value={"partitionId":2}}
Position: 6 TypedRecordImpl{metadata=RecordMetadata{recordType=EVENT, intentValue=255, intent=DISTRIBUTING, requestStreamId=-2147483648, requestId=-1, protocolVersion=3, valueType=DEPLOYMENT_DISTRIBUTION, rejectionType=NULL_VAL, rejectionReason=, brokerVersion=8.2.0}, value={"partitionId":3}}
Position: 7 TypedRecordImpl{metadata=RecordMetadata{recordType=COMMAND, intentValue=255, intent=COMPLETE, requestStreamId=-2147483648, requestId=-1, protocolVersion=3, valueType=DEPLOYMENT_DISTRIBUTION, rejectionType=NULL_VAL, rejectionReason=, brokerVersion=8.2.0}, value={"partitionId":3}}
Position: 8 TypedRecordImpl{metadata=RecordMetadata{recordType=EVENT, intentValue=255, intent=COMPLETED, requestStreamId=-2147483648, requestId=-1, protocolVersion=3, valueType=DEPLOYMENT_DISTRIBUTION, rejectionType=NULL_VAL, rejectionReason=, brokerVersion=8.2.0}, value={"partitionId":2}}
Position: 9 TypedRecordImpl{metadata=RecordMetadata{recordType=EVENT, intentValue=255, intent=COMPLETED, requestStreamId=-2147483648, requestId=-1, protocolVersion=3, valueType=DEPLOYMENT_DISTRIBUTION, rejectionType=NULL_VAL, rejectionReason=, brokerVersion=8.2.0}, value={"partitionId":3}}
Position: 10 TypedRecordImpl{metadata=RecordMetadata{recordType=EVENT, intentValue=255, intent=FULLY_DISTRIBUTED, requestStreamId=-2147483648, requestId=-1, protocolVersion=3, valueType=DEPLOYMENT, rejectionType=NULL_VAL, rejectionReason=, brokerVersion=8.2.0}, value={"resources":[],"processesMetadata":[],"decisionRequirementsMetadata":[],"decisionsMetadata":[]}}
Position: 2 TypedRecordImpl{metadata=RecordMetadata{recordType=EVENT, intentValue=255, intent=DISTRIBUTED, requestStreamId=-2147483648, requestId=-1, protocolVersion=3, valueType=DEPLOYMENT, rejectionType=NULL_VAL, rejectionReason=, brokerVersion=8.2.0}, value={"resources":[{"resourceName":"process.bpmn","resource":"PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iVVRGLTgiIHN0YW5kYWxvbmU9Im5vIj8+CjxkZWZpbml0aW9ucyB4bWxuczpicG1uZGk9Imh0dHA6Ly93d3cub21nLm9yZy9zcGVjL0JQTU4vMjAxMDA1MjQvREkiIHhtbG5zOmRjPSJodHRwOi8vd3d3Lm9tZy5vcmcvc3BlYy9ERC8yMDEwMDUyNC9EQyIgeG1sbnM6ZGk9Imh0dHA6Ly93d3cub21nLm9yZy9zcGVjL0RELzIwMTAwNTI0L0RJIiBpZD0iZGVmaW5pdGlvbnNfZmNkYTJlY2MtYzBhYS00NDNmLWI5NDItNWZkNjFkMmQzMjVkIiB0YXJnZXROYW1lc3BhY2U9Imh0dHA6Ly93d3cub21nLm9yZy9zcGVjL0JQTU4vMjAxMDA1MjQvTU9ERUwiIHhtbG5zPSJodHRwOi8vd3d3Lm9tZy5vcmcvc3BlYy9CUE1OLzIwMTAwNTI0L01PREVMIj4KICA8cHJvY2VzcyBpZD0ic2hvdWxkUmVEaXN0cmlidXRlQWZ0ZXJSZWNvdmVyeSIgaXNFeGVjdXRhYmxlPSJ0cnVlIj4KICAgIDxzdGFydEV2ZW50IGlkPSJzdGFydEV2ZW50XzgxY2FhNmQ2LTA5YmUtNDgzZC05YjZkLTg4NGZlODYyYjU2YiI+CiAgICAgIDxvdXRnb2luZz5zZXF1ZW5jZUZsb3dfODJmMjYzOTAtOWM4MS00NzhiLTgzM2YtYmQwZmViZmVlNmFiPC9vdXRnb2luZz4KICAgIDwvc3RhcnRFdmVudD4KICAgIDxlbmRFdmVudCBpZD0iZW5kRXZlbnRfM2NiODUyZDgtMThiYS00YjBlLTg2MGMtOTg1ZDc2ODZhYWMwIj4KICAgICAgPGluY29taW5nPnNlcXVlbmNlRmxvd184MmY...}
Position: 2 TypedRecordImpl{metadata=RecordMetadata{recordType=EVENT, intentValue=255, intent=DISTRIBUTED, requestStreamId=-2147483648, requestId=-1, protocolVersion=3, valueType=DEPLOYMENT, rejectionType=NULL_VAL, rejectionReason=, brokerVersion=8.2.0}, value={"resources":[{"resourceName":"process.bpmn","resource":"PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iVVRGLTgiIHN0YW5kYWxvbmU9Im5vIj8+CjxkZWZpbml0aW9ucyB4bWxuczpicG1uZGk9Imh0dHA6Ly93d3cub21nLm9yZy9zcGVjL0JQTU4vMjAxMDA1MjQvREkiIHhtbG5zOmRjPSJodHRwOi8vd3d3Lm9tZy5vcmcvc3BlYy9ERC8yMDEwMDUyNC9EQyIgeG1sbnM6ZGk9Imh0dHA6Ly93d3cub21nLm9yZy9zcGVjL0RELzIwMTAwNTI0L0RJIiBpZD0iZGVmaW5pdGlvbnNfZmNkYTJlY2MtYzBhYS00NDNmLWI5NDItNWZkNjFkMmQzMjVkIiB0YXJnZXROYW1lc3BhY2U9Imh0dHA6Ly93d3cub21nLm9yZy9zcGVjL0JQTU4vMjAxMDA1MjQvTU9ERUwiIHhtbG5zPSJodHRwOi8vd3d3Lm9tZy5vcmcvc3BlYy9CUE1OLzIwMTAwNTI0L01PREVMIj4KICA8cHJvY2VzcyBpZD0ic2hvdWxkUmVEaXN0cmlidXRlQWZ0ZXJSZWNvdmVyeSIgaXNFeGVjdXRhYmxlPSJ0cnVlIj4KICAgIDxzdGFydEV2ZW50IGlkPSJzdGFydEV2ZW50XzgxY2FhNmQ2LTA5YmUtNDgzZC05YjZkLTg4NGZlODYyYjU2YiI+CiAgICAgIDxvdXRnb2luZz5zZXF1ZW5jZUZsb3dfODJmMjYzOTAtOWM4MS00NzhiLTgzM2YtYmQwZmViZmVlNmFiPC9vdXRnb2luZz4KICAgIDwvc3RhcnRFdmVudD4KICAgIDxlbmRFdmVudCBpZD0iZW5kRXZlbnRfM2NiODUyZDgtMThiYS00YjBlLTg2MGMtOTg1ZDc2ODZhYWMwIj4KICAgICAgPGluY29taW5nPnNlcXVlbmNlRmxvd184MmY...}

Looking at the code we can see that we always write a DeploymentDistribution.DISTRIBUTING event before we start distributing to other partitions:

    otherPartitions.forEach(
        partitionId -> {
          deploymentDistributionRecord.setPartition(partitionId);
          stateWriter.appendFollowUpEvent(
              key, DeploymentDistributionIntent.DISTRIBUTING, deploymentDistributionRecord);

          distributeDeploymentToPartition(key, partitionId, copiedDeploymentBuffer);
        });

Somehow we are writing the response of this deployment distribution to the log stream before we are writing the distributing event.

Hypothesis
I am not sure how to verify this, but I feel like this could be a result from the engine abstraction. My theory is that it has to do with the buffering of records and writing until the command has been processed fully.

  1. We receive a Deployment.CREATE command
  2. During processing we will write the DeploymentDistribution.DISTRIBUTING event. During processing we also send a message to the other partitions in order to distribute the deployment.
  3. The DeploymentDistribution.DISTRIBUTING event is written to the buffer. Nothing is written to the log stream yet. The other partition receives the message and writes a Deployment.DISTRIBUTE command. At this point this partition is idle so it will immediately start processing this command.
  4. Here we run into a race condition. If the second partition sends the response back to the first partition before the first partition has finished processing the Deployment.CREATE command it will write the DeploymentDistribution.COMPLETE before it writes the buffered events.

Tl;dr: I think the inter partition communication does not take event buffering into consideration. Resulting in a strange order of commands/events on the log stream.

@Zelldon Could you verify if my hypothesis makes any sense?

@Zelldon
Copy link
Member

Zelldon commented Oct 12, 2022

Sounds plausible, but I think this was already before the abstraction like that, since the LogStreamWriters also just "buffered" the records during processing. I guess there was just some timing change which might cause this issue more often.

@remcowesterhoud
Copy link
Contributor

🤔 There was also the changes in how the deployment distribution works somewhat recently. It might be that this caused a change in the timing. I can't fully remember the details of this change.

I think the order shouldn't matter either way so I will modify the test to be a bit more lenient in the command ordering and that should fix the flakiness.

@Zelldon
Copy link
Member

Zelldon commented Oct 12, 2022

@remcowesterhoud maybe it makes sense to use the PostCommitTasks for the Deployment distribution? So the request would only be executed IF the processing is done and the transaction is committed. Might make more sense I guess?

@korthout korthout self-assigned this Oct 13, 2022
zeebe-bors-camunda bot added a commit that referenced this issue Oct 13, 2022
10711: [Backport stable/8.1] Distribute deployment in post commit tasks r=remcowesterhoud a=backport-action

# Description
Backport of #10689 to `stable/8.1`.

relates to #9964

Co-authored-by: Remco Westerhoud <remco@westerhoud.nl>
@korthout korthout added the version:8.1.2 Marks an issue as being completely or in parts released in 8.1.2 label Oct 18, 2022
@korthout korthout added the version:8.2.0-alpha1 Marks an issue as being completely or in parts released in 8.2.0-alpha1 label Nov 1, 2022
@npepinpe npepinpe added the version:8.2.0 Marks an issue as being completely or in parts released in 8.2.0 label Apr 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/flake Categorizes issue or PR as related to a flaky test version:1.3.14 Marks an issue as being completely or in parts released in 1.3.14 version:8.1.0-alpha5 Marks an issue as being completely or in parts released in 8.1.0-alpha5 version:8.1.0 Marks an issue as being completely or in parts released in 8.1.0 version:8.1.2 Marks an issue as being completely or in parts released in 8.1.2 version:8.2.0-alpha1 Marks an issue as being completely or in parts released in 8.2.0-alpha1 version:8.2.0 Marks an issue as being completely or in parts released in 8.2.0
Projects
None yet
8 participants