Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not possible to cancel process instance with many active element instances #11355

Closed
Zelldon opened this issue Jan 3, 2023 · 10 comments · Fixed by #12604
Closed

Not possible to cancel process instance with many active element instances #11355

Zelldon opened this issue Jan 3, 2023 · 10 comments · Fixed by #12604
Assignees
Labels
component/engine kind/bug Categorizes an issue or PR as a bug severity/high Marks a bug as having a noticeable impact on the user with no known workaround support Marks an issue as related to a customer support request version:8.1.12 Marks an issue as being completely or in parts released in 8.1.12 version:8.2.4 Marks an issue as being completely or in parts released in 8.2.4 version:8.3.0-alpha1 Marks an issue as being completely or in parts released in 8.3.0-alpha1 version:8.3.0-alpha2 Marks an issue as being completely or in parts released in 8.3.0-alpha2 version:8.3.0 Marks an issue as being completely or in parts released in 8.3.0

Comments

@Zelldon
Copy link
Member

Zelldon commented Jan 3, 2023

Describe the bug

We got reports of crash looping Zeebe brokers on prod, it looks like the process which is running does some nesting or looping over certain activities. TODO: I will add the process model later.

The user tried to cancel the corresponding process instance but this failed because there were too many activities to terminate.

Expected to write one or more follow-up records for record 'LoggedEvent [type=0, version=0, streamId=2, position=299792, key=4503599627371681, timestamp=1672654759877, sourceEventPosition=297539] RecordMetadata{recordType=COMMAND, intentValue=255, intent=TERMINATE_ELEMENT, requestStreamId=-2147483648, requestId=-1, protocolVersion=3, valueType=PROCESS_INSTANCE, rejectionType=NULL_VAL, rejectionReason=, brokerVersion=8.2.0}' without errors, but exception was thrown.

Error group: https://console.cloud.google.com/errors/detail/COWzpqvwz4Cg0wE;service=zeebe;time=P7D?project=camunda-cloud-240911

Note: Even though we replaced the dispatcher this error will still happen since we have this max message size limit.

I put the severity to high since I see no workaround. BTW due to the loop and which causes the pod crash looping the cluster was in this case unusable.

To Reproduce
Have a process instance with a lot of activities active, and terminate the corresponding process instance.

Expected behavior
Termination of instances takes into account the batch size, and terminates activities batch-wise, similar issue as to activitate multi instances.

Log/Stacktrace

Full Stacktrace

java.lang.IllegalArgumentException: Expected to claim segment of size 4481608, but can't claim more than 4194304 bytes.
	at io.camunda.zeebe.dispatcher.Dispatcher.offer(Dispatcher.java:207) ~[zeebe-dispatcher-8.2.0-alpha2.jar:8.2.0-alpha2]
	at io.camunda.zeebe.dispatcher.Dispatcher.claimFragmentBatch(Dispatcher.java:164) ~[zeebe-dispatcher-8.2.0-alpha2.jar:8.2.0-alpha2]
	at io.camunda.zeebe.logstreams.impl.log.LogStreamBatchWriterImpl.claimBatchForEvents(LogStreamBatchWriterImpl.java:235) ~[zeebe-logstreams-8.2.0-alpha2.jar:8.2.0-alpha2]
	at io.camunda.zeebe.logstreams.impl.log.LogStreamBatchWriterImpl.tryWrite(LogStreamBatchWriterImpl.java:212) ~[zeebe-logstreams-8.2.0-alpha2.jar:8.2.0-alpha2]
	at io.camunda.zeebe.stream.impl.ProcessingStateMachine.lambda$writeRecords$9(ProcessingStateMachine.java:354) ~[zeebe-stream-platform-8.2.0-alpha2.jar:8.2.0-alpha2]
	at io.camunda.zeebe.scheduler.retry.ActorRetryMechanism.run(ActorRetryMechanism.java:28) ~[zeebe-scheduler-8.2.0-alpha2.jar:8.2.0-alpha2]
	at io.camunda.zeebe.scheduler.retry.AbortableRetryStrategy.run(AbortableRetryStrategy.java:45) ~[zeebe-scheduler-8.2.0-alpha2.jar:8.2.0-alpha2]
	at io.camunda.zeebe.scheduler.ActorJob.invoke(ActorJob.java:92) ~[zeebe-scheduler-8.2.0-alpha2.jar:8.2.0-alpha2]
	at io.camunda.zeebe.scheduler.ActorJob.execute(ActorJob.java:45) ~[zeebe-scheduler-8.2.0-alpha2.jar:8.2.0-alpha2]
	at io.camunda.zeebe.scheduler.ActorTask.execute(ActorTask.java:119) ~[zeebe-scheduler-8.2.0-alpha2.jar:8.2.0-alpha2]
	at io.camunda.zeebe.scheduler.ActorThread.executeCurrentTask(ActorThread.java:106) ~[zeebe-scheduler-8.2.0-alpha2.jar:8.2.0-alpha2]
	at io.camunda.zeebe.scheduler.ActorThread.doWork(ActorThread.java:87) ~[zeebe-scheduler-8.2.0-alpha2.jar:8.2.0-alpha2]
	at io.camunda.zeebe.scheduler.ActorThread.run(ActorThread.java:198) ~[zeebe-scheduler-8.2.0-alpha2.jar:8.2.0-alpha2]
"

Environment:

  • OS:
  • Zeebe Version: 8.2.0-alpha2
  • Configuration: Production G3-S

relates to https://jira.camunda.com/browse/SUPPORT-16499

@Zelldon Zelldon added kind/bug Categorizes an issue or PR as a bug severity/high Marks a bug as having a noticeable impact on the user with no known workaround component/engine labels Jan 3, 2023
@Zelldon
Copy link
Member Author

Zelldon commented Jan 4, 2023

Another but related error occured on PROD:

io.camunda.zeebe.stream.api.records.ExceededBatchRecordSizeException: Can't append entry: 'RecordBatchEntry[key=2251799813801783, sourceIndex=-1, recordMetadata=RecordMetadata{recordType=COMMAND, intentValue=10, intent=TERMINATE_ELEMENT, requestStreamId=-2147483648, requestId=-1, protocolVersion=3, valueType=PROCESS_INSTANCE, rejectionType=NULL_VAL, rejectionReason=, brokerVersion=8.2.0}, unifiedRecordValue={"bpmnProcessId":"Process_372fbfc7-9a4a-4f0b-aee5-bd96ed3e3e5d","version":1,"processDefinitionKey":2251799813685320,"processInstanceKey":2251799813685333,"elementId":"Activity_0vhm20h","flowScopeKey":2251799813685333,"bpmnElementType":"USER_TASK","bpmnEventType":"UNSPECIFIED","parentProcessInstanceKey":-1,"parentElementInstanceKey":-1}]' with size: 335 this would exceed the maximum batch size. [ currentBatchEntryCount: 11814, currentBatchSize: 3957709]

at io.camunda.zeebe.stream.impl.records.RecordBatch.appendRecord ( [io/camunda.zeebe.stream.impl.records/RecordBatch.java:66](https://console.cloud.google.com/debug?referrer=fromlog&file=io%2Fcamunda.zeebe.stream.impl.records%2FRecordBatch.java&line=66&project=camunda-cloud-240911) )
at io.camunda.zeebe.stream.impl.BufferedProcessingResultBuilder.appendRecordReturnEither ( [io/camunda.zeebe.stream.impl/BufferedProcessingResultBuilder.java:62](https://console.cloud.google.com/debug?referrer=fromlog&file=io%2Fcamunda.zeebe.stream.impl%2FBufferedProcessingResultBuilder.java&line=62&project=camunda-cloud-240911) )
at io.camunda.zeebe.stream.api.ProcessingResultBuilder.appendRecord ( [io/camunda.zeebe.stream.api/ProcessingResultBuilder.java:38](https://console.cloud.google.com/debug?referrer=fromlog&file=io%2Fcamunda.zeebe.stream.api%2FProcessingResultBuilder.java&line=38&project=camunda-cloud-240911) )
at io.camunda.zeebe.engine.processing.streamprocessor.writers.ResultBuilderBackedTypedCommandWriter.appendRecord ( [io/camunda.zeebe.engine.processing.streamprocessor.writers/ResultBuilderBackedTypedCommandWriter.java:37](https://console.cloud.google.com/debug?referrer=fromlog&file=io%2Fcamunda.zeebe.engine.processing.streamprocessor.writers%2FResultBuilderBackedTypedCommandWriter.java&line=37&project=camunda-cloud-240911) )
at io.camunda.zeebe.engine.processing.streamprocessor.writers.ResultBuilderBackedTypedCommandWriter.appendFollowUpCommand ( [io/camunda.zeebe.engine.processing.streamprocessor.writers/ResultBuilderBackedTypedCommandWriter.java:32](https://console.cloud.google.com/debug?referrer=fromlog&file=io%2Fcamunda.zeebe.engine.processing.streamprocessor.writers%2FResultBuilderBackedTypedCommandWriter.java&line=32&project=camunda-cloud-240911) )
at io.camunda.zeebe.engine.processing.bpmn.behavior.BpmnStateTransitionBehavior.lambda$terminateChildInstances$3 ( [io/camunda.zeebe.engine.processing.bpmn.behavior/BpmnStateTransitionBehavior.java:332](https://console.cloud.google.com/debug?referrer=fromlog&file=io%2Fcamunda.zeebe.engine.processing.bpmn.behavior%2FBpmnStateTransitionBehavior.java&line=332&project=camunda-cloud-240911) )
at io.camunda.zeebe.engine.processing.bpmn.behavior.BpmnStateTransitionBehavior.terminateChildInstances ( [io/camunda.zeebe.engine.processing.bpmn.behavior/BpmnStateTransitionBehavior.java:330](https://console.cloud.google.com/debug?referrer=fromlog&file=io%2Fcamunda.zeebe.engine.processing.bpmn.behavior%2FBpmnStateTransitionBehavior.java&line=330&project=camunda-cloud-240911) )
at io.camunda.zeebe.engine.processing.bpmn.container.ProcessProcessor.onTerminate ( [io/camunda.zeebe.engine.processing.bpmn.container/ProcessProcessor.java:85](https://console.cloud.google.com/debug?referrer=fromlog&file=io%2Fcamunda.zeebe.engine.processing.bpmn.container%2FProcessProcessor.java&line=85&project=camunda-cloud-240911) )
at io.camunda.zeebe.engine.processing.bpmn.container.ProcessProcessor.onTerminate ( [io/camunda.zeebe.engine.processing.bpmn.container/ProcessProcessor.java:27](https://console.cloud.google.com/debug?referrer=fromlog&file=io%2Fcamunda.zeebe.engine.processing.bpmn.container%2FProcessProcessor.java&line=27&project=camunda-cloud-240911) )
at io.camunda.zeebe.engine.processing.bpmn.BpmnStreamProcessor.processEvent ( [io/camunda.zeebe.engine.processing.bpmn/BpmnStreamProcessor.java:122](https://console.cloud.google.com/debug?referrer=fromlog&file=io%2Fcamunda.zeebe.engine.processing.bpmn%2FBpmnStreamProcessor.java&line=122&project=camunda-cloud-240911) )
at io.camunda.zeebe.engine.processing.bpmn.BpmnStreamProcessor.lambda$processRecord$0 ( [io/camunda.zeebe.engine.processing.bpmn/BpmnStreamProcessor.java:95](https://console.cloud.google.com/debug?referrer=fromlog&file=io%2Fcamunda.zeebe.engine.processing.bpmn%2FBpmnStreamProcessor.java&line=95&project=camunda-cloud-240911) )
at io.camunda.zeebe.util.Either$Right.ifRightOrLeft ( [io/camunda.zeebe.util/Either.java:381](https://console.cloud.google.com/debug?referrer=fromlog&file=io%2Fcamunda.zeebe.util%2FEither.java&line=381&project=camunda-cloud-240911) )
at io.camunda.zeebe.engine.processing.bpmn.BpmnStreamProcessor.processRecord ( [io/camunda.zeebe.engine.processing.bpmn/BpmnStreamProcessor.java:92](https://console.cloud.google.com/debug?referrer=fromlog&file=io%2Fcamunda.zeebe.engine.processing.bpmn%2FBpmnStreamProcessor.java&line=92&project=camunda-cloud-240911) )
at io.camunda.zeebe.engine.Engine.process ( [io/camunda.zeebe.engine/Engine.java:128](https://console.cloud.google.com/debug?referrer=fromlog&file=io%2Fcamunda.zeebe.engine%2FEngine.java&line=128&project=camunda-cloud-240911) )
at io.camunda.zeebe.stream.impl.ProcessingStateMachine.lambda$processCommand$3 ( [io/camunda.zeebe.stream.impl/ProcessingStateMachine.java:264](https://console.cloud.google.com/debug?referrer=fromlog&file=io%2Fcamunda.zeebe.stream.impl%2FProcessingStateMachine.java&line=264&project=camunda-cloud-240911) )
at io.camunda.zeebe.db.impl.rocksdb.transaction.ZeebeTransaction.run ( [io/camunda.zeebe.db.impl.rocksdb.transaction/ZeebeTransaction.java:84](https://console.cloud.google.com/debug?referrer=fromlog&file=io%2Fcamunda.zeebe.db.impl.rocksdb.transaction%2FZeebeTransaction.java&line=84&project=camunda-cloud-240911) )
at io.camunda.zeebe.stream.impl.ProcessingStateMachine.processCommand ( [io/camunda.zeebe.stream.impl/ProcessingStateMachine.java:260](https://console.cloud.google.com/debug?referrer=fromlog&file=io%2Fcamunda.zeebe.stream.impl%2FProcessingStateMachine.java&line=260&project=camunda-cloud-240911) )
at io.camunda.zeebe.stream.impl.ProcessingStateMachine.tryToReadNextRecord ( [io/camunda.zeebe.stream.impl/ProcessingStateMachine.java:209](https://console.cloud.google.com/debug?referrer=fromlog&file=io%2Fcamunda.zeebe.stream.impl%2FProcessingStateMachine.java&line=209&project=camunda-cloud-240911) )
at io.camunda.zeebe.stream.impl.ProcessingStateMachine.readNextRecord ( [io/camunda.zeebe.stream.impl/ProcessingStateMachine.java:185](https://console.cloud.google.com/debug?referrer=fromlog&file=io%2Fcamunda.zeebe.stream.impl%2FProcessingStateMachine.java&line=185&project=camunda-cloud-240911) )
at io.camunda.zeebe.scheduler.ActorJob.invoke ( [io/camunda.zeebe.scheduler/ActorJob.java:92](https://console.cloud.google.com/debug?referrer=fromlog&file=io%2Fcamunda.zeebe.scheduler%2FActorJob.java&line=92&project=camunda-cloud-240911) )
at io.camunda.zeebe.scheduler.ActorJob.execute ( [io/camunda.zeebe.scheduler/ActorJob.java:45](https://console.cloud.google.com/debug?referrer=fromlog&file=io%2Fcamunda.zeebe.scheduler%2FActorJob.java&line=45&project=camunda-cloud-240911) )
at io.camunda.zeebe.scheduler.ActorTask.execute ( [io/camunda.zeebe.scheduler/ActorTask.java:119](https://console.cloud.google.com/debug?referrer=fromlog&file=io%2Fcamunda.zeebe.scheduler%2FActorTask.java&line=119&project=camunda-cloud-240911) )
at io.camunda.zeebe.scheduler.ActorThread.executeCurrentTask ( [io/camunda.zeebe.scheduler/ActorThread.java:106](https://console.cloud.google.com/debug?referrer=fromlog&file=io%2Fcamunda.zeebe.scheduler%2FActorThread.java&line=106&project=camunda-cloud-240911) )
at io.camunda.zeebe.scheduler.ActorThread.doWork ( [io/camunda.zeebe.scheduler/ActorThread.java:87](https://console.cloud.google.com/debug?referrer=fromlog&file=io%2Fcamunda.zeebe.scheduler%2FActorThread.java&line=87&project=camunda-cloud-240911) )
at io.camunda.zeebe.scheduler.ActorThread.run ( [io/camunda.zeebe.scheduler/ActorThread.java:198](https://console.cloud.google.com/debug?referrer=fromlog&file=io%2Fcamunda.zeebe.scheduler%2FActorThread.java&line=198&project=camunda-cloud-240911) )

Error group, https://console.cloud.google.com/errors/detail/CJujpJmq_NqemgE;service=zeebe;time=P7D?project=camunda-cloud-240911

@saig0
Copy link
Member

saig0 commented Feb 17, 2023

ℹ️ Currently, the cancel command is excluded from blacklisting (see here). As a result, the process instance continues with processing.

@Zelldon
Copy link
Member Author

Zelldon commented Mar 9, 2023

⚠️ Happened again this week, and caused another incident

Happened on 8.1.3 https://console.cloud.google.com/errors/detail/CKvjvtrYm_SiuwE;service=zeebe;time=P7D?project=camunda-cloud-240911

@Zelldon
Copy link
Member Author

Zelldon commented Mar 9, 2023

I would request to re-evaluate the priority of this by @camunda/zeebe-process-automation

Incidents shouldn't happen twice. This seems to be an issue that people seem to run into easily, and there is no good way to resolve it.

@korthout
Copy link
Member

korthout commented Mar 15, 2023

Triage summary:

  • Create an EPIC to tackle this problem correctly: support cancelling instances with many tokens (@aleksander-dytko )
  • Provide a quick and dirty solution to avoid this producing further incidents.

Let's continue working on this issue by providing this quick and dirty solution

@aleksander-dytko
Copy link

@korthout could you please check if I have summarized all the details in https://github.com/camunda/product-hub/issues/1067 ?
Thanks!

@korthout
Copy link
Member

@aleksander-dytko Thanks for creating the EPIC. I think you cover all the details.

@npepinpe
Copy link
Member

npepinpe commented Apr 3, 2023

This happened again, except this time the number of child element instances is so great it causes the nodes to first slow down to a crawl due to very high GC times, then be killed due to OOM.

Incident link: https://camunda.slack.com/archives/C051HA4V63D
Data link (incl. heap dump, process BPMN, and the complete node state): https://drive.google.com/drive/folders/1VkseQsD8Czi33dQi_kE_vV-YnfOTuJgu?usp=share_link

In case of investigation with this data, the key of the command is 4503599643148887 and its position is 93582578. It is a ProcessInstance.TERMINATE_ELEMENT command.

Affected version is 8.1.9, though I imagine most versions are affected.

From the heap dump:

image

The thread io.camunda.zeebe.scheduler.ActorThread @ 0xab7760e8 Broker-2-zb-actors-2 keeps local variables with total size 1.90 GB (98.54%) bytes.
The memory is accumulated in one instance of java.lang.Object[], loaded by , which occupies 1.90 GB (98.52%) bytes.
The stacktrace of this Thread is available. See stacktrace. See stacktrace with involved local variables.

Keywords

  • java.lang.Object[]
  • io.camunda.zeebe.engine.state.instance.DbElementInstanceState.lambda$getChildren$2(Ljava/util/List;Lio/camunda/zeebe/db/impl/DbCompositeKey;Lio/camunda/zeebe/db/impl/DbNil;)V
    DbElementInstanceState.java:258
  • io.camunda.zeebe.engine.state.instance.DbElementInstanceState.getChildren(J)Ljava/util/List;
  • DbElementInstanceState.java:254

Memory metrics:

image

In our case, the cluster was also unusable, and likely the only way to recover it is to give it ludicrous amounts of memory.

@npepinpe npepinpe added the support Marks an issue as related to a customer support request label Apr 4, 2023
@npepinpe
Copy link
Member

npepinpe commented Apr 4, 2023

Relevant support issue: https://jira.camunda.com/browse/SUPPORT-16499

And clusters which run into this are likely to be affected by #12239 as well (relevant support issue: https://jira.camunda.com/browse/SUPPORT-16394).

Please update the support team once these issues are fixed with a patch ETA 🙏

@remcowesterhoud remcowesterhoud self-assigned this Apr 17, 2023
@remcowesterhoud remcowesterhoud changed the title Not possible to cancel deep nested process instance Not possible to cancel process instance with many of active element instances Apr 19, 2023
@remcowesterhoud remcowesterhoud changed the title Not possible to cancel process instance with many of active element instances Not possible to cancel process instance with many active element instances Apr 19, 2023
@remcowesterhoud
Copy link
Contributor

remcowesterhoud commented Apr 19, 2023

I've renamed this issue as the descriptions are not related to deep-nesting. They are related to a process instance which contains many active elements instances.

For the deep-nesting we have another issue:

I've created an epic to do a proper task breakdown #12485

zeebe-bors-camunda bot added a commit that referenced this issue May 1, 2023
12604: Terminate children using the new `ProcessInstanceBatch` command r=berkaycanbc a=remcowesterhoud

## Description

<!-- Please explain the changes you made here. -->

This PR switches the termination of child instances to use the new `ProcessInstanceBatch` command.

## Related issues

<!-- Which issues are closed by this PR or are related -->

closes #12538 
closes #11355 



Co-authored-by: Remco Westerhoud <remco@westerhoud.nl>
zeebe-bors-camunda bot added a commit that referenced this issue May 1, 2023
12616: [Backport release-8.3.0-alpha1] Terminate children using the new `ProcessInstanceBatch` command r=remcowesterhoud a=backport-action

# Description
Backport of #12604 to `release-8.3.0-alpha1`.

relates to #12538 #11355

Co-authored-by: Remco Westerhoud <remco@westerhoud.nl>
zeebe-bors-camunda bot added a commit that referenced this issue May 1, 2023
12614: [Backport stable/8.1] Terminate children using the new `ProcessInstanceBatch` command r=remcowesterhoud a=backport-action

# Description
Backport of #12604 to `stable/8.1`.

relates to #12538 #11355

Co-authored-by: Remco Westerhoud <remco@westerhoud.nl>
zeebe-bors-camunda bot added a commit that referenced this issue May 2, 2023
12614: [Backport stable/8.1] Terminate children using the new `ProcessInstanceBatch` command r=remcowesterhoud a=backport-action

# Description
Backport of #12604 to `stable/8.1`.

relates to #12538 #11355

Co-authored-by: Remco Westerhoud <remco@westerhoud.nl>
zeebe-bors-camunda bot added a commit that referenced this issue May 2, 2023
12615: [Backport stable/8.2] Terminate children using the new `ProcessInstanceBatch` command r=remcowesterhoud a=backport-action

# Description
Backport of #12604 to `stable/8.2`.

relates to #12538 #11355

Co-authored-by: Remco Westerhoud <remco@westerhoud.nl>
@remcowesterhoud remcowesterhoud added version:8.3.0-alpha1 Marks an issue as being completely or in parts released in 8.3.0-alpha1 version:8.1.12 Marks an issue as being completely or in parts released in 8.1.12 version:8.2.4 Marks an issue as being completely or in parts released in 8.2.4 and removed version:8.3.0-alpha1 Marks an issue as being completely or in parts released in 8.3.0-alpha1 labels May 3, 2023
@oleschoenburg oleschoenburg added the version:8.3.0-alpha2 Marks an issue as being completely or in parts released in 8.3.0-alpha2 label Jun 7, 2023
@megglos megglos removed the support Marks an issue as related to a customer support request label Jun 12, 2023
@github-actions github-actions bot added the support Marks an issue as related to a customer support request label Jun 12, 2023
@megglos megglos added the version:8.3.0 Marks an issue as being completely or in parts released in 8.3.0 label Oct 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/engine kind/bug Categorizes an issue or PR as a bug severity/high Marks a bug as having a noticeable impact on the user with no known workaround support Marks an issue as related to a customer support request version:8.1.12 Marks an issue as being completely or in parts released in 8.1.12 version:8.2.4 Marks an issue as being completely or in parts released in 8.2.4 version:8.3.0-alpha1 Marks an issue as being completely or in parts released in 8.3.0-alpha1 version:8.3.0-alpha2 Marks an issue as being completely or in parts released in 8.3.0-alpha2 version:8.3.0 Marks an issue as being completely or in parts released in 8.3.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants