Not possible to cancel process instance with many active element instances #11355

Zelldon · 2023-01-03T07:33:30Z

Describe the bug

We got reports of crash looping Zeebe brokers on prod, it looks like the process which is running does some nesting or looping over certain activities. TODO: I will add the process model later.

The user tried to cancel the corresponding process instance but this failed because there were too many activities to terminate.

Expected to write one or more follow-up records for record 'LoggedEvent [type=0, version=0, streamId=2, position=299792, key=4503599627371681, timestamp=1672654759877, sourceEventPosition=297539] RecordMetadata{recordType=COMMAND, intentValue=255, intent=TERMINATE_ELEMENT, requestStreamId=-2147483648, requestId=-1, protocolVersion=3, valueType=PROCESS_INSTANCE, rejectionType=NULL_VAL, rejectionReason=, brokerVersion=8.2.0}' without errors, but exception was thrown.

Error group: https://console.cloud.google.com/errors/detail/COWzpqvwz4Cg0wE;service=zeebe;time=P7D?project=camunda-cloud-240911

Note: Even though we replaced the dispatcher this error will still happen since we have this max message size limit.

I put the severity to high since I see no workaround. BTW due to the loop and which causes the pod crash looping the cluster was in this case unusable.

To Reproduce
Have a process instance with a lot of activities active, and terminate the corresponding process instance.

Expected behavior
Termination of instances takes into account the batch size, and terminates activities batch-wise, similar issue as to activitate multi instances.

Log/Stacktrace

Full Stacktrace

java.lang.IllegalArgumentException: Expected to claim segment of size 4481608, but can't claim more than 4194304 bytes.
	at io.camunda.zeebe.dispatcher.Dispatcher.offer(Dispatcher.java:207) ~[zeebe-dispatcher-8.2.0-alpha2.jar:8.2.0-alpha2]
	at io.camunda.zeebe.dispatcher.Dispatcher.claimFragmentBatch(Dispatcher.java:164) ~[zeebe-dispatcher-8.2.0-alpha2.jar:8.2.0-alpha2]
	at io.camunda.zeebe.logstreams.impl.log.LogStreamBatchWriterImpl.claimBatchForEvents(LogStreamBatchWriterImpl.java:235) ~[zeebe-logstreams-8.2.0-alpha2.jar:8.2.0-alpha2]
	at io.camunda.zeebe.logstreams.impl.log.LogStreamBatchWriterImpl.tryWrite(LogStreamBatchWriterImpl.java:212) ~[zeebe-logstreams-8.2.0-alpha2.jar:8.2.0-alpha2]
	at io.camunda.zeebe.stream.impl.ProcessingStateMachine.lambda$writeRecords$9(ProcessingStateMachine.java:354) ~[zeebe-stream-platform-8.2.0-alpha2.jar:8.2.0-alpha2]
	at io.camunda.zeebe.scheduler.retry.ActorRetryMechanism.run(ActorRetryMechanism.java:28) ~[zeebe-scheduler-8.2.0-alpha2.jar:8.2.0-alpha2]
	at io.camunda.zeebe.scheduler.retry.AbortableRetryStrategy.run(AbortableRetryStrategy.java:45) ~[zeebe-scheduler-8.2.0-alpha2.jar:8.2.0-alpha2]
	at io.camunda.zeebe.scheduler.ActorJob.invoke(ActorJob.java:92) ~[zeebe-scheduler-8.2.0-alpha2.jar:8.2.0-alpha2]
	at io.camunda.zeebe.scheduler.ActorJob.execute(ActorJob.java:45) ~[zeebe-scheduler-8.2.0-alpha2.jar:8.2.0-alpha2]
	at io.camunda.zeebe.scheduler.ActorTask.execute(ActorTask.java:119) ~[zeebe-scheduler-8.2.0-alpha2.jar:8.2.0-alpha2]
	at io.camunda.zeebe.scheduler.ActorThread.executeCurrentTask(ActorThread.java:106) ~[zeebe-scheduler-8.2.0-alpha2.jar:8.2.0-alpha2]
	at io.camunda.zeebe.scheduler.ActorThread.doWork(ActorThread.java:87) ~[zeebe-scheduler-8.2.0-alpha2.jar:8.2.0-alpha2]
	at io.camunda.zeebe.scheduler.ActorThread.run(ActorThread.java:198) ~[zeebe-scheduler-8.2.0-alpha2.jar:8.2.0-alpha2]
"

Environment:

OS:
Zeebe Version: 8.2.0-alpha2
Configuration: Production G3-S

relates to https://jira.camunda.com/browse/SUPPORT-16499

The text was updated successfully, but these errors were encountered:

Zelldon · 2023-01-04T08:08:27Z

Another but related error occured on PROD:

io.camunda.zeebe.stream.api.records.ExceededBatchRecordSizeException: Can't append entry: 'RecordBatchEntry[key=2251799813801783, sourceIndex=-1, recordMetadata=RecordMetadata{recordType=COMMAND, intentValue=10, intent=TERMINATE_ELEMENT, requestStreamId=-2147483648, requestId=-1, protocolVersion=3, valueType=PROCESS_INSTANCE, rejectionType=NULL_VAL, rejectionReason=, brokerVersion=8.2.0}, unifiedRecordValue={"bpmnProcessId":"Process_372fbfc7-9a4a-4f0b-aee5-bd96ed3e3e5d","version":1,"processDefinitionKey":2251799813685320,"processInstanceKey":2251799813685333,"elementId":"Activity_0vhm20h","flowScopeKey":2251799813685333,"bpmnElementType":"USER_TASK","bpmnEventType":"UNSPECIFIED","parentProcessInstanceKey":-1,"parentElementInstanceKey":-1}]' with size: 335 this would exceed the maximum batch size. [ currentBatchEntryCount: 11814, currentBatchSize: 3957709]

at io.camunda.zeebe.stream.impl.records.RecordBatch.appendRecord ( [io/camunda.zeebe.stream.impl.records/RecordBatch.java:66](https://console.cloud.google.com/debug?referrer=fromlog&file=io%2Fcamunda.zeebe.stream.impl.records%2FRecordBatch.java&line=66&project=camunda-cloud-240911) )
at io.camunda.zeebe.stream.impl.BufferedProcessingResultBuilder.appendRecordReturnEither ( [io/camunda.zeebe.stream.impl/BufferedProcessingResultBuilder.java:62](https://console.cloud.google.com/debug?referrer=fromlog&file=io%2Fcamunda.zeebe.stream.impl%2FBufferedProcessingResultBuilder.java&line=62&project=camunda-cloud-240911) )
at io.camunda.zeebe.stream.api.ProcessingResultBuilder.appendRecord ( [io/camunda.zeebe.stream.api/ProcessingResultBuilder.java:38](https://console.cloud.google.com/debug?referrer=fromlog&file=io%2Fcamunda.zeebe.stream.api%2FProcessingResultBuilder.java&line=38&project=camunda-cloud-240911) )
at io.camunda.zeebe.engine.processing.streamprocessor.writers.ResultBuilderBackedTypedCommandWriter.appendRecord ( [io/camunda.zeebe.engine.processing.streamprocessor.writers/ResultBuilderBackedTypedCommandWriter.java:37](https://console.cloud.google.com/debug?referrer=fromlog&file=io%2Fcamunda.zeebe.engine.processing.streamprocessor.writers%2FResultBuilderBackedTypedCommandWriter.java&line=37&project=camunda-cloud-240911) )
at io.camunda.zeebe.engine.processing.streamprocessor.writers.ResultBuilderBackedTypedCommandWriter.appendFollowUpCommand ( [io/camunda.zeebe.engine.processing.streamprocessor.writers/ResultBuilderBackedTypedCommandWriter.java:32](https://console.cloud.google.com/debug?referrer=fromlog&file=io%2Fcamunda.zeebe.engine.processing.streamprocessor.writers%2FResultBuilderBackedTypedCommandWriter.java&line=32&project=camunda-cloud-240911) )
at io.camunda.zeebe.engine.processing.bpmn.behavior.BpmnStateTransitionBehavior.lambda$terminateChildInstances$3 ( [io/camunda.zeebe.engine.processing.bpmn.behavior/BpmnStateTransitionBehavior.java:332](https://console.cloud.google.com/debug?referrer=fromlog&file=io%2Fcamunda.zeebe.engine.processing.bpmn.behavior%2FBpmnStateTransitionBehavior.java&line=332&project=camunda-cloud-240911) )
at io.camunda.zeebe.engine.processing.bpmn.behavior.BpmnStateTransitionBehavior.terminateChildInstances ( [io/camunda.zeebe.engine.processing.bpmn.behavior/BpmnStateTransitionBehavior.java:330](https://console.cloud.google.com/debug?referrer=fromlog&file=io%2Fcamunda.zeebe.engine.processing.bpmn.behavior%2FBpmnStateTransitionBehavior.java&line=330&project=camunda-cloud-240911) )
at io.camunda.zeebe.engine.processing.bpmn.container.ProcessProcessor.onTerminate ( [io/camunda.zeebe.engine.processing.bpmn.container/ProcessProcessor.java:85](https://console.cloud.google.com/debug?referrer=fromlog&file=io%2Fcamunda.zeebe.engine.processing.bpmn.container%2FProcessProcessor.java&line=85&project=camunda-cloud-240911) )
at io.camunda.zeebe.engine.processing.bpmn.container.ProcessProcessor.onTerminate ( [io/camunda.zeebe.engine.processing.bpmn.container/ProcessProcessor.java:27](https://console.cloud.google.com/debug?referrer=fromlog&file=io%2Fcamunda.zeebe.engine.processing.bpmn.container%2FProcessProcessor.java&line=27&project=camunda-cloud-240911) )
at io.camunda.zeebe.engine.processing.bpmn.BpmnStreamProcessor.processEvent ( [io/camunda.zeebe.engine.processing.bpmn/BpmnStreamProcessor.java:122](https://console.cloud.google.com/debug?referrer=fromlog&file=io%2Fcamunda.zeebe.engine.processing.bpmn%2FBpmnStreamProcessor.java&line=122&project=camunda-cloud-240911) )
at io.camunda.zeebe.engine.processing.bpmn.BpmnStreamProcessor.lambda$processRecord$0 ( [io/camunda.zeebe.engine.processing.bpmn/BpmnStreamProcessor.java:95](https://console.cloud.google.com/debug?referrer=fromlog&file=io%2Fcamunda.zeebe.engine.processing.bpmn%2FBpmnStreamProcessor.java&line=95&project=camunda-cloud-240911) )
at io.camunda.zeebe.util.Either$Right.ifRightOrLeft ( [io/camunda.zeebe.util/Either.java:381](https://console.cloud.google.com/debug?referrer=fromlog&file=io%2Fcamunda.zeebe.util%2FEither.java&line=381&project=camunda-cloud-240911) )
at io.camunda.zeebe.engine.processing.bpmn.BpmnStreamProcessor.processRecord ( [io/camunda.zeebe.engine.processing.bpmn/BpmnStreamProcessor.java:92](https://console.cloud.google.com/debug?referrer=fromlog&file=io%2Fcamunda.zeebe.engine.processing.bpmn%2FBpmnStreamProcessor.java&line=92&project=camunda-cloud-240911) )
at io.camunda.zeebe.engine.Engine.process ( [io/camunda.zeebe.engine/Engine.java:128](https://console.cloud.google.com/debug?referrer=fromlog&file=io%2Fcamunda.zeebe.engine%2FEngine.java&line=128&project=camunda-cloud-240911) )
at io.camunda.zeebe.stream.impl.ProcessingStateMachine.lambda$processCommand$3 ( [io/camunda.zeebe.stream.impl/ProcessingStateMachine.java:264](https://console.cloud.google.com/debug?referrer=fromlog&file=io%2Fcamunda.zeebe.stream.impl%2FProcessingStateMachine.java&line=264&project=camunda-cloud-240911) )
at io.camunda.zeebe.db.impl.rocksdb.transaction.ZeebeTransaction.run ( [io/camunda.zeebe.db.impl.rocksdb.transaction/ZeebeTransaction.java:84](https://console.cloud.google.com/debug?referrer=fromlog&file=io%2Fcamunda.zeebe.db.impl.rocksdb.transaction%2FZeebeTransaction.java&line=84&project=camunda-cloud-240911) )
at io.camunda.zeebe.stream.impl.ProcessingStateMachine.processCommand ( [io/camunda.zeebe.stream.impl/ProcessingStateMachine.java:260](https://console.cloud.google.com/debug?referrer=fromlog&file=io%2Fcamunda.zeebe.stream.impl%2FProcessingStateMachine.java&line=260&project=camunda-cloud-240911) )
at io.camunda.zeebe.stream.impl.ProcessingStateMachine.tryToReadNextRecord ( [io/camunda.zeebe.stream.impl/ProcessingStateMachine.java:209](https://console.cloud.google.com/debug?referrer=fromlog&file=io%2Fcamunda.zeebe.stream.impl%2FProcessingStateMachine.java&line=209&project=camunda-cloud-240911) )
at io.camunda.zeebe.stream.impl.ProcessingStateMachine.readNextRecord ( [io/camunda.zeebe.stream.impl/ProcessingStateMachine.java:185](https://console.cloud.google.com/debug?referrer=fromlog&file=io%2Fcamunda.zeebe.stream.impl%2FProcessingStateMachine.java&line=185&project=camunda-cloud-240911) )
at io.camunda.zeebe.scheduler.ActorJob.invoke ( [io/camunda.zeebe.scheduler/ActorJob.java:92](https://console.cloud.google.com/debug?referrer=fromlog&file=io%2Fcamunda.zeebe.scheduler%2FActorJob.java&line=92&project=camunda-cloud-240911) )
at io.camunda.zeebe.scheduler.ActorJob.execute ( [io/camunda.zeebe.scheduler/ActorJob.java:45](https://console.cloud.google.com/debug?referrer=fromlog&file=io%2Fcamunda.zeebe.scheduler%2FActorJob.java&line=45&project=camunda-cloud-240911) )
at io.camunda.zeebe.scheduler.ActorTask.execute ( [io/camunda.zeebe.scheduler/ActorTask.java:119](https://console.cloud.google.com/debug?referrer=fromlog&file=io%2Fcamunda.zeebe.scheduler%2FActorTask.java&line=119&project=camunda-cloud-240911) )
at io.camunda.zeebe.scheduler.ActorThread.executeCurrentTask ( [io/camunda.zeebe.scheduler/ActorThread.java:106](https://console.cloud.google.com/debug?referrer=fromlog&file=io%2Fcamunda.zeebe.scheduler%2FActorThread.java&line=106&project=camunda-cloud-240911) )
at io.camunda.zeebe.scheduler.ActorThread.doWork ( [io/camunda.zeebe.scheduler/ActorThread.java:87](https://console.cloud.google.com/debug?referrer=fromlog&file=io%2Fcamunda.zeebe.scheduler%2FActorThread.java&line=87&project=camunda-cloud-240911) )
at io.camunda.zeebe.scheduler.ActorThread.run ( [io/camunda.zeebe.scheduler/ActorThread.java:198](https://console.cloud.google.com/debug?referrer=fromlog&file=io%2Fcamunda.zeebe.scheduler%2FActorThread.java&line=198&project=camunda-cloud-240911) )

Error group, https://console.cloud.google.com/errors/detail/CJujpJmq_NqemgE;service=zeebe;time=P7D?project=camunda-cloud-240911

saig0 · 2023-02-17T12:32:47Z

ℹ️ Currently, the cancel command is excluded from blacklisting (see here). As a result, the process instance continues with processing.

Zelldon · 2023-03-09T07:53:23Z

⚠️ Happened again this week, and caused another incident

Happened on 8.1.3 https://console.cloud.google.com/errors/detail/CKvjvtrYm_SiuwE;service=zeebe;time=P7D?project=camunda-cloud-240911

Zelldon · 2023-03-09T07:58:52Z

I would request to re-evaluate the priority of this by @camunda/zeebe-process-automation

Incidents shouldn't happen twice. This seems to be an issue that people seem to run into easily, and there is no good way to resolve it.

korthout · 2023-03-15T09:12:20Z

Triage summary:

Create an EPIC to tackle this problem correctly: support cancelling instances with many tokens (@aleksander-dytko )
Provide a quick and dirty solution to avoid this producing further incidents.

Let's continue working on this issue by providing this quick and dirty solution

aleksander-dytko · 2023-03-28T09:55:02Z

@korthout could you please check if I have summarized all the details in https://github.com/camunda/product-hub/issues/1067 ?
Thanks!

korthout · 2023-03-30T08:33:08Z

@aleksander-dytko Thanks for creating the EPIC. I think you cover all the details.

npepinpe · 2023-04-03T13:56:18Z

This happened again, except this time the number of child element instances is so great it causes the nodes to first slow down to a crawl due to very high GC times, then be killed due to OOM.

Incident link: https://camunda.slack.com/archives/C051HA4V63D
Data link (incl. heap dump, process BPMN, and the complete node state): https://drive.google.com/drive/folders/1VkseQsD8Czi33dQi_kE_vV-YnfOTuJgu?usp=share_link

In case of investigation with this data, the key of the command is 4503599643148887 and its position is 93582578. It is a ProcessInstance.TERMINATE_ELEMENT command.

Affected version is 8.1.9, though I imagine most versions are affected.

From the heap dump:

The thread io.camunda.zeebe.scheduler.ActorThread @ 0xab7760e8 Broker-2-zb-actors-2 keeps local variables with total size 1.90 GB (98.54%) bytes.
The memory is accumulated in one instance of java.lang.Object[], loaded by , which occupies 1.90 GB (98.52%) bytes.
The stacktrace of this Thread is available. See stacktrace. See stacktrace with involved local variables.

Keywords

java.lang.Object[]

io.camunda.zeebe.engine.state.instance.DbElementInstanceState.lambda$getChildren$2(Ljava/util/List;Lio/camunda/zeebe/db/impl/DbCompositeKey;Lio/camunda/zeebe/db/impl/DbNil;)V
DbElementInstanceState.java:258

io.camunda.zeebe.engine.state.instance.DbElementInstanceState.getChildren(J)Ljava/util/List;

DbElementInstanceState.java:254

Memory metrics:

In our case, the cluster was also unusable, and likely the only way to recover it is to give it ludicrous amounts of memory.

npepinpe · 2023-04-04T11:21:33Z

Relevant support issue: https://jira.camunda.com/browse/SUPPORT-16499

And clusters which run into this are likely to be affected by #12239 as well (relevant support issue: https://jira.camunda.com/browse/SUPPORT-16394).

Please update the support team once these issues are fixed with a patch ETA 🙏

remcowesterhoud · 2023-04-19T12:35:04Z

I've renamed this issue as the descriptions are not related to deep-nesting. They are related to a process instance which contains many active elements instances.

For the deep-nesting we have another issue:

StackOverflowError when terminating process with a lot of child processes #8955

I've created an epic to do a proper task breakdown #12485

12604: Terminate children using the new `ProcessInstanceBatch` command r=berkaycanbc a=remcowesterhoud ## Description  This PR switches the termination of child instances to use the new `ProcessInstanceBatch` command. ## Related issues  closes #12538 closes #11355 Co-authored-by: Remco Westerhoud <remco@westerhoud.nl>

12616: [Backport release-8.3.0-alpha1] Terminate children using the new `ProcessInstanceBatch` command r=remcowesterhoud a=backport-action # Description Backport of #12604 to `release-8.3.0-alpha1`. relates to #12538 #11355 Co-authored-by: Remco Westerhoud <remco@westerhoud.nl>

12614: [Backport stable/8.1] Terminate children using the new `ProcessInstanceBatch` command r=remcowesterhoud a=backport-action # Description Backport of #12604 to `stable/8.1`. relates to #12538 #11355 Co-authored-by: Remco Westerhoud <remco@westerhoud.nl>

12615: [Backport stable/8.2] Terminate children using the new `ProcessInstanceBatch` command r=remcowesterhoud a=backport-action # Description Backport of #12604 to `stable/8.2`. relates to #12538 #11355 Co-authored-by: Remco Westerhoud <remco@westerhoud.nl>

Zelldon added kind/bug Categorizes an issue or PR as a bug severity/high Marks a bug as having a noticeable impact on the user with no known workaround component/engine labels Jan 3, 2023

npepinpe added the support Marks an issue as related to a customer support request label Apr 4, 2023

epollum mentioned this issue Apr 6, 2023

I can spawn inner instances for a large input collection #2890

Closed

remcowesterhoud self-assigned this Apr 17, 2023

remcowesterhoud changed the title ~~Not possible to cancel deep nested process instance~~ Not possible to cancel process instance with many of active element instances Apr 19, 2023

remcowesterhoud changed the title ~~Not possible to cancel process instance with many of active element instances~~ Not possible to cancel process instance with many active element instances Apr 19, 2023

remcowesterhoud mentioned this issue Apr 19, 2023

[EPIC] Terminate Process Instance in batches #12485

Closed

5 tasks

remcowesterhoud mentioned this issue May 1, 2023

Terminate children using the new ProcessInstanceBatch command #12604

Merged

14 tasks

zeebe-bors-camunda bot closed this as completed in e3b025a May 1, 2023

Zelldon mentioned this issue May 16, 2023

Allow to cancel bannend instances #12772

Open

oleschoenburg added the version:8.3.0-alpha2 Marks an issue as being completely or in parts released in 8.3.0-alpha2 label Jun 7, 2023

megglos removed the support Marks an issue as related to a customer support request label Jun 12, 2023

github-actions bot added the support Marks an issue as related to a customer support request label Jun 12, 2023

megglos added the version:8.3.0 Marks an issue as being completely or in parts released in 8.3.0 label Oct 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not possible to cancel process instance with many active element instances #11355

Not possible to cancel process instance with many active element instances #11355

Zelldon commented Jan 3, 2023 •

edited by megglos

Zelldon commented Jan 4, 2023

saig0 commented Feb 17, 2023

Zelldon commented Mar 9, 2023

Zelldon commented Mar 9, 2023

korthout commented Mar 15, 2023 •

edited

aleksander-dytko commented Mar 28, 2023

korthout commented Mar 30, 2023

npepinpe commented Apr 3, 2023 •

edited

npepinpe commented Apr 4, 2023 •

edited

remcowesterhoud commented Apr 19, 2023 •

edited

Not possible to cancel process instance with many active element instances #11355

Not possible to cancel process instance with many active element instances #11355

Comments

Zelldon commented Jan 3, 2023 • edited by megglos

Zelldon commented Jan 4, 2023

saig0 commented Feb 17, 2023

Zelldon commented Mar 9, 2023

Zelldon commented Mar 9, 2023

korthout commented Mar 15, 2023 • edited

aleksander-dytko commented Mar 28, 2023

korthout commented Mar 30, 2023

npepinpe commented Apr 3, 2023 • edited

npepinpe commented Apr 4, 2023 • edited

remcowesterhoud commented Apr 19, 2023 • edited

Zelldon commented Jan 3, 2023 •

edited by megglos

korthout commented Mar 15, 2023 •

edited

npepinpe commented Apr 3, 2023 •

edited

npepinpe commented Apr 4, 2023 •

edited

remcowesterhoud commented Apr 19, 2023 •

edited