When assignee or candidate groups expression evaluates to an empty string, the logged event cannot be processed nor exported #8611

pihme · 2022-01-19T07:26:30Z

Describe the bug

Whenever the assignee (or candidate groups) expression evaluates to an empty string, the logged event contains the wrong number expressing how many elements the customerHeaders map contains.

Basically, this returns an empty string "":

https://github.com/camunda-cloud/zeebe/blob/c683c401309a07f2d38aaeee3f42e6227104a306/engine/src/main/java/io/camunda/zeebe/engine/processing/bpmn/behavior/BpmnJobBehavior.java#L151

The assignee is not null, so it puts as value the empty string with the key assignee to the headers map:

https://github.com/camunda-cloud/zeebe/blob/c683c401309a07f2d38aaeee3f42e6227104a306/engine/src/main/java/io/camunda/zeebe/engine/processing/bpmn/behavior/BpmnJobBehavior.java#L153-L155

When encoding the headers, it checks for valid headers, i.e., if the key and the value is not null and not empty:

https://github.com/camunda-cloud/zeebe/blob/c683c401309a07f2d38aaeee3f42e6227104a306/engine/src/main/java/io/camunda/zeebe/engine/processing/bpmn/behavior/BpmnJobBehavior.java#L259-L261

Since the assignee has an empty string as a value, the header is not valid and is skipped while writing it.

When reading these record, the reader assumes that the customerHeaders property has one element, so that the subsequent property variables (key and value) is set as the customHeaders property's value. That way, the variables property is "skipped". As a consequence, the reader assumes that they read only 15 properties out of 16 properties, and tries to read the last missing property but since it already reached the end of the byte array it fails with the

java.lang.RuntimeException: Could not deserialize object [JobRecord]. Deserialization stuck at offset 306 of length 306

After reading the customHeaders property key, it reads the next byte in #skipValues()

https://github.com/camunda-cloud/zeebe/blob/9e189a1d087ba08191cb888bca2ccc32c325cf8e/msgpack-core/src/main/java/io/camunda/zeebe/msgpack/spec/MsgPackReader.java#L355-L356

The bytes suggests that the value is a fixed map with a map length of 1

https://github.com/camunda-cloud/zeebe/blob/9e189a1d087ba08191cb888bca2ccc32c325cf8e/msgpack-core/src/main/java/io/camunda/zeebe/msgpack/spec/MsgPackReader.java#L366-L371

This increases the count, so that the reader continues in the while-loop and reads the next byte, etc. That way, the entire variables property is read.

To reproduce

Deploy a BPMN process with a user task
Start a process instance
Let the assignee (or candidate groups) expression evaluate to an empty string

Log/Stacktrace

Full Stacktrace

java.lang.RuntimeException: Could not deserialize object [JobRecord]. Deserialization stuck at offset 306 of length 306
	at io.camunda.zeebe.msgpack.UnpackedObject.wrap(UnpackedObject.java:38) ~[zeebe-msgpack-value-1.3.0.jar:1.3.0]
	at io.camunda.zeebe.logstreams.impl.log.LoggedEventImpl.readValue(LoggedEventImpl.java:135) ~[zeebe-logstreams-1.3.0.jar:1.3.0]
	at io.camunda.zeebe.engine.processing.streamprocessor.RecordValues.readRecordValue(RecordValues.java:35) ~[zeebe-workflow-engine-1.3.0.jar:1.3.0]
	at io.camunda.zeebe.broker.exporter.stream.ExporterDirector$RecordExporter.wrap(ExporterDirector.java:489) ~[zeebe-broker-1.3.0.jar:1.3.0]
	at io.camunda.zeebe.broker.exporter.stream.ExporterDirector.lambda$exportEvent$9(ExporterDirector.java:391) ~[zeebe-broker-1.3.0.jar:1.3.0]
	at io.camunda.zeebe.util.retry.ActorRetryMechanism.run(ActorRetryMechanism.java:36) ~[zeebe-util-1.3.0.jar:1.3.0]
	at io.camunda.zeebe.util.retry.EndlessRetryStrategy.run(EndlessRetryStrategy.java:50) ~[zeebe-util-1.3.0.jar:1.3.0]
	at io.camunda.zeebe.util.sched.ActorJob.invoke(ActorJob.java:73) [zeebe-util-1.3.0.jar:1.3.0]
	at io.camunda.zeebe.util.sched.ActorJob.execute(ActorJob.java:39) [zeebe-util-1.3.0.jar:1.3.0]
	at io.camunda.zeebe.util.sched.ActorTask.execute(ActorTask.java:122) [zeebe-util-1.3.0.jar:1.3.0]
	at io.camunda.zeebe.util.sched.ActorThread.executeCurrentTask(ActorThread.java:95) [zeebe-util-1.3.0.jar:1.3.0]
	at io.camunda.zeebe.util.sched.ActorThread.doWork(ActorThread.java:78) [zeebe-util-1.3.0.jar:1.3.0]
	at io.camunda.zeebe.util.sched.ActorThread.run(ActorThread.java:192) [zeebe-util-1.3.0.jar:1.3.0]
Caused by: java.lang.IndexOutOfBoundsException: index=306 capacity=306
	at org.agrona.concurrent.UnsafeBuffer.boundsCheck(UnsafeBuffer.java:2212) ~[agrona-1.14.0.jar:1.14.0]
	at org.agrona.concurrent.UnsafeBuffer.getByte(UnsafeBuffer.java:947) ~[agrona-1.14.0.jar:1.14.0]
	at io.camunda.zeebe.msgpack.spec.MsgPackReader.readStringLength(MsgPackReader.java:115) ~[zeebe-msgpack-core-1.3.0.jar:1.3.0]
	at io.camunda.zeebe.msgpack.value.StringValue.read(StringValue.java:96) ~[zeebe-msgpack-value-1.3.0.jar:1.3.0]
	at io.camunda.zeebe.msgpack.value.ObjectValue.read(ObjectValue.java:91) ~[zeebe-msgpack-value-1.3.0.jar:1.3.0]
	at io.camunda.zeebe.msgpack.UnpackedObject.wrap(UnpackedObject.java:32) ~[zeebe-msgpack-value-1.3.0.jar:1.3.0]
	... 12 more

Relevant links

Observed in integration logs: https://console.cloud.google.com/errors/CIvw6JSBkIegmAE?service=zeebe&time=P7D&refresh=off&project=camunda-cloud-240911
Raft partition log to reproduce the issue locally: https://drive.google.com/drive/folders/1ZDiXZ_gox4TJf7JbENeZJODOGbYZzw_K?usp=sharing

Possible solutions

Ensure that the correct size of valid headers is written
For already existing records in the logstream, the reader should handle such "corrupted" jobs, so that the exporter can export that job and the stream process can replay that event (in case of a failover or restart).

Environment:

Zeebe Version: 1.3.0

The text was updated successfully, but these errors were encountered:

pihme · 2022-01-19T07:27:27Z

@npepinpe Can you maybe assign someone to look at it. This seems to be a recurring issue in that cluster. I won't have time to root cause this today because of the Scala training

npepinpe · 2022-01-20T07:55:25Z

Will do 👍

romansmirnov · 2022-01-20T12:25:53Z

@npepinpe, a quick update on the impact:

The exporter fails to export a job record because it fails when trying to read the job record with the exception above. Meaning, the job record won't be exported, this results in a situation where the exporter retries to export the record forever (and most likely occupies the actor thread, therefore). This results in the following:

No other records are exported (for that partition),
Potentially, the broker might fail with an OutOfMemoryError because a lot of actor jobs will pile up in the exporter actor (that won't be executed),
Journal cannot be compacted anymore which may result in higher disk space usage, and
Operate is not up-to-date.

Next step: root cause the issue

npepinpe · 2022-01-20T12:33:48Z

Thanks Roman! I'm interested to know the root cause, but if it proves difficult to diagnose, also the likelihood of this occurring, and if there's anyway to recover from the fault.

romansmirnov · 2022-01-20T19:48:14Z

Basically, whenever the property customHeaders of a job is empty, the variables property is skipped when reading the logged job event. I will update the issue description to describe the actual bug accordingly.

@npepinpe, I assume that the property customHeaders can be empty when there isn't any assignee nor any candidate group given in the BPMN process, for example when using user tasks. (Not sure, about service tasks, though, but I guess there are also cases where the custom headers can be empty - at least the code suggests it so.) Regarding the likelihood, not sure - is it safe to assume that jobs with empty customHeaders are exceptional cases (in how user tasks and service tasks are used)?

However, I personally think it is a critical bug, because if a broker runs into this issue, the partition cannot be recovered. Meaning, the job cannot be exported. Also, in case of a failover (or restart) the logged job event cannot be replayed anymore, because it fails with the same exception as the exporter does.

npepinpe · 2022-01-21T08:40:16Z

@saig0 @korthout - is this the expected behavior of the message pack parser? Is there no possibility of recovering by patching the message pack parsing code to handle this case, so no data would be lost?

If not, then yes this is likely a critical bug.

npepinpe · 2022-01-24T08:58:04Z

It seems there's some dispute over the root cause, but consensus that once you hit this state there's no way to recover, so I will treat this as a critical bug.

romansmirnov · 2022-01-24T13:31:12Z

@npepinpe & @remcowesterhoud, I had a second look at the issue. I was wrong with the empty customHeaders, I misinterpreted the byte expressing how many elements the customHeaders property contains. Actually, the byte suggests that there is one element in the custom header map but this expected element does not exist in the byte array.

The event's custom headers get written here:

https://github.com/camunda-cloud/zeebe/blob/9b41a8160d5432fe148d4425670fd4ed40e87424/engine/src/main/java/io/camunda/zeebe/engine/processing/bpmn/behavior/BpmnJobBehavior.java#L230-L257

To my understanding, when creating the corresponding job, the taskHeaders map must be non-empty and contains one element, so that when writing the map headers it will write the byte -127 (indicating that there is one element in the map) as a first byte to the given byte array:

https://github.com/camunda-cloud/zeebe/blob/9b41a8160d5432fe148d4425670fd4ed40e87424/engine/src/main/java/io/camunda/zeebe/engine/processing/bpmn/behavior/BpmnJobBehavior.java#L240-L241

But writing the actual element to the byte array does not happen (or does not succeed):

https://github.com/camunda-cloud/zeebe/blob/9b41a8160d5432fe148d4425670fd4ed40e87424/engine/src/main/java/io/camunda/zeebe/engine/processing/bpmn/behavior/BpmnJobBehavior.java#L243-L252

Basically, it does not check if everything got written to the byte array as expected (note: it checks for a valid header, and if it is not valid it skips that element in the map).

When I debug that part with an empty custom header and manually manipulate the map during its execution by adding an empty element taskHeaders.put("", ""), I can reproduce the exception eventually. But it requires restarting the broker so that the stream processor fails to replay that event.

However, I could not produce a BPMN process (containing a user task and service task with different headers), that reproduces that exception without any manual manipulation.

npepinpe · 2022-01-24T13:33:00Z

Thanks for double checking! About reproducing, @remcowesterhoud you can ask @menski, he mentioned he might be able to reproduce this as it came from one of his clusters.

remcowesterhoud · 2022-01-24T14:51:20Z

I have created #8638 as a possible fix for this issue (therefore the PR does not close the issue yet). It's hard to know if this was the root cause of the problem since I've been unable to reproduce it via a BPMN process. I have asked @menski if he knows how to do this.

npepinpe · 2022-01-24T16:51:21Z

The tentative fix only fixes how we write the data, but wouldn't help us recover the broken state, would it? In case not yet done, please spend some time as part of this after to at least figure out if there's anything we can do to help someone recover their data after they ran into this.

menski · 2022-01-24T17:07:55Z

@remcowesterhoud I tried to reproduce it with the setup of the last cluster that failed, but it didn't happen again. I will keep the cluster running over night, but doesn't look like it is reproducible. The cluster I setup is https://console.cloud.ultrawombat.com/org/6ff582aa-a62e-4a28-aac7-4d2224d8c58a/cluster/9d464e7f-5a93-4a4f-9d22-9aa8dedec331

Zelldon · 2022-01-25T05:17:00Z

Do we still have the cluster which was causing this? If so we could take a look at the data via zdb. @menski did you update your cluster to 1.3.0? Maybe this caused that issue, since we changed something in the custom header handling?

menski · 2022-01-25T06:48:22Z

No we don't have the cluster anymore, it was deleted as it was constantly throwing exceptions, but the raft partition should be available here https://drive.google.com/drive/folders/1ZDiXZ_gox4TJf7JbENeZJODOGbYZzw_K?usp=sharing

The cluster I deleted was from my release presentation, I directly created it with 1.3.0, and then deployed 4 definitions on it to demonstrate call activity hierarchy in Operate. As far as I remember I started one instance which as expected lead to an incident for demonstration purpose. But maybe I also did something else which I can remember.

Maybe somebody can check the log and tell me to which process the job belong to, maybe I did something else which I forgot, as the incident shouldn't create a job.

Parent.bpmn20.xml.txt
Child 1.bpmn20.xml.txt
Child 2.bpmn20.xml.txt
Child 3.bpmn20.xml.txt

romansmirnov · 2022-01-25T10:05:31Z

Could it be that when evaluating the assignee or candidate groups expression an empty string is returned?

Basically, this returns an empty string "":

https://github.com/camunda-cloud/zeebe/blob/c683c401309a07f2d38aaeee3f42e6227104a306/engine/src/main/java/io/camunda/zeebe/engine/processing/bpmn/behavior/BpmnJobBehavior.java#L151

The assignee is not null, so it puts as value the empty string with the key assignee to the headers map:

https://github.com/camunda-cloud/zeebe/blob/c683c401309a07f2d38aaeee3f42e6227104a306/engine/src/main/java/io/camunda/zeebe/engine/processing/bpmn/behavior/BpmnJobBehavior.java#L153-L155

When encoding the headers, it checks for valid headers, i.e., if the key and the value is not null and not empty:

https://github.com/camunda-cloud/zeebe/blob/c683c401309a07f2d38aaeee3f42e6227104a306/engine/src/main/java/io/camunda/zeebe/engine/processing/bpmn/behavior/BpmnJobBehavior.java#L259-L261

Since the assignee has an empty string as a value, the header is not valid and is skipped while writing it.

@remcowesterhoud, does it maybe help to reproduce the issue?

korthout · 2022-01-25T10:16:27Z

Nice find @romansmirnov. That definitely looks like it would cause problems. @remcowesterhoud You had the actual problematic process right? Could this situation arise in there?

remcowesterhoud · 2022-01-25T10:17:47Z

I just tested this, this does indeed reproduce the issue. Nice find @romansmirnov!

romansmirnov · 2022-01-25T10:29:21Z

Great! I am relieved 😄 Sorry, for my first wrong analysis.

8638: Write correct header size when encoding task headers r=remcowesterhoud a=remcowesterhoud ## Description  When we write the headers size we would include invalid headers. These headers would be filtered and not written to the buffer. Therefore, the size could differ from the actual amount of headers written in the buffer. With this fix we first validate all the headers, before writing the header size. This way we will never write the wrong size. ## Related issues  Relates to #8611 Since I have not yet been able to reproduce the problem I have decided against closing this issue with this PR. First I'd like to discuss if this could indeed be the root cause. Co-authored-by: Remco Westerhoud <remco@westerhoud.nl>

8661: [Backport stable/1.3] Write correct header size when encoding task headers r=remcowesterhoud a=github-actions[bot] # Description Backport of #8638 to `stable/1.3`. relates to #8611 Co-authored-by: Remco Westerhoud <remco@westerhoud.nl>

remcowesterhoud · 2022-01-27T14:56:27Z

@npepinpe To get back to your question about how to recover from this problem. I can't think of any way to do this.

If I understand everything correct, when we read the records from the log there is an expected amount of bytes belonging to a specific record. Because of this bug the expected amount will be greater than the actual amount of bytes, thus resulting in the IndexOutOfBoundsException.
In order to recover we must somehow skip the difference in bytes between the expected and actual amounts. The problem I see here is that this is all generic code and I can't think of a way to tell if a Record is corrupted by this problem, or how much bytes should be skipped.

npepinpe · 2022-01-27T14:58:15Z

All good. Let me know when this is merged and closed so we can do a patch release then. The sooner this is out the less users will run into this unrecoverable issue.

remcowesterhoud · 2022-01-27T15:03:30Z

It is merged and backported to 1.3, so I'll close this issue now :)

pihme added the kind/bug Categorizes an issue or PR as a bug label Jan 19, 2022

pihme assigned npepinpe Jan 19, 2022

romansmirnov assigned romansmirnov and unassigned npepinpe Jan 20, 2022

romansmirnov changed the title ~~RuntimeException: Could not deserialize object [JobRecord]. Deserialization stuck at offset~~ Jobs with empty customHeaders cannot be read by the stream processor and the exporter Jan 20, 2022

romansmirnov assigned npepinpe and unassigned romansmirnov Jan 20, 2022

npepinpe added severity/critical Marks a stop-the-world bug, with a high impact and no existing workaround Impact: Availability labels Jan 21, 2022

npepinpe added this to Ready in Zeebe Jan 21, 2022

npepinpe assigned remcowesterhoud and unassigned npepinpe Jan 24, 2022

npepinpe moved this from Ready to In progress in Zeebe Jan 24, 2022

remcowesterhoud mentioned this issue Jan 24, 2022

Write correct header size when encoding task headers #8638

Merged

9 tasks

romansmirnov changed the title ~~Jobs with empty customHeaders cannot be read by the stream processor and the exporter~~ When assignee or candidate groups expression evaluates to empty string, the logged event cannot be processed nor exported Jan 25, 2022

romansmirnov changed the title ~~When assignee or candidate groups expression evaluates to empty string, the logged event cannot be processed nor exported~~ When assignee or candidate groups expression evaluates to an empty string, the logged event cannot be processed nor exported Jan 25, 2022

github-actions bot mentioned this issue Jan 26, 2022

[Backport stable/1.3] Write correct header size when encoding task headers #8661

Merged

remcowesterhoud closed this as completed Jan 27, 2022

Zeebe automation moved this from In progress to Done Jan 27, 2022

npepinpe added version:1.3.2 Marks an issue as being completely or in parts released in 1.3.2 1.3.0 labels Jan 28, 2022

carlocab mentioned this issue Jan 28, 2022

zbctl 1.3.2 Homebrew/homebrew-core#93995

Closed

npepinpe mentioned this issue Jan 28, 2022

Add update warning about faulty 1.3.0 and 1.3.1 versions camunda/camunda-docs#579

Merged

remcowesterhoud added the Release: 1.4.0-alpha1 label Feb 1, 2022

KerstinHebel removed this from Done in Zeebe Mar 23, 2022

npepinpe added the Release: 8.0.0-rc1 label Mar 28, 2022

pihme added the Release: 8.0.0 label Apr 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When assignee or candidate groups expression evaluates to an empty string, the logged event cannot be processed nor exported #8611

When assignee or candidate groups expression evaluates to an empty string, the logged event cannot be processed nor exported #8611

pihme commented Jan 19, 2022 •

edited by romansmirnov

Loading

pihme commented Jan 19, 2022

npepinpe commented Jan 20, 2022

romansmirnov commented Jan 20, 2022 •

edited

Loading

npepinpe commented Jan 20, 2022

romansmirnov commented Jan 20, 2022

npepinpe commented Jan 21, 2022

npepinpe commented Jan 24, 2022

romansmirnov commented Jan 24, 2022 •

edited

Loading

npepinpe commented Jan 24, 2022

remcowesterhoud commented Jan 24, 2022

npepinpe commented Jan 24, 2022

menski commented Jan 24, 2022

Zelldon commented Jan 25, 2022

menski commented Jan 25, 2022

romansmirnov commented Jan 25, 2022 •

edited

Loading

korthout commented Jan 25, 2022

remcowesterhoud commented Jan 25, 2022

romansmirnov commented Jan 25, 2022

remcowesterhoud commented Jan 27, 2022

npepinpe commented Jan 27, 2022

remcowesterhoud commented Jan 27, 2022

When assignee or candidate groups expression evaluates to an empty string, the logged event cannot be processed nor exported #8611

When assignee or candidate groups expression evaluates to an empty string, the logged event cannot be processed nor exported #8611

Comments

pihme commented Jan 19, 2022 • edited by romansmirnov Loading

pihme commented Jan 19, 2022

npepinpe commented Jan 20, 2022

romansmirnov commented Jan 20, 2022 • edited Loading

npepinpe commented Jan 20, 2022

romansmirnov commented Jan 20, 2022

npepinpe commented Jan 21, 2022

npepinpe commented Jan 24, 2022

romansmirnov commented Jan 24, 2022 • edited Loading

npepinpe commented Jan 24, 2022

remcowesterhoud commented Jan 24, 2022

npepinpe commented Jan 24, 2022

menski commented Jan 24, 2022

Zelldon commented Jan 25, 2022

menski commented Jan 25, 2022

romansmirnov commented Jan 25, 2022 • edited Loading

korthout commented Jan 25, 2022

remcowesterhoud commented Jan 25, 2022

romansmirnov commented Jan 25, 2022

remcowesterhoud commented Jan 27, 2022

npepinpe commented Jan 27, 2022

remcowesterhoud commented Jan 27, 2022

pihme commented Jan 19, 2022 •

edited by romansmirnov

Loading

romansmirnov commented Jan 20, 2022 •

edited

Loading

romansmirnov commented Jan 24, 2022 •

edited

Loading

romansmirnov commented Jan 25, 2022 •

edited

Loading