Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When assignee or candidate groups expression evaluates to an empty string, the logged event cannot be processed nor exported #8611

Closed
pihme opened this issue Jan 19, 2022 · 21 comments
Assignees
Labels
kind/bug Categorizes an issue or PR as a bug severity/critical Marks a stop-the-world bug, with a high impact and no existing workaround version:1.3.2 Marks an issue as being completely or in parts released in 1.3.2

Comments

@pihme
Copy link
Contributor

pihme commented Jan 19, 2022

Describe the bug

Whenever the assignee (or candidate groups) expression evaluates to an empty string, the logged event contains the wrong number expressing how many elements the customerHeaders map contains.

Basically, this returns an empty string "":

https://github.com/camunda-cloud/zeebe/blob/c683c401309a07f2d38aaeee3f42e6227104a306/engine/src/main/java/io/camunda/zeebe/engine/processing/bpmn/behavior/BpmnJobBehavior.java#L151

The assignee is not null, so it puts as value the empty string with the key assignee to the headers map:

https://github.com/camunda-cloud/zeebe/blob/c683c401309a07f2d38aaeee3f42e6227104a306/engine/src/main/java/io/camunda/zeebe/engine/processing/bpmn/behavior/BpmnJobBehavior.java#L153-L155

When encoding the headers, it checks for valid headers, i.e., if the key and the value is not null and not empty:

https://github.com/camunda-cloud/zeebe/blob/c683c401309a07f2d38aaeee3f42e6227104a306/engine/src/main/java/io/camunda/zeebe/engine/processing/bpmn/behavior/BpmnJobBehavior.java#L259-L261

Since the assignee has an empty string as a value, the header is not valid and is skipped while writing it.

When reading these record, the reader assumes that the customerHeaders property has one element, so that the subsequent property variables (key and value) is set as the customHeaders property's value. That way, the variables property is "skipped". As a consequence, the reader assumes that they read only 15 properties out of 16 properties, and tries to read the last missing property but since it already reached the end of the byte array it fails with the

java.lang.RuntimeException: Could not deserialize object [JobRecord]. Deserialization stuck at offset 306 of length 306

After reading the customHeaders property key, it reads the next byte in #skipValues()

https://github.com/camunda-cloud/zeebe/blob/9e189a1d087ba08191cb888bca2ccc32c325cf8e/msgpack-core/src/main/java/io/camunda/zeebe/msgpack/spec/MsgPackReader.java#L355-L356

The bytes suggests that the value is a fixed map with a map length of 1

https://github.com/camunda-cloud/zeebe/blob/9e189a1d087ba08191cb888bca2ccc32c325cf8e/msgpack-core/src/main/java/io/camunda/zeebe/msgpack/spec/MsgPackReader.java#L366-L371

This increases the count, so that the reader continues in the while-loop and reads the next byte, etc. That way, the entire variables property is read.

To reproduce

  1. Deploy a BPMN process with a user task
  2. Start a process instance
  3. Let the assignee (or candidate groups) expression evaluate to an empty string

Log/Stacktrace

Full Stacktrace

java.lang.RuntimeException: Could not deserialize object [JobRecord]. Deserialization stuck at offset 306 of length 306
	at io.camunda.zeebe.msgpack.UnpackedObject.wrap(UnpackedObject.java:38) ~[zeebe-msgpack-value-1.3.0.jar:1.3.0]
	at io.camunda.zeebe.logstreams.impl.log.LoggedEventImpl.readValue(LoggedEventImpl.java:135) ~[zeebe-logstreams-1.3.0.jar:1.3.0]
	at io.camunda.zeebe.engine.processing.streamprocessor.RecordValues.readRecordValue(RecordValues.java:35) ~[zeebe-workflow-engine-1.3.0.jar:1.3.0]
	at io.camunda.zeebe.broker.exporter.stream.ExporterDirector$RecordExporter.wrap(ExporterDirector.java:489) ~[zeebe-broker-1.3.0.jar:1.3.0]
	at io.camunda.zeebe.broker.exporter.stream.ExporterDirector.lambda$exportEvent$9(ExporterDirector.java:391) ~[zeebe-broker-1.3.0.jar:1.3.0]
	at io.camunda.zeebe.util.retry.ActorRetryMechanism.run(ActorRetryMechanism.java:36) ~[zeebe-util-1.3.0.jar:1.3.0]
	at io.camunda.zeebe.util.retry.EndlessRetryStrategy.run(EndlessRetryStrategy.java:50) ~[zeebe-util-1.3.0.jar:1.3.0]
	at io.camunda.zeebe.util.sched.ActorJob.invoke(ActorJob.java:73) [zeebe-util-1.3.0.jar:1.3.0]
	at io.camunda.zeebe.util.sched.ActorJob.execute(ActorJob.java:39) [zeebe-util-1.3.0.jar:1.3.0]
	at io.camunda.zeebe.util.sched.ActorTask.execute(ActorTask.java:122) [zeebe-util-1.3.0.jar:1.3.0]
	at io.camunda.zeebe.util.sched.ActorThread.executeCurrentTask(ActorThread.java:95) [zeebe-util-1.3.0.jar:1.3.0]
	at io.camunda.zeebe.util.sched.ActorThread.doWork(ActorThread.java:78) [zeebe-util-1.3.0.jar:1.3.0]
	at io.camunda.zeebe.util.sched.ActorThread.run(ActorThread.java:192) [zeebe-util-1.3.0.jar:1.3.0]
Caused by: java.lang.IndexOutOfBoundsException: index=306 capacity=306
	at org.agrona.concurrent.UnsafeBuffer.boundsCheck(UnsafeBuffer.java:2212) ~[agrona-1.14.0.jar:1.14.0]
	at org.agrona.concurrent.UnsafeBuffer.getByte(UnsafeBuffer.java:947) ~[agrona-1.14.0.jar:1.14.0]
	at io.camunda.zeebe.msgpack.spec.MsgPackReader.readStringLength(MsgPackReader.java:115) ~[zeebe-msgpack-core-1.3.0.jar:1.3.0]
	at io.camunda.zeebe.msgpack.value.StringValue.read(StringValue.java:96) ~[zeebe-msgpack-value-1.3.0.jar:1.3.0]
	at io.camunda.zeebe.msgpack.value.ObjectValue.read(ObjectValue.java:91) ~[zeebe-msgpack-value-1.3.0.jar:1.3.0]
	at io.camunda.zeebe.msgpack.UnpackedObject.wrap(UnpackedObject.java:32) ~[zeebe-msgpack-value-1.3.0.jar:1.3.0]
	... 12 more

Relevant links

Possible solutions

  • Ensure that the correct size of valid headers is written
  • For already existing records in the logstream, the reader should handle such "corrupted" jobs, so that the exporter can export that job and the stream process can replay that event (in case of a failover or restart).

Environment:

  • Zeebe Version: 1.3.0
@pihme pihme added the kind/bug Categorizes an issue or PR as a bug label Jan 19, 2022
@pihme
Copy link
Contributor Author

pihme commented Jan 19, 2022

@npepinpe Can you maybe assign someone to look at it. This seems to be a recurring issue in that cluster. I won't have time to root cause this today because of the Scala training

@npepinpe
Copy link
Member

Will do 👍

@romansmirnov
Copy link
Member

romansmirnov commented Jan 20, 2022

@npepinpe, a quick update on the impact:

The exporter fails to export a job record because it fails when trying to read the job record with the exception above. Meaning, the job record won't be exported, this results in a situation where the exporter retries to export the record forever (and most likely occupies the actor thread, therefore). This results in the following:

  • No other records are exported (for that partition),
  • Potentially, the broker might fail with an OutOfMemoryError because a lot of actor jobs will pile up in the exporter actor (that won't be executed),
  • Journal cannot be compacted anymore which may result in higher disk space usage, and
  • Operate is not up-to-date.

Next step: root cause the issue

@npepinpe
Copy link
Member

Thanks Roman! I'm interested to know the root cause, but if it proves difficult to diagnose, also the likelihood of this occurring, and if there's anyway to recover from the fault.

@romansmirnov
Copy link
Member

Basically, whenever the property customHeaders of a job is empty, the variables property is skipped when reading the logged job event. I will update the issue description to describe the actual bug accordingly.

@npepinpe, I assume that the property customHeaders can be empty when there isn't any assignee nor any candidate group given in the BPMN process, for example when using user tasks. (Not sure, about service tasks, though, but I guess there are also cases where the custom headers can be empty - at least the code suggests it so.) Regarding the likelihood, not sure - is it safe to assume that jobs with empty customHeaders are exceptional cases (in how user tasks and service tasks are used)?

However, I personally think it is a critical bug, because if a broker runs into this issue, the partition cannot be recovered. Meaning, the job cannot be exported. Also, in case of a failover (or restart) the logged job event cannot be replayed anymore, because it fails with the same exception as the exporter does.

@romansmirnov romansmirnov changed the title RuntimeException: Could not deserialize object [JobRecord]. Deserialization stuck at offset Jobs with empty customHeaders cannot be read by the stream processor and the exporter Jan 20, 2022
@npepinpe npepinpe added severity/critical Marks a stop-the-world bug, with a high impact and no existing workaround Impact: Availability labels Jan 21, 2022
@npepinpe npepinpe added this to Ready in Zeebe Jan 21, 2022
@npepinpe
Copy link
Member

@saig0 @korthout - is this the expected behavior of the message pack parser? Is there no possibility of recovering by patching the message pack parsing code to handle this case, so no data would be lost?

If not, then yes this is likely a critical bug.

@npepinpe
Copy link
Member

It seems there's some dispute over the root cause, but consensus that once you hit this state there's no way to recover, so I will treat this as a critical bug.

@npepinpe npepinpe assigned remcowesterhoud and unassigned npepinpe Jan 24, 2022
@npepinpe npepinpe moved this from Ready to In progress in Zeebe Jan 24, 2022
@romansmirnov
Copy link
Member

romansmirnov commented Jan 24, 2022

@npepinpe & @remcowesterhoud, I had a second look at the issue. I was wrong with the empty customHeaders, I misinterpreted the byte expressing how many elements the customHeaders property contains. Actually, the byte suggests that there is one element in the custom header map but this expected element does not exist in the byte array.

The event's custom headers get written here:

https://github.com/camunda-cloud/zeebe/blob/9b41a8160d5432fe148d4425670fd4ed40e87424/engine/src/main/java/io/camunda/zeebe/engine/processing/bpmn/behavior/BpmnJobBehavior.java#L230-L257

To my understanding, when creating the corresponding job, the taskHeaders map must be non-empty and contains one element, so that when writing the map headers it will write the byte -127 (indicating that there is one element in the map) as a first byte to the given byte array:

https://github.com/camunda-cloud/zeebe/blob/9b41a8160d5432fe148d4425670fd4ed40e87424/engine/src/main/java/io/camunda/zeebe/engine/processing/bpmn/behavior/BpmnJobBehavior.java#L240-L241

But writing the actual element to the byte array does not happen (or does not succeed):

https://github.com/camunda-cloud/zeebe/blob/9b41a8160d5432fe148d4425670fd4ed40e87424/engine/src/main/java/io/camunda/zeebe/engine/processing/bpmn/behavior/BpmnJobBehavior.java#L243-L252

Basically, it does not check if everything got written to the byte array as expected (note: it checks for a valid header, and if it is not valid it skips that element in the map).

When I debug that part with an empty custom header and manually manipulate the map during its execution by adding an empty element taskHeaders.put("", ""), I can reproduce the exception eventually. But it requires restarting the broker so that the stream processor fails to replay that event.

However, I could not produce a BPMN process (containing a user task and service task with different headers), that reproduces that exception without any manual manipulation.

@npepinpe
Copy link
Member

Thanks for double checking! About reproducing, @remcowesterhoud you can ask @menski, he mentioned he might be able to reproduce this as it came from one of his clusters.

@remcowesterhoud
Copy link
Contributor

I have created #8638 as a possible fix for this issue (therefore the PR does not close the issue yet). It's hard to know if this was the root cause of the problem since I've been unable to reproduce it via a BPMN process. I have asked @menski if he knows how to do this.

@npepinpe
Copy link
Member

The tentative fix only fixes how we write the data, but wouldn't help us recover the broken state, would it? In case not yet done, please spend some time as part of this after to at least figure out if there's anything we can do to help someone recover their data after they ran into this.

@menski
Copy link
Contributor

menski commented Jan 24, 2022

@remcowesterhoud I tried to reproduce it with the setup of the last cluster that failed, but it didn't happen again. I will keep the cluster running over night, but doesn't look like it is reproducible. The cluster I setup is https://console.cloud.ultrawombat.com/org/6ff582aa-a62e-4a28-aac7-4d2224d8c58a/cluster/9d464e7f-5a93-4a4f-9d22-9aa8dedec331

@Zelldon
Copy link
Member

Zelldon commented Jan 25, 2022

Do we still have the cluster which was causing this? If so we could take a look at the data via zdb. @menski did you update your cluster to 1.3.0? Maybe this caused that issue, since we changed something in the custom header handling?

@menski
Copy link
Contributor

menski commented Jan 25, 2022

No we don't have the cluster anymore, it was deleted as it was constantly throwing exceptions, but the raft partition should be available here https://drive.google.com/drive/folders/1ZDiXZ_gox4TJf7JbENeZJODOGbYZzw_K?usp=sharing

The cluster I deleted was from my release presentation, I directly created it with 1.3.0, and then deployed 4 definitions on it to demonstrate call activity hierarchy in Operate. As far as I remember I started one instance which as expected lead to an incident for demonstration purpose. But maybe I also did something else which I can remember.

Maybe somebody can check the log and tell me to which process the job belong to, maybe I did something else which I forgot, as the incident shouldn't create a job.

Parent.bpmn20.xml.txt
Child 1.bpmn20.xml.txt
Child 2.bpmn20.xml.txt
Child 3.bpmn20.xml.txt

@romansmirnov
Copy link
Member

romansmirnov commented Jan 25, 2022

Could it be that when evaluating the assignee or candidate groups expression an empty string is returned?

Basically, this returns an empty string "":

https://github.com/camunda-cloud/zeebe/blob/c683c401309a07f2d38aaeee3f42e6227104a306/engine/src/main/java/io/camunda/zeebe/engine/processing/bpmn/behavior/BpmnJobBehavior.java#L151

The assignee is not null, so it puts as value the empty string with the key assignee to the headers map:

https://github.com/camunda-cloud/zeebe/blob/c683c401309a07f2d38aaeee3f42e6227104a306/engine/src/main/java/io/camunda/zeebe/engine/processing/bpmn/behavior/BpmnJobBehavior.java#L153-L155

When encoding the headers, it checks for valid headers, i.e., if the key and the value is not null and not empty:

https://github.com/camunda-cloud/zeebe/blob/c683c401309a07f2d38aaeee3f42e6227104a306/engine/src/main/java/io/camunda/zeebe/engine/processing/bpmn/behavior/BpmnJobBehavior.java#L259-L261

Since the assignee has an empty string as a value, the header is not valid and is skipped while writing it.

@remcowesterhoud, does it maybe help to reproduce the issue?

@korthout
Copy link
Member

Nice find @romansmirnov. That definitely looks like it would cause problems. @remcowesterhoud You had the actual problematic process right? Could this situation arise in there?

@remcowesterhoud
Copy link
Contributor

I just tested this, this does indeed reproduce the issue. Nice find @romansmirnov!

@romansmirnov
Copy link
Member

Great! I am relieved 😄 Sorry, for my first wrong analysis.

@romansmirnov romansmirnov changed the title Jobs with empty customHeaders cannot be read by the stream processor and the exporter When assignee or candidate groups expression evaluates to empty string, the logged event cannot be processed nor exported Jan 25, 2022
@romansmirnov romansmirnov changed the title When assignee or candidate groups expression evaluates to empty string, the logged event cannot be processed nor exported When assignee or candidate groups expression evaluates to an empty string, the logged event cannot be processed nor exported Jan 25, 2022
ghost pushed a commit that referenced this issue Jan 26, 2022
8638: Write correct header size when encoding task headers r=remcowesterhoud a=remcowesterhoud

## Description

<!-- Please explain the changes you made here. -->
When we write the headers size we would include invalid headers. These headers would be filtered and not written to the buffer. Therefore, the size could differ from the actual amount of headers written in the buffer.

With this fix we first validate all the headers, before writing the header size. This way we will never write the wrong size.

## Related issues

<!-- Which issues are closed by this PR or are related -->

Relates to #8611

Since I have not yet been able to reproduce the problem I have decided against closing this issue with this PR. First I'd like to discuss if this could indeed be the root cause.



Co-authored-by: Remco Westerhoud <remco@westerhoud.nl>
ghost pushed a commit that referenced this issue Jan 26, 2022
8638: Write correct header size when encoding task headers r=remcowesterhoud a=remcowesterhoud

## Description

<!-- Please explain the changes you made here. -->
When we write the headers size we would include invalid headers. These headers would be filtered and not written to the buffer. Therefore, the size could differ from the actual amount of headers written in the buffer.

With this fix we first validate all the headers, before writing the header size. This way we will never write the wrong size.

## Related issues

<!-- Which issues are closed by this PR or are related -->

Relates to #8611

Since I have not yet been able to reproduce the problem I have decided against closing this issue with this PR. First I'd like to discuss if this could indeed be the root cause.



Co-authored-by: Remco Westerhoud <remco@westerhoud.nl>
ghost pushed a commit that referenced this issue Jan 26, 2022
8661: [Backport stable/1.3] Write correct header size when encoding task headers r=remcowesterhoud a=github-actions[bot]

# Description
Backport of #8638 to `stable/1.3`.

relates to #8611

Co-authored-by: Remco Westerhoud <remco@westerhoud.nl>
@remcowesterhoud
Copy link
Contributor

@npepinpe To get back to your question about how to recover from this problem. I can't think of any way to do this.

If I understand everything correct, when we read the records from the log there is an expected amount of bytes belonging to a specific record. Because of this bug the expected amount will be greater than the actual amount of bytes, thus resulting in the IndexOutOfBoundsException.
In order to recover we must somehow skip the difference in bytes between the expected and actual amounts. The problem I see here is that this is all generic code and I can't think of a way to tell if a Record is corrupted by this problem, or how much bytes should be skipped.

@npepinpe
Copy link
Member

All good. Let me know when this is merged and closed so we can do a patch release then. The sooner this is out the less users will run into this unrecoverable issue.

@remcowesterhoud
Copy link
Contributor

It is merged and backported to 1.3, so I'll close this issue now :)

Zeebe automation moved this from In progress to Done Jan 27, 2022
@npepinpe npepinpe added version:1.3.2 Marks an issue as being completely or in parts released in 1.3.2 1.3.0 labels Jan 28, 2022
@KerstinHebel KerstinHebel removed this from Done in Zeebe Mar 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes an issue or PR as a bug severity/critical Marks a stop-the-world bug, with a high impact and no existing workaround version:1.3.2 Marks an issue as being completely or in parts released in 1.3.2
Projects
None yet
Development

No branches or pull requests

7 participants