-
Notifications
You must be signed in to change notification settings - Fork 569
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When assignee or candidate groups expression evaluates to an empty string, the logged event cannot be processed nor exported #8611
Comments
@npepinpe Can you maybe assign someone to look at it. This seems to be a recurring issue in that cluster. I won't have time to root cause this today because of the Scala training |
Will do 👍 |
@npepinpe, a quick update on the impact: The exporter fails to export a job record because it fails when trying to read the job record with the exception above. Meaning, the job record won't be exported, this results in a situation where the exporter retries to export the record forever (and most likely occupies the actor thread, therefore). This results in the following:
Next step: root cause the issue |
Thanks Roman! I'm interested to know the root cause, but if it proves difficult to diagnose, also the likelihood of this occurring, and if there's anyway to recover from the fault. |
Basically, whenever the property @npepinpe, I assume that the property However, I personally think it is a critical bug, because if a broker runs into this issue, the partition cannot be recovered. Meaning, the job cannot be exported. Also, in case of a failover (or restart) the logged job event cannot be replayed anymore, because it fails with the same exception as the exporter does. |
customHeaders
cannot be read by the stream processor and the exporter
It seems there's some dispute over the root cause, but consensus that once you hit this state there's no way to recover, so I will treat this as a critical bug. |
@npepinpe & @remcowesterhoud, I had a second look at the issue. I was wrong with the empty The event's custom headers get written here: To my understanding, when creating the corresponding job, the But writing the actual element to the byte array does not happen (or does not succeed): Basically, it does not check if everything got written to the byte array as expected (note: it checks for a valid header, and if it is not valid it skips that element in the map). When I debug that part with an empty custom header and manually manipulate the map during its execution by adding an empty element However, I could not produce a BPMN process (containing a user task and service task with different headers), that reproduces that exception without any manual manipulation. |
Thanks for double checking! About reproducing, @remcowesterhoud you can ask @menski, he mentioned he might be able to reproduce this as it came from one of his clusters. |
The tentative fix only fixes how we write the data, but wouldn't help us recover the broken state, would it? In case not yet done, please spend some time as part of this after to at least figure out if there's anything we can do to help someone recover their data after they ran into this. |
@remcowesterhoud I tried to reproduce it with the setup of the last cluster that failed, but it didn't happen again. I will keep the cluster running over night, but doesn't look like it is reproducible. The cluster I setup is https://console.cloud.ultrawombat.com/org/6ff582aa-a62e-4a28-aac7-4d2224d8c58a/cluster/9d464e7f-5a93-4a4f-9d22-9aa8dedec331 |
Do we still have the cluster which was causing this? If so we could take a look at the data via zdb. @menski did you update your cluster to 1.3.0? Maybe this caused that issue, since we changed something in the custom header handling? |
No we don't have the cluster anymore, it was deleted as it was constantly throwing exceptions, but the raft partition should be available here https://drive.google.com/drive/folders/1ZDiXZ_gox4TJf7JbENeZJODOGbYZzw_K?usp=sharing The cluster I deleted was from my release presentation, I directly created it with 1.3.0, and then deployed 4 definitions on it to demonstrate call activity hierarchy in Operate. As far as I remember I started one instance which as expected lead to an incident for demonstration purpose. But maybe I also did something else which I can remember. Maybe somebody can check the log and tell me to which process the job belong to, maybe I did something else which I forgot, as the incident shouldn't create a job. Parent.bpmn20.xml.txt |
Could it be that when evaluating the assignee or candidate groups expression an empty string is returned? Basically, this returns an empty string The When encoding the headers, it checks for valid headers, i.e., if the key and the value is not Since the @remcowesterhoud, does it maybe help to reproduce the issue? |
Nice find @romansmirnov. That definitely looks like it would cause problems. @remcowesterhoud You had the actual problematic process right? Could this situation arise in there? |
I just tested this, this does indeed reproduce the issue. Nice find @romansmirnov! |
Great! I am relieved 😄 Sorry, for my first wrong analysis. |
customHeaders
cannot be read by the stream processor and the exporter8638: Write correct header size when encoding task headers r=remcowesterhoud a=remcowesterhoud ## Description <!-- Please explain the changes you made here. --> When we write the headers size we would include invalid headers. These headers would be filtered and not written to the buffer. Therefore, the size could differ from the actual amount of headers written in the buffer. With this fix we first validate all the headers, before writing the header size. This way we will never write the wrong size. ## Related issues <!-- Which issues are closed by this PR or are related --> Relates to #8611 Since I have not yet been able to reproduce the problem I have decided against closing this issue with this PR. First I'd like to discuss if this could indeed be the root cause. Co-authored-by: Remco Westerhoud <remco@westerhoud.nl>
8638: Write correct header size when encoding task headers r=remcowesterhoud a=remcowesterhoud ## Description <!-- Please explain the changes you made here. --> When we write the headers size we would include invalid headers. These headers would be filtered and not written to the buffer. Therefore, the size could differ from the actual amount of headers written in the buffer. With this fix we first validate all the headers, before writing the header size. This way we will never write the wrong size. ## Related issues <!-- Which issues are closed by this PR or are related --> Relates to #8611 Since I have not yet been able to reproduce the problem I have decided against closing this issue with this PR. First I'd like to discuss if this could indeed be the root cause. Co-authored-by: Remco Westerhoud <remco@westerhoud.nl>
@npepinpe To get back to your question about how to recover from this problem. I can't think of any way to do this. If I understand everything correct, when we read the records from the log there is an expected amount of bytes belonging to a specific record. Because of this bug the expected amount will be greater than the actual amount of bytes, thus resulting in the |
All good. Let me know when this is merged and closed so we can do a patch release then. The sooner this is out the less users will run into this unrecoverable issue. |
It is merged and backported to 1.3, so I'll close this issue now :) |
Describe the bug
Whenever the assignee (or candidate groups) expression evaluates to an empty string, the logged event contains the wrong number expressing how many elements the
customerHeaders
map contains.Basically, this returns an empty string
""
:https://github.com/camunda-cloud/zeebe/blob/c683c401309a07f2d38aaeee3f42e6227104a306/engine/src/main/java/io/camunda/zeebe/engine/processing/bpmn/behavior/BpmnJobBehavior.java#L151
The
assignee
is not null, so it puts as value the empty string with the keyassignee
to theheaders
map:https://github.com/camunda-cloud/zeebe/blob/c683c401309a07f2d38aaeee3f42e6227104a306/engine/src/main/java/io/camunda/zeebe/engine/processing/bpmn/behavior/BpmnJobBehavior.java#L153-L155
When encoding the headers, it checks for valid headers, i.e., if the key and the value is not
null
and not empty:https://github.com/camunda-cloud/zeebe/blob/c683c401309a07f2d38aaeee3f42e6227104a306/engine/src/main/java/io/camunda/zeebe/engine/processing/bpmn/behavior/BpmnJobBehavior.java#L259-L261
Since the
assignee
has an empty string as a value, the header is not valid and is skipped while writing it.When reading these record, the reader assumes that the
customerHeaders
property has one element, so that the subsequent propertyvariables
(key and value) is set as thecustomHeaders
property's value. That way, thevariables
property is "skipped". As a consequence, the reader assumes that they read only 15 properties out of 16 properties, and tries to read the last missing property but since it already reached the end of the byte array it fails with theAfter reading the
customHeaders
property key, it reads the next byte in#skipValues()
https://github.com/camunda-cloud/zeebe/blob/9e189a1d087ba08191cb888bca2ccc32c325cf8e/msgpack-core/src/main/java/io/camunda/zeebe/msgpack/spec/MsgPackReader.java#L355-L356
The bytes suggests that the value is a fixed map with a map length of 1
https://github.com/camunda-cloud/zeebe/blob/9e189a1d087ba08191cb888bca2ccc32c325cf8e/msgpack-core/src/main/java/io/camunda/zeebe/msgpack/spec/MsgPackReader.java#L366-L371
This increases the count, so that the reader continues in the while-loop and reads the next byte, etc. That way, the entire
variables
property is read.To reproduce
Log/Stacktrace
Full Stacktrace
Relevant links
Possible solutions
Environment:
The text was updated successfully, but these errors were encountered: