-
Notifications
You must be signed in to change notification settings - Fork 562
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Backport stable/8.4] Fix: Zeebe gateway fails to stream out activated jobs by not respecting the max message size #15784 #16229
Conversation
…ed jobs by not respecting the maxMessageSize The test is added to verify gateway failure with RESOURCE_EXHAUSTED. In the test case, a process instance with 1MB sized variable is created. In the broker JobBatchCollector sends jobs that has size (<4MB) to the gateway but since the size of the response becomes more than 4MB after converting MessagePack to JSON, gateway fails to stream out jobs with RESOURCE_EXHAUSTED. (cherry picked from commit a6a6b95)
…t respecting the maxMessageSize There were three options to solve original problem: 1) When mapping the broker response to the client response, do bookkeeping on the response size to ensure the maxMessageSize. 2) Double the max inbound message size in the Zeebe Java Client by default. 3) Bring the WARNING: Stream Error exception into the control flow of the Gateway so that the Gateway can react to it and potentially make the jobs re-activatable again. I used mixed of option 1 and 3 for better UX. With option 2, it is still possible for client users to override default maxInboundMessage which might result in the same issue. In the ResponseMapper, we check if the gRPC response size of the activated job will exceed the allowed maxMessageSize in the gateway. Afterward, we re-activate exceeding jobs in RoundRobinActivateJobsHandler in a similar way that we do for failed jobs. Using `getSerializedSize()` method shouldn't have any performance impact here since the result of it will be cached to be used while actually returning the response. See: https://protobuf.dev/reference/java/api-docs/com/google/protobuf/AbstractMessage.html#getSerializedSize-- (cherry picked from commit 58c0e20)
refactor(gateway): use long for maxMessageSize We would like to depend on spring framework only for running application containers (e.g. gateway and broker). Therefore, Spring's DataSize object is replaced with `long`. refactor(gateway): rename maxMessageSize and add JavaDoc to toActivateJobsResponse Since this maxResponseSize is specific to toActivateJobsResponse method in the mapper, it is better if we add reasons for it to JavaDoc. refactor(gateway): correct the actual size of returned response For every ActivatedJob added to ActivateJobsResponse, 2 bytes are added to the serialized size. While calculating the total response size, we need to consider adding this to the total as well. feat(gateway): handle if max message size exceeded after building the response The statically added metadata size is removed because I discovered that it can change based on the response data. See: https://github.com/protocolbuffers/protobuf/blob/main/java/core/src/main/java/com/ google/protobuf/CodedOutputStream.java#L607 https://github.com/protocolbuffers/protobuf/blob/main/java/core/src/main/java/com/ google/protobuf/CodedOutputStream.java#L700 Instead, a final check is added to validate if the actual response size is exceeding the max message size. I added the re-building to the end because building a gRPC response is costly. Additionally, the case that built response size will exceed the max message size is rare since metadata is expected to be <100 bytes long. test(gateway): handle if max message size exceeded after building the response A test is added to test the case where gRPC response size is bigger than max message size when it is built. Numbers 5100 and 144 is used to actually trigger the case. When we set jobVariableSize to 144, it produces ActivatedJob of size 1020 bytes. 5 jobs of 1020 bytes equals to 5100 bytes (the configured maxMessageSize). But when we build the actual response it exceeds 5100 bytes and fall into our case. (cherry picked from commit b2ad953)
* Removed unnecessary if check * Updated tests to make sure all jobs are created. Otherwise, they might end up with being flaky tests. Because in the old implementation, when we limit the stream to 5 records, it only throws an exception when the stream has no entries after 5 seconds. It would wait 5 seconds and then continue if there is only a single job record. Which can cause flakiness. fix(gateway): review fixes for gateway max message size assertion * Removed blocking join() calls * Removed unused variables * Added limit to job created assertion to speed up the test execution test: assert jobs created before worker opened This helps avoid multiple job activation roundtrips, because the worker is only opened once all jobs are created. Otherwise, there's a chance that the worker only receives some of the jobs and has has to poll for more jobs until the final assertion is met. (cherry picked from commit d5e3deb)
gateway/src/main/java/io/camunda/zeebe/gateway/ResponseMapper.java
Dismissed
Show dismissed
Hide dismissed
There's a silly checkstyle error:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work on backporting @berkaycanbc ✍️
🔧 I'm sure you'll resolve the checkstyle error before merging, not gonna ask you for another review on this
LGTM 👍
🔧 Might be useful to add some links to the backported PR and the related issue to the pull request description (here is a recent example I made)
Description
The only change other than the actual PR is the conflict resolution in
RoundRobinActivateJobsHandler.java
. Git was conflicting with import statements.Backport of #15784 to stable/8.4.
Related issue #12778
Related issues
closes #
Definition of Done
Not all items need to be done depending on the issue and the pull request.
Code changes:
backport stable/1.3
) to the PR, in case that fails you need to create backports manually.Testing:
Documentation:
Other teams:
If the change impacts another team an issue has been created for this team, explaining what they need to do to support this change.
Please refer to our review guidelines.