[Backport stable/8.4] Fix: Zeebe gateway fails to stream out activated jobs by not respecting the max message size #15784 #16229

berkaycanbc · 2024-02-05T16:50:39Z

Description

The only change other than the actual PR is the conflict resolution in RoundRobinActivateJobsHandler.java. Git was conflicting with import statements.

Backport of #15784 to stable/8.4.

Related issue #12778

Related issues

closes #

Definition of Done

Not all items need to be done depending on the issue and the pull request.

Code changes:

The changes are backwards compatibility with previous versions
If it fixes a bug then PRs are created to backport the fix to the last two minor versions. You can trigger a backport by assigning labels (e.g. backport stable/1.3) to the PR, in case that fails you need to create backports manually.

Testing:

There are unit/integration tests that verify all acceptance criterias of the issue
New tests are written to ensure backwards compatibility with further versions
The behavior is tested manually
The change has been verified by a QA run
The impact of the changes is verified by a benchmark

Documentation:

The documentation is updated (e.g. BPMN reference, configuration, examples, get-started guides, etc.)
If the PR changes how BPMN processes are validated (e.g. support new BPMN element) then the Camunda modeling team should be informed to adjust the BPMN linting.

Other teams:
If the change impacts another team an issue has been created for this team, explaining what they need to do to support this change.

Please refer to our review guidelines.

…ed jobs by not respecting the maxMessageSize The test is added to verify gateway failure with RESOURCE_EXHAUSTED. In the test case, a process instance with 1MB sized variable is created. In the broker JobBatchCollector sends jobs that has size (<4MB) to the gateway but since the size of the response becomes more than 4MB after converting MessagePack to JSON, gateway fails to stream out jobs with RESOURCE_EXHAUSTED. (cherry picked from commit a6a6b95)

…t respecting the maxMessageSize There were three options to solve original problem: 1) When mapping the broker response to the client response, do bookkeeping on the response size to ensure the maxMessageSize. 2) Double the max inbound message size in the Zeebe Java Client by default. 3) Bring the WARNING: Stream Error exception into the control flow of the Gateway so that the Gateway can react to it and potentially make the jobs re-activatable again. I used mixed of option 1 and 3 for better UX. With option 2, it is still possible for client users to override default maxInboundMessage which might result in the same issue. In the ResponseMapper, we check if the gRPC response size of the activated job will exceed the allowed maxMessageSize in the gateway. Afterward, we re-activate exceeding jobs in RoundRobinActivateJobsHandler in a similar way that we do for failed jobs. Using `getSerializedSize()` method shouldn't have any performance impact here since the result of it will be cached to be used while actually returning the response. See: https://protobuf.dev/reference/java/api-docs/com/google/protobuf/AbstractMessage.html#getSerializedSize-- (cherry picked from commit 58c0e20)

refactor(gateway): use long for maxMessageSize We would like to depend on spring framework only for running application containers (e.g. gateway and broker). Therefore, Spring's DataSize object is replaced with `long`. refactor(gateway): rename maxMessageSize and add JavaDoc to toActivateJobsResponse Since this maxResponseSize is specific to toActivateJobsResponse method in the mapper, it is better if we add reasons for it to JavaDoc. refactor(gateway): correct the actual size of returned response For every ActivatedJob added to ActivateJobsResponse, 2 bytes are added to the serialized size. While calculating the total response size, we need to consider adding this to the total as well. feat(gateway): handle if max message size exceeded after building the response The statically added metadata size is removed because I discovered that it can change based on the response data. See: https://github.com/protocolbuffers/protobuf/blob/main/java/core/src/main/java/com/ google/protobuf/CodedOutputStream.java#L607 https://github.com/protocolbuffers/protobuf/blob/main/java/core/src/main/java/com/ google/protobuf/CodedOutputStream.java#L700 Instead, a final check is added to validate if the actual response size is exceeding the max message size. I added the re-building to the end because building a gRPC response is costly. Additionally, the case that built response size will exceed the max message size is rare since metadata is expected to be <100 bytes long. test(gateway): handle if max message size exceeded after building the response A test is added to test the case where gRPC response size is bigger than max message size when it is built. Numbers 5100 and 144 is used to actually trigger the case. When we set jobVariableSize to 144, it produces ActivatedJob of size 1020 bytes. 5 jobs of 1020 bytes equals to 5100 bytes (the configured maxMessageSize). But when we build the actual response it exceeds 5100 bytes and fall into our case. (cherry picked from commit b2ad953)

* Removed unnecessary if check * Updated tests to make sure all jobs are created. Otherwise, they might end up with being flaky tests. Because in the old implementation, when we limit the stream to 5 records, it only throws an exception when the stream has no entries after 5 seconds. It would wait 5 seconds and then continue if there is only a single job record. Which can cause flakiness. fix(gateway): review fixes for gateway max message size assertion * Removed blocking join() calls * Removed unused variables * Added limit to job created assertion to speed up the test execution test: assert jobs created before worker opened This helps avoid multiple job activation roundtrips, because the worker is only opened once all jobs are created. Otherwise, there's a chance that the worker only receives some of the jobs and has has to poll for more jobs until the final assertion is met. (cherry picked from commit d5e3deb)

gateway/src/main/java/io/camunda/zeebe/gateway/ResponseMapper.java

korthout · 2024-02-05T17:08:17Z

There's a silly checkstyle error:

src/main/java/io/camunda/zeebe/gateway/impl/job/RoundRobinActivateJobsHandler.java:[16,1] (imports) CustomImportOrder: Wrong lexicographical order for 'io.camunda.zeebe.gateway.ResponseMapper.JobActivationResult' import. Should be before 'io.camunda.zeebe.gateway.cmd.BrokerRejectionException'.

korthout

Nice work on backporting @berkaycanbc ✍️

🔧 I'm sure you'll resolve the checkstyle error before merging, not gonna ask you for another review on this

LGTM 👍

🔧 Might be useful to add some links to the backported PR and the related issue to the pull request description (here is a recent example I made)

berkaycanbc added 4 commits February 5, 2024 17:20

berkaycanbc changed the title ~~[Backport stable/8.1] Fix: Zeebe gateway fails to stream out activated jobs by not respecting the max message size #15784~~ [Backport stable/8.4] Fix: Zeebe gateway fails to stream out activated jobs by not respecting the max message size #15784 Feb 5, 2024

berkaycanbc requested a review from korthout February 5, 2024 16:50

github-advanced-security bot found potential problems Feb 5, 2024

View reviewed changes

gateway/src/main/java/io/camunda/zeebe/gateway/ResponseMapper.java Dismissed Show dismissed Hide dismissed

korthout approved these changes Feb 5, 2024

View reviewed changes

fix(gateway): fix Java Checkstyle error

450075b

berkaycanbc added this pull request to the merge queue Feb 5, 2024

Merged via the queue into stable/8.4 with commit 27ba208 Feb 5, 2024
31 checks passed

berkaycanbc deleted the bcan-backport-15784-to-stable/8.4 branch February 5, 2024 17:51

npepinpe added the version:8.4.2 label Feb 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Backport stable/8.4] Fix: Zeebe gateway fails to stream out activated jobs by not respecting the max message size #15784 #16229

[Backport stable/8.4] Fix: Zeebe gateway fails to stream out activated jobs by not respecting the max message size #15784 #16229

berkaycanbc commented Feb 5, 2024 •

edited

korthout commented Feb 5, 2024

korthout left a comment

[Backport stable/8.4] Fix: Zeebe gateway fails to stream out activated jobs by not respecting the max message size #15784 #16229

[Backport stable/8.4] Fix: Zeebe gateway fails to stream out activated jobs by not respecting the max message size #15784 #16229

Conversation

berkaycanbc commented Feb 5, 2024 • edited

Description

Related issues

Definition of Done

korthout commented Feb 5, 2024

korthout left a comment

Choose a reason for hiding this comment

berkaycanbc commented Feb 5, 2024 •

edited