Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Backport stable/8.4] Fix: Zeebe gateway fails to stream out activated jobs by not respecting the max message size #15784 #16229

Merged
merged 5 commits into from
Feb 5, 2024

Conversation

berkaycanbc
Copy link
Contributor

@berkaycanbc berkaycanbc commented Feb 5, 2024

Description

The only change other than the actual PR is the conflict resolution in RoundRobinActivateJobsHandler.java. Git was conflicting with import statements.

Backport of #15784 to stable/8.4.

Related issue #12778

Related issues

closes #

Definition of Done

Not all items need to be done depending on the issue and the pull request.

Code changes:

  • The changes are backwards compatibility with previous versions
  • If it fixes a bug then PRs are created to backport the fix to the last two minor versions. You can trigger a backport by assigning labels (e.g. backport stable/1.3) to the PR, in case that fails you need to create backports manually.

Testing:

  • There are unit/integration tests that verify all acceptance criterias of the issue
  • New tests are written to ensure backwards compatibility with further versions
  • The behavior is tested manually
  • The change has been verified by a QA run
  • The impact of the changes is verified by a benchmark

Documentation:

  • The documentation is updated (e.g. BPMN reference, configuration, examples, get-started guides, etc.)
  • If the PR changes how BPMN processes are validated (e.g. support new BPMN element) then the Camunda modeling team should be informed to adjust the BPMN linting.

Other teams:
If the change impacts another team an issue has been created for this team, explaining what they need to do to support this change.

Please refer to our review guidelines.

…ed jobs by not respecting the maxMessageSize

The test is added to verify gateway failure with RESOURCE_EXHAUSTED.
In the test case, a process instance with 1MB sized variable is created.
In the broker JobBatchCollector sends jobs that has size (<4MB) to the
gateway but since the size of the response becomes more than 4MB after
converting MessagePack to JSON, gateway fails to stream out jobs with
RESOURCE_EXHAUSTED.

(cherry picked from commit a6a6b95)
…t respecting the maxMessageSize

There were three options to solve original problem:
1) When mapping the broker response to the client response, do bookkeeping
on the response size to ensure the maxMessageSize.
2) Double the max inbound message size in the Zeebe Java Client by default.
3) Bring the WARNING: Stream Error exception into the control flow of the Gateway
so that the Gateway can react to it and potentially make the jobs re-activatable again.

I used mixed of option 1 and 3 for better UX. With option 2, it is still possible for
client users to override default maxInboundMessage which might result in the same issue.

In the ResponseMapper, we check if the gRPC response size of the activated job will
exceed the allowed maxMessageSize in the gateway. Afterward, we re-activate exceeding
jobs in RoundRobinActivateJobsHandler in a similar way that we do for failed jobs.

Using `getSerializedSize()` method shouldn't have any performance impact here since
the result of it will be cached to be used while actually returning the response.
See: https://protobuf.dev/reference/java/api-docs/com/google/protobuf/AbstractMessage.html#getSerializedSize--

(cherry picked from commit 58c0e20)
refactor(gateway): use long for maxMessageSize

We would like to depend on spring framework only for
running application containers (e.g. gateway and broker).
Therefore, Spring's DataSize object is replaced with `long`.

refactor(gateway): rename maxMessageSize and add JavaDoc to toActivateJobsResponse

Since this maxResponseSize is specific to toActivateJobsResponse
method in the mapper, it is better if we add reasons for it to JavaDoc.

refactor(gateway): correct the actual size of returned response

For every ActivatedJob added to ActivateJobsResponse, 2 bytes are
added to the serialized size. While calculating the total response size,
we need to consider adding this to the total as well.

feat(gateway): handle if max message size exceeded after building the response

The statically added metadata size is removed because I discovered that it can
change based on the response data. See:
https://github.com/protocolbuffers/protobuf/blob/main/java/core/src/main/java/com/
google/protobuf/CodedOutputStream.java#L607
https://github.com/protocolbuffers/protobuf/blob/main/java/core/src/main/java/com/
google/protobuf/CodedOutputStream.java#L700

Instead, a final check is added to validate if the actual response size is exceeding
the max message size. I added the re-building to the end because building a gRPC response
is costly. Additionally, the case that built response size will exceed the max message size
is rare since metadata is expected to be <100 bytes long.

test(gateway): handle if max message size exceeded after building the response

A test is added to test the case where gRPC response size is bigger than
max message size when it is built. Numbers 5100 and 144 is used to actually trigger
the case. When we set jobVariableSize to 144, it produces ActivatedJob of size
1020 bytes. 5 jobs of 1020 bytes equals to 5100 bytes (the configured maxMessageSize).
But when we build the actual response it exceeds 5100 bytes and fall into our case.

(cherry picked from commit b2ad953)
* Removed unnecessary if check
* Updated tests to make sure all jobs are created. Otherwise, they might end up with
being flaky tests. Because in the old implementation, when we limit the stream to 5
records, it only throws an exception when the stream has no entries after 5 seconds.
It would wait 5 seconds and then continue if there is only a single job record.
Which can cause flakiness.

fix(gateway): review fixes for gateway max message size assertion

* Removed blocking join() calls
* Removed unused variables
* Added limit to job created assertion to speed up the test execution

test: assert jobs created before worker opened

This helps avoid multiple job activation roundtrips, because the worker
is only opened once all jobs are created.

Otherwise, there's a chance that the worker only receives some of the
jobs and has has to poll for more jobs until the final assertion is met.

(cherry picked from commit d5e3deb)
@berkaycanbc berkaycanbc changed the title [Backport stable/8.1] Fix: Zeebe gateway fails to stream out activated jobs by not respecting the max message size #15784 [Backport stable/8.4] Fix: Zeebe gateway fails to stream out activated jobs by not respecting the max message size #15784 Feb 5, 2024
@korthout
Copy link
Member

korthout commented Feb 5, 2024

There's a silly checkstyle error:

src/main/java/io/camunda/zeebe/gateway/impl/job/RoundRobinActivateJobsHandler.java:[16,1] (imports) CustomImportOrder: Wrong lexicographical order for 'io.camunda.zeebe.gateway.ResponseMapper.JobActivationResult' import. Should be before 'io.camunda.zeebe.gateway.cmd.BrokerRejectionException'.

Copy link
Member

@korthout korthout left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work on backporting @berkaycanbc ✍️

🔧 I'm sure you'll resolve the checkstyle error before merging, not gonna ask you for another review on this

LGTM 👍

🔧 Might be useful to add some links to the backported PR and the related issue to the pull request description (here is a recent example I made)

@berkaycanbc berkaycanbc added this pull request to the merge queue Feb 5, 2024
Merged via the queue into stable/8.4 with commit 27ba208 Feb 5, 2024
31 checks passed
@berkaycanbc berkaycanbc deleted the bcan-backport-15784-to-stable/8.4 branch February 5, 2024 17:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants