[Backports stable/1.0] Ensure requests are always completed #7383

npepinpe · 2021-06-28T07:25:06Z

Description

This PR backports #7362. The only conflicts were with the PartitionCommandSenderImpl, where we still use the Atomix object directly in 1.0.x, but have replaced it with using the communication service directly in 1.1.x.

Related issues

relates to #7360

Definition of Done

Not all items need to be done depending on the issue and the pull request.

Code changes:

The changes are backwards compatibility with previous versions
If it fixes a bug then PRs are created to backport the fix to the last two minor versions. You can trigger a backport by assigning labels (e.g. backport stable/0.25) to the PR, in case that fails you need to create backports manually.

Testing:

There are unit/integration tests that verify all acceptance criterias of the issue
New tests are written to ensure backwards compatibility with further versions
The behavior is tested manually
The change has been verified by a QA run
The impact of the changes is verified by a benchmark

Documentation:

The documentation is updated (e.g. BPMN reference, configuration, examples, get-started guides, etc.)
New content is added to the release announcement

Rewrites the request timeout logic in NettyMessagingService by scheduling the request timeout to include the complete request lifecycle, including creating the connection, performing the handshake, sending the request, receiving the response, etc. Previously, it was scheduled only after performing the handshake, which means if anything took too long or went wrong before, the contract of the method was broken. The number of threads for the time out executor was also lowered from 4 to 1, as the executor shouldn't be doing anything but failing futures (so it's not expected to be doing much). (cherry picked from commit 5b6d286)

Simplifies local and remote client connections by dropping the callback construct for a simple map of message ID -> response futures. The futures are returned already set up to clean themselves. Additionally, when the connection is closed in flight, the exception returned now has slightly improved error message (though it's far from ideal still). (cherry picked from commit 8713f4c)

Ensure we report the channel close event by failing the future if the future has not yet been completed. Previously, if the future was not completed (e.g. the channel was closed mid-way through the handshake), we were not notified of the channel closing, and the future remained open forever. (cherry picked from commit 3c2772f)

Zelldon

bors r+

The quiet and timeout periods when shutting down the service is now configurable. This means the messaging service test now takes 10 seconds instead of 40, as previously every service had to wait 2 seconds to shutdown at least, and we spawned a couple of them. (cherry picked from commit 2a10b0f)

ghost · 2021-06-28T07:35:21Z

Canceled.

npepinpe · 2021-06-28T07:35:31Z

Some linting/formatting issues 😅

npepinpe · 2021-06-28T07:35:38Z

bors retry

7383: [Backports stable/1.0] Ensure requests are always completed r=Zelldon a=npepinpe ## Description This PR backports #7362. The only conflicts were with the PartitionCommandSenderImpl, where we still use the Atomix object directly in 1.0.x, but have replaced it with using the communication service directly in 1.1.x. ## Related issues relates to #7360 Co-authored-by: Nicolas Pepin-Perreault <nicolas.pepin-perreault@camunda.com>

ghost · 2021-06-28T07:41:10Z

Build failed:

continuous-integration/jenkins/branch

Removes usages of sendAndReceive where the timeout is null. This means adding back the unicast method to the ClusterCommunicationService (so cases where we don't care about timeout can use the unicast), and making sure previous callers now use the previously maximum timeout (5 seconds). We could consider making these time out configurable however, but in general when using a request-reply pattern, we should ALWAYS set an explicit timeout! (cherry picked from commit 1726194)

As the PartitionCommandSenderImpl doesn't await a reply, use a reliable (TCP) unicast, i.e. send the message and don't wait for the reply. This removes a usage of using a "null" timeout, but generally also makes more sense since we don't care about the reply here. (cherry picked from commit ac67e89)

npepinpe · 2021-06-28T08:30:07Z

bors retry

ghost · 2021-06-28T08:56:48Z

Build succeeded:

continuous-integration/jenkins/branch

npepinpe added 3 commits June 28, 2021 09:20

Zelldon approved these changes Jun 28, 2021

View reviewed changes

npepinpe force-pushed the backport-7362-to-stable/1.0 branch from ea5e611 to b714573 Compare June 28, 2021 07:35

npepinpe added 2 commits June 28, 2021 10:13

npepinpe force-pushed the backport-7362-to-stable/1.0 branch from b714573 to 11e1532 Compare June 28, 2021 08:13

ghost merged commit 5a30f74 into stable/1.0 Jun 28, 2021

ghost deleted the backport-7362-to-stable/1.0 branch June 28, 2021 08:56

npepinpe added the Release: 1.0.2 label Jul 6, 2021

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Backports stable/1.0] Ensure requests are always completed #7383

[Backports stable/1.0] Ensure requests are always completed #7383

npepinpe commented Jun 28, 2021 •

edited

Zelldon left a comment

ghost commented Jun 28, 2021

npepinpe commented Jun 28, 2021

npepinpe commented Jun 28, 2021

ghost commented Jun 28, 2021

npepinpe commented Jun 28, 2021

ghost commented Jun 28, 2021

[Backports stable/1.0] Ensure requests are always completed #7383

[Backports stable/1.0] Ensure requests are always completed #7383

Conversation

npepinpe commented Jun 28, 2021 • edited

Description

Related issues

Definition of Done

Zelldon left a comment

Choose a reason for hiding this comment

ghost commented Jun 28, 2021

npepinpe commented Jun 28, 2021

npepinpe commented Jun 28, 2021

ghost commented Jun 28, 2021

npepinpe commented Jun 28, 2021

ghost commented Jun 28, 2021

npepinpe commented Jun 28, 2021 •

edited