Provide default service config for gateway_protocol.Gateway service #15230

npepinpe · 2023-11-16T09:06:22Z

Description

This PR creates a default service config for the gRPC Gateway service. It defines a retry policy per call, with default retry-able calls, a max number of retries, as well as a back off policy.

As an example, this configuration is also loaded as the default service config for the Java client. To use with the Go client, you would simply use the WithDefaultServiceConfig DialOption.

The default service config can be superseded by local user configuration, or a DNS provided service config for the given gateway endpoint.

This may deprecate the use of the back off policy in the worker, but not necessarily. There, the policy throttles the polling logic, whereas the retry here is on a per call basis. So they could be made to work in conjunction.

The nice thing about providing a default service config is that other clients could also use it in the future, as most gRPC libraries provide support for it.

I did not add tests for this, but I did confirm manually that this works.

One thing to note: when DEADLINE_EXCEEDED is added as a retry-able call, this refers to DEADLINE_EXCEEDED returned by the server. In case the client is the one cancelling with a timeout (which the client will observe as DEADLINE_EXCEEDED, though it's generated locally), no retry is performed, as it's assumed the user would not want to retry after the request timeout.

Regarding the policy itself, I suspect the ZPA team will have a better idea of what codes are actually retry-able and what aren't. In the future, it also means that whenever you add new calls/RPCs, you would need to think about adding them to the default config.

Related issues

related to #14217

Definition of Done

Not all items need to be done depending on the issue and the pull request.

Code changes:

The changes are backwards compatibility with previous versions
If it fixes a bug then PRs are created to backport the fix to the last two minor versions. You can trigger a backport by assigning labels (e.g. backport stable/1.3) to the PR, in case that fails you need to create backports manually.

Testing:

There are unit/integration tests that verify all acceptance criterias of the issue
New tests are written to ensure backwards compatibility with further versions
The behavior is tested manually
The change has been verified by a QA run
The impact of the changes is verified by a benchmark

Documentation:

The documentation is updated (e.g. BPMN reference, configuration, examples, get-started guides, etc.)
If the PR changes how BPMN processes are validated (e.g. support new BPMN element) then the Camunda modeling team should be informed to adjust the BPMN linting.

Other teams:
If the change impacts another team an issue has been created for this team, explaining what they need to do to support this change.

Please refer to our review guidelines.

korthout

Thanks @npepinpe ❤️ I really like this idea

💭 I wonder what impact this makes on the hardcoded retry mechanisms implemented in all of the clients

gateway-protocol-impl/src/main/resources/gateway-service-config.json

clients/java/src/main/java/io/camunda/zeebe/client/impl/ZeebeClientImpl.java

npepinpe · 2023-11-16T12:29:36Z

I wonder what impact this makes on the hardcoded retry mechanisms implemented in all of the clients

What do you mean by this? Users who wrote retry behavior around commands? I guess it depends. I think all things considered, it makes more sense to keep it off by default, and let users figure out if it fits their use case. We can turn it on by default later down the line. We could keep it on for ourselves though (QA, E2E, benchmarks, etc.) to validate its usefulness in practice.

npepinpe · 2023-11-16T15:15:10Z

Alright, so I can explain the previously failing tests. With the default retry policy enabled, what happened is the default time out would kick in. Basically, we retry the request longer than the default client time out, resulting in the client time out kicking in and closing the request, so you don't see the previously expected error. If you set a longer time out, then you will see the expected error and the test pass.

Now that this is opt-in, all the test will pass since retry will be disabled in the tests. I think that's a strong case as to why we shouldn't enable it by default yet 👍

npepinpe · 2023-11-16T15:32:23Z

I've updated it to make the whole thing an opt-in feature. I think we should enable it for ourselves, but as it changes the behavior (e.g. your request might time out client-side before the retries are up), and it's not very flexible (you can't specify a custom retry policy per call), I think it's better to keep it as an optional feature for users.

korthout · 2023-11-16T16:50:57Z

I wonder what impact this makes on the hardcoded retry mechanisms implemented in all of the clients

What do you mean by this? Users who wrote retry behavior around commands? I guess it depends. I think all things considered, it makes more sense to keep it off by default, and let users figure out if it fits their use case. We can turn it on by default later down the line. We could keep it on for ourselves though (QA, E2E, benchmarks, etc.) to validate its usefulness in practice.

I was referring to the hardcoded backoffs in the java, spring, etc, clients.

Let's say that the request to ActivateJobs (by a job worker's job poller) fails. This config will transparently retry a few times with exponential backoff. Will there be a time that the actual error is reported to the job poller? And what if users make this same call explicitly, will they receive the actual error at some point?

npepinpe · 2023-11-16T16:59:56Z

Let's say that the request to ActivateJobs (by a job worker's job poller) fails. This config will transparently retry a few times with exponential backoff. Will there be a time that the actual error is reported to the job poller?

Yes, two possible ways:

The max retry attempts is reached (5 right now)
Once the client request time out (specified in JobWorkerBuilderStep3#requestTimeout) is reached

In the first case, you'll get the last error. In the second case, you'll get a DEADLINE_EXCEEDED. Enabling debug/trace logging for gRPC will show you more context about each request.

korthout

👍 I like it!

Sorry for the late review 🙇

🔧 / ❓ Please have a look at my suggestions and questions. Feel free to merge when you feel this is ready.

korthout · 2023-11-20T15:02:51Z

gateway-protocol-impl/src/test/java/io/camunda/zeebe/gateway/protocol/ServiceConfigTest.java

+        String.format(
+            "a service config with a retry policy for method '%s'", method.getFullMethodName()),
+        config -> " but no such retry policy was found");


👍 I like the test

🔧 While you and I understand the need for this now, it might get lost over time leading to confusion when this test fails. Could we add some additional context to this message describing what this is or why it's needed that all methods have a retry policy?

🔧 It would also be very helpful to describe how to resolve this, so where should this policy be defined.

New error message:

Expecting actual: PartialServiceConfig{methodConfig=[PartialMethodConfig{name=[MethodName{service='gateway_protocol.Gateway', method='ActivateJobs'}, MethodName{service='gateway_protocol.Gateway', method='CancelProcessInstance'}, MethodName{service='gateway_protocol.Gateway', method='CompleteJob'}, MethodName{service='gateway_protocol.Gateway', method='DeleteResource'}, MethodName{service='gateway_protocol.Gateway', method='EvaluateDecision'}, MethodName{service='gateway_protocol.Gateway', method='FailJob'}, MethodName{service='gateway_protocol.Gateway', method='ResolveIncident'}, MethodName{service='gateway_protocol.Gateway', method='SetVariables'}, MethodName{service='gateway_protocol.Gateway', method='StreamActivatedJobs'}, MethodName{service='gateway_protocol.Gateway', method='Topology'}], retryPolicy={maxAttempts=5.0, initialBackoff=0.1s, maxBackoff=5s, backoffMultiplier=3.0, retryableStatusCodes=[UNAVAILABLE, RESOURCE_EXHAUSTED, DEADLINE_EXCEEDED]}}, PartialMethodConfig{name=[MethodName{service='gateway_protocol.Gateway', method='BroadcastSignal'}, MethodName{service='gateway_protocol.Gateway', method='CreateProcessInstance'}, MethodName{service='gateway_protocol.Gateway', method='CreateProcessInstanceWithResult'}, MethodName{service='gateway_protocol.Gateway', method='DeployProcess'}, MethodName{service='gateway_protocol.Gateway', method='DeployResource'}, MethodName{service='gateway_protocol.Gateway', method='ModifyProcessInstance'}, MethodName{service='gateway_protocol.Gateway', method='PublishMessage'}, MethodName{service='gateway_protocol.Gateway', method='ThrowError'}, MethodName{service='gateway_protocol.Gateway', method='UpdateJobRetries'}, MethodName{service='gateway_protocol.Gateway', method='UpdateJobTimeout'}], retryPolicy={maxAttempts=5.0, initialBackoff=0.1s, maxBackoff=5s, backoffMultiplier=3.0, retryableStatusCodes=[UNAVAILABLE, RESOURCE_EXHAUSTED]}}]} to have a service config with a retry policy for method 'gateway_protocol.Gateway/ActivateJobs' defined but no such retry policy was found in 'file:/home/nicolas/src/github.com/camunda-cloud/zeebe/gateway-protocol-impl/target/classes/gateway-service-config.json'

(obviously the path will change based on your local machine ;))

gateway-protocol-impl/src/main/resources/gateway-service-config.json

clients/java/src/main/java/io/camunda/zeebe/client/ClientProperties.java

clients/java/src/main/java/io/camunda/zeebe/client/ZeebeClientBuilder.java

…retry policy

npepinpe · 2023-11-21T13:00:58Z

Docs follow issue: camunda/camunda-docs#2939

npepinpe · 2023-11-21T13:06:33Z

bors merge

zeebe-bors-camunda · 2023-11-21T13:16:23Z

Build succeeded:

Test summary

npepinpe mentioned this pull request Nov 16, 2023

fix(clients/go): client does not retry on permanent errors #14217

Open

14 tasks

korthout requested changes Nov 16, 2023

View reviewed changes

npepinpe force-pushed the np-default-service-config branch 2 times, most recently from 2550059 to 91def78 Compare November 16, 2023 14:22

npepinpe requested a review from korthout November 16, 2023 15:31

korthout approved these changes Nov 20, 2023

View reviewed changes

feat(gateway-protocol-impl): add default service config

48f8011

npepinpe force-pushed the np-default-service-config branch from 4ef93d4 to 4b20e9d Compare November 21, 2023 12:48

npepinpe added 4 commits November 21, 2023 13:58

test(gateway-protocol-impl): add sanity test that all methods have a …

0658145

…retry policy

refactor(gateway-protocol-impl): add missing RPC

1eb2c9b

refactor(clients): load default service config

ba4ce2d

refactor(clients): control default retry policy via feature flag

98eb89e

npepinpe force-pushed the np-default-service-config branch from 4b20e9d to 98eb89e Compare November 21, 2023 12:58

zeebe-bors-camunda bot merged commit 3fc2d8d into main Nov 21, 2023
30 of 32 checks passed

zeebe-bors-camunda bot deleted the np-default-service-config branch November 21, 2023 13:16

korthout mentioned this pull request Nov 22, 2023

As a user I can retry commands on retryable failed responses #15032

Closed

korthout linked an issue Nov 22, 2023 that may be closed by this pull request

As a user I can retry commands on retryable failed responses #15032

Closed

berkaycanbc mentioned this pull request Nov 23, 2023

Introduce a new migrate process instance rpc #15307

Merged

14 tasks

lenaschoenburg mentioned this pull request Dec 11, 2023

SetVariables is not always retryable #15549

Closed

berkaycanbc mentioned this pull request May 2, 2024

fix: retry SetVariable on for UNAVAILABLE and RESOURCE_EXHAUSTED #18180

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide default service config for gateway_protocol.Gateway service #15230

Provide default service config for gateway_protocol.Gateway service #15230

npepinpe commented Nov 16, 2023

korthout left a comment

npepinpe commented Nov 16, 2023

npepinpe commented Nov 16, 2023

npepinpe commented Nov 16, 2023

korthout commented Nov 16, 2023

npepinpe commented Nov 16, 2023 •

edited

korthout left a comment

korthout Nov 20, 2023

npepinpe Nov 21, 2023

npepinpe commented Nov 21, 2023

npepinpe commented Nov 21, 2023

zeebe-bors-camunda bot commented Nov 21, 2023

Provide default service config for gateway_protocol.Gateway service #15230

Provide default service config for gateway_protocol.Gateway service #15230

Conversation

npepinpe commented Nov 16, 2023

Description

Related issues

Definition of Done

korthout left a comment

Choose a reason for hiding this comment

npepinpe commented Nov 16, 2023

npepinpe commented Nov 16, 2023

npepinpe commented Nov 16, 2023

korthout commented Nov 16, 2023

npepinpe commented Nov 16, 2023 • edited

korthout left a comment

Choose a reason for hiding this comment

korthout Nov 20, 2023

Choose a reason for hiding this comment

npepinpe Nov 21, 2023

Choose a reason for hiding this comment

npepinpe commented Nov 21, 2023

npepinpe commented Nov 21, 2023

zeebe-bors-camunda bot commented Nov 21, 2023

npepinpe commented Nov 16, 2023 •

edited