-
Notifications
You must be signed in to change notification settings - Fork 557
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide default service config for gateway_protocol.Gateway service #15230
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @npepinpe ❤️ I really like this idea
💭 I wonder what impact this makes on the hardcoded retry mechanisms implemented in all of the clients
gateway-protocol-impl/src/main/resources/gateway-service-config.json
Outdated
Show resolved
Hide resolved
clients/java/src/main/java/io/camunda/zeebe/client/impl/ZeebeClientImpl.java
Outdated
Show resolved
Hide resolved
clients/java/src/main/java/io/camunda/zeebe/client/impl/ZeebeClientImpl.java
Outdated
Show resolved
Hide resolved
What do you mean by this? Users who wrote retry behavior around commands? I guess it depends. I think all things considered, it makes more sense to keep it off by default, and let users figure out if it fits their use case. We can turn it on by default later down the line. We could keep it on for ourselves though (QA, E2E, benchmarks, etc.) to validate its usefulness in practice. |
2550059
to
91def78
Compare
Alright, so I can explain the previously failing tests. With the default retry policy enabled, what happened is the default time out would kick in. Basically, we retry the request longer than the default client time out, resulting in the client time out kicking in and closing the request, so you don't see the previously expected error. If you set a longer time out, then you will see the expected error and the test pass. Now that this is opt-in, all the test will pass since retry will be disabled in the tests. I think that's a strong case as to why we shouldn't enable it by default yet 👍 |
I've updated it to make the whole thing an opt-in feature. I think we should enable it for ourselves, but as it changes the behavior (e.g. your request might time out client-side before the retries are up), and it's not very flexible (you can't specify a custom retry policy per call), I think it's better to keep it as an optional feature for users. |
I was referring to the hardcoded backoffs in the java, spring, etc, clients. Let's say that the request to |
Yes, two possible ways:
In the first case, you'll get the last error. In the second case, you'll get a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I like it!
Sorry for the late review 🙇
🔧 / ❓ Please have a look at my suggestions and questions. Feel free to merge when you feel this is ready.
String.format( | ||
"a service config with a retry policy for method '%s'", method.getFullMethodName()), | ||
config -> " but no such retry policy was found"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I like the test
🔧 While you and I understand the need for this now, it might get lost over time leading to confusion when this test fails. Could we add some additional context to this message describing what this is or why it's needed that all methods have a retry policy?
🔧 It would also be very helpful to describe how to resolve this, so where should this policy be defined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New error message:
Expecting actual:
PartialServiceConfig{methodConfig=[PartialMethodConfig{name=[MethodName{service='gateway_protocol.Gateway', method='ActivateJobs'}, MethodName{service='gateway_protocol.Gateway', method='CancelProcessInstance'}, MethodName{service='gateway_protocol.Gateway', method='CompleteJob'}, MethodName{service='gateway_protocol.Gateway', method='DeleteResource'}, MethodName{service='gateway_protocol.Gateway', method='EvaluateDecision'}, MethodName{service='gateway_protocol.Gateway', method='FailJob'}, MethodName{service='gateway_protocol.Gateway', method='ResolveIncident'}, MethodName{service='gateway_protocol.Gateway', method='SetVariables'}, MethodName{service='gateway_protocol.Gateway', method='StreamActivatedJobs'}, MethodName{service='gateway_protocol.Gateway', method='Topology'}], retryPolicy={maxAttempts=5.0, initialBackoff=0.1s, maxBackoff=5s, backoffMultiplier=3.0, retryableStatusCodes=[UNAVAILABLE, RESOURCE_EXHAUSTED, DEADLINE_EXCEEDED]}}, PartialMethodConfig{name=[MethodName{service='gateway_protocol.Gateway', method='BroadcastSignal'}, MethodName{service='gateway_protocol.Gateway', method='CreateProcessInstance'}, MethodName{service='gateway_protocol.Gateway', method='CreateProcessInstanceWithResult'}, MethodName{service='gateway_protocol.Gateway', method='DeployProcess'}, MethodName{service='gateway_protocol.Gateway', method='DeployResource'}, MethodName{service='gateway_protocol.Gateway', method='ModifyProcessInstance'}, MethodName{service='gateway_protocol.Gateway', method='PublishMessage'}, MethodName{service='gateway_protocol.Gateway', method='ThrowError'}, MethodName{service='gateway_protocol.Gateway', method='UpdateJobRetries'}, MethodName{service='gateway_protocol.Gateway', method='UpdateJobTimeout'}], retryPolicy={maxAttempts=5.0, initialBackoff=0.1s, maxBackoff=5s, backoffMultiplier=3.0, retryableStatusCodes=[UNAVAILABLE, RESOURCE_EXHAUSTED]}}]}
to have a service config with a retry policy for method 'gateway_protocol.Gateway/ActivateJobs' defined but no such retry policy was found in 'file:/home/nicolas/src/github.com/camunda-cloud/zeebe/gateway-protocol-impl/target/classes/gateway-service-config.json'
(obviously the path will change based on your local machine ;))
clients/java/src/main/java/io/camunda/zeebe/client/ClientProperties.java
Show resolved
Hide resolved
clients/java/src/main/java/io/camunda/zeebe/client/ZeebeClientBuilder.java
Outdated
Show resolved
Hide resolved
4ef93d4
to
4b20e9d
Compare
4b20e9d
to
98eb89e
Compare
Docs follow issue: camunda/camunda-docs#2939 |
bors merge |
Build succeeded: |
Description
This PR creates a default service config for the gRPC Gateway service. It defines a retry policy per call, with default retry-able calls, a max number of retries, as well as a back off policy.
As an example, this configuration is also loaded as the default service config for the Java client. To use with the Go client, you would simply use the WithDefaultServiceConfig
DialOption
.The default service config can be superseded by local user configuration, or a DNS provided service config for the given gateway endpoint.
This may deprecate the use of the back off policy in the worker, but not necessarily. There, the policy throttles the polling logic, whereas the retry here is on a per call basis. So they could be made to work in conjunction.
The nice thing about providing a default service config is that other clients could also use it in the future, as most gRPC libraries provide support for it.
I did not add tests for this, but I did confirm manually that this works.
One thing to note: when
DEADLINE_EXCEEDED
is added as a retry-able call, this refers toDEADLINE_EXCEEDED
returned by the server. In case the client is the one cancelling with a timeout (which the client will observe asDEADLINE_EXCEEDED
, though it's generated locally), no retry is performed, as it's assumed the user would not want to retry after the request timeout.Regarding the policy itself, I suspect the ZPA team will have a better idea of what codes are actually retry-able and what aren't. In the future, it also means that whenever you add new calls/RPCs, you would need to think about adding them to the default config.
Related issues
related to #14217
Definition of Done
Not all items need to be done depending on the issue and the pull request.
Code changes:
backport stable/1.3
) to the PR, in case that fails you need to create backports manually.Testing:
Documentation:
Other teams:
If the change impacts another team an issue has been created for this team, explaining what they need to do to support this change.
Please refer to our review guidelines.