Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide default service config for gateway_protocol.Gateway service #15230

Merged
merged 5 commits into from
Nov 21, 2023

Conversation

npepinpe
Copy link
Member

Description

This PR creates a default service config for the gRPC Gateway service. It defines a retry policy per call, with default retry-able calls, a max number of retries, as well as a back off policy.

As an example, this configuration is also loaded as the default service config for the Java client. To use with the Go client, you would simply use the WithDefaultServiceConfig DialOption.

The default service config can be superseded by local user configuration, or a DNS provided service config for the given gateway endpoint.

This may deprecate the use of the back off policy in the worker, but not necessarily. There, the policy throttles the polling logic, whereas the retry here is on a per call basis. So they could be made to work in conjunction.

The nice thing about providing a default service config is that other clients could also use it in the future, as most gRPC libraries provide support for it.

I did not add tests for this, but I did confirm manually that this works.

One thing to note: when DEADLINE_EXCEEDED is added as a retry-able call, this refers to DEADLINE_EXCEEDED returned by the server. In case the client is the one cancelling with a timeout (which the client will observe as DEADLINE_EXCEEDED, though it's generated locally), no retry is performed, as it's assumed the user would not want to retry after the request timeout.

Regarding the policy itself, I suspect the ZPA team will have a better idea of what codes are actually retry-able and what aren't. In the future, it also means that whenever you add new calls/RPCs, you would need to think about adding them to the default config.

Related issues

related to #14217

Definition of Done

Not all items need to be done depending on the issue and the pull request.

Code changes:

  • The changes are backwards compatibility with previous versions
  • If it fixes a bug then PRs are created to backport the fix to the last two minor versions. You can trigger a backport by assigning labels (e.g. backport stable/1.3) to the PR, in case that fails you need to create backports manually.

Testing:

  • There are unit/integration tests that verify all acceptance criterias of the issue
  • New tests are written to ensure backwards compatibility with further versions
  • The behavior is tested manually
  • The change has been verified by a QA run
  • The impact of the changes is verified by a benchmark

Documentation:

  • The documentation is updated (e.g. BPMN reference, configuration, examples, get-started guides, etc.)
  • If the PR changes how BPMN processes are validated (e.g. support new BPMN element) then the Camunda modeling team should be informed to adjust the BPMN linting.

Other teams:
If the change impacts another team an issue has been created for this team, explaining what they need to do to support this change.

Please refer to our review guidelines.

Copy link
Member

@korthout korthout left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @npepinpe ❤️ I really like this idea

💭 I wonder what impact this makes on the hardcoded retry mechanisms implemented in all of the clients

@npepinpe
Copy link
Member Author

I wonder what impact this makes on the hardcoded retry mechanisms implemented in all of the clients

What do you mean by this? Users who wrote retry behavior around commands? I guess it depends. I think all things considered, it makes more sense to keep it off by default, and let users figure out if it fits their use case. We can turn it on by default later down the line. We could keep it on for ourselves though (QA, E2E, benchmarks, etc.) to validate its usefulness in practice.

@npepinpe npepinpe force-pushed the np-default-service-config branch 2 times, most recently from 2550059 to 91def78 Compare November 16, 2023 14:22
@npepinpe
Copy link
Member Author

Alright, so I can explain the previously failing tests. With the default retry policy enabled, what happened is the default time out would kick in. Basically, we retry the request longer than the default client time out, resulting in the client time out kicking in and closing the request, so you don't see the previously expected error. If you set a longer time out, then you will see the expected error and the test pass.

Now that this is opt-in, all the test will pass since retry will be disabled in the tests. I think that's a strong case as to why we shouldn't enable it by default yet 👍

@npepinpe
Copy link
Member Author

I've updated it to make the whole thing an opt-in feature. I think we should enable it for ourselves, but as it changes the behavior (e.g. your request might time out client-side before the retries are up), and it's not very flexible (you can't specify a custom retry policy per call), I think it's better to keep it as an optional feature for users.

@korthout
Copy link
Member

I wonder what impact this makes on the hardcoded retry mechanisms implemented in all of the clients

What do you mean by this? Users who wrote retry behavior around commands? I guess it depends. I think all things considered, it makes more sense to keep it off by default, and let users figure out if it fits their use case. We can turn it on by default later down the line. We could keep it on for ourselves though (QA, E2E, benchmarks, etc.) to validate its usefulness in practice.

I was referring to the hardcoded backoffs in the java, spring, etc, clients.

Let's say that the request to ActivateJobs (by a job worker's job poller) fails. This config will transparently retry a few times with exponential backoff. Will there be a time that the actual error is reported to the job poller? And what if users make this same call explicitly, will they receive the actual error at some point?

@npepinpe
Copy link
Member Author

npepinpe commented Nov 16, 2023

Let's say that the request to ActivateJobs (by a job worker's job poller) fails. This config will transparently retry a few times with exponential backoff. Will there be a time that the actual error is reported to the job poller?

Yes, two possible ways:

  1. The max retry attempts is reached (5 right now)
  2. Once the client request time out (specified in JobWorkerBuilderStep3#requestTimeout) is reached

In the first case, you'll get the last error. In the second case, you'll get a DEADLINE_EXCEEDED. Enabling debug/trace logging for gRPC will show you more context about each request.

Copy link
Member

@korthout korthout left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I like it!

Sorry for the late review 🙇

🔧 / ❓ Please have a look at my suggestions and questions. Feel free to merge when you feel this is ready.

Comment on lines 62 to 64
String.format(
"a service config with a retry policy for method '%s'", method.getFullMethodName()),
config -> " but no such retry policy was found");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I like the test

🔧 While you and I understand the need for this now, it might get lost over time leading to confusion when this test fails. Could we add some additional context to this message describing what this is or why it's needed that all methods have a retry policy?

🔧 It would also be very helpful to describe how to resolve this, so where should this policy be defined.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New error message:

Expecting actual:
  PartialServiceConfig{methodConfig=[PartialMethodConfig{name=[MethodName{service='gateway_protocol.Gateway', method='ActivateJobs'}, MethodName{service='gateway_protocol.Gateway', method='CancelProcessInstance'}, MethodName{service='gateway_protocol.Gateway', method='CompleteJob'}, MethodName{service='gateway_protocol.Gateway', method='DeleteResource'}, MethodName{service='gateway_protocol.Gateway', method='EvaluateDecision'}, MethodName{service='gateway_protocol.Gateway', method='FailJob'}, MethodName{service='gateway_protocol.Gateway', method='ResolveIncident'}, MethodName{service='gateway_protocol.Gateway', method='SetVariables'}, MethodName{service='gateway_protocol.Gateway', method='StreamActivatedJobs'}, MethodName{service='gateway_protocol.Gateway', method='Topology'}], retryPolicy={maxAttempts=5.0, initialBackoff=0.1s, maxBackoff=5s, backoffMultiplier=3.0, retryableStatusCodes=[UNAVAILABLE, RESOURCE_EXHAUSTED, DEADLINE_EXCEEDED]}}, PartialMethodConfig{name=[MethodName{service='gateway_protocol.Gateway', method='BroadcastSignal'}, MethodName{service='gateway_protocol.Gateway', method='CreateProcessInstance'}, MethodName{service='gateway_protocol.Gateway', method='CreateProcessInstanceWithResult'}, MethodName{service='gateway_protocol.Gateway', method='DeployProcess'}, MethodName{service='gateway_protocol.Gateway', method='DeployResource'}, MethodName{service='gateway_protocol.Gateway', method='ModifyProcessInstance'}, MethodName{service='gateway_protocol.Gateway', method='PublishMessage'}, MethodName{service='gateway_protocol.Gateway', method='ThrowError'}, MethodName{service='gateway_protocol.Gateway', method='UpdateJobRetries'}, MethodName{service='gateway_protocol.Gateway', method='UpdateJobTimeout'}], retryPolicy={maxAttempts=5.0, initialBackoff=0.1s, maxBackoff=5s, backoffMultiplier=3.0, retryableStatusCodes=[UNAVAILABLE, RESOURCE_EXHAUSTED]}}]}
to have a service config with a retry policy for method 'gateway_protocol.Gateway/ActivateJobs' defined but no such retry policy was found in 'file:/home/nicolas/src/github.com/camunda-cloud/zeebe/gateway-protocol-impl/target/classes/gateway-service-config.json'

(obviously the path will change based on your local machine ;))

@npepinpe
Copy link
Member Author

Docs follow issue: camunda/camunda-docs#2939

@npepinpe
Copy link
Member Author

bors merge

@zeebe-bors-camunda
Copy link
Contributor

Build succeeded:

@zeebe-bors-camunda zeebe-bors-camunda bot merged commit 3fc2d8d into main Nov 21, 2023
30 of 32 checks passed
@zeebe-bors-camunda zeebe-bors-camunda bot deleted the np-default-service-config branch November 21, 2023 13:16
@korthout korthout linked an issue Nov 22, 2023 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

As a user I can retry commands on retryable failed responses
2 participants