Skip to content
This repository has been archived by the owner on Apr 8, 2024. It is now read-only.

Handle api throttling #301

Closed
nhomble opened this issue Mar 1, 2023 · 4 comments · Fixed by #306
Closed

Handle api throttling #301

nhomble opened this issue Mar 1, 2023 · 4 comments · Fixed by #306
Assignees

Comments

@nhomble
Copy link

nhomble commented Mar 1, 2023

Issue tracker is ONLY used for reporting bugs. New features and questions should be discussed on our slack channel.

Expected Behavior

An API client should backoff appropriately to http 429's. Especially in ideal cases where clients are calling Camunda cloud like cloud identity, the api clients should throttle better.

Current Behavior

Workers keep polling and logging 429s which increase the duration of an outage.

Possible Solution

In the case of a 429, check for Retry-After (or default to current strategy) and sleep for that amount + jitter.

Additionally, this static wait could be enhanced to backoff.

Steps to Reproduce

  1. provision oauth creds to set of workers
  2. invalidate those credentials

Context (Environment)

New workers with valid credentials are also throttled by Camunda Identity.

This causes an outage for any worker progress.

In our case, we see the 429s with Identity.

Detailed Description

Possible Implementation

@nhomble
Copy link
Author

nhomble commented Mar 5, 2023

This seems solved in the Java worker by adding a backoff on a failed poll

https://github.com/camunda/zeebe/blob/main/clients/java/src/main/java/io/camunda/zeebe/client/impl/worker/JobWorkerImpl.java#L196

Without getting too deep into the 429 weeds at a first pass, adding backoff logic in general would be a good next step before over-indexing on just the 429 use case.

@nhomble
Copy link
Author

nhomble commented Mar 6, 2023

It looks like the library is trying to handle this with this change/issue, but I am seeing error code 13 in my logs.

From the grpc docs, I see the library is trying to do the right thing, but I'm thinking the error code is just not mapped correctly.

@nhomble
Copy link
Author

nhomble commented Mar 6, 2023

@jwulf does this analysis make sense? This looks like an old feature, so I am surprised that I am seeing this. I'm currently on 8.1.2 but nothing in the changelog tells me that 8.1.5 changes this behavior.

@jwulf jwulf self-assigned this Mar 8, 2023
@jwulf
Copy link
Member

jwulf commented Mar 21, 2023

I’ve published 8.1.7-debug.

It does a few things.

  1. It adds debugging output using debug. So you can run your application with the env var DEBUG='oauth' to get debugging output from the OAuthProvider.
  2. The OAuthProvider getToken method returns an inflight request promise, so only one request to the token endpoint will be made per client. Any parallel requests for a token by other operations while the first request is in-flight will use the same promise.
  3. It uses an exponential backoff for retries when the token endpoint returns a failure - from 1 to 2 to 4 to 8 to 16 to 32 to 60 seconds.

The last two items may solve the issue you are seeing.

If not, please run your application with DEBUG='oauth' and we'll examine the output.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants