Gateway can easily overwhelm broker with handling blocking jobs (long polling) #7659

Zelldon · 2021-08-20T07:33:28Z

Describe the bug

The gateway can overwhelm the broker with activation requests if many workers waiting for jobs.

I can reproduce that with a normal benchmark were I created 100 workers and create once one instance with one task.

Did we know that we have metrics about blocked requests?:

The issue is here https://github.com/camunda-cloud/zeebe/blob/e0c789dfe77f3d210ed4dd06c3177db145369794/gateway/src/main/java/io/camunda/zeebe/gateway/impl/job/LongPollingActivateJobsHandler.java#L198-L202
when we get the Broker notification:

  private void resetFailedAttemptsAndHandlePendingRequests(final String jobType) {
    final InFlightLongPollingActivateJobsRequestsState state = getJobTypeState(jobType);

    state.resetFailedAttempts();

    final Queue<LongPollingActivateJobsRequest> pendingRequests = state.getPendingRequests();

    if (!pendingRequests.isEmpty()) {
      pendingRequests.forEach(
          nextPendingRequest -> {
            LOG.trace("Unblocking ActivateJobsRequest {}", nextPendingRequest.getRequest());
            activateJobs(nextPendingRequest);
          });
    } else {
      if (!state.hasActiveRequests()) {
        jobTypeState.remove(jobType);
      }
    }
  }

To Reproduce

Run multiple worker, wait, and start a instance you can observe multiple activations.

Expected behavior

It sends one activate for a notification, not 100.

Environment:

OS: k8
Zeebe Version: 1.1.2
Configuration: benchmark

npepinpe · 2021-08-20T07:51:33Z

You're right. Generally I'd like to rewrite the long polling approach. I was hoping to wait until we switched to gRPC transport also for internal communication, but we could already do it now. Can you elaborate on the impact here?

Zelldon · 2021-08-20T08:21:53Z

Regarding impact:

If you have multiple workers waiting for a task and create a new process instance or reach a point where a new task comes available then the gateway will write activation commands for all waiting workers. This can overwhelm the broker, which means backpressure spikes, we drop more requests then usual. This can affect other things like execution of instances, creation etc.

npepinpe · 2021-08-23T07:35:32Z

I think this will take some time to fix, and as such we should plan it as a KR - I don't see any quick fix or anything we can do in between other topics, so I will leave it in the back log until we do the planning for Q4.

romansmirnov · 2021-12-14T15:38:31Z

@Zelldon, just want to share a thought here.

In a scenario with a replicated standalone gateway, this issue might become more obvious. Meaning, if the gateway is replicated 3 times and each gateway receives 100 activate jobs requests but no jobs are available right now, then there are in total 300 pending activate jobs requests. Now, the broker creates one job and notifies all gateways which results in 300 requests to the broker but only one request will actually activate a result.

npepinpe · 2021-12-14T15:44:43Z

Regarding this scenario, would it make sense to aggregate requests for the same job type in each gateway, and queue them? In that case, only one request for this job type would be sent, and the gateway would return the jobs to the client at the front of the queue. Of course this adds some complexity in the gateway. I would try to avoid coordinating the gateways together as much as possible though.

romansmirnov · 2021-12-14T20:30:39Z

Actually, the requests for the same job type are already aggregated and queued as pending requests, when they don't activate jobs immediately. As outlined by @Zelldon, this might be the problematic part (when a notification is received):

https://github.com/camunda-cloud/zeebe/blob/e0c789dfe77f3d210ed4dd06c3177db145369794/gateway/src/main/java/io/camunda/zeebe/gateway/impl/job/LongPollingActivateJobsHandler.java#L198-L202

If I understand your suggestion correctly, then instead of going through the entire pending queue and triggering all contained requests, it would get only the first request from the queue and execute this request, for example like this:

if (!pendingRequests.isEmpty()) {
  final var pendingRequest = pendingRequests.poll();
  activateJobs(pendingRequest);
} else ...

That means:

Each notification would send only one pending request.
Each successful job activation would send only one pending request, see
https://github.com/camunda-cloud/zeebe/blob/df9710cc95436a9b512ca751114a1e5fbdc7c685/gateway/src/main/java/io/camunda/zeebe/gateway/impl/job/LongPollingActivateJobsHandler.java#L183-L188

Am I correct with my understanding? If so, then this should be an easy pick, and when thinking about it (also in terms of pros and cons), then I don't see any big issues during the runtime.

npepinpe · 2021-12-14T20:54:07Z

That's what I was thinking, yes - only pick the first one, then if it completes and filled out all the jobs, we can try the next one (possibly we should even if it didn't fill out all of them? not sure, didn't think through, just want to point out the case). This will avoid a stampede effect. I'm also not 100% sure what the impact would be if you get a notification after you've sent the first request, but still have more queued - do you wait for the first one to complete, or immediately send the next? There's a chance the first one will grab the next job anyway, but also maybe it won't...

I didn't spend a great deal of time thinking about it, so it's worth thinking through before doing it. I'm happy to join a discussion on this, or let others do that 🙂 Of course, if you already have, then I trust your judgement if you say there's no downsides.

To be honest, I think it might be worth rethinking long polling again, but in order to get resources for this I would need to have an outline of the problems we want to fix by reworking it (and maybe this time documenting it 🙃)

korthout · 2022-08-30T08:44:12Z

Marking this under the area performance as this is about optimizing long polling for scaling the number of job workers

korthout · 2022-08-30T08:45:42Z

Marking priority as later because scaling job workers is not a main priority for the process automation team right now.

Please comment if you think this should have a higher priority.

Zelldon added kind/bug Categorizes an issue or PR as a bug severity/mid Marks a bug as having a noticeable impact but with a known workaround scope/gateway Marks an issue or PR to appear in the gateway section of the changelog labels Aug 20, 2021

Zelldon mentioned this issue Oct 7, 2021

Iterate over long polling and notification mechanism #4542

Closed

Zelldon mentioned this issue Dec 2, 2021

Requests to activate jobs may result in infinite execution from the Gateway to the brokers #8310

Closed

2 tasks

Zelldon added team/distributed labels Jun 2, 2022

Zelldon removed the team/distributed label Jun 10, 2022

menski removed the team/process-automation label Jul 11, 2022

korthout added the area/performance Marks an issue as performance related label Aug 30, 2022

Zelldon added the component/gateway label Dec 29, 2022

romansmirnov added the component/zeebe Related to the Zeebe component/team label Mar 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gateway can easily overwhelm broker with handling blocking jobs (long polling) #7659

Gateway can easily overwhelm broker with handling blocking jobs (long polling) #7659

Zelldon commented Aug 20, 2021 •

edited

npepinpe commented Aug 20, 2021

Zelldon commented Aug 20, 2021

npepinpe commented Aug 23, 2021

romansmirnov commented Dec 14, 2021

npepinpe commented Dec 14, 2021

romansmirnov commented Dec 14, 2021 •

edited

npepinpe commented Dec 14, 2021 •

edited

korthout commented Aug 30, 2022

korthout commented Aug 30, 2022

Gateway can easily overwhelm broker with handling blocking jobs (long polling) #7659

Gateway can easily overwhelm broker with handling blocking jobs (long polling) #7659

Comments

Zelldon commented Aug 20, 2021 • edited

npepinpe commented Aug 20, 2021

Zelldon commented Aug 20, 2021

npepinpe commented Aug 23, 2021

romansmirnov commented Dec 14, 2021

npepinpe commented Dec 14, 2021

romansmirnov commented Dec 14, 2021 • edited

npepinpe commented Dec 14, 2021 • edited

korthout commented Aug 30, 2022

korthout commented Aug 30, 2022

Zelldon commented Aug 20, 2021 •

edited

romansmirnov commented Dec 14, 2021 •

edited

npepinpe commented Dec 14, 2021 •

edited