Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gateway can easily overwhelm broker with handling blocking jobs (long polling) #7659

Open
Zelldon opened this issue Aug 20, 2021 · 9 comments
Labels
area/performance Marks an issue as performance related component/gateway component/zeebe Related to the Zeebe component/team kind/bug Categorizes an issue or PR as a bug scope/gateway Marks an issue or PR to appear in the gateway section of the changelog severity/mid Marks a bug as having a noticeable impact but with a known workaround

Comments

@Zelldon
Copy link
Member

Zelldon commented Aug 20, 2021

Describe the bug

The gateway can overwhelm the broker with activation requests if many workers waiting for jobs.

gw

I can reproduce that with a normal benchmark were I created 100 workers and create once one instance with one task.

general
grpc

Did we know that we have metrics about blocked requests?:

blocked-req

The issue is here https://github.com/camunda-cloud/zeebe/blob/e0c789dfe77f3d210ed4dd06c3177db145369794/gateway/src/main/java/io/camunda/zeebe/gateway/impl/job/LongPollingActivateJobsHandler.java#L198-L202
when we get the Broker notification:

  private void resetFailedAttemptsAndHandlePendingRequests(final String jobType) {
    final InFlightLongPollingActivateJobsRequestsState state = getJobTypeState(jobType);

    state.resetFailedAttempts();

    final Queue<LongPollingActivateJobsRequest> pendingRequests = state.getPendingRequests();

    if (!pendingRequests.isEmpty()) {
      pendingRequests.forEach(
          nextPendingRequest -> {
            LOG.trace("Unblocking ActivateJobsRequest {}", nextPendingRequest.getRequest());
            activateJobs(nextPendingRequest);
          });
    } else {
      if (!state.hasActiveRequests()) {
        jobTypeState.remove(jobType);
      }
    }
  }

To Reproduce

Run multiple worker, wait, and start a instance you can observe multiple activations.

Expected behavior

It sends one activate for a notification, not 100.

Environment:

  • OS: k8
  • Zeebe Version: 1.1.2
  • Configuration: benchmark
@Zelldon Zelldon added kind/bug Categorizes an issue or PR as a bug severity/mid Marks a bug as having a noticeable impact but with a known workaround scope/gateway Marks an issue or PR to appear in the gateway section of the changelog labels Aug 20, 2021
@npepinpe
Copy link
Member

You're right. Generally I'd like to rewrite the long polling approach. I was hoping to wait until we switched to gRPC transport also for internal communication, but we could already do it now. Can you elaborate on the impact here?

@Zelldon
Copy link
Member Author

Zelldon commented Aug 20, 2021

Regarding impact:

If you have multiple workers waiting for a task and create a new process instance or reach a point where a new task comes available then the gateway will write activation commands for all waiting workers. This can overwhelm the broker, which means backpressure spikes, we drop more requests then usual. This can affect other things like execution of instances, creation etc.

@npepinpe
Copy link
Member

I think this will take some time to fix, and as such we should plan it as a KR - I don't see any quick fix or anything we can do in between other topics, so I will leave it in the back log until we do the planning for Q4.

@romansmirnov
Copy link
Member

@Zelldon, just want to share a thought here.

In a scenario with a replicated standalone gateway, this issue might become more obvious. Meaning, if the gateway is replicated 3 times and each gateway receives 100 activate jobs requests but no jobs are available right now, then there are in total 300 pending activate jobs requests. Now, the broker creates one job and notifies all gateways which results in 300 requests to the broker but only one request will actually activate a result.

@npepinpe
Copy link
Member

Regarding this scenario, would it make sense to aggregate requests for the same job type in each gateway, and queue them? In that case, only one request for this job type would be sent, and the gateway would return the jobs to the client at the front of the queue. Of course this adds some complexity in the gateway. I would try to avoid coordinating the gateways together as much as possible though.

@romansmirnov
Copy link
Member

romansmirnov commented Dec 14, 2021

Actually, the requests for the same job type are already aggregated and queued as pending requests, when they don't activate jobs immediately. As outlined by @Zelldon, this might be the problematic part (when a notification is received):

https://github.com/camunda-cloud/zeebe/blob/e0c789dfe77f3d210ed4dd06c3177db145369794/gateway/src/main/java/io/camunda/zeebe/gateway/impl/job/LongPollingActivateJobsHandler.java#L198-L202

If I understand your suggestion correctly, then instead of going through the entire pending queue and triggering all contained requests, it would get only the first request from the queue and execute this request, for example like this:

if (!pendingRequests.isEmpty()) {
  final var pendingRequest = pendingRequests.poll();
  activateJobs(pendingRequest);
} else ...

That means:

Am I correct with my understanding? If so, then this should be an easy pick, and when thinking about it (also in terms of pros and cons), then I don't see any big issues during the runtime.

@npepinpe
Copy link
Member

npepinpe commented Dec 14, 2021

That's what I was thinking, yes - only pick the first one, then if it completes and filled out all the jobs, we can try the next one (possibly we should even if it didn't fill out all of them? not sure, didn't think through, just want to point out the case). This will avoid a stampede effect. I'm also not 100% sure what the impact would be if you get a notification after you've sent the first request, but still have more queued - do you wait for the first one to complete, or immediately send the next? There's a chance the first one will grab the next job anyway, but also maybe it won't...

I didn't spend a great deal of time thinking about it, so it's worth thinking through before doing it. I'm happy to join a discussion on this, or let others do that 🙂 Of course, if you already have, then I trust your judgement if you say there's no downsides.

To be honest, I think it might be worth rethinking long polling again, but in order to get resources for this I would need to have an outline of the problems we want to fix by reworking it (and maybe this time documenting it 🙃)

@korthout korthout added the area/performance Marks an issue as performance related label Aug 30, 2022
@korthout
Copy link
Member

Marking this under the area performance as this is about optimizing long polling for scaling the number of job workers

@korthout
Copy link
Member

Marking priority as later because scaling job workers is not a main priority for the process automation team right now.

Please comment if you think this should have a higher priority.

@romansmirnov romansmirnov added the component/zeebe Related to the Zeebe component/team label Mar 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/performance Marks an issue as performance related component/gateway component/zeebe Related to the Zeebe component/team kind/bug Categorizes an issue or PR as a bug scope/gateway Marks an issue or PR to appear in the gateway section of the changelog severity/mid Marks a bug as having a noticeable impact but with a known workaround
Projects
None yet
Development

No branches or pull requests

5 participants