New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gateway can easily overwhelm broker with handling blocking jobs (long polling) #7659
Comments
You're right. Generally I'd like to rewrite the long polling approach. I was hoping to wait until we switched to gRPC transport also for internal communication, but we could already do it now. Can you elaborate on the impact here? |
Regarding impact: If you have multiple workers waiting for a task and create a new process instance or reach a point where a new task comes available then the gateway will write activation commands for all waiting workers. This can overwhelm the broker, which means backpressure spikes, we drop more requests then usual. This can affect other things like execution of instances, creation etc. |
I think this will take some time to fix, and as such we should plan it as a KR - I don't see any quick fix or anything we can do in between other topics, so I will leave it in the back log until we do the planning for Q4. |
@Zelldon, just want to share a thought here. In a scenario with a replicated standalone gateway, this issue might become more obvious. Meaning, if the gateway is replicated 3 times and each gateway receives 100 activate jobs requests but no jobs are available right now, then there are in total 300 pending activate jobs requests. Now, the broker creates one job and notifies all gateways which results in 300 requests to the broker but only one request will actually activate a result. |
Regarding this scenario, would it make sense to aggregate requests for the same job type in each gateway, and queue them? In that case, only one request for this job type would be sent, and the gateway would return the jobs to the client at the front of the queue. Of course this adds some complexity in the gateway. I would try to avoid coordinating the gateways together as much as possible though. |
Actually, the requests for the same job type are already aggregated and queued as pending requests, when they don't activate jobs immediately. As outlined by @Zelldon, this might be the problematic part (when a notification is received): If I understand your suggestion correctly, then instead of going through the entire pending queue and triggering all contained requests, it would get only the first request from the queue and execute this request, for example like this:
That means:
Am I correct with my understanding? If so, then this should be an easy pick, and when thinking about it (also in terms of pros and cons), then I don't see any big issues during the runtime. |
That's what I was thinking, yes - only pick the first one, then if it completes and filled out all the jobs, we can try the next one (possibly we should even if it didn't fill out all of them? not sure, didn't think through, just want to point out the case). This will avoid a stampede effect. I'm also not 100% sure what the impact would be if you get a notification after you've sent the first request, but still have more queued - do you wait for the first one to complete, or immediately send the next? There's a chance the first one will grab the next job anyway, but also maybe it won't... I didn't spend a great deal of time thinking about it, so it's worth thinking through before doing it. I'm happy to join a discussion on this, or let others do that 🙂 Of course, if you already have, then I trust your judgement if you say there's no downsides. To be honest, I think it might be worth rethinking long polling again, but in order to get resources for this I would need to have an outline of the problems we want to fix by reworking it (and maybe this time documenting it 🙃) |
Marking this under the area |
Marking priority as Please comment if you think this should have a higher priority. |
Describe the bug
The gateway can overwhelm the broker with activation requests if many workers waiting for jobs.
I can reproduce that with a normal benchmark were I created 100 workers and create once one instance with one task.
Did we know that we have metrics about blocked requests?:
The issue is here https://github.com/camunda-cloud/zeebe/blob/e0c789dfe77f3d210ed4dd06c3177db145369794/gateway/src/main/java/io/camunda/zeebe/gateway/impl/job/LongPollingActivateJobsHandler.java#L198-L202
when we get the Broker notification:
To Reproduce
Run multiple worker, wait, and start a instance you can observe multiple activations.
Expected behavior
It sends one activate for a notification, not 100.
Environment:
The text was updated successfully, but these errors were encountered: