-
Notifications
You must be signed in to change notification settings - Fork 608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Broken job stream aggregation #17513
Comments
A quick glance on the gateway shows that the aggregation fails because the streams get a different server stream ID, so it seems the "logical" ID comparison breaks. Unclear yet if it's the metadata or the stream type which causes it to "fail". It could also be a race condition, but the code path is extremely simple, so I fail to see how that would be possible. |
Can confirm it's not aggregated properly on either side, broker or gateway. On the gateway clearly the server ID ends up being wrong, but on the broker it's unclear, as the aggregation is done purely over the job type and metadata. Again, could be there's a race condition on both sides. Having checked the code path, it seems unlikely, as there's a clear single entrypoint wrapped by the actor. Of course, it could be an actor issue (:scream:), but that's unlikely. |
I tried writing a quick test to reproduce it, but I can't, it always seem to aggregate properly locally. |
I think the issue is the tenant IDs. I've added them to the endpoint, and you can see the following values:
|
The issue is that we reuse the same command builder ( |
## Description Adding tenants to the and job activation and streaming commands can end up accumulating the same tenant IDs over and over if the command builder is reuse to send new commands. This is because the commands keep adding the tenant IDs to the underlying gRPC request, without ever clearing them. So each successive `send()` call would end up accumulating IDs. Say you use only the default tenant ID (`<default>`), then on the first send it would be `[<default>]`. On the second, `[<default>, <default>]`, and so on. For job streaming, this would break the aggregation as the tenant IDs are part of the stream aggregation. This highlights that the aggregation for job streaming is not ideal - for example, if one stream is tenants `foo, bar`, and the other `foo`, then technically any job of type `foo` can go to either of them. This is out of scope for now, but I will open another issue to improve the aggregation. ## Related issues closes #17513
Describe the bug
From running a benchmark with > 100 workers, it seems we sometimes fail to aggregate streams properly on the gateway and on the broker. In my benchmark, I had 5 job types, and 20 workers (with the exact same configuration) per job type. I would expect to see at most 5 aggregated streams per gateway, but instead some had 30, some had 17, etc.
So there seems to be some race condition or weird thing breaking the aggregation. This has an impact on performance as it means jobs cannot be retried with logically equivalent workers, and must be resubmitted to the engine (possibly).
This also impacts performance because while you think you're scaling out workers, one aggregated stream may not be scaling out, leading to a job being yielded due to being the stream seeming blocked. This means scaling out workers may not yield any performance gain at all, which negates the horizontal scalability goal of Zeebe in that respect.
To Reproduce
Run a benchmark with multiple different job types, with lots of workers per job type. Then, randomly restart your gateways and brokers. After some restarts (sometimes only 1), you will see the aggregation is broken.
Expected behavior
Given 5 job types and 20 workers per type with the exact same config, I would expect to see at most 5 aggregated streams.
Environment:
The text was updated successfully, but these errors were encountered: