Broken job stream aggregation #17513

npepinpe · 2024-04-16T13:03:16Z

Describe the bug

From running a benchmark with > 100 workers, it seems we sometimes fail to aggregate streams properly on the gateway and on the broker. In my benchmark, I had 5 job types, and 20 workers (with the exact same configuration) per job type. I would expect to see at most 5 aggregated streams per gateway, but instead some had 30, some had 17, etc.

So there seems to be some race condition or weird thing breaking the aggregation. This has an impact on performance as it means jobs cannot be retried with logically equivalent workers, and must be resubmitted to the engine (possibly).

This also impacts performance because while you think you're scaling out workers, one aggregated stream may not be scaling out, leading to a job being yielded due to being the stream seeming blocked. This means scaling out workers may not yield any performance gain at all, which negates the horizontal scalability goal of Zeebe in that respect.

To Reproduce

Run a benchmark with multiple different job types, with lots of workers per job type. Then, randomly restart your gateways and brokers. After some restarts (sometimes only 1), you will see the aggregation is broken.

Expected behavior

Given 5 job types and 20 workers per type with the exact same config, I would expect to see at most 5 aggregated streams.

Environment:

Zeebe Version: 8.6.0-SNAPSHOT (and earlier)

npepinpe · 2024-04-16T13:11:08Z

A quick glance on the gateway shows that the aggregation fails because the streams get a different server stream ID, so it seems the "logical" ID comparison breaks. Unclear yet if it's the metadata or the stream type which causes it to "fail".

It could also be a race condition, but the code path is extremely simple, so I fail to see how that would be possible.

npepinpe · 2024-04-16T13:50:07Z

Can confirm it's not aggregated properly on either side, broker or gateway. On the gateway clearly the server ID ends up being wrong, but on the broker it's unclear, as the aggregation is done purely over the job type and metadata.

Again, could be there's a race condition on both sides. Having checked the code path, it seems unlikely, as there's a clear single entrypoint wrapped by the actor. Of course, it could be an actor issue (:scream:), but that's unlikely.

npepinpe · 2024-04-16T14:00:24Z

I tried writing a quick test to reproduce it, but I can't, it always seem to aggregate properly locally.

npepinpe · 2024-04-17T08:56:41Z

I think the issue is the tenant IDs. I've added them to the endpoint, and you can see the following values:

[
  RemoteJobStream[
    jobType=id-d22ebb23-8ac5-49ee-816a-dc47ca450633, 
    metadata=Metadata[
      worker=command,
      timeout=PT0.5S,
      fetchVariables=[foo, bar],
      tenantIds=[<default>, <default>, <default>, <default>, <default>, <default>, <default>, <default>]
    ], 
    consumers=[
      RemoteStreamId[id=27eb2a1b-d1c2-4033-b804-1cde40101f97, receiver=gateway-1]
    ]
  ],
  RemoteJobStream[
    jobType=id-d22ebb23-8ac5-49ee-816a-dc47ca450633,
    metadata=Metadata[
      worker=command,
      timeout=PT0.5S,
      fetchVariables=[foo, bar],
      tenantIds=[<default>, <default>, <default>, <default>, <default>, <default>, <default>, <default>, <default>]
    ],
    consumers=[
      RemoteStreamId[id=3195f3fd-2be6-4b09-94b3-71a7f7764683, receiver=gateway-1]
    ]
  ]
]

npepinpe · 2024-04-17T11:33:20Z

The issue is that we reuse the same command builder (StreamJobsCommandImpl), in the JobStreamerImpl, and that one keeps adding the tenants on every send :)

## Description Adding tenants to the and job activation and streaming commands can end up accumulating the same tenant IDs over and over if the command builder is reuse to send new commands. This is because the commands keep adding the tenant IDs to the underlying gRPC request, without ever clearing them. So each successive `send()` call would end up accumulating IDs. Say you use only the default tenant ID (`<default>`), then on the first send it would be `[<default>]`. On the second, `[<default>, <default>]`, and so on. For job streaming, this would break the aggregation as the tenant IDs are part of the stream aggregation. This highlights that the aggregation for job streaming is not ideal - for example, if one stream is tenants `foo, bar`, and the other `foo`, then technically any job of type `foo` can go to either of them. This is out of scope for now, but I will open another issue to improve the aggregation. ## Related issues closes #17513

@npepinpe

… restarts (#17580) # Description Backport of #17545 to `stable/8.5`. relates to #17513 original author: @npepinpe

@npepinpe

… restarts (#17601) # Description Backport of #17545 to `stable/8.3`. > [!Note] > There was a very small conflict in the `JobStreamEndpoint` actuator around the `Metadata` record, I guess because of the annotations, but nothing major. relates to #17513 original author: @npepinpe

@npepinpe

… restarts (#17600) # Description Backport of #17545 to `stable/8.4`. > [!Note] > There was a very small conflict in the `JobStreamEndpoint` actuator around the `Metadata` record, I guess because of the annotations, but nothing major. relates to #17513 original author: @npepinpe

npepinpe added kind/bug Categorizes an issue or PR as a bug component/zeebe Related to the Zeebe component/team labels Apr 16, 2024

npepinpe changed the title ~~Broken job push aggregation~~ Broken job stream aggregation Apr 16, 2024

npepinpe added area/performance Marks an issue as performance related severity/high Marks a bug as having a noticeable impact on the user with no known workaround impact/high Marks issues relieving a recurring pain or fulfilling an immediate need likelihood/high A recurring issue labels Apr 16, 2024

npepinpe self-assigned this Apr 17, 2024

npepinpe mentioned this issue Apr 17, 2024

Fix broken job stream aggregation across stream restarts #17545

Merged

npepinpe closed this as completed in #17545 Apr 18, 2024

backport-action mentioned this issue Apr 18, 2024

[Backport stable/8.5] Fix broken job stream aggregation across stream restarts #17580

Merged

github-merge-queue bot pushed a commit that referenced this issue Apr 18, 2024

[Backport stable/8.5] Fix broken job stream aggregation across stream…

3107f47

… restarts (#17580) # Description Backport of #17545 to `stable/8.5`. relates to #17513 original author: @npepinpe

This was referenced Apr 19, 2024

[Backport stable/8.4] Fix broken job stream aggregation across stream restarts #17600

Merged

[Backport stable/8.3] Fix broken job stream aggregation across stream restarts #17601

Merged

Zelldon added version:8.4.7 Marks an issue as being completely or in parts released in 8.4.7 version:8.5.1 Marks an issue as being completely or in parts released in 8.5.1 version:8.3.11 Marks an issue as being completely or in parts released in 8.3.11 version:8.6.0-alpha1 labels May 7, 2024

tmetzke added the version:8.6.0 Label that represents issues released on version 8.6.0 label Oct 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Broken job stream aggregation #17513

Broken job stream aggregation #17513

npepinpe commented Apr 16, 2024 •

edited

Loading

npepinpe commented Apr 16, 2024 •

edited

Loading

npepinpe commented Apr 16, 2024

npepinpe commented Apr 16, 2024

npepinpe commented Apr 17, 2024

npepinpe commented Apr 17, 2024

Broken job stream aggregation #17513

Broken job stream aggregation #17513

Comments

npepinpe commented Apr 16, 2024 • edited Loading

npepinpe commented Apr 16, 2024 • edited Loading

npepinpe commented Apr 16, 2024

npepinpe commented Apr 16, 2024

npepinpe commented Apr 17, 2024

npepinpe commented Apr 17, 2024

npepinpe commented Apr 16, 2024 •

edited

Loading

npepinpe commented Apr 16, 2024 •

edited

Loading