Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broken job stream aggregation #17513

Closed
npepinpe opened this issue Apr 16, 2024 · 5 comments · Fixed by #17545
Closed

Broken job stream aggregation #17513

npepinpe opened this issue Apr 16, 2024 · 5 comments · Fixed by #17545
Assignees
Labels
area/performance Marks an issue as performance related component/zeebe Related to the Zeebe component/team impact/high Marks issues relieving a recurring pain or fulfilling an immediate need kind/bug Categorizes an issue or PR as a bug likelihood/high A recurring issue severity/high Marks a bug as having a noticeable impact on the user with no known workaround version:8.3.11 Marks an issue as being completely or in parts released in 8.3.11 version:8.4.7 Marks an issue as being completely or in parts released in 8.4.7 version:8.5.1 Marks an issue as being completely or in parts released in 8.5.1 version:8.6.0-alpha1 Label that represents issues released on verions 8.6.0-alpha1 version:8.6.0 Label that represents issues released on version 8.6.0

Comments

@npepinpe
Copy link
Member

npepinpe commented Apr 16, 2024

Describe the bug

From running a benchmark with > 100 workers, it seems we sometimes fail to aggregate streams properly on the gateway and on the broker. In my benchmark, I had 5 job types, and 20 workers (with the exact same configuration) per job type. I would expect to see at most 5 aggregated streams per gateway, but instead some had 30, some had 17, etc.

So there seems to be some race condition or weird thing breaking the aggregation. This has an impact on performance as it means jobs cannot be retried with logically equivalent workers, and must be resubmitted to the engine (possibly).

This also impacts performance because while you think you're scaling out workers, one aggregated stream may not be scaling out, leading to a job being yielded due to being the stream seeming blocked. This means scaling out workers may not yield any performance gain at all, which negates the horizontal scalability goal of Zeebe in that respect.

To Reproduce

Run a benchmark with multiple different job types, with lots of workers per job type. Then, randomly restart your gateways and brokers. After some restarts (sometimes only 1), you will see the aggregation is broken.

Expected behavior

Given 5 job types and 20 workers per type with the exact same config, I would expect to see at most 5 aggregated streams.

Environment:

  • Zeebe Version: 8.6.0-SNAPSHOT (and earlier)
@npepinpe npepinpe added kind/bug Categorizes an issue or PR as a bug component/zeebe Related to the Zeebe component/team labels Apr 16, 2024
@npepinpe npepinpe changed the title Broken job push aggregation Broken job stream aggregation Apr 16, 2024
@npepinpe
Copy link
Member Author

npepinpe commented Apr 16, 2024

A quick glance on the gateway shows that the aggregation fails because the streams get a different server stream ID, so it seems the "logical" ID comparison breaks. Unclear yet if it's the metadata or the stream type which causes it to "fail".

It could also be a race condition, but the code path is extremely simple, so I fail to see how that would be possible.

@npepinpe
Copy link
Member Author

Can confirm it's not aggregated properly on either side, broker or gateway. On the gateway clearly the server ID ends up being wrong, but on the broker it's unclear, as the aggregation is done purely over the job type and metadata.

Again, could be there's a race condition on both sides. Having checked the code path, it seems unlikely, as there's a clear single entrypoint wrapped by the actor. Of course, it could be an actor issue (:scream:), but that's unlikely.

@npepinpe npepinpe added area/performance Marks an issue as performance related severity/high Marks a bug as having a noticeable impact on the user with no known workaround impact/high Marks issues relieving a recurring pain or fulfilling an immediate need likelihood/high A recurring issue labels Apr 16, 2024
@npepinpe
Copy link
Member Author

I tried writing a quick test to reproduce it, but I can't, it always seem to aggregate properly locally.

@npepinpe
Copy link
Member Author

I think the issue is the tenant IDs. I've added them to the endpoint, and you can see the following values:

[
  RemoteJobStream[
    jobType=id-d22ebb23-8ac5-49ee-816a-dc47ca450633, 
    metadata=Metadata[
      worker=command,
      timeout=PT0.5S,
      fetchVariables=[foo, bar],
      tenantIds=[<default>, <default>, <default>, <default>, <default>, <default>, <default>, <default>]
    ], 
    consumers=[
      RemoteStreamId[id=27eb2a1b-d1c2-4033-b804-1cde40101f97, receiver=gateway-1]
    ]
  ],
  RemoteJobStream[
    jobType=id-d22ebb23-8ac5-49ee-816a-dc47ca450633,
    metadata=Metadata[
      worker=command,
      timeout=PT0.5S,
      fetchVariables=[foo, bar],
      tenantIds=[<default>, <default>, <default>, <default>, <default>, <default>, <default>, <default>, <default>]
    ],
    consumers=[
      RemoteStreamId[id=3195f3fd-2be6-4b09-94b3-71a7f7764683, receiver=gateway-1]
    ]
  ]
]

@npepinpe
Copy link
Member Author

The issue is that we reuse the same command builder (StreamJobsCommandImpl), in the JobStreamerImpl, and that one keeps adding the tenants on every send :)

@npepinpe npepinpe self-assigned this Apr 17, 2024
github-merge-queue bot pushed a commit that referenced this issue Apr 18, 2024
## Description

Adding tenants to the and job activation and streaming commands can end
up accumulating the same tenant IDs over and over if the command builder
is reuse to send new commands. This is because the commands keep adding
the tenant IDs to the underlying gRPC request, without ever clearing
them. So each successive `send()` call would end up accumulating IDs.

Say you use only the default tenant ID (`<default>`), then on the first
send it would be `[<default>]`. On the second, `[<default>, <default>]`,
and so on.

For job streaming, this would break the aggregation as the tenant IDs
are part of the stream aggregation.

This highlights that the aggregation for job streaming is not ideal -
for example, if one stream is tenants `foo, bar`, and the other `foo`,
then technically any job of type `foo` can go to either of them. This is
out of scope for now, but I will open another issue to improve the
aggregation.

## Related issues

closes #17513
github-merge-queue bot pushed a commit that referenced this issue Apr 18, 2024
… restarts (#17580)

# Description
Backport of #17545 to `stable/8.5`.

relates to #17513
original author: @npepinpe
github-merge-queue bot pushed a commit that referenced this issue Apr 23, 2024
… restarts (#17601)

# Description
Backport of #17545 to `stable/8.3`.

> [!Note]
> There was a very small conflict in the `JobStreamEndpoint` actuator
around the `Metadata` record, I guess because of the annotations, but
nothing major.

relates to #17513
original author: @npepinpe
github-merge-queue bot pushed a commit that referenced this issue Apr 23, 2024
… restarts (#17600)

# Description
Backport of #17545 to `stable/8.4`.

> [!Note]
> There was a very small conflict in the `JobStreamEndpoint` actuator
around the `Metadata` record, I guess because of the annotations, but
nothing major.

relates to #17513
original author: @npepinpe
@Zelldon Zelldon added version:8.4.7 Marks an issue as being completely or in parts released in 8.4.7 version:8.5.1 Marks an issue as being completely or in parts released in 8.5.1 version:8.3.11 Marks an issue as being completely or in parts released in 8.3.11 version:8.6.0-alpha1 labels May 7, 2024
@tmetzke tmetzke added the version:8.6.0 Label that represents issues released on version 8.6.0 label Oct 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/performance Marks an issue as performance related component/zeebe Related to the Zeebe component/team impact/high Marks issues relieving a recurring pain or fulfilling an immediate need kind/bug Categorizes an issue or PR as a bug likelihood/high A recurring issue severity/high Marks a bug as having a noticeable impact on the user with no known workaround version:8.3.11 Marks an issue as being completely or in parts released in 8.3.11 version:8.4.7 Marks an issue as being completely or in parts released in 8.4.7 version:8.5.1 Marks an issue as being completely or in parts released in 8.5.1 version:8.6.0-alpha1 Label that represents issues released on verions 8.6.0-alpha1 version:8.6.0 Label that represents issues released on version 8.6.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants