Too many workers can break cluster performance #8267

Zelldon · 2021-11-24T12:56:52Z

Describe the bug

Run a new Chaos Day were I observed the following:

With three Broker, three partition, one starter (300 pi/s) and 16 workers

We can observe that the throughput drops after a short period of time. We are not able to complete more process instances, but still new instances are created.

Looks like it is related to #7955 and #8244

To Reproduce

For more details about the setup please take a look at the Chaos Day Summary and the resource files:
16-workers.zip

Expected behavior

We can complete the 300 instances without such issues.

Log/Stacktrace

There is nothing visible in the gateway logs,

Analysis

I started a small analysis, took a heap dump and created a flame graph with async profiler.

Based on the metrics we can see that the gateway is working on their limit (2 CPU)

In the gateway threads we can see that they are idle, this is also visible in the flame graph.
flame.zip

Since we know that instance creation still works, we expect that the long polling handler might have some issues.

Taking a look at the code and the heap dump we found this:

The pending requests seems to be empty, but the requests which should be repeated are at 38 and all objects are the same. It looks like this is not cleaned up correctly. Currently I'm not sure whether this is a problem or not. It should be removed here https://github.com/camunda-cloud/zeebe/blob/b0fec6391814ff7a6f575086520115dccdbe5930/gateway/src/main/java/io/camunda/zeebe/gateway/impl/job/InFlightLongPollingActivateJobsRequestsState.java#L94-L103

I think it makes sense to investigate that further.

Environment:

OS:
Zeebe Version: 1.3.0-SNAPSHOT
Configuration: see above

menski · 2021-11-29T08:36:47Z

I think it's worth to look into it, maybe something we can prioritize for next quarter. The workaround is having less workers

romansmirnov · 2021-12-02T08:58:52Z

The pending requests seems to be empty, but the requests which should be repeated are at 38 and all objects are the same. It looks like this is not cleaned up correctly. Currently I'm not sure whether this is a problem or not. It should be removed here

When running my own benchmark, I could confirm this observation. For example, in one case, the activeRequestsToBeRepeated Queue had a size of ~34k entries whereby one request was present ~10k times in the queue.

In my opinion, this is not the root cause for the actual described bug/observation in this ticket (i.e., jobs are created but not completed). That's why I will create a separate issue, therefore, to document the multiple occurrences of the same request in the queue and its potential impact.

romansmirnov · 2021-12-02T11:07:42Z

We can observe that the throughput drops after a short period of time. We are not able to complete more process instances, but still new instances are created.

To investigate this observation, I collected multiple thread dumps and heap dumps over time within a time frame of up to 3 minutes of the Gateway. And the following are my observations.

Actor Thread executes LongPollingActivateJobsHandler#resetFailedAttemptsAndHandlePendingRequests()

According to the thread dumps, the actor thread was exclusively busy in executing LongPollingActivateJobsHandler#resetFailedAttemptsAndHandlePendingRequests():

"gateway-scheduler-zb-actors-0" #24 prio=5 os_prio=0 cpu=174011.46ms elapsed=690.09s tid=0x00007f60558b1980 nid=0x2a runnable  [0x00007f5ff1bf9000]
   java.lang.Thread.State: RUNNABLE
        at java.util.Collection.removeIf(java.base@17.0.1/Unknown Source)
        at io.camunda.zeebe.gateway.impl.job.InFlightLongPollingActivateJobsRequestsState.removeObsoleteRequestsAndUpdateMetrics(InFlightLongPollingActivateJobsRequestsState.java:74)
        at io.camunda.zeebe.gateway.impl.job.InFlightLongPollingActivateJobsRequestsState.getPendingRequests(InFlightLongPollingActivateJobsRequestsState.java:67)
        at io.camunda.zeebe.gateway.impl.job.LongPollingActivateJobsHandler.resetFailedAttemptsAndHandlePendingRequests(LongPollingActivateJobsHandler.java:194)
        at io.camunda.zeebe.gateway.impl.job.LongPollingActivateJobsHandler.lambda$onNotification$5(LongPollingActivateJobsHandler.java:130)
        at io.camunda.zeebe.gateway.impl.job.LongPollingActivateJobsHandler$$Lambda$1170/0x00000001008251f8.run(Unknown Source)
        at io.camunda.zeebe.util.sched.ActorJob.invoke(ActorJob.java:73)
        at io.camunda.zeebe.util.sched.ActorJob.execute(ActorJob.java:39)
        at io.camunda.zeebe.util.sched.ActorTask.execute(ActorTask.java:122)
        at io.camunda.zeebe.util.sched.ActorThread.executeCurrentTask(ActorThread.java:95)
        at io.camunda.zeebe.util.sched.ActorThread.doWork(ActorThread.java:78)
        at io.camunda.zeebe.util.sched.ActorThread.run(ActorThread.java:192)

Increasing Heap Size

The heap size increased over time (to over 80MB) and then again decreased (to ~30MB). When looking at the heap dumps, especially the case where the heap reaches its peak, it is noticeable that there is one big object of the class ActorTask with a retained size of ~36MB:

ActorTask with thousands of ActorJobs to execute`

When investigating the biggest object ActorTask, there is a lot of ActorJobs instances in the fastLaneJobs array but also in the submittedJobs queue.

When checking the total number of ActorJobs, then according to heap dump, there are ~19k ActorJob instances (please note, that 19k might not be accurate because each actor thread has an array of max. 2k recycled ActorJob objects).

And almost of all them are submitted to execute LongPollingActivateJobsHandler#resetFailedAttemptsAndHandlePendingRequests():

And only sometimes in between, there are other ActorJobs who are submitted for other work to do:

Conclusion

The ActorThread is busy handling the broker's notifications that new jobs are available. As a consequence, the ActorThread does that exclusively and only very rarely anything else. Meaning, when 300pi/s are started and for each instance a new job is created, it may result in at most 300 notifications per second that a new job is available.

From a client perspective (i.e., the workers who communicate with the gateway), the following might happen:

They do get any response so that the request times out on the client-side after 72s.
They may get supposedly "activated" jobs that might have timed out in between.
...

So, code-wise the following subscription is problematic which results in multiple notifications in such a load scenario:

https://github.com/camunda-cloud/zeebe/blob/a33682e411724107e962faf6a4a069cc2b675c65/gateway/src/main/java/io/camunda/zeebe/gateway/impl/job/LongPollingActivateJobsHandler.java#L74

romansmirnov · 2021-12-02T12:33:08Z

To prove the conclusion above, I executed the same benchmark but without the mentioned subscription. As a result, jobs got completed almost at the same rate as they are created, and process instances got completed as well.

After ~8 minutes

After ~16 minutes

After ~20 minutes

romansmirnov · 2021-12-02T17:07:50Z

Looking at the implementation on how the Gateway is notified about newly available jobs, basically, it confirms that for each job that is created or timed out a notification happens:

https://github.com/camunda-cloud/zeebe/blob/12eeb3f81af0c1b65d2257cab52b71a712562757/engine/src/main/java/io/camunda/zeebe/engine/state/instance/DbJobState.java#L297-L301

Just to recap, why it is necessary to notify the gateway about newly available jobs?

To my understanding: because the latency between job creation and activation (and thus its execution by a worker/client) should be kept as low as possible. Especially, in the case where the number of total jobs that can be executed by the workers is greater than the total number of created jobs, i.e., there are fewer jobs available than the workers can handle. In such a scenario, some of the worker's long-polling activate job requests would be pending in the gateway and waiting for new jobs. With the notification, they immediately start polling those newly available jobs which reduces the latency to its minimum as possible.

However, in the scenario where there are equal or more jobs available than the workers can handle, the notification is not necessary, in the best case 😉 Meaning, a long-polling activate request should always activate at least one job to execute, and there shouldn't be any pending activate requests in the gateway.

With the current implementation, it (more or less) optimizes for the first case (i.e., fewer jobs available than workers can handle). That way, each notification results in resetting failed attempts and handling pending requests. In such a load scenario as executed with the benchmark, it results in the same number of notifications as jobs being created at the same time which keeps the actor busy in exclusively executing these notifications.

What could be possible solutions?

Batching:
- Batching on Broker-side: For example, for every 1000 job creations or if the last notification happened 1 second ago, the gateway is notified.
- Batching on Gateway-side: Basically the same as on the Broker-side, only after 1000 notifications or if the last notification happened 1 second ago, the actual notification (resetting failed attempts and ...) happens
Collect data and try to imply the usage scenario: For example, collect how many jobs have been activated in the last 1 second and how many were created
- if more jobs are created than activated then do nothing (i.e., ignore the notification)
- if fewer jobs are created than activated then handle the notification
Keep it as it is, but try to mitigate the situation. Currently, there is only one actor handling activate jobs requests for all job types. So instead of having only one actor, there is some sort of partitioning by, for example, job types, so that there are multiple actors doing the work. That way only one actor would be affected when a for a specific job type a lot of jobs are created. However, this would not change the results from the benchmark
...

romansmirnov · 2021-12-03T20:19:22Z

Implementation-wise, I followed the approach to ignore/skip incoming notifications, if a notification is already submitted for the given job type. Meaning, at any time there will at most one submitted notification for a job type.

Running the same benchmark (i.e., 300pi/s and 16 workers), the broker constantly activates and completes jobs at the same rate as they are created.

8345: [Backport stable/1.1] fix(gtw/jobs): ignore notifications if already scheduled r=romansmirnov a=github-actions[bot] # Description Backport of #8317 to `stable/1.1`. relates to #8267 Co-authored-by: Roman <roman.smirnov@camunda.com>

8346: [Backport stable/1.2] fix(gtw/jobs): ignore notifications if already scheduled r=romansmirnov a=github-actions[bot] # Description Backport of #8317 to `stable/1.2`. relates to #8267 8387: [Backport stable/1.2] fix(journal): always release acquired read lock r=romansmirnov a=github-actions[bot] # Description Backport of #8372 to `stable/1.2`. relates to #8369 Co-authored-by: Roman <roman.smirnov@camunda.com>

Zelldon added kind/bug Categorizes an issue or PR as a bug area/performance Marks an issue as performance related Impact: Availability severity/high Marks a bug as having a noticeable impact on the user with no known workaround labels Nov 24, 2021

romansmirnov self-assigned this Dec 1, 2021

romansmirnov added this to In progress in Zeebe Dec 1, 2021

romansmirnov mentioned this issue Dec 2, 2021

Requests to activate jobs may result in infinite execution from the Gateway to the brokers #8310

Closed

2 tasks

romansmirnov mentioned this issue Dec 3, 2021

fix(gtw/jobs): ignore notifications if already scheduled #8317

Merged

9 tasks

romansmirnov moved this from In progress to Review in progress in Zeebe Dec 9, 2021

ghost closed this as completed in 8431348 Dec 9, 2021

Zeebe automation moved this from Review in progress to Done Dec 9, 2021

This was referenced Dec 9, 2021

[Backport stable/1.1] fix(gtw/jobs): ignore notifications if already scheduled #8345

Merged

[Backport stable/1.2] fix(gtw/jobs): ignore notifications if already scheduled #8346

Merged

menski added Release: 1.1.9 labels Dec 22, 2021

korthout added the version:1.3.0 Marks an issue as being completely or in parts released in 1.3.0 label Jan 4, 2022

KerstinHebel removed this from Done in Zeebe Mar 23, 2022

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Too many workers can break cluster performance #8267

Too many workers can break cluster performance #8267

Zelldon commented Nov 24, 2021 •

edited

menski commented Nov 29, 2021

romansmirnov commented Dec 2, 2021 •

edited

romansmirnov commented Dec 2, 2021 •

edited

romansmirnov commented Dec 2, 2021

romansmirnov commented Dec 2, 2021 •

edited

romansmirnov commented Dec 3, 2021

Too many workers can break cluster performance #8267

Too many workers can break cluster performance #8267

Comments

Zelldon commented Nov 24, 2021 • edited

menski commented Nov 29, 2021

romansmirnov commented Dec 2, 2021 • edited

romansmirnov commented Dec 2, 2021 • edited

romansmirnov commented Dec 2, 2021

After ~8 minutes

After ~16 minutes

After ~20 minutes

romansmirnov commented Dec 2, 2021 • edited

romansmirnov commented Dec 3, 2021

Zelldon commented Nov 24, 2021 •

edited

romansmirnov commented Dec 2, 2021 •

edited

romansmirnov commented Dec 2, 2021 •

edited

romansmirnov commented Dec 2, 2021 •

edited