Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too many workers can break cluster performance #8267

Closed
Zelldon opened this issue Nov 24, 2021 · 6 comments · Fixed by #8317
Closed

Too many workers can break cluster performance #8267

Zelldon opened this issue Nov 24, 2021 · 6 comments · Fixed by #8317
Assignees
Labels
area/performance Marks an issue as performance related kind/bug Categorizes an issue or PR as a bug severity/high Marks a bug as having a noticeable impact on the user with no known workaround version:1.3.0 Marks an issue as being completely or in parts released in 1.3.0

Comments

@Zelldon
Copy link
Member

Zelldon commented Nov 24, 2021

Describe the bug

Run a new Chaos Day were I observed the following:

With three Broker, three partition, one starter (300 pi/s) and 16 workers

worker-16-p3-300-general-reproduce

We can observe that the throughput drops after a short period of time. We are not able to complete more process instances, but still new instances are created.

Looks like it is related to #7955 and #8244

To Reproduce

For more details about the setup please take a look at the Chaos Day Summary and the resource files:
16-workers.zip

Expected behavior

We can complete the 300 instances without such issues.

Log/Stacktrace

There is nothing visible in the gateway logs,

log

Analysis

I started a small analysis, took a heap dump and created a flame graph with async profiler.

Based on the metrics we can see that the gateway is working on their limit (2 CPU)

cpu

In the gateway threads we can see that they are idle, this is also visible in the flame graph.
flame.zip

flame

Since we know that instance creation still works, we expect that the long polling handler might have some issues.

Taking a look at the code and the heap dump we found this:

heap

The pending requests seems to be empty, but the requests which should be repeated are at 38 and all objects are the same. It looks like this is not cleaned up correctly. Currently I'm not sure whether this is a problem or not. It should be removed here https://github.com/camunda-cloud/zeebe/blob/b0fec6391814ff7a6f575086520115dccdbe5930/gateway/src/main/java/io/camunda/zeebe/gateway/impl/job/InFlightLongPollingActivateJobsRequestsState.java#L94-L103

I think it makes sense to investigate that further.

Environment:

  • OS:
  • Zeebe Version: 1.3.0-SNAPSHOT
  • Configuration: see above
@Zelldon Zelldon added kind/bug Categorizes an issue or PR as a bug area/performance Marks an issue as performance related Impact: Availability severity/high Marks a bug as having a noticeable impact on the user with no known workaround labels Nov 24, 2021
@menski
Copy link
Contributor

menski commented Nov 29, 2021

I think it's worth to look into it, maybe something we can prioritize for next quarter. The workaround is having less workers

@romansmirnov romansmirnov self-assigned this Dec 1, 2021
@romansmirnov romansmirnov added this to In progress in Zeebe Dec 1, 2021
@romansmirnov
Copy link
Member

romansmirnov commented Dec 2, 2021

The pending requests seems to be empty, but the requests which should be repeated are at 38 and all objects are the same. It looks like this is not cleaned up correctly. Currently I'm not sure whether this is a problem or not. It should be removed here

When running my own benchmark, I could confirm this observation. For example, in one case, the activeRequestsToBeRepeated Queue had a size of ~34k entries whereby one request was present ~10k times in the queue.

image

image

In my opinion, this is not the root cause for the actual described bug/observation in this ticket (i.e., jobs are created but not completed). That's why I will create a separate issue, therefore, to document the multiple occurrences of the same request in the queue and its potential impact.

@romansmirnov
Copy link
Member

romansmirnov commented Dec 2, 2021

We can observe that the throughput drops after a short period of time. We are not able to complete more process instances, but still new instances are created.

To investigate this observation, I collected multiple thread dumps and heap dumps over time within a time frame of up to 3 minutes of the Gateway. And the following are my observations.

Actor Thread executes LongPollingActivateJobsHandler#resetFailedAttemptsAndHandlePendingRequests()

According to the thread dumps, the actor thread was exclusively busy in executing LongPollingActivateJobsHandler#resetFailedAttemptsAndHandlePendingRequests():

"gateway-scheduler-zb-actors-0" #24 prio=5 os_prio=0 cpu=174011.46ms elapsed=690.09s tid=0x00007f60558b1980 nid=0x2a runnable  [0x00007f5ff1bf9000]
   java.lang.Thread.State: RUNNABLE
        at java.util.Collection.removeIf(java.base@17.0.1/Unknown Source)
        at io.camunda.zeebe.gateway.impl.job.InFlightLongPollingActivateJobsRequestsState.removeObsoleteRequestsAndUpdateMetrics(InFlightLongPollingActivateJobsRequestsState.java:74)
        at io.camunda.zeebe.gateway.impl.job.InFlightLongPollingActivateJobsRequestsState.getPendingRequests(InFlightLongPollingActivateJobsRequestsState.java:67)
        at io.camunda.zeebe.gateway.impl.job.LongPollingActivateJobsHandler.resetFailedAttemptsAndHandlePendingRequests(LongPollingActivateJobsHandler.java:194)
        at io.camunda.zeebe.gateway.impl.job.LongPollingActivateJobsHandler.lambda$onNotification$5(LongPollingActivateJobsHandler.java:130)
        at io.camunda.zeebe.gateway.impl.job.LongPollingActivateJobsHandler$$Lambda$1170/0x00000001008251f8.run(Unknown Source)
        at io.camunda.zeebe.util.sched.ActorJob.invoke(ActorJob.java:73)
        at io.camunda.zeebe.util.sched.ActorJob.execute(ActorJob.java:39)
        at io.camunda.zeebe.util.sched.ActorTask.execute(ActorTask.java:122)
        at io.camunda.zeebe.util.sched.ActorThread.executeCurrentTask(ActorThread.java:95)
        at io.camunda.zeebe.util.sched.ActorThread.doWork(ActorThread.java:78)
        at io.camunda.zeebe.util.sched.ActorThread.run(ActorThread.java:192)

Increasing Heap Size

The heap size increased over time (to over 80MB) and then again decreased (to ~30MB). When looking at the heap dumps, especially the case where the heap reaches its peak, it is noticeable that there is one big object of the class ActorTask with a retained size of ~36MB:

image

ActorTask with thousands of ActorJobs to execute`

When investigating the biggest object ActorTask, there is a lot of ActorJobs instances in the fastLaneJobs array but also in the submittedJobs queue.

image

When checking the total number of ActorJobs, then according to heap dump, there are ~19k ActorJob instances (please note, that 19k might not be accurate because each actor thread has an array of max. 2k recycled ActorJob objects).

image

And almost of all them are submitted to execute LongPollingActivateJobsHandler#resetFailedAttemptsAndHandlePendingRequests():

image

And only sometimes in between, there are other ActorJobs who are submitted for other work to do:

image

Conclusion

The ActorThread is busy handling the broker's notifications that new jobs are available. As a consequence, the ActorThread does that exclusively and only very rarely anything else. Meaning, when 300pi/s are started and for each instance a new job is created, it may result in at most 300 notifications per second that a new job is available.

From a client perspective (i.e., the workers who communicate with the gateway), the following might happen:

  • They do get any response so that the request times out on the client-side after 72s.
  • They may get supposedly "activated" jobs that might have timed out in between.
  • ...

So, code-wise the following subscription is problematic which results in multiple notifications in such a load scenario:

https://github.com/camunda-cloud/zeebe/blob/a33682e411724107e962faf6a4a069cc2b675c65/gateway/src/main/java/io/camunda/zeebe/gateway/impl/job/LongPollingActivateJobsHandler.java#L74

@romansmirnov
Copy link
Member

To prove the conclusion above, I executed the same benchmark but without the mentioned subscription. As a result, jobs got completed almost at the same rate as they are created, and process instances got completed as well.

After ~8 minutes

8-minutes

After ~16 minutes

15-minutes

After ~20 minutes

20-minutes

@romansmirnov
Copy link
Member

romansmirnov commented Dec 2, 2021

Looking at the implementation on how the Gateway is notified about newly available jobs, basically, it confirms that for each job that is created or timed out a notification happens:

https://github.com/camunda-cloud/zeebe/blob/12eeb3f81af0c1b65d2257cab52b71a712562757/engine/src/main/java/io/camunda/zeebe/engine/state/instance/DbJobState.java#L297-L301

Just to recap, why it is necessary to notify the gateway about newly available jobs?

To my understanding: because the latency between job creation and activation (and thus its execution by a worker/client) should be kept as low as possible. Especially, in the case where the number of total jobs that can be executed by the workers is greater than the total number of created jobs, i.e., there are fewer jobs available than the workers can handle. In such a scenario, some of the worker's long-polling activate job requests would be pending in the gateway and waiting for new jobs. With the notification, they immediately start polling those newly available jobs which reduces the latency to its minimum as possible.

However, in the scenario where there are equal or more jobs available than the workers can handle, the notification is not necessary, in the best case 😉 Meaning, a long-polling activate request should always activate at least one job to execute, and there shouldn't be any pending activate requests in the gateway.

With the current implementation, it (more or less) optimizes for the first case (i.e., fewer jobs available than workers can handle). That way, each notification results in resetting failed attempts and handling pending requests. In such a load scenario as executed with the benchmark, it results in the same number of notifications as jobs being created at the same time which keeps the actor busy in exclusively executing these notifications.

What could be possible solutions?

  • Batching:
    • Batching on Broker-side: For example, for every 1000 job creations or if the last notification happened 1 second ago, the gateway is notified.
    • Batching on Gateway-side: Basically the same as on the Broker-side, only after 1000 notifications or if the last notification happened 1 second ago, the actual notification (resetting failed attempts and ...) happens
  • Collect data and try to imply the usage scenario: For example, collect how many jobs have been activated in the last 1 second and how many were created
    • if more jobs are created than activated then do nothing (i.e., ignore the notification)
    • if fewer jobs are created than activated then handle the notification
  • Keep it as it is, but try to mitigate the situation. Currently, there is only one actor handling activate jobs requests for all job types. So instead of having only one actor, there is some sort of partitioning by, for example, job types, so that there are multiple actors doing the work. That way only one actor would be affected when a for a specific job type a lot of jobs are created. However, this would not change the results from the benchmark
  • ...

@romansmirnov
Copy link
Member

Implementation-wise, I followed the approach to ignore/skip incoming notifications, if a notification is already submitted for the given job type. Meaning, at any time there will at most one submitted notification for a job type.

Running the same benchmark (i.e., 300pi/s and 16 workers), the broker constantly activates and completes jobs at the same rate as they are created.

ongoing-notification-30-minutes

@romansmirnov romansmirnov moved this from In progress to Review in progress in Zeebe Dec 9, 2021
@ghost ghost closed this as completed in 8431348 Dec 9, 2021
Zeebe automation moved this from Review in progress to Done Dec 9, 2021
ghost pushed a commit that referenced this issue Dec 16, 2021
8345: [Backport stable/1.1] fix(gtw/jobs): ignore notifications if already scheduled r=romansmirnov a=github-actions[bot]

# Description
Backport of #8317 to `stable/1.1`.

relates to #8267

Co-authored-by: Roman <roman.smirnov@camunda.com>
ghost pushed a commit that referenced this issue Dec 16, 2021
8346: [Backport stable/1.2] fix(gtw/jobs): ignore notifications if already scheduled r=romansmirnov a=github-actions[bot]

# Description
Backport of #8317 to `stable/1.2`.

relates to #8267

8387: [Backport stable/1.2] fix(journal): always release acquired read lock r=romansmirnov a=github-actions[bot]

# Description
Backport of #8372 to `stable/1.2`.

relates to #8369

Co-authored-by: Roman <roman.smirnov@camunda.com>
@korthout korthout added the version:1.3.0 Marks an issue as being completely or in parts released in 1.3.0 label Jan 4, 2022
@KerstinHebel KerstinHebel removed this from Done in Zeebe Mar 23, 2022
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/performance Marks an issue as performance related kind/bug Categorizes an issue or PR as a bug severity/high Marks a bug as having a noticeable impact on the user with no known workaround version:1.3.0 Marks an issue as being completely or in parts released in 1.3.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants