New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Too many workers can break cluster performance #8267
Comments
I think it's worth to look into it, maybe something we can prioritize for next quarter. The workaround is having less workers |
When running my own benchmark, I could confirm this observation. For example, in one case, the In my opinion, this is not the root cause for the actual described bug/observation in this ticket (i.e., jobs are created but not completed). That's why I will create a separate issue, therefore, to document the multiple occurrences of the same request in the queue and its potential impact. |
To investigate this observation, I collected multiple thread dumps and heap dumps over time within a time frame of up to 3 minutes of the Gateway. And the following are my observations. Actor Thread executes According to the thread dumps, the actor thread was exclusively busy in executing
Increasing Heap Size The heap size increased over time (to over 80MB) and then again decreased (to ~30MB). When looking at the heap dumps, especially the case where the heap reaches its peak, it is noticeable that there is one big object of the class
When investigating the biggest object When checking the total number of And almost of all them are submitted to execute And only sometimes in between, there are other Conclusion The From a client perspective (i.e., the workers who communicate with the gateway), the following might happen:
So, code-wise the following subscription is problematic which results in multiple notifications in such a load scenario: |
Looking at the implementation on how the Gateway is notified about newly available jobs, basically, it confirms that for each job that is created or timed out a notification happens: Just to recap, why it is necessary to notify the gateway about newly available jobs? To my understanding: because the latency between job creation and activation (and thus its execution by a worker/client) should be kept as low as possible. Especially, in the case where the number of total jobs that can be executed by the workers is greater than the total number of created jobs, i.e., there are fewer jobs available than the workers can handle. In such a scenario, some of the worker's long-polling activate job requests would be pending in the gateway and waiting for new jobs. With the notification, they immediately start polling those newly available jobs which reduces the latency to its minimum as possible. However, in the scenario where there are equal or more jobs available than the workers can handle, the notification is not necessary, in the best case 😉 Meaning, a long-polling activate request should always activate at least one job to execute, and there shouldn't be any pending activate requests in the gateway. With the current implementation, it (more or less) optimizes for the first case (i.e., fewer jobs available than workers can handle). That way, each notification results in resetting failed attempts and handling pending requests. In such a load scenario as executed with the benchmark, it results in the same number of notifications as jobs being created at the same time which keeps the actor busy in exclusively executing these notifications. What could be possible solutions?
|
Implementation-wise, I followed the approach to ignore/skip incoming notifications, if a notification is already submitted for the given job type. Meaning, at any time there will at most one submitted notification for a job type. Running the same benchmark (i.e., 300pi/s and 16 workers), the broker constantly activates and completes jobs at the same rate as they are created. |
8346: [Backport stable/1.2] fix(gtw/jobs): ignore notifications if already scheduled r=romansmirnov a=github-actions[bot] # Description Backport of #8317 to `stable/1.2`. relates to #8267 8387: [Backport stable/1.2] fix(journal): always release acquired read lock r=romansmirnov a=github-actions[bot] # Description Backport of #8372 to `stable/1.2`. relates to #8369 Co-authored-by: Roman <roman.smirnov@camunda.com>
Describe the bug
Run a new Chaos Day were I observed the following:
With three Broker, three partition, one starter (300 pi/s) and 16 workers
We can observe that the throughput drops after a short period of time. We are not able to complete more process instances, but still new instances are created.
Looks like it is related to #7955 and #8244
To Reproduce
For more details about the setup please take a look at the Chaos Day Summary and the resource files:
16-workers.zip
Expected behavior
We can complete the 300 instances without such issues.
Log/Stacktrace
There is nothing visible in the gateway logs,
Analysis
I started a small analysis, took a heap dump and created a flame graph with async profiler.
Based on the metrics we can see that the gateway is working on their limit (2 CPU)
In the gateway threads we can see that they are idle, this is also visible in the flame graph.
flame.zip
Since we know that instance creation still works, we expect that the long polling handler might have some issues.
Taking a look at the code and the heap dump we found this:
The pending requests seems to be empty, but the requests which should be repeated are at 38 and all objects are the same. It looks like this is not cleaned up correctly. Currently I'm not sure whether this is a problem or not. It should be removed here https://github.com/camunda-cloud/zeebe/blob/b0fec6391814ff7a6f575086520115dccdbe5930/gateway/src/main/java/io/camunda/zeebe/gateway/impl/job/InFlightLongPollingActivateJobsRequestsState.java#L94-L103
I think it makes sense to investigate that further.
Environment:
The text was updated successfully, but these errors were encountered: