Iterate over long polling and notification mechanism #4542
Labels
kind/toil
Categorizes an issue or PR as general maintenance, i.e. cleanup, refactoring, etc.
scope/broker
Marks an issue or PR to appear in the broker section of the changelog
Description
I had a small look how the long polling and notification works and for me it seems a bit sub-optimal.
Scenario
Say we have multiple workers, which all listen for the same job type
X
. After starting the workers all of them will start to poll. This means they try to activate jobs, on all partitions via the gateway. If we have no jobs available, all activation requests will block. Say we have 12 workers, then we have 12 blocked requests, each worker want to activate 120 jobs.If we now create one workflow instance. This means one job is created with this specific type
X
.The gateway is notified that an job with this type is created. It will now enable the requests again and iterate overall of them. This means it will start 12 (because of 12 workers) activation cycles, where it iterates over all partitions to activate an job with the type
X
. One will succeed the others will fail and block again.Possible Improvements
What is missing here is that we have no information on which partition the job was created, so currently it is necessary to iterate overall partitions, which adds latency to handling the actual job. Furthermore we start to activate all blocked requests, where probably just one would be sufficient.
The text was updated successfully, but these errors were encountered: