-
Notifications
You must be signed in to change notification settings - Fork 95
Description
Shift broadcast commands from running in the server thread, to running in the scheduler main-loop.
Find another way to return the command outcome (see #3329).
The Problem
At present, broadcast commands are not queued on the scheduler to be run in the main-loop, they are actioned in the server thread.
I suspect the reason for this is to allow the server to synchronously respond to the client, i.e. return an error if there was any issue applying the broadcast (you really don't want this command to fail quietly).
In order to prevent contention issues (?), there is a locking mechanism:
cylc-flow/cylc/flow/broadcast_mgr.py
Line 71 in 0e738b4
| self.lock = RLock() |
When large numbers of broadcast commands are issued in parallel, this can cause ClientTimeout errors in the broadcast command. It can also cause spurious ClientTimeout errors to be logged by other commands, however, this is a misnomer as these commands have queued (so will be actioned by the main loop in due course), the client just hasn't received the "command queued" message (presumably because the server is busy actioning a broadcast).
This issue is fairly easy to work around:
- Merge multiple single-broadcasts into a single
cylc broadcastcommand (generally a good idea where possible). - Use Cylc queues to limit the number of broadcasting tasks to prevent parallel broadcasts coming in from mulitple tasks.
The Solution
After #3329 has been implemented, we will be able to return the command status rather than just "command queued".
As a result we will be able to handle broadcasts using the regular command queue like all the other commands (assuming I'm right that this is the reason they are implemented in this way).
This will allow us to handle broadcasts in the same way as other commands, but might, possibly create an inter-version compatibility issue.
This is of relevance to #6429 which would be easier after this change.
This would probably remove the need for the locking mechanism and resolve the timeout issue.