- start up fleet on a single node
- start 50 simple units - they should all be scheduled to the one node
sudo systemctl killl --signal SIGKILL fleet && sudo systemctl start --no-block fleet
- unload all 50 units
At this point, I see log lines repeated every 5s for several of the units:
Sep 17 00:13:14 deis-1 fleetd[10287]: INFO reconcile.go:267: AgentReconciler task chain failed: chain={rugged-newsreel_v2.web.6-announce.service (UnloadUnit, "unit loaded but not scheduled here")} err=task already in flight
Looking at fleetctl list-units, I see the state for these problem units, but not the units that appeared to unload successfully.
Now for the interesting piece: I instrumented fleet to log the systemd job IDs whenever it does something. For this particular unit, I see that fleet called the StartUnit dbus method, which returned the job ID 929681. During all this commotion, I was also running dbus-monitor (sudo dbus-monitor --system "type=signal"). I see the JobNew signal for 929681, but no JobRemoved!
So this looks like fleet, go-systemd, and godbus are all doing the right thing. The lack of a JobRemoved event means that fleet is going to hang on the StartUnit call and block all other operations for a given unit.
The only hint I have is that fleet calls Reload a lot. Several times a second when its starting 50+ units all at once. Maybe jobs are getting dropped inside of systemd during the Reload? I seem to remember running into a problem like this in the past.
sudo systemctl killl --signal SIGKILL fleet && sudo systemctl start --no-block fleetAt this point, I see log lines repeated every 5s for several of the units:
Looking at
fleetctl list-units, I see the state for these problem units, but not the units that appeared to unload successfully.Now for the interesting piece: I instrumented fleet to log the systemd job IDs whenever it does something. For this particular unit, I see that fleet called the
StartUnitdbus method, which returned the job ID 929681. During all this commotion, I was also running dbus-monitor (sudo dbus-monitor --system "type=signal"). I see theJobNewsignal for 929681, but noJobRemoved!So this looks like fleet, go-systemd, and godbus are all doing the right thing. The lack of a JobRemoved event means that fleet is going to hang on the
StartUnitcall and block all other operations for a given unit.The only hint I have is that fleet calls Reload a lot. Several times a second when its starting 50+ units all at once. Maybe jobs are getting dropped inside of systemd during the Reload? I seem to remember running into a problem like this in the past.