Unkillable units - possible systemd bug

1. start up fleet on a single node
2. start 50 simple units - they should all be scheduled to the one node
3. `sudo systemctl killl --signal SIGKILL fleet && sudo systemctl start --no-block fleet`
4. unload all 50 units

At this point, I see log lines repeated every 5s for several of the units:

```
Sep 17 00:13:14 deis-1 fleetd[10287]: INFO reconcile.go:267: AgentReconciler task chain failed: chain={rugged-newsreel_v2.web.6-announce.service (UnloadUnit, "unit loaded but not scheduled here")} err=task already in flight
```

Looking at `fleetctl list-units`, I see the state for these problem units, but not the units that appeared to unload successfully.

Now for the interesting piece: I instrumented fleet to log the systemd job IDs whenever it does something. For this particular unit, I see that fleet called the `StartUnit` dbus method, which returned the job ID 929681. During all this commotion, I was also running dbus-monitor (`sudo dbus-monitor --system "type=signal"`). I see the `JobNew` signal for 929681, but no `JobRemoved`!

So this looks like fleet, go-systemd, and godbus are all doing the right thing. The lack of a JobRemoved event means that fleet is going to hang on the `StartUnit` call and block all other operations for a given unit.

The only hint I have is that fleet calls Reload a lot. Several times a second when its starting 50+ units all at once. Maybe jobs are getting dropped inside of systemd during the Reload? I seem to remember running into a problem like this in the past.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unkillable units - possible systemd bug #903

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Unkillable units - possible systemd bug #903

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions