Skip to content
This repository was archived by the owner on Jan 30, 2020. It is now read-only.
This repository was archived by the owner on Jan 30, 2020. It is now read-only.

Unkillable units - possible systemd bug #903

@bcwaldon

Description

@bcwaldon
  1. start up fleet on a single node
  2. start 50 simple units - they should all be scheduled to the one node
  3. sudo systemctl killl --signal SIGKILL fleet && sudo systemctl start --no-block fleet
  4. unload all 50 units

At this point, I see log lines repeated every 5s for several of the units:

Sep 17 00:13:14 deis-1 fleetd[10287]: INFO reconcile.go:267: AgentReconciler task chain failed: chain={rugged-newsreel_v2.web.6-announce.service (UnloadUnit, "unit loaded but not scheduled here")} err=task already in flight

Looking at fleetctl list-units, I see the state for these problem units, but not the units that appeared to unload successfully.

Now for the interesting piece: I instrumented fleet to log the systemd job IDs whenever it does something. For this particular unit, I see that fleet called the StartUnit dbus method, which returned the job ID 929681. During all this commotion, I was also running dbus-monitor (sudo dbus-monitor --system "type=signal"). I see the JobNew signal for 929681, but no JobRemoved!

So this looks like fleet, go-systemd, and godbus are all doing the right thing. The lack of a JobRemoved event means that fleet is going to hang on the StartUnit call and block all other operations for a given unit.

The only hint I have is that fleet calls Reload a lot. Several times a second when its starting 50+ units all at once. Maybe jobs are getting dropped inside of systemd during the Reload? I seem to remember running into a problem like this in the past.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions