Skip to content
This repository has been archived by the owner on Jan 30, 2020. It is now read-only.

unit with lengthy deactivation procedure remains inactive (No such file or directory bug) #1158

Closed
bcwaldon opened this issue Mar 20, 2015 · 11 comments · Fixed by #1168
Closed

Comments

@bcwaldon
Copy link
Contributor

I'm scheduling the following units bar.service and baz.service to a single-node cluster:

core@core-01 ~ $ cat bar.service
[Service]
ExecStart=/usr/bin/sleep infinity

core@core-01 ~ $ cat baz.service
[Unit]
After=bar.service
BindsTo=bar.service
[Service]
ExecStart=/usr/bin/sleep infinity
ExecStop=/usr/bin/sleep 20

First, start the two units:

core@core-01 ~ $ fleetctl start --no-block bar baz
Triggered unit bar.service start
Triggered unit baz.service start

Check the status of fleet and systemd:

core@core-01 ~ $ fleetctl list-unit-files && fleetctl list-units && systemctl status bar baz
UNIT        HASH    DSTATE      STATE       TARGET
bar.service 40ea664 launched    launched    a84622dd.../172.17.8.101
baz.service 221b757 launched    launched    a84622dd.../172.17.8.101
UNIT        MACHINE             ACTIVE  SUB
bar.service a84622dd.../172.17.8.101    active  running
baz.service a84622dd.../172.17.8.101    active  running
● bar.service
   Loaded: loaded (/run/fleet/units/bar.service; linked-runtime; vendor preset: disabled)
   Active: active (running) since Fri 2015-03-20 03:21:50 UTC; 1s ago
 Main PID: 2193 (sleep)
   CGroup: /system.slice/bar.service
           └─2193 /usr/bin/sleep infinity

Mar 20 03:21:50 core-01 systemd[1]: Starting bar.service...
Mar 20 03:21:50 core-01 systemd[1]: Started bar.service.

● baz.service
   Loaded: loaded (/run/fleet/units/baz.service; linked-runtime; vendor preset: disabled)
   Active: active (running) since Fri 2015-03-20 03:21:50 UTC; 1s ago
 Main PID: 2194 (sleep)
   CGroup: /system.slice/baz.service
           └─2194 /usr/bin/sleep infinity

Mar 20 03:21:50 core-01 systemd[1]: Starting baz.service...
Mar 20 03:21:50 core-01 systemd[1]: Started baz.service.

Everything is OK. Now unload the units:

core@core-01 ~ $ fleetctl unload --no-block bar baz
Triggered unit bar.service unload
Triggered unit baz.service unload
core@core-01 ~ $ fleetctl list-unit-files && fleetctl list-units && systemctl status bar baz
UNIT        HASH    DSTATE      STATE       TARGET
bar.service 40ea664 inactive    inactive    -
baz.service 221b757 inactive    inactive    -
UNIT    MACHINE ACTIVE  SUB
● bar.service
   Loaded: loaded (/run/fleet/units/bar.service; linked-runtime; vendor preset: disabled)
   Active: active (running) since Fri 2015-03-20 03:21:50 UTC; 12s ago
 Main PID: 2193 (sleep)
   CGroup: /system.slice/bar.service
           └─2193 /usr/bin/sleep infinity

Mar 20 03:21:50 core-01 systemd[1]: Starting bar.service...
Mar 20 03:21:50 core-01 systemd[1]: Started bar.service.

Warning: Unit file changed on disk, 'systemctl daemon-reload' recommended.

● baz.service
   Loaded: loaded (/run/fleet/units/baz.service; linked-runtime; vendor preset: disabled)
   Active: deactivating (stop) since Fri 2015-03-20 03:22:00 UTC; 2s ago
 Main PID: 2194 (sleep);         : 2215 (sleep)
   CGroup: /system.slice/baz.service
           ├─2194 /usr/bin/sleep infinity
           └─control
             └─2215 /usr/bin/sleep 20

Mar 20 03:21:50 core-01 systemd[1]: Starting baz.service...
Mar 20 03:21:50 core-01 systemd[1]: Started baz.service.
Mar 20 03:22:00 core-01 systemd[1]: Stopping baz.service...

Warning: Unit file changed on disk, 'systemctl daemon-reload' recommended.

fleetctl stops reporting state for the units immediately, but baz.service is still deactivating. Now start the two units again before baz.service finishes its ExecStop:

core@core-01 ~ $ fleetctl start --no-block bar baz
Triggered unit bar.service start
Triggered unit baz.service start

Check the status of fleet and systemd immediately:

core@core-01 ~ $ fleetctl list-unit-files && fleetctl list-units && systemctl status bar baz
UNIT        HASH    DSTATE      STATE       TARGET
bar.service 40ea664 launched    launched    a84622dd.../172.17.8.101
baz.service 221b757 launched    launched    a84622dd.../172.17.8.101
UNIT        MACHINE             ACTIVE      SUB
bar.service a84622dd.../172.17.8.101    active      running
baz.service a84622dd.../172.17.8.101    deactivating    stop
● bar.service
   Loaded: loaded (/run/fleet/units/bar.service; linked-runtime; vendor preset: disabled)
   Active: active (running) since Fri 2015-03-20 03:22:15 UTC; 2s ago
 Main PID: 2269 (sleep)
   CGroup: /system.slice/bar.service
           └─2269 /usr/bin/sleep infinity

Mar 20 03:22:15 core-01 systemd[1]: Started bar.service.

● baz.service
   Loaded: not-found (Reason: No such file or directory)
   Active: deactivating (stop) since Fri 2015-03-20 03:22:00 UTC; 17s ago
 Main PID: 2194 (sleep);         : 2215 (sleep)
   CGroup: /system.slice/baz.service
           ├─2194 /usr/bin/sleep infinity
           └─control
             └─2215 /usr/bin/sleep 20

Mar 20 03:21:50 core-01 systemd[1]: Starting baz.service...
Mar 20 03:21:50 core-01 systemd[1]: Started baz.service.
Mar 20 03:22:00 core-01 systemd[1]: Stopping baz.service...

baz.service is still not done deactivating, but oddly enough, it's still not-found. Checking the status after deactivation is complete:

core@core-01 ~ $ fleetctl list-unit-files && fleetctl list-units && systemctl status bar baz
UNIT        HASH    DSTATE      STATE       TARGET
bar.service 40ea664 launched    launched    a84622dd.../172.17.8.101
baz.service 221b757 launched    launched    a84622dd.../172.17.8.101
UNIT        MACHINE             ACTIVE      SUB
bar.service a84622dd.../172.17.8.101    active      running
baz.service a84622dd.../172.17.8.101    inactive    dead
● bar.service
   Loaded: loaded (/run/fleet/units/bar.service; linked-runtime; vendor preset: disabled)
   Active: active (running) since Fri 2015-03-20 03:22:15 UTC; 10s ago
 Main PID: 2269 (sleep)
   CGroup: /system.slice/bar.service
           └─2269 /usr/bin/sleep infinity

Mar 20 03:22:15 core-01 systemd[1]: Started bar.service.

● baz.service
   Loaded: loaded (/run/fleet/units/baz.service; linked-runtime; vendor preset: disabled)
   Active: inactive (dead)

Mar 20 03:20:33 core-01 systemd[1]: Starting baz.service...
Mar 20 03:20:38 core-01 systemd[1]: Starting baz.service...
Mar 20 03:20:38 core-01 systemd[1]: Starting baz.service...
Mar 20 03:20:38 core-01 systemd[1]: Started baz.service.
Mar 20 03:21:18 core-01 systemd[1]: Stopping baz.service...
Mar 20 03:21:38 core-01 systemd[1]: Stopped baz.service.
Mar 20 03:21:50 core-01 systemd[1]: Starting baz.service...
Mar 20 03:21:50 core-01 systemd[1]: Started baz.service.
Mar 20 03:22:00 core-01 systemd[1]: Stopping baz.service...
Mar 20 03:22:20 core-01 systemd[1]: Stopped baz.service.

Now baz.service is inactive. Given that I just called fleetctl start on it, though, I would expect it to be active. Checking the logs, I see the dreaded No such file or directory error (fifth from the bottom):

Mar 20 03:21:49 core-01 fleetd[963]: INFO engine.go:272: Scheduled Unit(bar.service) to Machine(a84622dda07549d0b4d855ca2b78948c)
Mar 20 03:21:49 core-01 fleetd[963]: INFO reconciler.go:163: EngineReconciler completed task: {Type: AttemptScheduleUnit, JobName: bar.service, MachineID: a84622dda07549d0b4d855ca2b78948c, Reason: "target state launched and unit not scheduled"}
Mar 20 03:21:49 core-01 fleetd[963]: INFO engine.go:272: Scheduled Unit(baz.service) to Machine(a84622dda07549d0b4d855ca2b78948c)
Mar 20 03:21:49 core-01 fleetd[963]: INFO reconciler.go:163: EngineReconciler completed task: {Type: AttemptScheduleUnit, JobName: baz.service, MachineID: a84622dda07549d0b4d855ca2b78948c, Reason: "target state launched and unit not scheduled"}
Mar 20 03:21:50 core-01 fleetd[963]: INFO manager.go:262: Writing systemd unit bar.service (44b)
Mar 20 03:21:50 core-01 fleetd[963]: INFO manager.go:262: Writing systemd unit baz.service (117b)
Mar 20 03:21:50 core-01 fleetd[963]: INFO manager.go:134: Triggered systemd unit bar.service start: job=7560
Mar 20 03:21:50 core-01 fleetd[963]: INFO manager.go:134: Triggered systemd unit baz.service start: job=7640
Mar 20 03:21:50 core-01 fleetd[963]: INFO reconcile.go:311: AgentReconciler completed task: type=LoadUnit job=bar.service reason="unit scheduled here but not loaded"
Mar 20 03:21:50 core-01 fleetd[963]: INFO reconcile.go:311: AgentReconciler completed task: type=LoadUnit job=baz.service reason="unit scheduled here but not loaded"
Mar 20 03:21:50 core-01 fleetd[963]: INFO reconcile.go:311: AgentReconciler completed task: type=StartUnit job=bar.service reason="unit currently loaded but desired state is launched"
Mar 20 03:21:50 core-01 fleetd[963]: INFO reconcile.go:311: AgentReconciler completed task: type=StartUnit job=baz.service reason="unit currently loaded but desired state is launched"
Mar 20 03:22:00 core-01 fleetd[963]: INFO manager.go:145: Triggered systemd unit bar.service stop: job=7721
Mar 20 03:22:00 core-01 fleetd[963]: INFO manager.go:275: Removing systemd unit bar.service
Mar 20 03:22:00 core-01 fleetd[963]: INFO manager.go:145: Triggered systemd unit baz.service stop: job=7722
Mar 20 03:22:00 core-01 fleetd[963]: INFO manager.go:275: Removing systemd unit baz.service
Mar 20 03:22:00 core-01 fleetd[963]: INFO reconcile.go:311: AgentReconciler completed task: type=UnloadUnit job=bar.service reason="unit loaded but not scheduled here"
Mar 20 03:22:00 core-01 fleetd[963]: INFO reconcile.go:311: AgentReconciler completed task: type=UnloadUnit job=baz.service reason="unit loaded but not scheduled here"
Mar 20 03:22:00 core-01 fleetd[963]: INFO engine.go:257: Unscheduled Job(bar.service) from Machine(a84622dda07549d0b4d855ca2b78948c)
Mar 20 03:22:00 core-01 fleetd[963]: INFO reconciler.go:163: EngineReconciler completed task: {Type: UnscheduleUnit, JobName: bar.service, MachineID: a84622dda07549d0b4d855ca2b78948c, Reason: "target state inactive"}
Mar 20 03:22:00 core-01 fleetd[963]: INFO engine.go:257: Unscheduled Job(baz.service) from Machine(a84622dda07549d0b4d855ca2b78948c)
Mar 20 03:22:00 core-01 fleetd[963]: INFO reconciler.go:163: EngineReconciler completed task: {Type: UnscheduleUnit, JobName: baz.service, MachineID: a84622dda07549d0b4d855ca2b78948c, Reason: "target state inactive"}
Mar 20 03:22:14 core-01 fleetd[963]: INFO engine.go:272: Scheduled Unit(bar.service) to Machine(a84622dda07549d0b4d855ca2b78948c)
Mar 20 03:22:14 core-01 fleetd[963]: INFO reconciler.go:163: EngineReconciler completed task: {Type: AttemptScheduleUnit, JobName: bar.service, MachineID: a84622dda07549d0b4d855ca2b78948c, Reason: "target state launched and unit not scheduled"}
Mar 20 03:22:14 core-01 fleetd[963]: INFO engine.go:272: Scheduled Unit(baz.service) to Machine(a84622dda07549d0b4d855ca2b78948c)
Mar 20 03:22:14 core-01 fleetd[963]: INFO reconciler.go:163: EngineReconciler completed task: {Type: AttemptScheduleUnit, JobName: baz.service, MachineID: a84622dda07549d0b4d855ca2b78948c, Reason: "target state launched and unit not scheduled"}
Mar 20 03:22:15 core-01 fleetd[963]: INFO manager.go:262: Writing systemd unit bar.service (44b)
Mar 20 03:22:15 core-01 fleetd[963]: INFO manager.go:198: Instructing systemd to reload units
Mar 20 03:22:15 core-01 fleetd[963]: INFO manager.go:262: Writing systemd unit baz.service (117b)
Mar 20 03:22:15 core-01 fleetd[963]: INFO manager.go:134: Triggered systemd unit bar.service start: job=7806
Mar 20 03:22:15 core-01 fleetd[963]: ERROR manager.go:136: Failed to trigger systemd unit baz.service start: Unit baz.service failed to load: No such file or directory.
Mar 20 03:22:15 core-01 fleetd[963]: INFO reconcile.go:311: AgentReconciler completed task: type=LoadUnit job=bar.service reason="unit scheduled here but not loaded"
Mar 20 03:22:15 core-01 fleetd[963]: INFO reconcile.go:311: AgentReconciler completed task: type=LoadUnit job=baz.service reason="unit scheduled here but not loaded"
Mar 20 03:22:15 core-01 fleetd[963]: INFO reconcile.go:311: AgentReconciler completed task: type=StartUnit job=bar.service reason="unit currently loaded but desired state is launched"
Mar 20 03:22:15 core-01 fleetd[963]: INFO reconcile.go:311: AgentReconciler completed task: type=StartUnit job=baz.service reason="unit currently loaded but desired state is launched"
@tleyden
Copy link
Contributor

tleyden commented Mar 20, 2015

Just curious, is it possible to write a unit test to detect this error? (I guess that would require a mock dbus api?)

@bcwaldon
Copy link
Contributor Author

It would likely be difficult to reproduce this without testing against an actual systemd instance. However, we do already have a functional testing platform against which we can design a repeatable test case.

On Mar 20, 2015, at 8:07 AM, Traun Leyden notifications@github.com wrote:

Just curious, is it possible to write a unit test to detect this error? (I guess that would require a mock dbus api?)


Reply to this email directly or view it on GitHub.

@arunoda
Copy link

arunoda commented Mar 23, 2015

Is this issue also related to this: #1159

@wsimmonds
Copy link

+1 to this.

Currently running Deis and this sporadically occurs when it is scaling (scheduling units in fleet). Presently am manually resolving by logging in to each box and issuing systemctl daemon-reload then systemctl start for each failed unit.

@scole-scea
Copy link

@wsimmonds: The hack at https://gist.github.com/scole-scea/177a367c51d0b93d51f6 may be a useful workaround for your issue. (It's automation for what you're doing by hand.)

@tleyden
Copy link
Contributor

tleyden commented Mar 31, 2015

Great news! What's the ETA for the next alpha build?

@bcwaldon
Copy link
Contributor Author

Just merged the update into github.com/coreos/coreos-overlay, and I believe the next Alpha is slated for Thursday.

@tleyden
Copy link
Contributor

tleyden commented Mar 31, 2015

Is it possible to test this before the new alpha is released?

I tried manually overwriting the binaries but got this error:

core@ip-10-35-198-130 ~ $ sudo cp fleet-v0.9.2-linux-amd64/fleet* /usr/bin
cp: cannot create regular file '/usr/bin/fleetctl': Read-only file system
cp: cannot create regular file '/usr/bin/fleetd': Read-only file system

@robszumski
Copy link
Member

You can try this out with a systemd dropin. What this does is append/overwrite just part of the fleet unit on disk. After copying the binary on disk somewhere:

#cloud-config

coreos:
  units:
    - name: fleet.service
      drop-ins:
        - name: 99-custom-binary.conf
          content: |
            [Service]
            ExecStart=
            ExecStart=/home/core/fleetd

or just create 99-custom-binary.conf manually, add that content, then run:

$ sudo systemctl daemon-reload
$ sudo systemctl restart fleet

To revert back, just remove 99-custom-binary.conf and then:

$ sudo systemctl daemon-reload
$ sudo systemctl restart fleet

@PaulCapestany
Copy link

FWIW, I'm pretty new to this all, and couldn't get the append/overwrite method @robszumski suggested to work (I kept getting a fleet.service has more than one ExecStart= setting, which is only allowed for Type=oneshot services. Refusing. error), but the following did work:

wget https://github.com/coreos/fleet/releases/download/v0.9.2/fleet-v0.9.2-linux-amd64.tar.gz && \
tar -xzvf fleet-v0.9.2-linux-amd64.tar.gz && \
mv /home/core/fleet-v0.9.2-linux-amd64/fleetd /home/core/fleetd && \
echo "[Unit]
Description=fleet daemon

Wants=etcd.service
After=etcd.service

Wants=fleet.socket
After=fleet.socket

[Service]
ExecStart=/home/core/fleetd
Restart=always
RestartSec=10s
" > /home/core/fleet.service && \
sudo mv /home/core/fleet.service /etc/systemd/system/fleet.service && \
sudo systemctl daemon-reload && \
sudo systemctl restart fleet

Followup question: now that I'm using the 0.9.2 version of fleetd I'm getting a warning (shown below) whenever I use fleetctl — is it OK to ignore it, or is there also a way to force CoreOS to use a binary when it's not controlled by systemctl?

WARNING: fleetctl (0.9.1) is older than the latest registered
version of fleet found in the cluster (0.9.2). You are strongly
recommended to upgrade fleetctl to prevent incompatibility issues.

Thanks!

@robszumski
Copy link
Member

@PaulCapestany Sorry about that, it was appending two ExecStart= lines. Apparently if you do this, it resets them and will only use the last one provided:

ExecStart=
ExecStart=/home/core/fleetd

I've updated my example above. Everyone else, let me know if this works, or you can create a entirely new file like the post above.


That version warning is there because in the early days of fleet the version on the client (like your laptop) had to match what was on the server for it to function correctly. This change didn't impact anything compatibility-wise so it is safe to ignore.

carmstrong added a commit to carmstrong/deis that referenced this issue Apr 6, 2015
This commit introduces a systemd unit to upgrade fleet to 0.9.2
before the daemon is started. This is intended to address
coreos/fleet#1158 which was a regression
from fixes introduced in fleet 0.9.1, which is included for the first
time in CoreOS 607.0.0.
carmstrong added a commit to carmstrong/deis that referenced this issue Apr 7, 2015
This commit introduces a systemd unit to upgrade fleet to 0.9.2
before the daemon is started. This is intended to address
coreos/fleet#1158 which was a regression
from fixes introduced in fleet 0.9.1, which is included for the first
time in CoreOS 607.0.0.
carmstrong added a commit to carmstrong/deis that referenced this issue Apr 7, 2015
This commit introduces a systemd unit to upgrade fleet to 0.9.2
before the daemon is started. This is intended to address
coreos/fleet#1158 which was a regression
from fixes introduced in fleet 0.9.1, which is included for the first
time in CoreOS 607.0.0.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants