fleet can start units out of order on startup #997

bcwaldon · 2014-10-22T23:18:44Z

fleetctl start the following units

$ cat bar.service
[Service]
ExecStart=/usr/bin/sleep infinity

$ cat foo.service
[Unit]
Requires=bar.service
After=bar.service

[Service]
ExecStart=/usr/bin/sleep infinity

[X-Fleet]
MachineOf=bar.service

systemctl kill -s SIGKILL fleet && systemctl start fleet

There is a chance that fleet will load and attempt to start bar.service before loading foo.service, causing the following error:

Oct 22 22:15:57 core-01 fleetd[6111]: ERROR manager.go:80: Failed to trigger systemd unit bar.service start: Unit foo.service failed to load: No such file or directory.

This bug was originally reported in #974.

The text was updated successfully, but these errors were encountered:

rynbrd · 2014-11-10T17:17:12Z

Is there a halfway decent workaround for this? This has been biting us through pretty much every CoreOS update cycle. I've tried playing with systemd and fleet dependencies but there's always at least one service that gets lost on a reboot cycle.

balboah · 2014-12-16T16:16:45Z

I'm having this issue as well on initial provisioning of my test Vagrant cluster,
first doing fleetctl submit and then fleetctl start didn't make it any happier. I was hoping that submit would have written the unit files.

Requirements don't get fulfilled because of "No such file or directory" leaving the service that requires it in inactive state.

bcwaldon · 2014-12-16T16:23:28Z

@balboah fleetctl load is the step that actually lands unit files on disk. Using that in place of fleetctl submit above should work for demonstration purposes. This is not a solution to the core problem, though.

balboah · 2014-12-16T16:34:22Z

@bcwaldon Thanks for the hint, that makes a usable work around for my setup. I hope fleet will be production ready by the time I'm migrating my vagrant setup :)

rynbrd · 2014-12-16T17:14:09Z

@balboah that's not a workaround, that's how it's suppose to work =)

balboah · 2014-12-17T08:30:34Z

@bluedragonx If fleetctl start path/*.service also writes the unit files, I would expect it to write them in place before starting them and not leaving some of them inactive even if all dependencies are available.

To me it seems confusing when start does both jobs if load && start is how you are supposed to do it.

rynbrd · 2014-12-17T11:06:22Z

If you start a set if units it will submit, load, and start them together as necessary. The bug associated with this issue does not manifest itself in these circumstances. You do not need to load and then start if you've written your unit files correctly.

jonboulle · 2014-12-17T19:36:47Z

related - #993

msumme · 2015-02-09T18:54:11Z

Additionally, I've seen this happen after fleet reschedules failed units to different machines in the cluster.

Which basically means it can randomly stop working after a successful deploy.

sukrit007 · 2015-02-09T18:58:37Z

👍 for a fix for this.

Just a note: if a service is using "Wants" directive and if we follow the steps described in :
#1079 the issue is always reproducible even while deploying the service.

bepremeg · 2015-02-11T10:02:20Z

Upon auto-updating from 522.6.0 to 557.2.0 we're hit by the same problem as described by @sukrit007. We're using Wants too and the only work-around seems to be starting the sidekick unit before the worker unit. Since sidekick units are binding services, this is pretty bad.

ryantanner · 2015-02-22T16:38:52Z

We're getting hit with this too. Our current workaround is to pepper our scripts with calls to systemctl daemon-reload but this is pretty ugly.

msumme · 2015-02-24T19:29:23Z

Interesting that 557.2.0 just got moved to stable with this being reported
so widely for a very common use case (sidekick discovery services)

On Sun, Feb 22, 2015 at 10:38 AM, Ryan Tanner notifications@github.com
wrote:

We're getting hit with this too. Our current workaround is to pepper our
scripts with calls to systemctl daemon-reload but this is pretty ugly.

—
Reply to this email directly or view it on GitHub
#997 (comment).

lordelph · 2015-02-26T10:36:10Z

I'm getting this issue too - I experience it every morning when I open my laptop and the database running in my Vagrant cluster has died - the only way to get it started is fleetctl destroy / fleetctl start

If it helps, the database has got dependancies like this (I've omitted all the other elements of the unit for clarity)

#mysql@%i.service
[Unit]
# Requirements
Requires=etcd.service
Requires=docker.service
Requires=mysql-discovery@%i.service
Requires=mysql-backup@%i.service

# Dependency ordering
After=etcd.service
After=docker.service
Before=mysql-discovery@%i.service
Before=mysql-backup@%i.service

#constrain mysql to a specific machine-id
[X-Fleet]
MachineID=%i

The two sidekick units are similar, and have dependancies like this

#mysql-backup@%i.service
[Unit]
# Requirements
Requires=docker.service
Requires=mysql@%i.service

# Dependency ordering and binding
After=docker.service
After=mysql@%i.service
BindsTo=mysql@%i.service

[X-Fleet]
MachineOf=mysql@%i.service

akaspin · 2015-02-26T12:32:22Z

#944 - our hero? Only one question: how to specify RequiresDaemonDeload to fleet unit? Any docs?

msumme · 2015-02-27T21:52:33Z

@bcwaldon Thanks for fixing - this is awesome. Do you know if there will be a patch into the stable branch?

bcwaldon · 2015-02-27T22:08:07Z

@msumme I am going to cut v0.9.1 and we will roll it to all CoreOS release channels shortly. I will update #1134 when I have more information.

bcwaldon · 2015-03-02T20:49:33Z

Please see #1134

sukrit007 · 2015-03-04T19:33:55Z

just ran into this issue again with 607.0 fleet 0.9.1 . Though the probability of occurrence has reduced after the fix, but happened once after 10 deploys.

● meltmedia-talu-website-feature_totem-1425496747312-yoda-register@1.service
   Loaded: not-found (Reason: No such file or directory)
   Active: inactive (dead)

Mar 04 19:19:14 ip-10-227-136-198.us-west-1.compute.internal systemd[1]: Cannot add dependency job for unit meltmedia-talu-website-feature_totem-1425496747312-yoda-register@1.service, ignoring: Unit meltmedia-talu-website-feature_totem-1425496747312-yoda-register@1.service failed to load: No such file or directory.
Mar 04 19:19:15 ip-10-227-136-198.us-west-1.compute.internal systemd[1]: Cannot add dependency job for unit meltmedia-talu-website-feature_totem-1425496747312-yoda-register@1.service, ignoring: Unit meltmedia-talu-website-feature_totem-1425496747312-yoda-register@1.service failed to load: No such file or directory.

Note: This did not happen on startup , but while deploying the services using the sidekick pattern with "Wants" directive as described in :
#1079

cat /etc/os-release 
NAME=CoreOS
ID=coreos
VERSION=607.0.0
VERSION_ID=607.0.0
BUILD_ID=
PRETTY_NAME="CoreOS 607.0.0"
ANSI_COLOR="1;32"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://github.com/coreos/bugs/issues"

bcwaldon · 2015-03-04T20:42:22Z

@sukrit007 I'm surprised you're running into it more often, to be honest. Your master unit has a dependency on your sidekick unit, but fleet will not schedule your sidekick unit out to the cluster until the master has already been scheduled. Remove the Wants= directive from your master and the problem will be resolved. The reason you see it so infrequently now is likely due to the fact that you start the units at the same time and the fleet engine is slow enough that both units are available for reconciliation (and therefore scheduling) together.

sukrit007 · 2015-03-04T21:02:22Z

@bcwaldon I think I would still need Wants in order to ensure that Sidekick restarts, when my main unit restarts. Even though I have Restart=always, my sidekick won't restart itself (when main unit restarts) if I do not have "Wants". (Ideally , I would have used "Requires" instead of "Wants" , but due to bug in systemd (#1089) , can not use that either.

I will keep monitoring this, but any suggestion to workaround the issue is appreciated.

bcwaldon · 2015-03-09T18:10:14Z

@sukrit007 I see two short-term paths forward for you:

Remove the Wants= from your master and rely on health checks from a load balancer or something to keep unhealthy instances out of service, regardless of them being discoverable
Move whatever logic the sidekick process is doing into an ExecStartPost option in your main unit, removing the need for this inter-unit dependency

sukrit007 · 2015-03-09T19:13:51Z

@bcwaldon Thx for the same.... I will move to option 2 , if I start seeing this more frequent. But would love to stick with Sidekick with inter-unit dependency for future to isolate the concerns in 2 separate unit files.

umiller · 2015-03-22T16:15:25Z

Same here, I thought that "Cannot add dependency job for unit discovery ..." were resolved on fleet 0.9.1, it happens to me every time i launch the side kick..
is there a way to roll back to an older version of fleet ?

bcwaldon · 2015-03-24T15:15:46Z

@umiller if you cannot use either of the options above, you can roll back the version of fleet by placing a fleetd binary at /opt/bin/fleetd and overriding the builtin fleet unit by copying /usr/lib64/systemd/system/fleet.service to /etc/systemd/system/fleet.service and modifying the ExecStart line to point to /opt/bin/fleetd.

bcwaldon added the bug label Oct 22, 2014

bcwaldon mentioned this issue Oct 22, 2014

Units fail after a restart #974

Closed

digital-wonderland mentioned this issue Oct 26, 2014

Fleet starts units before required ones are loaded after a reboot #1003

Closed

bcwaldon mentioned this issue Feb 9, 2015

Fleet Sidekick scheduling fails with "Wants" directive. #1079

Closed

bcwaldon mentioned this issue Feb 9, 2015

Fleet fails to start units after restart #1090

Closed

bmorton mentioned this issue Feb 20, 2015

Creating/destroying units repeatedly using HTTP API is causing units to fail to launch #1131

Closed

bcwaldon mentioned this issue Feb 27, 2015

Ordered task execution #1134

Merged

bcwaldon closed this as completed in #1134 Feb 27, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fleet can start units out of order on startup #997

fleet can start units out of order on startup #997

bcwaldon commented Oct 22, 2014

rynbrd commented Nov 10, 2014

balboah commented Dec 16, 2014

bcwaldon commented Dec 16, 2014

balboah commented Dec 16, 2014

rynbrd commented Dec 16, 2014

balboah commented Dec 17, 2014

rynbrd commented Dec 17, 2014

jonboulle commented Dec 17, 2014

msumme commented Feb 9, 2015

sukrit007 commented Feb 9, 2015

bepremeg commented Feb 11, 2015

ryantanner commented Feb 22, 2015

msumme commented Feb 24, 2015

lordelph commented Feb 26, 2015

akaspin commented Feb 26, 2015

msumme commented Feb 27, 2015

bcwaldon commented Feb 27, 2015

bcwaldon commented Mar 2, 2015

sukrit007 commented Mar 4, 2015

bcwaldon commented Mar 4, 2015

sukrit007 commented Mar 4, 2015

bcwaldon commented Mar 9, 2015

sukrit007 commented Mar 9, 2015

umiller commented Mar 22, 2015

bcwaldon commented Mar 24, 2015

fleet can start units out of order on startup #997

fleet can start units out of order on startup #997

Comments

bcwaldon commented Oct 22, 2014

rynbrd commented Nov 10, 2014

balboah commented Dec 16, 2014

bcwaldon commented Dec 16, 2014

balboah commented Dec 16, 2014

rynbrd commented Dec 16, 2014

balboah commented Dec 17, 2014

rynbrd commented Dec 17, 2014

jonboulle commented Dec 17, 2014

msumme commented Feb 9, 2015

sukrit007 commented Feb 9, 2015

bepremeg commented Feb 11, 2015

ryantanner commented Feb 22, 2015

msumme commented Feb 24, 2015

lordelph commented Feb 26, 2015

akaspin commented Feb 26, 2015

msumme commented Feb 27, 2015

bcwaldon commented Feb 27, 2015

bcwaldon commented Mar 2, 2015

sukrit007 commented Mar 4, 2015

bcwaldon commented Mar 4, 2015

sukrit007 commented Mar 4, 2015

bcwaldon commented Mar 9, 2015

sukrit007 commented Mar 9, 2015

umiller commented Mar 22, 2015

bcwaldon commented Mar 24, 2015