Skip to content
This repository has been archived by the owner on Jan 30, 2020. It is now read-only.

fleet can start units out of order on startup #997

Closed
bcwaldon opened this issue Oct 22, 2014 · 25 comments · Fixed by #1134
Closed

fleet can start units out of order on startup #997

bcwaldon opened this issue Oct 22, 2014 · 25 comments · Fixed by #1134

Comments

@bcwaldon
Copy link
Contributor

  1. fleetctl start the following units
$ cat bar.service
[Service]
ExecStart=/usr/bin/sleep infinity

$ cat foo.service
[Unit]
Requires=bar.service
After=bar.service

[Service]
ExecStart=/usr/bin/sleep infinity

[X-Fleet]
MachineOf=bar.service
  1. systemctl kill -s SIGKILL fleet && systemctl start fleet

There is a chance that fleet will load and attempt to start bar.service before loading foo.service, causing the following error:

Oct 22 22:15:57 core-01 fleetd[6111]: ERROR manager.go:80: Failed to trigger systemd unit bar.service start: Unit foo.service failed to load: No such file or directory.

This bug was originally reported in #974.

@rynbrd
Copy link

rynbrd commented Nov 10, 2014

Is there a halfway decent workaround for this? This has been biting us through pretty much every CoreOS update cycle. I've tried playing with systemd and fleet dependencies but there's always at least one service that gets lost on a reboot cycle.

@balboah
Copy link

balboah commented Dec 16, 2014

I'm having this issue as well on initial provisioning of my test Vagrant cluster,
first doing fleetctl submit and then fleetctl start didn't make it any happier. I was hoping that submit would have written the unit files.

Requirements don't get fulfilled because of "No such file or directory" leaving the service that requires it in inactive state.

@bcwaldon
Copy link
Contributor Author

@balboah fleetctl load is the step that actually lands unit files on disk. Using that in place of fleetctl submit above should work for demonstration purposes. This is not a solution to the core problem, though.

@balboah
Copy link

balboah commented Dec 16, 2014

@bcwaldon Thanks for the hint, that makes a usable work around for my setup. I hope fleet will be production ready by the time I'm migrating my vagrant setup :)

@rynbrd
Copy link

rynbrd commented Dec 16, 2014

@balboah that's not a workaround, that's how it's suppose to work =)

@balboah
Copy link

balboah commented Dec 17, 2014

@bluedragonx If fleetctl start path/*.service also writes the unit files, I would expect it to write them in place before starting them and not leaving some of them inactive even if all dependencies are available.

To me it seems confusing when start does both jobs if load && start is how you are supposed to do it.

@rynbrd
Copy link

rynbrd commented Dec 17, 2014

If you start a set if units it will submit, load, and start them together as necessary. The bug associated with this issue does not manifest itself in these circumstances. You do not need to load and then start if you've written your unit files correctly.

@jonboulle
Copy link
Contributor

related - #993

@msumme
Copy link

msumme commented Feb 9, 2015

Additionally, I've seen this happen after fleet reschedules failed units to different machines in the cluster.

Which basically means it can randomly stop working after a successful deploy.

@sukrit007
Copy link

👍 for a fix for this.

Just a note: if a service is using "Wants" directive and if we follow the steps described in :
#1079 the issue is always reproducible even while deploying the service.

@bepremeg
Copy link

Upon auto-updating from 522.6.0 to 557.2.0 we're hit by the same problem as described by @sukrit007. We're using Wants too and the only work-around seems to be starting the sidekick unit before the worker unit. Since sidekick units are binding services, this is pretty bad.

@ryantanner
Copy link

We're getting hit with this too. Our current workaround is to pepper our scripts with calls to systemctl daemon-reload but this is pretty ugly.

@msumme
Copy link

msumme commented Feb 24, 2015

Interesting that 557.2.0 just got moved to stable with this being reported
so widely for a very common use case (sidekick discovery services)

On Sun, Feb 22, 2015 at 10:38 AM, Ryan Tanner notifications@github.com
wrote:

We're getting hit with this too. Our current workaround is to pepper our
scripts with calls to systemctl daemon-reload but this is pretty ugly.


Reply to this email directly or view it on GitHub
#997 (comment).

@lordelph
Copy link

I'm getting this issue too - I experience it every morning when I open my laptop and the database running in my Vagrant cluster has died - the only way to get it started is fleetctl destroy / fleetctl start

If it helps, the database has got dependancies like this (I've omitted all the other elements of the unit for clarity)

#mysql@%i.service
[Unit]
# Requirements
Requires=etcd.service
Requires=docker.service
Requires=mysql-discovery@%i.service
Requires=mysql-backup@%i.service

# Dependency ordering
After=etcd.service
After=docker.service
Before=mysql-discovery@%i.service
Before=mysql-backup@%i.service

#constrain mysql to a specific machine-id
[X-Fleet]
MachineID=%i

The two sidekick units are similar, and have dependancies like this

#mysql-backup@%i.service
[Unit]
# Requirements
Requires=docker.service
Requires=mysql@%i.service

# Dependency ordering and binding
After=docker.service
After=mysql@%i.service
BindsTo=mysql@%i.service

[X-Fleet]
MachineOf=mysql@%i.service

@akaspin
Copy link

akaspin commented Feb 26, 2015

#944 - our hero? Only one question: how to specify RequiresDaemonDeload to fleet unit? Any docs?

@msumme
Copy link

msumme commented Feb 27, 2015

@bcwaldon Thanks for fixing - this is awesome. Do you know if there will be a patch into the stable branch?

@bcwaldon
Copy link
Contributor Author

@msumme I am going to cut v0.9.1 and we will roll it to all CoreOS release channels shortly. I will update #1134 when I have more information.

@bcwaldon
Copy link
Contributor Author

bcwaldon commented Mar 2, 2015

Please see #1134

@sukrit007
Copy link

just ran into this issue again with 607.0 fleet 0.9.1 . Though the probability of occurrence has reduced after the fix, but happened once after 10 deploys.

● meltmedia-talu-website-feature_totem-1425496747312-yoda-register@1.service
   Loaded: not-found (Reason: No such file or directory)
   Active: inactive (dead)

Mar 04 19:19:14 ip-10-227-136-198.us-west-1.compute.internal systemd[1]: Cannot add dependency job for unit meltmedia-talu-website-feature_totem-1425496747312-yoda-register@1.service, ignoring: Unit meltmedia-talu-website-feature_totem-1425496747312-yoda-register@1.service failed to load: No such file or directory.
Mar 04 19:19:15 ip-10-227-136-198.us-west-1.compute.internal systemd[1]: Cannot add dependency job for unit meltmedia-talu-website-feature_totem-1425496747312-yoda-register@1.service, ignoring: Unit meltmedia-talu-website-feature_totem-1425496747312-yoda-register@1.service failed to load: No such file or directory.

Note: This did not happen on startup , but while deploying the services using the sidekick pattern with "Wants" directive as described in :
#1079

cat /etc/os-release 
NAME=CoreOS
ID=coreos
VERSION=607.0.0
VERSION_ID=607.0.0
BUILD_ID=
PRETTY_NAME="CoreOS 607.0.0"
ANSI_COLOR="1;32"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://github.com/coreos/bugs/issues"

@bcwaldon
Copy link
Contributor Author

bcwaldon commented Mar 4, 2015

@sukrit007 I'm surprised you're running into it more often, to be honest. Your master unit has a dependency on your sidekick unit, but fleet will not schedule your sidekick unit out to the cluster until the master has already been scheduled. Remove the Wants= directive from your master and the problem will be resolved. The reason you see it so infrequently now is likely due to the fact that you start the units at the same time and the fleet engine is slow enough that both units are available for reconciliation (and therefore scheduling) together.

@sukrit007
Copy link

@bcwaldon I think I would still need Wants in order to ensure that Sidekick restarts, when my main unit restarts. Even though I have Restart=always, my sidekick won't restart itself (when main unit restarts) if I do not have "Wants". (Ideally , I would have used "Requires" instead of "Wants" , but due to bug in systemd (#1089) , can not use that either.

I will keep monitoring this, but any suggestion to workaround the issue is appreciated.

@bcwaldon
Copy link
Contributor Author

bcwaldon commented Mar 9, 2015

@sukrit007 I see two short-term paths forward for you:

  1. Remove the Wants= from your master and rely on health checks from a load balancer or something to keep unhealthy instances out of service, regardless of them being discoverable
  2. Move whatever logic the sidekick process is doing into an ExecStartPost option in your main unit, removing the need for this inter-unit dependency

@sukrit007
Copy link

@bcwaldon Thx for the same.... I will move to option 2 , if I start seeing this more frequent. But would love to stick with Sidekick with inter-unit dependency for future to isolate the concerns in 2 separate unit files.

@umiller
Copy link

umiller commented Mar 22, 2015

Same here, I thought that "Cannot add dependency job for unit discovery ..." were resolved on fleet 0.9.1, it happens to me every time i launch the side kick..
is there a way to roll back to an older version of fleet ?

@bcwaldon
Copy link
Contributor Author

@umiller if you cannot use either of the options above, you can roll back the version of fleet by placing a fleetd binary at /opt/bin/fleetd and overriding the builtin fleet unit by copying /usr/lib64/systemd/system/fleet.service to /etc/systemd/system/fleet.service and modifying the ExecStart line to point to /opt/bin/fleetd.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.