New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve experience working with template units #969

Closed
jonboulle opened this Issue Oct 14, 2014 · 32 comments

Comments

Projects
None yet
@jonboulle
Contributor

jonboulle commented Oct 14, 2014

It should not be possible for template units to be scheduled to a system. Right now the experience is not great; it will be scheduled, but then cause chronic issues with the agent on that machine, e.g.:

Oct 14 17:26:41 core-01 fleetd[557]: ERROR generator.go:51: Failed fetching current unit states: Unit name foo@.service is not valid.
Oct 14 17:26:42 core-01 fleetd[557]: ERROR generator.go:51: Failed fetching current unit states: Unit name foo@.service is not valid.
Oct 14 17:26:43 core-01 fleetd[557]: ERROR generator.go:51: Failed fetching current unit states: Unit name foo@.service is not valid.

(Really, any unit with a bad name should never be scheduled; but fleetctl should now block all bad names except for template units).

Related: #541

@irony

This comment has been minimized.

Show comment
Hide comment
@irony

irony Dec 7, 2014

Until this is fixed - if anyone arrives here - you have tried to start a template (fleetctl start foo@.service) which prevents the rest of the units from being loaded. To resolve the problem run fleetctl destroy foo@.service

An ideal solution would be for fleetctl to prevent users from starting templates, or to ignore them when starting..

irony commented Dec 7, 2014

Until this is fixed - if anyone arrives here - you have tried to start a template (fleetctl start foo@.service) which prevents the rest of the units from being loaded. To resolve the problem run fleetctl destroy foo@.service

An ideal solution would be for fleetctl to prevent users from starting templates, or to ignore them when starting..

@gegere

This comment has been minimized.

Show comment
Hide comment
@gegere

gegere Dec 28, 2014

@irony great job with providing additional information but I regret to inform you this does not work properly.

fleetctl destroy Postfix@.service

This issue persists. :(

gegere commented Dec 28, 2014

@irony great job with providing additional information but I regret to inform you this does not work properly.

fleetctl destroy Postfix@.service

This issue persists. :(

@bcwaldon

This comment has been minimized.

Show comment
Hide comment
@bcwaldon

bcwaldon Dec 29, 2014

Contributor

@gegere What version of CoreOS & fleetd are you running?

Contributor

bcwaldon commented Dec 29, 2014

@gegere What version of CoreOS & fleetd are you running?

@gegere

This comment has been minimized.

Show comment
Hide comment
@gegere

gegere Dec 29, 2014

fleet version 0.8.3
etcd version 0.4.6
Docker version 1.3.3, build 5dc1c5a
NAME=CoreOS
ID=coreos
VERSION=494.5.0
VERSION_ID=494.5.0
BUILD_ID=
PRETTY_NAME="CoreOS 494.5.0"
ANSI_COLOR="1;32"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://github.com/coreos/bugs/issues"

The solution to this stuck instance was to reboot the node. I believe the cause of the problem was the load of the template which of course it incorrect. The proper way to add a template unit file is it use submit

gegere commented Dec 29, 2014

fleet version 0.8.3
etcd version 0.4.6
Docker version 1.3.3, build 5dc1c5a
NAME=CoreOS
ID=coreos
VERSION=494.5.0
VERSION_ID=494.5.0
BUILD_ID=
PRETTY_NAME="CoreOS 494.5.0"
ANSI_COLOR="1;32"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://github.com/coreos/bugs/issues"

The solution to this stuck instance was to reboot the node. I believe the cause of the problem was the load of the template which of course it incorrect. The proper way to add a template unit file is it use submit

@bcwaldon

This comment has been minimized.

Show comment
Hide comment
@bcwaldon

bcwaldon Dec 29, 2014

Contributor

@gegere The right thing to do is actually to leave your template units on your local system altogether. Only run fleetctl commands on instances of the template unit. For example, leave Postfix@.service in your CWD

$ ls .
Postfix@.service

Then run fleetctl start Postfix@1.service (or load, stop, submit, etc).

Contributor

bcwaldon commented Dec 29, 2014

@gegere The right thing to do is actually to leave your template units on your local system altogether. Only run fleetctl commands on instances of the template unit. For example, leave Postfix@.service in your CWD

$ ls .
Postfix@.service

Then run fleetctl start Postfix@1.service (or load, stop, submit, etc).

@gegere

This comment has been minimized.

Show comment
Hide comment
@gegere

gegere Dec 29, 2014

Well this looks to be a pretty great solution also! 👍

gegere commented Dec 29, 2014

Well this looks to be a pretty great solution also! 👍

@efuquen

This comment has been minimized.

Show comment
Hide comment
@efuquen

efuquen Feb 5, 2015

👍 on this issue. The above comment of running fleetctl destroy foo@.service to resolve this issue does not work. I still get errors in the journal ever few seconds like this:

ERROR reconcile.go:80: Unable to determine agent's current state: failed fetching unit states from UnitManager: Unit name foo@.service is not valid.

I also tried restarted fleetd and that still didn't work. I'm running the latest CoreOS alpha channel version, 575.0.0, which includes fleetd 0.9.0 and etcd 0.4.6. Regardless of any fix to prevent invalid template names it shouldn't take a full system restart to resolve this.

efuquen commented Feb 5, 2015

👍 on this issue. The above comment of running fleetctl destroy foo@.service to resolve this issue does not work. I still get errors in the journal ever few seconds like this:

ERROR reconcile.go:80: Unable to determine agent's current state: failed fetching unit states from UnitManager: Unit name foo@.service is not valid.

I also tried restarted fleetd and that still didn't work. I'm running the latest CoreOS alpha channel version, 575.0.0, which includes fleetd 0.9.0 and etcd 0.4.6. Regardless of any fix to prevent invalid template names it shouldn't take a full system restart to resolve this.

@kayrus

This comment has been minimized.

Show comment
Hide comment
@kayrus

kayrus Feb 18, 2015

Contributor

Have the same issue.

Feb 16 14:47:31 coreos3 fleetd[20310]: ERROR reconcile.go:80: Unable to determine agent's current state: failed fetching unit states from UnitManager: Unit name servicename@.service is not valid.

Can not remove it. Is there a temporarily solution to remove invalid service?

Contributor

kayrus commented Feb 18, 2015

Have the same issue.

Feb 16 14:47:31 coreos3 fleetd[20310]: ERROR reconcile.go:80: Unable to determine agent's current state: failed fetching unit states from UnitManager: Unit name servicename@.service is not valid.

Can not remove it. Is there a temporarily solution to remove invalid service?

@BugRoger

This comment has been minimized.

Show comment
Hide comment
@BugRoger

BugRoger Feb 20, 2015

A workaround that I found is to just delete the broken units manually. Something like:

find / -name "mnt@*.mount" | xargs rm
systemctl daemon-reload
systemctl restart fleet

BugRoger commented Feb 20, 2015

A workaround that I found is to just delete the broken units manually. Something like:

find / -name "mnt@*.mount" | xargs rm
systemctl daemon-reload
systemctl restart fleet
@radek-senfeld

This comment has been minimized.

Show comment
Hide comment
@radek-senfeld

radek-senfeld Feb 20, 2015

Purging /run/fleet/units did solve the problem indeed! There are also symlinks in /run/systemd/system. Also tried to remove etcd keys at first but it didn't help..

etcdctl ls _coreos.com/fleet/unit

radek-senfeld commented Feb 20, 2015

Purging /run/fleet/units did solve the problem indeed! There are also symlinks in /run/systemd/system. Also tried to remove etcd keys at first but it didn't help..

etcdctl ls _coreos.com/fleet/unit

@bcwaldon bcwaldon modified the milestone: v0.10.0 Feb 21, 2015

@chrisfarms

This comment has been minimized.

Show comment
Hide comment
@chrisfarms

chrisfarms Mar 10, 2015

I am not doing fleetctl start template@.service.

I am doing fleetctl submit template@.service, followed later by multiple fleetctl start template@N.service where N is a number 0 - 5.

This works fine 95% of the time, but every now and then one of my nodes will get stuck in the Failed fetching current unit states referencing the template@.service unit.

The only thing I can think of is some kind of race condition between the submit and the start calls, but I cannot reliably recreate it at present.

(fleet 0.9.0 / coreos 557.2.0)

chrisfarms commented Mar 10, 2015

I am not doing fleetctl start template@.service.

I am doing fleetctl submit template@.service, followed later by multiple fleetctl start template@N.service where N is a number 0 - 5.

This works fine 95% of the time, but every now and then one of my nodes will get stuck in the Failed fetching current unit states referencing the template@.service unit.

The only thing I can think of is some kind of race condition between the submit and the start calls, but I cannot reliably recreate it at present.

(fleet 0.9.0 / coreos 557.2.0)

@HarryR

This comment has been minimized.

Show comment
Hide comment
@HarryR

HarryR Mar 21, 2015

I ran into this problem today with CoreOS 612.1.0 and fleet 0.9.1 in a similar situation.

I submitted template@.service then started an instance of template@1.service, after starting another service the fleetctl command was hanging whenever I tried to do anything, on one machine it was repeating Failed fetching current unit states in `journalctl.

HarryR commented Mar 21, 2015

I ran into this problem today with CoreOS 612.1.0 and fleet 0.9.1 in a similar situation.

I submitted template@.service then started an instance of template@1.service, after starting another service the fleetctl command was hanging whenever I tried to do anything, on one machine it was repeating Failed fetching current unit states in `journalctl.

@hauptmedia

This comment has been minimized.

Show comment
Hide comment
@hauptmedia

hauptmedia Mar 31, 2015

The problem still exists in CoreOS 633.1.0 & fleet 0.9.1. I submitted a template@.service file via fleetctl and started an instance of template@1.service. After that fleetctl hangs when it tries to schedule work on a specific machine.

In the journal of this machine you can read:

coreos-1 core # journalctl -u fleet -f

(...) fleetd[10224]: ERROR reconcile.go:81: Unable to determine agent's current state: failed fetching unit states from UnitManager: Unit name template@.service is not valid.

hauptmedia commented Mar 31, 2015

The problem still exists in CoreOS 633.1.0 & fleet 0.9.1. I submitted a template@.service file via fleetctl and started an instance of template@1.service. After that fleetctl hangs when it tries to schedule work on a specific machine.

In the journal of this machine you can read:

coreos-1 core # journalctl -u fleet -f

(...) fleetd[10224]: ERROR reconcile.go:81: Unable to determine agent's current state: failed fetching unit states from UnitManager: Unit name template@.service is not valid.

@maccman

This comment has been minimized.

Show comment
Hide comment
@maccman

maccman Apr 6, 2015

We're having exactly the same problem with the same error msg.

maccman commented Apr 6, 2015

We're having exactly the same problem with the same error msg.

@geniousphp

This comment has been minimized.

Show comment
Hide comment
@geniousphp

geniousphp Apr 19, 2015

I still get the same message error like you guys, I'm running coreos 647.0.0. Is there any clean solution ?

geniousphp commented Apr 19, 2015

I still get the same message error like you guys, I'm running coreos 647.0.0. Is there any clean solution ?

@duffqiu

This comment has been minimized.

Show comment
Hide comment
@duffqiu

duffqiu Apr 23, 2015

follow @BugRoger 's suggestion, but need to change the command to your fleet unit name.

the correct path in coreos is:

/run/fleet/units/
/run/systemd/
/run/systemd/system/

and then reboot all your feet server in the cluster

Maybe I found the root cause it I run the fleetctl with root user.

duffqiu commented Apr 23, 2015

follow @BugRoger 's suggestion, but need to change the command to your fleet unit name.

the correct path in coreos is:

/run/fleet/units/
/run/systemd/
/run/systemd/system/

and then reboot all your feet server in the cluster

Maybe I found the root cause it I run the fleetctl with root user.

@duffqiu

This comment has been minimized.

Show comment
Hide comment
@duffqiu

duffqiu Apr 23, 2015

we need to run the fleetctl command with core user name.
and if you run it with root user, you need to clean the unit files manually before switch wot core user. because the core user can't delete the unit files in the system.

duffqiu commented Apr 23, 2015

we need to run the fleetctl command with core user name.
and if you run it with root user, you need to clean the unit files manually before switch wot core user. because the core user can't delete the unit files in the system.

@rosskukulinski

This comment has been minimized.

Show comment
Hide comment
@rosskukulinski

rosskukulinski Apr 23, 2015

FWIW, ran into this issue today -- accidentally started service@.service. Subtly caused problems that took a while to find the root cause of. HUGE nightmare.

rosskukulinski commented Apr 23, 2015

FWIW, ran into this issue today -- accidentally started service@.service. Subtly caused problems that took a while to find the root cause of. HUGE nightmare.

@Vishant0031

This comment has been minimized.

Show comment
Hide comment
@Vishant0031

Vishant0031 Jun 23, 2015

@duffqiu, do we need to restart fleet on entire cluster or just on the nodes where the 'template' units are started by fleet?
Is there a way i can reload the fleet daemon without restarting (like system daemon-reload)?

Vishant0031 commented Jun 23, 2015

@duffqiu, do we need to restart fleet on entire cluster or just on the nodes where the 'template' units are started by fleet?
Is there a way i can reload the fleet daemon without restarting (like system daemon-reload)?

@duffqiu

This comment has been minimized.

Show comment
Hide comment
@duffqiu

duffqiu Jun 23, 2015

@Vish0007 just need to clean the template unit(s) in the specific node and restart the fleet

I don't find the way to reload teh daemon without restart, you can find it, please tell me.

duffqiu commented Jun 23, 2015

@Vish0007 just need to clean the template unit(s) in the specific node and restart the fleet

I don't find the way to reload teh daemon without restart, you can find it, please tell me.

@Vishant0031

This comment has been minimized.

Show comment
Hide comment
@Vishant0031

Vishant0031 Jun 23, 2015

would this only reload the fleet.conf?
systemctl kill -s SIGHUP fleet

Vishant0031 commented Jun 23, 2015

would this only reload the fleet.conf?
systemctl kill -s SIGHUP fleet

@duffqiu

This comment has been minimized.

Show comment
Hide comment
@duffqiu

duffqiu Jun 24, 2015

I am not sure if systemctl kill -s SIGHUP fleet or not. Anyone can help?

duffqiu commented Jun 24, 2015

I am not sure if systemctl kill -s SIGHUP fleet or not. Anyone can help?

@beginrescueend

This comment has been minimized.

Show comment
Hide comment
@beginrescueend

beginrescueend Jun 24, 2015

This still exists in 681.2.0.

I can do this, however, as a workaround:
sudo rm /run/fleet/units/mysvc@.service
sudo systemctl daemon-reload
sudo systemctl restart fleet

beginrescueend commented Jun 24, 2015

This still exists in 681.2.0.

I can do this, however, as a workaround:
sudo rm /run/fleet/units/mysvc@.service
sudo systemctl daemon-reload
sudo systemctl restart fleet

@arthur-c

This comment has been minimized.

Show comment
Hide comment
@arthur-c

arthur-c Jun 26, 2015

I accidently ran a template, @beginrescueend trick solved the problem.

arthur-c commented Jun 26, 2015

I accidently ran a template, @beginrescueend trick solved the problem.

@spiddy

This comment has been minimized.

Show comment
Hide comment
@spiddy

spiddy Jul 16, 2015

I just hit with the same issue (CoreOS 647.0.0). Because the solution is a combination of all the above, I put it here for the next victims:

  • First you need to remove the misconfigured template from fleet (otherwise it will bite you again)
fleetctl destroy templatename@.service
  • Then clean the local service file and restart fleet:
sudo rm /run/fleet/units/templatename\@.service
sudo systemctl daemon-reload
sudo systemctl restart fleet

Is there a roadmap on how to fix this permanently?

spiddy commented Jul 16, 2015

I just hit with the same issue (CoreOS 647.0.0). Because the solution is a combination of all the above, I put it here for the next victims:

  • First you need to remove the misconfigured template from fleet (otherwise it will bite you again)
fleetctl destroy templatename@.service
  • Then clean the local service file and restart fleet:
sudo rm /run/fleet/units/templatename\@.service
sudo systemctl daemon-reload
sudo systemctl restart fleet

Is there a roadmap on how to fix this permanently?

@sukrit007

This comment has been minimized.

Show comment
Hide comment
@sukrit007

sukrit007 Jul 18, 2015

Just ran into this issue in latest stable version of CoreOS :

DISTRIB_ID=CoreOS
DISTRIB_RELEASE=717.3.0
DISTRIB_CODENAME="Red Dog"
DISTRIB_DESCRIPTION="CoreOS 717.3.0"

Though cleaning the unit, daemon-reload and followed by restart fleet fixes this issue , but running into this a lot more frequently with 717.3.0 update.

sukrit007 commented Jul 18, 2015

Just ran into this issue in latest stable version of CoreOS :

DISTRIB_ID=CoreOS
DISTRIB_RELEASE=717.3.0
DISTRIB_CODENAME="Red Dog"
DISTRIB_DESCRIPTION="CoreOS 717.3.0"

Though cleaning the unit, daemon-reload and followed by restart fleet fixes this issue , but running into this a lot more frequently with 717.3.0 update.

@knguyen142

This comment has been minimized.

Show comment
Hide comment
@knguyen142

knguyen142 Jul 29, 2015

I'm running into this issue too, at 717.3.0. Restarting fleet seems to reshuffle the containers on the machine, so I don't think that's a viable solution for us. Any other ideas on how to fix?

knguyen142 commented Jul 29, 2015

I'm running into this issue too, at 717.3.0. Restarting fleet seems to reshuffle the containers on the machine, so I don't think that's a viable solution for us. Any other ideas on how to fix?

@jonboulle jonboulle added kind/bug and removed bug labels Sep 24, 2015

@yhzhao

This comment has been minimized.

Show comment
Hide comment
@yhzhao

yhzhao Sep 27, 2015

Running into the same nightmare just now. Any plan this bug get fixed?

yhzhao commented Sep 27, 2015

Running into the same nightmare just now. Any plan this bug get fixed?

@lfittl

This comment has been minimized.

Show comment
Hide comment
@lfittl

lfittl Sep 27, 2015

@yhzhao I think this was fixed in #1273 / release 0.11.0

CoreOS beta+stable images are still on fleet 0.10, but I assume they will move to 0.11 soon-ish.

lfittl commented Sep 27, 2015

@yhzhao I think this was fixed in #1273 / release 0.11.0

CoreOS beta+stable images are still on fleet 0.10, but I assume they will move to 0.11 soon-ish.

@polvi

This comment has been minimized.

Show comment
Hide comment
@polvi

polvi Sep 30, 2015

Contributor

yes, alpha is shipping 0.11.5 and will be promoted to beta soon.

Contributor

polvi commented Sep 30, 2015

yes, alpha is shipping 0.11.5 and will be promoted to beta soon.

@jonboulle jonboulle added priority/P1 and removed area/api labels Jan 26, 2016

@jonboulle jonboulle added this to the v0.13.0 milestone Jan 26, 2016

@sukrit007

This comment has been minimized.

Show comment
Hide comment
@sukrit007

sukrit007 Feb 23, 2016

So far I have not seeing this issue with 0.11.5 . Any reason it is still open ? Is there any other scenario (0.11.5 release) in which this can still occur ?

sukrit007 commented Feb 23, 2016

So far I have not seeing this issue with 0.11.5 . Any reason it is still open ? Is there any other scenario (0.11.5 release) in which this can still occur ?

@jonboulle jonboulle closed this Feb 23, 2016

@jonboulle

This comment has been minimized.

Show comment
Hide comment
@jonboulle

jonboulle Feb 23, 2016

Contributor

thanks for the ping, closing

Contributor

jonboulle commented Feb 23, 2016

thanks for the ping, closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment