Skip to content
This repository has been archived by the owner on Jan 30, 2020. It is now read-only.

Sporadic fleet launch errors: Unit failed to load: No such file or directory #1149

Closed
tleyden opened this issue Mar 11, 2015 · 1 comment
Closed

Comments

@tleyden
Copy link
Contributor

tleyden commented Mar 11, 2015

I'm running CoreOS alpha 612.1.0 and launching / destroying units via the fleet REST api. I'm seeing sporadic issues where units fail to start.

Here's a full walkthrough of what I'm doing to reproduce the issue:

Start units

  • Launch a 3 node cluster on ec2 using this cloudformation template
  • Launch fleet units by running sudo docker run --net=host tleyden5iwx/couchbase-cluster-go update-wrapper couchbase-fleet launch-cbs --version 3.0.1 --num-nodes 3 --userpass "user:passw0rd" -- this dynamically generates fleet units based on templates, then submits them via the fleet api.

At this point, my journalctl -b -u fleet.service --no-pager logs are:

Stop + destroy units

This just stops and destroys all units, its essentially the equivalent of fleetctl stop * && fleetctl destroy *:

sudo docker run --net=host tleyden5iwx/couchbase-cluster-go update-wrapper couchbase-fleet stop --all-units && sudo docker run --net=host tleyden5iwx/couchbase-cluster-go update-wrapper couchbase-fleet destroy --all-units

Verify everything is clean:

$ fleetctl list-units
UNIT    MACHINE ACTIVE  SUB
$ fleetctl list-unit-files
UNIT    HASH    DSTATE  STATE   TARGET

At this point, my journalctl -b -u fleet.service --no-pager logs are:

Restart units

Run the same command as earlier to kick things off: sudo docker run --net=host tleyden5iwx/couchbase-cluster-go update-wrapper couchbase-fleet launch-cbs --version 3.0.1 --num-nodes 3 --userpass "user:passw0rd"

Didn't reproduce the bug this time, but I repeated the Stop + destroy units and Restart units steps three times (third time a charm!) and was able to reproduce it.

Fleet units:

$ fleetctl list-units
UNIT                MACHINE             ACTIVE  SUB
couchbase_node@1.service    8995d6d7.../10.156.7.12     active  running
couchbase_node@2.service    ad8cb97d.../10.239.174.35   active  running
couchbase_node@3.service    cc2b61a5.../10.141.247.11   active  running
couchbase_sidekick@1.service    8995d6d7.../10.156.7.12     failed  failed
couchbase_sidekick@2.service    ad8cb97d.../10.239.174.35   active  running
couchbase_sidekick@3.service    cc2b61a5.../10.141.247.11   failed  failed

Unit files:

$ fleetctl list-unit-files
UNIT                HASH    DSTATE      STATE       TARGET
couchbase_node@1.service    97550db launched    launched    8995d6d7.../10.156.7.12
couchbase_node@2.service    97550db launched    launched    ad8cb97d.../10.239.174.35
couchbase_node@3.service    97550db launched    launched    cc2b61a5.../10.141.247.11
couchbase_sidekick@1.service    3c37fd3 launched    launched    8995d6d7.../10.156.7.12
couchbase_sidekick@2.service    f438171 launched    launched    ad8cb97d.../10.239.174.35
couchbase_sidekick@3.service    5c1369d launched    launched    cc2b61a5.../10.141.247.11

Journalctl logs:

Analyzing the logs

On machine 11, which has one of the failed units, there is an error:

ERROR manager.go:136: Failed to trigger systemd unit couchbase_sidekick@3.service start: Unit couchbase_sidekick@3.service failed to load: No such file or directory.

Likewise on machine 12 which also has a failed unit, there is an identical error:

ERROR manager.go:136: Failed to trigger systemd unit couchbase_sidekick@1.service start: Unit couchbase_sidekick@1.service failed to load: No such file or directory.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants