store-volume fails after update to v0.15.0 #2300

mboersma · 2014-10-28T19:58:11Z

Following the upgrade instructions didn't work for me on 3-node vagrant.

$ git checkout v0.14.1
$ make -C deisctl/ build
$ deisctl --version
0.14.1
$ deisctl refresh-units
$ deisctl config platform set domain=local3.deisapp.com
$ deisctl install platform
$ deisctl start platform
$ deisctl uninstall platform
$ git checkout v0.15.0
$ make -C deisctl/ build
$ deisctl --version
0.15.0
$ deisctl refresh-units
$ deisctl config platform set version=v0.15.0
$ deisctl install platform
$ deisctl start platform
● ▴ ■
■ ● ▴ Starting Deis...
▴ ■ ●

Storage subsystem...
deis-store-monitor.service: activating/start-pre                                 
deis-store-monitor.service: active/running                                 
deis-store-daemon.service: activating/start-pre                                 
deis-store-daemon.service: active/running                                 
deis-store-metadata.service: activating/start-pre                                 
deis-store-metadata.service: active/running                                 
deis-store-volume.service: activating/start-pre                                 
deis-store-gateway.service: activating/start-pre                                 
deis-store-gateway.service: active/running                                 
deis-store-volume.service: activating/auto-restart                                 
deis-store-volume.service: activating/start-pre                                 
deis-store-volume.service: activating/auto-restart                                 
deis-store-volume.service: activating/start-pre                                 
deis-store-volume.service: activating/auto-restart                                 
deis-store-volume.service: activating/start-pre                                 
deis-store-volume.service: activating/auto-restart                                 
deis-store-volume.service: activating/start-pre 
...

mboersma · 2014-10-28T20:37:22Z

$ journalctl -u deis-store-volume
-- Logs begin at Tue 2014-10-28 19:13:28 UTC, end at Tue 2014-10-28 20:36:19 UTC. --
Oct 28 19:50:41 deis-1 systemd[1]: Starting deis-store-volume...
Oct 28 19:50:41 deis-1 sh[12964]: waiting for store-monitor...
Oct 28 19:50:41 deis-1 bash[12972]: cut: write error: Broken pipe
Oct 28 19:52:11 deis-1 systemd[1]: deis-store-volume.service start-pre operation timed out. Terminating.
Oct 28 19:52:11 deis-1 systemd[1]: Failed to start deis-store-volume.
Oct 28 19:52:11 deis-1 systemd[1]: Unit deis-store-volume.service entered failed state.
Oct 28 19:52:16 deis-1 systemd[1]: deis-store-volume.service holdoff time over, scheduling restart.
Oct 28 19:52:16 deis-1 systemd[1]: Stopping deis-store-volume...
Oct 28 19:52:16 deis-1 systemd[1]: Starting deis-store-volume...
Oct 28 19:52:16 deis-1 sh[13163]: waiting for store-monitor...
Oct 28 19:52:16 deis-1 bash[13169]: cut: write error: Broken pipe
Oct 28 19:53:46 deis-1 systemd[1]: deis-store-volume.service start-pre operation timed out. Terminating.
Oct 28 19:53:46 deis-1 systemd[1]: Failed to start deis-store-volume.
...
$ etcdctl get /deis/store/monSetupComplete
youBetcha

mboersma · 2014-10-28T20:42:09Z

Based on cut: write error: Broken pipe in the above log, I am guessing store-volume is faling somewhere in here:
ExecStartPre=/bin/bash -c "HOSTS=etcdctl ls /deis/store/hosts | cut -d/ -f5 | etcdctl ls /deis/store/hosts | cut -d/ -f5 | awk '{if(NR == 1) {printf $0} else {printf ","$0}}' && mount -t ceph $HOSTS:/ /var/lib/deis/store -o name=admin,secret=etcdctl get /deis/store/adminKeyring | grep 'key =' | cut -d' ' -f3"

carmstrong · 2014-10-28T21:36:24Z

That means either /deis/store/hosts or /deis/store/adminKeyring isn't what we expect it to be, but both should certainly be set in a working cluster.

mboersma · 2014-10-28T21:50:34Z

Both those look to be correctly populated in etcd. I ran the StartPre command by hand and it just hangs:

# HOSTS=`etcdctl ls /deis/store/hosts | cut -d/ -f5 | awk '{if(NR == 1) {printf $0} else {printf ","$0}}'` && mount -t ceph $HOSTS:/ /var/lib/deis/store -o name=admin,secret=`etcdctl get /deis/store/adminKeyring | grep 'key =' | cut -d' ' -f3`

carmstrong · 2014-10-28T21:51:44Z

it just hangs:

Sounds like some of your monitors are down. What does ceph -s show in one of the Ceph containers?

mboersma · 2014-10-28T21:53:35Z

Actually I'm wrong, it eventually returned:

# HOSTS=`etcdctl ls /deis/store/hosts | cut -d/ -f5 | awk '{if(NR == 1) {printf $0} else {printf ","$0}}'` && mount -t ceph $HOSTS:/ /var/lib/deis/store -o name=admin,secret=`etcdctl get /deis/store/adminKeyring | grep 'key =' | cut -d' ' -f3`
mount: 172.17.8.100,172.17.8.102,172.17.8.101:/: can't read superblock
$ fleetctl list-machines
MACHINE     IP      METADATA
11de358a... 172.17.8.101    -
6c5a5b0a... 172.17.8.100    -
ea7309c6... 172.17.8.102    -
$ nse deis-store-monitor
groups: cannot find name for group ID 11
root@deis-1:/# ceph -s
    cluster 82a7a99c-797a-44a4-8ffa-e3dd27a5daf4
     health HEALTH_WARN 677 pgs degraded; 667 pgs down; 667 pgs peering; 667 pgs stuck inactive; 1344 pgs stuck unclean; 17 requests are blocked > 32 sec; recovery 328/903 objects degraded (36.323%)
     monmap e3: 3 mons at {deis-1=172.17.8.100:6789/0,deis-2=172.17.8.101:6789/0,deis-3=172.17.8.102:6789/0}, election epoch 8, quorum 0,1,2 deis-1,deis-2,deis-3
     mdsmap e4: 1/1/1 up {0=deis-2=up:creating}, 2 up:standby
     osdmap e34: 3 osds: 1 up, 1 in
      pgmap v252: 1344 pgs, 12 pools, 340 MB data, 301 objects
            9175 MB used, 6944 MB / 16402 MB avail
            328/903 objects degraded (36.323%)
                 677 active+degraded
                 667 down+peering

carmstrong · 2014-10-28T21:55:19Z

osdmap e34: 3 osds: 1 up, 1 in. One of your OSDs is down, resulting in placement groups moving hosts. The CephFS mount request is blocked: That's what 17 requests are blocked > 32 sec; means.

It sounds like the larger issue here is that during an upgrade, we're either not restarting all the daemons properly, or in the right order.

mboersma · 2014-10-28T21:56:21Z

during an upgrade, we're either not restarting all the daemons properly, or in the right order.

That seems likely, since this was easy to reproduce on upgrade but I haven't seen it on a clean install.

carmstrong · 2014-10-28T22:03:07Z

@mboersma Does restarting the downed deis-store-daemon component recover things?

mboersma · 2014-10-28T22:14:23Z

@carmstrong yes indeed it does recover store-volume. deisctl restart store-daemon ftw.

carmstrong · 2014-10-28T22:19:07Z

yes indeed it does recover store-volume.

Excellent. I'm digging into this now - it's definitely not an ideal upgrade experience, but that workaround is pretty straightforward.

mboersma · 2014-10-28T22:24:18Z

It is possible I hit a non-deterministic error and that this doesn't actually have to do with upgrading. It's just that all my fresh v0.15.0 provisions have gone smoothly, and this was the first upgrade I tried since the release and my experience matched that of at least one IRC user, so I assumed upgrade was the key.

carmstrong · 2014-10-28T22:31:54Z

Yeah, definitely appropriate to raise a red flag. I'm seeing if I can reproduce this on an upgrade.

carmstrong · 2014-10-29T00:17:17Z

@mboersma The bad news is that I'm seeing is. But the good news is that I'm seeing this.

root@deis-1:/# ceph -s
    cluster c50a4ca5-36ba-4796-a7ae-c8fa173350fe
     health HEALTH_WARN 444 pgs peering; 444 pgs stuck inactive; 444 pgs stuck unclean; 9 requests are blocked > 32 sec
     monmap e3: 3 mons at {deis-1=172.17.8.100:6789/0,deis-2=172.17.8.101:6789/0,deis-3=172.17.8.102:6789/0}, election epoch 10, quorum 0,1,2 deis-1,deis-2,deis-3
     mdsmap e4: 1/1/1 up {0=deis-3=up:creating}, 2 up:standby
     osdmap e33: 3 osds: 3 up, 3 in
      pgmap v154: 1344 pgs, 12 pools, 340 MB data, 292 objects
            21617 MB used, 26563 MB / 49206 MB avail
                 444 peering
                 900 active+clean

@mboersma In your case, you had a downed OSD. In my case, I have a downed MDS:

4413:   172.17.8.102:6805/1 'deis-3' mds.0.1 up:creating seq 1
4416:   172.17.8.101:6805/1 'deis-2' mds.-1.0 up:standby seq 1
4419:   172.17.8.100:6805/1 'deis-1' mds.-1.0 up:standby seq 1

2014-10-29 00:05:18.856786 7fa9a47f4700  0 mon.deis-3@2(peon).data_health(10) update_stats avail 57% total 16795648 used 6756308 avail 9687820
2014-10-29 00:06:18.857303 7fa9a47f4700  0 mon.deis-3@2(peon).data_health(10) update_stats avail 57% total 16795648 used 6761712 avail 9687344
2014-10-29 00:07:00.572592 7fa9a24ed700  0 -- 172.17.8.102:6789/0 >> 172.17.8.101:0/3607606102 pipe(0x304e580 sd=19 :6789 s=0 pgs=0 cs=0 l=0 c=0x301c7e0).accept peer addr is really 172.17.8.101:0/3607606102 (socket is 172.17.8.101:48241/0)
2014-10-29 00:07:18.857574 7fa9a47f4700  0 mon.deis-3@2(peon).data_health(10) update_stats avail 57% total 16795648 used 6762240 avail 9686848
2014-10-29 00:08:18.857905 7fa9a47f4700  0 mon.deis-3@2(peon).data_health(10) update_stats avail 57% total 16795648 used 6762692 avail 9686396
2014-10-29 00:09:00.616919 7fa9a24ed700  0 -- 172.17.8.102:6789/0 >> 172.17.8.102:0/3071133433 pipe(0x304f980 sd=19 :6789 s=0 pgs=0 cs=0 l=0 c=0x301d180).accept peer addr is really 172.17.8.102:0/3071133433 (socket is 172.17.8.102:46427/0)
2014-10-29 00:09:18.858163 7fa9a47f4700  0 mon.deis-3@2(peon).data_health(10) update_stats avail 57% total 16795648 used 6763196 avail 9685892
2014-10-29 00:10:18.858526 7fa9a47f4700  0 mon.deis-3@2(peon).data_health(10) update_stats avail 57% total 16795648 used 6763460 avail 9685660
2014-10-29 00:11:18.858880 7fa9a47f4700  0 mon.deis-3@2(peon).data_health(10) update_stats avail 57% total 16795648 used 6763884 avail 9685268
2014-10-29 00:11:39.991963 7fa9a3ff3700  0 mon.deis-3@2(peon) e3 handle_command mon_command({"prefix": "status"} v 0) v1
2014-10-29 00:11:53.448984 7fa9a3ff3700  0 mon.deis-3@2(peon) e3 handle_command mon_command({"prefix": "status"} v 0) v1
2014-10-29 00:11:54.777507 7fa9a3ff3700  0 mon.deis-3@2(peon) e3 handle_command mon_command({"prefix": "status"} v 0) v1
2014-10-29 00:12:01.541955 7fa9a3ff3700  0 mon.deis-3@2(peon) e3 handle_command mon_command({"prefix": "status"} v 0) v1
2014-10-29 00:12:16.611664 7fa9a24ed700  0 -- 172.17.8.102:6789/0 >> 172.17.8.101:0/1825565583 pipe(0x304e300 sd=19 :6789 s=0 pgs=0 cs=0 l=0 c=0x301d020).accept peer addr is really 172.17.8.101:0/1825565583 (socket is 172.17.8.101:48945/0)
2014-10-29 00:12:17.000572 7fa9a22eb700  0 -- 172.17.8.102:6789/0 >> 172.17.8.100:0/3657654136 pipe(0x304e580 sd=22 :6789 s=0 pgs=0 cs=0 l=0 c=0x301cec0).accept peer addr is really 172.17.8.100:0/3657654136 (socket is 172.17.8.100:54392/0)
2014-10-29 00:12:18.859323 7fa9a47f4700  0 mon.deis-3@2(peon).data_health(10) update_stats avail 57% total 16795648 used 6764144 avail 9685008
2014-10-29 00:12:45.677039 7fa9a3ff3700  0 mon.deis-3@2(peon) e3 handle_command mon_command({"prefix": "mds stat"} v 0) v1
2014-10-29 00:13:18.861155 7fa9a47f4700  0 mon.deis-3@2(peon).data_health(10) update_stats avail 57% total 16795648 used 6759708 avail 9689348

I'm going to make a change to perform an ordered stop/uninstall with deisctl and see if that remedies this.

carmstrong · 2014-10-29T00:20:20Z

Doing deisctl restart store-daemon brought everything up again just fine.

EvanK · 2014-10-29T16:52:20Z

FWIW, I'm seeing the same death loop on deis-store-volume.service (activating/start-pre -> activating/auto-restart, et al) on a fresh bare-metal install of 0.15.0...so this may not just be an upgrade problem?

carmstrong · 2014-10-29T16:54:28Z

@EvanK :( That means there's still a race condition somewhere. Can you log into one of your machines, nse deis-store-monitor, and ceph -s for me please?

EvanK · 2014-10-29T17:32:19Z

@carmstrong I got errors for both: https://gist.github.com/EvanK/9c5bea91d9e47367c076

carmstrong · 2014-10-29T17:33:48Z

@EvanK Can you deisctl list? Sounds like your monitors aren't running.

EvanK · 2014-10-29T17:37:38Z

@carmstrong That's what it looks like:

core@deis-staging-node1 ~ $ deisctl list
UNIT                MACHINE         LOAD    ACTIVE      SUB
deis-builder.service        1fef40c1.../10.60.61.37 loaded  inactive    dead
deis-cache.service      f11329f5.../10.60.61.36 loaded  inactive    dead
deis-controller.service     1fef40c1.../10.60.61.37 loaded  inactive    dead
deis-database.service       95fcde77.../10.60.61.35 loaded  inactive    dead
deis-logger.service     95fcde77.../10.60.61.35 loaded  inactive    dead
deis-logspout.service       1fef40c1.../10.60.61.37 loaded  inactive    dead
deis-logspout.service       95fcde77.../10.60.61.35 loaded  inactive    dead
deis-logspout.service       f11329f5.../10.60.61.36 loaded  inactive    dead
deis-publisher.service      1fef40c1.../10.60.61.37 loaded  inactive    dead
deis-publisher.service      95fcde77.../10.60.61.35 loaded  inactive    dead
deis-publisher.service      f11329f5.../10.60.61.36 loaded  inactive    dead
deis-registry.service       f11329f5.../10.60.61.36 loaded  inactive    dead
deis-router@1.service       95fcde77.../10.60.61.35 loaded  inactive    dead
deis-router@2.service       f11329f5.../10.60.61.36 loaded  inactive    dead
deis-router@3.service       1fef40c1.../10.60.61.37 loaded  inactive    dead
deis-store-daemon.service   1fef40c1.../10.60.61.37 loaded  active      running
deis-store-daemon.service   95fcde77.../10.60.61.35 loaded  active      running
deis-store-daemon.service   f11329f5.../10.60.61.36 loaded  active      running
deis-store-gateway.service  1fef40c1.../10.60.61.37 loaded  active      running
deis-store-metadata.service 1fef40c1.../10.60.61.37 loaded  active      running
deis-store-metadata.service 95fcde77.../10.60.61.35 loaded  active      running
deis-store-metadata.service f11329f5.../10.60.61.36 loaded  active      running
deis-store-monitor.service  1fef40c1.../10.60.61.37 loaded  inactive    dead
deis-store-monitor.service  95fcde77.../10.60.61.35 loaded  inactive    dead
deis-store-monitor.service  f11329f5.../10.60.61.36 loaded  inactive    dead
deis-store-volume.service   1fef40c1.../10.60.61.37 loaded  activating  start-pre
deis-store-volume.service   95fcde77.../10.60.61.35 loaded  activating  start-pre
deis-store-volume.service   f11329f5.../10.60.61.36 loaded  activating  start-pre

This happened right out the gate from a deisctl start platform, where everything appeared to start -- up until the store-volume loop.

carmstrong · 2014-10-29T17:41:34Z

@EvanK That's really odd - a state of "inactive/dead" implies that the services were stopped, not failed. Can you log into one of your CoreOS hosts and journalctl -u deis-store-monitor? I'd like to see what happened.

But without your store-monitors, none of the store components will function properly. In either event, you'll need to deisctl start platform from the beginning so we make sure those are started. Alternatively, deisctl start store-monitor.

carmstrong · 2014-10-29T17:46:43Z

@EvanK I also deployed a fresh Deis 3 times with the deisctl built in #2307 and haven't run into an issue with store. Are you in a position where you can easily reprovision the cluster? I'd like to have some confidence that the fixes in that branch are an improvement on the current experience, and I haven't seen any store issues locally.

If you wouldn't mind starting over, I could build you a deisctl off that branch.

EvanK · 2014-10-29T18:10:29Z

@carmstrong I've stopped and restarted the platform a few times now since the coreos install. Here's the output of the latest start platform, as well as the entire log from that journalctl: https://gist.github.com/EvanK/d14303cc0decfe6955f9

The relevant lines for the latest stop/start are from 17:47 on. I am going to tear down these machines & completely recreate them again, following the same bare-metal docs

carmstrong · 2014-10-29T19:05:00Z

@EvanK We just merged #2307. If you still see issues starting the platform after using the deisctl I built for you, please open a new issue and I'll dive in. Thanks!

adrianmoya · 2014-10-29T19:07:03Z

I've just hit this issue installing fresh 0.15.0 on a private openstack. After provisioning the cluster, I installed the units, and then run deis platform start and it hanged:

● ▴ ■
■ ● ▴ Starting Deis...
▴ ■ ●

Storage subsystem...
deis-store-monitor.service: activating/start-pre                                 
deis-store-monitor.service: active/running                                 
deis-store-daemon.service: activating/start-pre                                 
deis-store-daemon.service: active/running                                 
deis-store-metadata.service: activating/start-pre                                 
deis-store-metadata.service: active/running                                 
deis-store-gateway.service: activating/start-pre                                 
deis-store-volume.service: activating/start-pre                                 
deis-store-volume.service: activating/auto-restart                                 
deis-store-gateway.service: active/running                                 
deis-store-volume.service: activating/start-pre                                 
deis-store-volume.service: activating/auto-restart                                 
deis-store-volume.service: activating/start-pre                                 
deis-store-volume.service: activating/auto-restart                                 
deis-store-volume.service: activating/start-pre                                 
deis-store-volume.service: activating/auto-restart                                 
deis-store-volume.service: activating/start-pre                                 
deis-store-volume.service: activating/auto-restart                                 
deis-store-volume.service: activating/start-pre                                 
deis-store-volume.service: active/running                                 
Logging subsystem...

I did some of the advice in this thread:

deisctl list

UNIT                MACHINE         LOAD    ACTIVE      SUB
deis-builder.service        2036302f.../10.21.12.10 loaded  inactive    dead
deis-cache.service      25a78c22.../10.21.12.9  loaded  inactive    dead
deis-controller.service     e07df3c0.../10.21.12.11 loaded  inactive    dead
deis-database.service       2036302f.../10.21.12.10 loaded  inactive    dead
deis-logger.service     25a78c22.../10.21.12.9  loaded  inactive    dead
deis-logspout.service       2036302f.../10.21.12.10 loaded  inactive    dead
deis-logspout.service       25a78c22.../10.21.12.9  loaded  inactive    dead
deis-logspout.service       e07df3c0.../10.21.12.11 loaded  inactive    dead
deis-publisher.service      2036302f.../10.21.12.10 loaded  inactive    dead
deis-publisher.service      25a78c22.../10.21.12.9  loaded  inactive    dead
deis-publisher.service      e07df3c0.../10.21.12.11 loaded  inactive    dead
deis-registry.service       e07df3c0.../10.21.12.11 loaded  inactive    dead
deis-router@1.service       25a78c22.../10.21.12.9  loaded  inactive    dead
deis-router@2.service       2036302f.../10.21.12.10 loaded  inactive    dead
deis-router@3.service       e07df3c0.../10.21.12.11 loaded  inactive    dead
deis-store-daemon.service   2036302f.../10.21.12.10 loaded  active      running
deis-store-daemon.service   25a78c22.../10.21.12.9  loaded  active      running
deis-store-daemon.service   e07df3c0.../10.21.12.11 loaded  active      running
deis-store-gateway.service  2036302f.../10.21.12.10 loaded  active      running
deis-store-metadata.service 2036302f.../10.21.12.10 loaded  active      running
deis-store-metadata.service 25a78c22.../10.21.12.9  loaded  active      running
deis-store-metadata.service e07df3c0.../10.21.12.11 loaded  active      running
deis-store-monitor.service  2036302f.../10.21.12.10 loaded  active      running
deis-store-monitor.service  25a78c22.../10.21.12.9  loaded  active      running
deis-store-monitor.service  e07df3c0.../10.21.12.11 loaded  active      running
deis-store-volume.service   2036302f.../10.21.12.10 loaded  active      running
deis-store-volume.service   25a78c22.../10.21.12.9  loaded  active      running
deis-store-volume.service   e07df3c0.../10.21.12.11 loaded  active      running

ceph -s command inside Ceph container:

    cluster c97e62d4-4920-4c3d-baa5-2543a9269026
     health HEALTH_OK
     monmap e3: 3 mons at {deis-2036302f-67e2-4c66-a609-9d0fb9b97bcf.novalocal=10.21.12.10:6789/0,deis-25a78c22-0bb1-43a9-bf27-8d0265f3ed51.novalocal=10.21.12.9:6789/0,deis-e07df3c0-3d13-409c-93e8-0e7b2fdf63f0.novalocal=10.21.12.11:6789/0}, election epoch 8, quorum 0,1,2 deis-25a78c22-0bb1-43a9-bf27-8d0265f3ed51.novalocal,deis-2036302f-67e2-4c66-a609-9d0fb9b97bcf.novalocal,deis-e07df3c0-3d13-409c-93e8-0e7b2fdf63f0.novalocal
     mdsmap e5: 1/1/1 up {0=deis-e07df3c0-3d13-409c-93e8-0e7b2fdf63f0.novalocal=up:active}, 2 up:standby
     osdmap e25: 3 osds: 3 up, 3 in
      pgmap v337: 960 pgs, 9 pools, 8216 bytes data, 65 objects
            16602 MB used, 37174 MB / 54326 MB avail
                 960 active+clean

journalctl -u deis-store-monitor

Oct 29 18:57:14 deis-25a78c22-0bb1-43a9-bf27-8d0265f3ed51.novalocal sh[10606]: 2014-10-29 18:57:14.597274 7f55a1ae0700  0 mon.deis-25a78c22-0bb1-43a9-bf27-8d0265f3ed51.novalocal@0(leader).data_health(8) upd
Oct 29 18:58:14 deis-25a78c22-0bb1-43a9-bf27-8d0265f3ed51.novalocal sh[10606]: 2014-10-29 18:58:14.597508 7f55a1ae0700  0 mon.deis-25a78c22-0bb1-43a9-bf27-8d0265f3ed51.novalocal@0(leader).data_health(8) upd
Oct 29 18:58:26 deis-25a78c22-0bb1-43a9-bf27-8d0265f3ed51.novalocal sh[10606]: 2014-10-29 18:58:26.152722 7f55a12df700  0 log [INF] : pgmap v341: 960 pgs: 960 active+clean; 8216 bytes data, 16598 MB used, 3
Oct 29 18:58:28 deis-25a78c22-0bb1-43a9-bf27-8d0265f3ed51.novalocal sh[10606]: 2014-10-29 18:58:28.216490 7f55a12df700  0 log [INF] : pgmap v342: 960 pgs: 960 active+clean; 8216 bytes data, 16592 MB used, 3
Oct 29 18:58:31 deis-25a78c22-0bb1-43a9-bf27-8d0265f3ed51.novalocal sh[10606]: 2014-10-29 18:58:31.316164 7f55a12df700  0 log [INF] : pgmap v343: 960 pgs: 960 active+clean; 8216 bytes data, 16567 MB used, 3
Oct 29 18:59:14 deis-25a78c22-0bb1-43a9-bf27-8d0265f3ed51.novalocal sh[10606]: 2014-10-29 18:59:14.597686 7f55a1ae0700  0 mon.deis-25a78c22-0bb1-43a9-bf27-8d0265f3ed51.novalocal@0(leader).data_health(8) upd
Oct 29 19:00:14 deis-25a78c22-0bb1-43a9-bf27-8d0265f3ed51.novalocal sh[10606]: 2014-10-29 19:00:14.597916 7f55a1ae0700  0 mon.deis-25a78c22-0bb1-43a9-bf27-8d0265f3ed51.novalocal@0(leader).data_health(8) upd
Oct 29 19:00:26 deis-25a78c22-0bb1-43a9-bf27-8d0265f3ed51.novalocal sh[10606]: 2014-10-29 19:00:26.155586 7f55a12df700  0 log [INF] : pgmap v344: 960 pgs: 960 active+clean; 8216 bytes data, 16567 MB used, 3
Oct 29 19:00:28 deis-25a78c22-0bb1-43a9-bf27-8d0265f3ed51.novalocal sh[10606]: 2014-10-29 19:00:28.322457 7f55a12df700  0 log [INF] : pgmap v345: 960 pgs: 960 active+clean; 8216 bytes data, 16568 MB used, 3
Oct 29 19:00:31 deis-25a78c22-0bb1-43a9-bf27-8d0265f3ed51.novalocal sh[10606]: 2014-10-29 19:00:31.339104 7f55a12df700  0 log [INF] : pgmap v346: 960 pgs: 960 active+clean; 8216 bytes data, 16564 MB used, 3
Oct 29 19:01:14 deis-25a78c22-0bb1-43a9-bf27-8d0265f3ed51.novalocal sh[10606]: 2014-10-29 19:01:14.598108 7f55a1ae0700  0 mon.deis-25a78c22-0bb1-43a9-bf27-8d0265f3ed51.novalocal@0(leader).data_health(8) upd
Oct 29 19:02:14 deis-25a78c22-0bb1-43a9-bf27-8d0265f3ed51.novalocal sh[10606]: 2014-10-29 19:02:14.598344 7f55a1ae0700  0 mon.deis-25a78c22-0bb1-43a9-bf27-8d0265f3ed51.novalocal@0(leader).data_health(8) upd
Oct 29 19:02:26 deis-25a78c22-0bb1-43a9-bf27-8d0265f3ed51.novalocal sh[10606]: 2014-10-29 19:02:26.153705 7f55a12df700  0 log [INF] : pgmap v347: 960 pgs: 960 active+clean; 8216 bytes data, 16559 MB used, 3
Oct 29 19:02:28 deis-25a78c22-0bb1-43a9-bf27-8d0265f3ed51.novalocal sh[10606]: 2014-10-29 19:02:28.361769 7f55a12df700  0 log [INF] : pgmap v348: 960 pgs: 960 active+clean; 8216 bytes data, 16554 MB used, 3
Oct 29 19:02:31 deis-25a78c22-0bb1-43a9-bf27-8d0265f3ed51.novalocal sh[10606]: 2014-10-29 19:02:31.353229 7f55a12df700  0 log [INF] : pgmap v349: 960 pgs: 960 active+clean; 8216 bytes data, 16555 MB used, 3

journalctl -u deis-store-volume

Oct 29 17:10:27 deis-25a78c22-0bb1-43a9-bf27-8d0265f3ed51.novalocal sh[11737]: waiting for store-monitor...
Oct 29 17:10:27 deis-25a78c22-0bb1-43a9-bf27-8d0265f3ed51.novalocal bash[11743]: cut: write error: Broken pipe
Oct 29 17:11:57 deis-25a78c22-0bb1-43a9-bf27-8d0265f3ed51.novalocal systemd[1]: deis-store-volume.service start-pre operation timed out. Terminating.
Oct 29 17:11:57 deis-25a78c22-0bb1-43a9-bf27-8d0265f3ed51.novalocal systemd[1]: Failed to start deis-store-volume.
Oct 29 17:11:57 deis-25a78c22-0bb1-43a9-bf27-8d0265f3ed51.novalocal systemd[1]: Unit deis-store-volume.service entered failed state.
Oct 29 17:12:02 deis-25a78c22-0bb1-43a9-bf27-8d0265f3ed51.novalocal systemd[1]: deis-store-volume.service holdoff time over, scheduling restart.
Oct 29 17:12:02 deis-25a78c22-0bb1-43a9-bf27-8d0265f3ed51.novalocal systemd[1]: Stopping deis-store-volume...
Oct 29 17:12:02 deis-25a78c22-0bb1-43a9-bf27-8d0265f3ed51.novalocal systemd[1]: Starting deis-store-volume...
Oct 29 17:12:02 deis-25a78c22-0bb1-43a9-bf27-8d0265f3ed51.novalocal sh[11794]: waiting for store-monitor...
Oct 29 17:12:02 deis-25a78c22-0bb1-43a9-bf27-8d0265f3ed51.novalocal bash[11800]: cut: write error: Broken pipe
Oct 29 17:12:27 deis-25a78c22-0bb1-43a9-bf27-8d0265f3ed51.novalocal systemd[1]: Started deis-store-volume.

If you want me to try anything else, please ask.

carmstrong · 2014-10-29T19:08:11Z

@adrianmoya I suspect your issue is also solved by #2307. If you're using OS X locally, can you try using this deisctl instead of v0.15.0: https://s3-us-west-2.amazonaws.com/opdemand/deisctl-pr2307-darwin

adrianmoya · 2014-10-29T19:08:42Z

@carmstrong I'm on linux

adrianmoya · 2014-10-29T19:15:33Z

I',ve just looked at #2307 and wanted to add that in my case, this is a fresh install, and not an upgrade. It hanged on my first run of deis start platform.

carmstrong · 2014-10-29T19:17:13Z

and wanted to add that in my case, this is a fresh install, and not an upgrade.

Right, I saw that too and tested in that PR. Commit 59f2bb5 should help on start.

carmstrong · 2014-10-29T19:24:37Z

@adrianmoya If you have a Go environment set up locally, you should be able to build deisctl with make -C deisctl install build. I'm going to fire up a Linux VM and see if I can't build it for you.

adrianmoya · 2014-10-29T22:38:55Z

@carmstrong ok, I builded deisctl from master, stopped/uninstalled what was done, and installed/started with this new deisctl, but the result seems to be the same:

● ▴ ■
■ ● ▴ Starting Deis...
▴ ■ ●

Storage subsystem...
deis-store-monitor.service: activating/start-pre                                 
deis-store-monitor.service: active/running                                 
deis-store-daemon.service: activating/start-pre                                 
deis-store-daemon.service: active/running                                 
deis-store-metadata.service: active/running                                 
deis-store-gateway.service: active/running                                 
deis-store-volume.service: activating/start-pre                                 
deis-store-volume.service: activating/auto-restart                                 
deis-store-volume.service: activating/start-pre                                 
deis-store-volume.service: activating/auto-restart                                 
deis-store-volume.service: activating/start-pre                                 
deis-store-volume.service: activating/auto-restart                                 
deis-store-volume.service: activating/start-pre                                 
deis-store-volume.service: activating/auto-restart                                 
deis-store-volume.service: activating/start-pre                                 
deis-store-volume.service: deactivating/final-sigterm                                 
deis-store-volume.service: activating/auto-restart                                 
deis-store-volume.service: activating/start-pre                                 
deis-store-volume.service: activating/auto-restart                                 
deis-store-volume.service: activating/start-pre                                 
deis-store-volume.service: activating/auto-restart                                 
deis-store-volume.service: activating/start-pre                                 
deis-store-volume.service: activating/auto-restart                                 
deis-store-volume.service: activating/start-pre                                 
deis-store-volume.service: activating/auto-restart                                 
deis-store-volume.service: activating/start-pre                                 
deis-store-volume.service: activating/auto-restart

carmstrong · 2014-10-29T22:42:25Z

@adrianmoya Can you post a gist of etcdctl ls --recursive /?

adrianmoya · 2014-10-29T22:43:34Z

https://gist.github.com/adrianmoya/83ef9e0be61f68cb8d75

carmstrong · 2014-10-29T22:46:09Z

@adrianmoya And ceph -s still looks healthy? Honestly, I'm at a loss here. I'm unable to reproduce this locally :(

adrianmoya · 2014-10-29T22:48:00Z

There's a warn:

    cluster c97e62d4-4920-4c3d-baa5-2543a9269026
     health HEALTH_WARN 649 pgs peering; 649 pgs stuck inactive; 649 pgs stuck unclean; 3 requests are blocked > 32 sec; mds cluster is degraded
     monmap e3: 3 mons at {deis-2036302f-67e2-4c66-a609-9d0fb9b97bcf.novalocal=10.21.12.10:6789/0,deis-25a78c22-0bb1-43a9-bf27-8d0265f3ed51.novalocal=10.21.12.9:6789/0,deis-e07df3c0-3d13-409c-93e8-0e7b2fdf63f0.novalocal=10.21.12.11:6789/0}, election epoch 12, quorum 0,1,2 deis-25a78c22-0bb1-43a9-bf27-8d0265f3ed51.novalocal,deis-2036302f-67e2-4c66-a609-9d0fb9b97bcf.novalocal,deis-e07df3c0-3d13-409c-93e8-0e7b2fdf63f0.novalocal
     mdsmap e10: 1/1/1 up {0=deis-2036302f-67e2-4c66-a609-9d0fb9b97bcf.novalocal=up:replay}, 2 up:standby
     osdmap e36: 3 osds: 3 up, 3 in
      pgmap v706: 960 pgs, 9 pools, 8430 bytes data, 65 objects
            16553 MB used, 37217 MB / 54326 MB avail
                 649 peering
                 311 active+clean

carmstrong · 2014-10-29T22:59:19Z

Ok. mds cluster is degraded. You can also see that your main mds is in a status of 'replaying': mdsmap e10: 1/1/1 up {0=deis-2036302f-67e2-4c66-a609-9d0fb9b97bcf.novalocal=up:replay}, 2 up:standby.

I suspect that the cluster will recover on its own once the placement groups have migrated - and if it doesn't, you should be able to deisctl restart store-daemon. Did you stop the old cluster with deisctl stop platform before you uninstalled it?

adrianmoya · 2014-10-29T23:02:29Z

Yes, I stoped, then uninstalled, and then installed/started again. I've just notice that the installation continued and it's currently installing the Control plane...

carmstrong · 2014-10-29T23:02:58Z

I've just notice that the installation continued and it's currently installing the Control plane...

Good, that's what I was hoping. It looks like your cluster just took a little longer to recover.

adrianmoya · 2014-10-30T02:58:58Z

After starting from scratch, the only component that didn't start was the builder:

deisctl list
UNIT                MACHINE         LOAD    ACTIVE      SUB
deis-builder.service        a9444298.../10.21.12.13 loaded  activating  start-post
deis-cache.service      be5aa61b.../10.21.12.12 loaded  active      running
deis-controller.service     e71e4d58.../10.21.12.14 loaded  active      running
deis-database.service       a9444298.../10.21.12.13 loaded  active      running
deis-logger.service     be5aa61b.../10.21.12.12 loaded  active      running
deis-logspout.service       a9444298.../10.21.12.13 loaded  active      running
deis-logspout.service       be5aa61b.../10.21.12.12 loaded  active      running
deis-logspout.service       e71e4d58.../10.21.12.14 loaded  active      running
deis-publisher.service      a9444298.../10.21.12.13 loaded  active      running
deis-publisher.service      be5aa61b.../10.21.12.12 loaded  active      running
deis-publisher.service      e71e4d58.../10.21.12.14 loaded  active      running
deis-registry.service       e71e4d58.../10.21.12.14 loaded  active      running
deis-router@1.service       e71e4d58.../10.21.12.14 loaded  active      running
deis-router@2.service       a9444298.../10.21.12.13 loaded  active      running
deis-router@3.service       be5aa61b.../10.21.12.12 loaded  active      running
deis-store-daemon.service   a9444298.../10.21.12.13 loaded  active      running
deis-store-daemon.service   be5aa61b.../10.21.12.12 loaded  active      running
deis-store-daemon.service   e71e4d58.../10.21.12.14 loaded  active      running
deis-store-gateway.service  a9444298.../10.21.12.13 loaded  active      running
deis-store-metadata.service a9444298.../10.21.12.13 loaded  active      running
deis-store-metadata.service be5aa61b.../10.21.12.12 loaded  active      running
deis-store-metadata.service e71e4d58.../10.21.12.14 loaded  active      running
deis-store-monitor.service  a9444298.../10.21.12.13 loaded  active      running
deis-store-monitor.service  be5aa61b.../10.21.12.12 loaded  active      running
deis-store-monitor.service  e71e4d58.../10.21.12.14 loaded  active      running
deis-store-volume.service   a9444298.../10.21.12.13 loaded  active      running
deis-store-volume.service   be5aa61b.../10.21.12.12 loaded  active      running
deis-store-volume.service   e71e4d58.../10.21.12.14 loaded  active      running

Journal for builder: https://gist.github.com/adrianmoya/d71f43e66dd76159cdb9

carmstrong · 2014-10-30T04:32:28Z

@adrianmoya That's a known issue, at least - I think #2214. Try deisctl restart builder and see if it comes back up the second try.

adrianmoya · 2014-10-30T15:12:22Z

@carmstrong No luck restarting the builder. Is there anything else I can check?

Oct 30 13:55:07 deis-a9444298-2a4d-4c58-ab53-855782330803.novalocal sh[6952]: 2014-10-30T13:55:07Z c8f6bf49cc9e confd[15655]: ERROR Config check failed: exit status 1
Oct 30 13:55:07 deis-a9444298-2a4d-4c58-ab53-855782330803.novalocal sh[6952]: 2014-10-30T13:55:07Z c8f6bf49cc9e confd[15655]: ERROR Config check failed: exit status 1
Oct 30 13:55:07 deis-a9444298-2a4d-4c58-ab53-855782330803.novalocal sh[6952]: builder: waiting for confd to write initial templates...
Oct 30 13:55:12 deis-a9444298-2a4d-4c58-ab53-855782330803.novalocal systemd[1]: deis-builder.service stop-sigterm timed out. Killing.
Oct 30 13:55:12 deis-a9444298-2a4d-4c58-ab53-855782330803.novalocal systemd[1]: deis-builder.service: control process exited, code=killed status=9
Oct 30 13:55:13 deis-a9444298-2a4d-4c58-ab53-855782330803.novalocal systemd[1]: Stopped deis-builder.
Oct 30 13:55:13 deis-a9444298-2a4d-4c58-ab53-855782330803.novalocal systemd[1]: Unit deis-builder.service entered failed state.
Oct 30 13:55:13 deis-a9444298-2a4d-4c58-ab53-855782330803.novalocal docker[31321]: deis-builder
Oct 30 15:09:44 deis-a9444298-2a4d-4c58-ab53-855782330803.novalocal systemd[1]: Starting deis-builder...
Oct 30 15:09:44 deis-a9444298-2a4d-4c58-ab53-855782330803.novalocal sh[9722]: Error: No such image or container: deis-builder
Oct 30 15:09:44 deis-a9444298-2a4d-4c58-ab53-855782330803.novalocal sh[9734]: Waiting for builder on 2223/tcp...
Oct 30 15:10:16 deis-a9444298-2a4d-4c58-ab53-855782330803.novalocal sh[9733]: 2014-10-30T15:10:16Z 6ebc9c43e463 confd[77]: ERROR Config check failed: exit status 1
Oct 30 15:10:16 deis-a9444298-2a4d-4c58-ab53-855782330803.novalocal sh[9733]: 2014-10-30T15:10:16Z 6ebc9c43e463 confd[77]: ERROR Config check failed: exit status 1
Oct 30 15:10:16 deis-a9444298-2a4d-4c58-ab53-855782330803.novalocal sh[9733]: builder: waiting for confd to write initial templates...

geovanisouza92 · 2014-11-01T11:31:55Z

Hello guys,

I'm starting on deis with my notebook:

Ubuntu Trusty 14.04;
4GB RAM;
i5 1st gen;
Vagrant 1.6.5;
Virtualbox 4.3.10.

And I'm stuck on same issue: store-volume loop. Tried restarting VM's, restarting store-daemon and so on. Same logs. Now I'll starting again from master branch.

carmstrong · 2014-11-01T17:19:46Z

4GB RAM;

You're likely going to run into issues deploying a 3-node Deis cluster with so little RAM. However, that shouldn't be the cause of the store-volume issues. Can you check the journal for the service on all of your VMs? I'm thinking one of them is failing.

EvanK · 2014-11-01T19:23:08Z

@carmstrong Are there minimum viable specs published anywhere? I couldn't find anything in the documentation beyond a reference to "large" EC2 instances...

adrianmoya · 2014-11-01T21:22:09Z

I had to give up on 0.15.0, will wait to 0.16.0 to try again the installation and see if the "component X got stuck during installation" errors get better. :(

carmstrong · 2014-11-02T09:30:50Z

Are there minimum viable specs published anywhere?

We only allude to the memory requirements in the main README. Having hard requirements is difficult because Deis is a distributed system, but ideally you want all components to be able to run on a single host if you're running in a degraded state. So I'd say 8GB of RAM per node to give you enough breathing room for that.

I'm working on putting together "Troubleshooting Deis" documentation which will detail commonly-encountered issues

carmstrong · 2014-11-02T09:31:51Z

I had to give up on 0.15.0, will wait to 0.16.0 to try again the installation and see if the "component X got stuck during installation" errors get better. :(

I'm really sorry you're running into issues :( We may cut a v0.15.1 with some fixes for both the confd loop and the start order for the deis-store components so that they're more reliable.

geovanisouza92 · 2014-11-02T15:02:22Z

@carmstrong You're right. First, I tried to run 3 instances and the result was so much memory pagination. Then, I changed the $vm_memory to 1024 and deployed 2 instances.

In both, the results was the same. The store-volume start-pre/auto-restart infinite loop, with some OSD nodes down (3 nodes, 1 up; 2 nodes, 1 up). The logs is identical to those reported in this issue and some related that I found near.

I tried restart the nodes, the deamon, the gateway, and so on. No one worked.

Anyway this project is very very promising. Please, don't give up! :D

I'll be here testing and reporting what I can, because I don't know Go (I'm an Python/Android programmer), but I really want to test it on my Django/Android project.

bacongobbler · 2014-11-02T18:14:41Z

because I don't know Go (I'm an Python/Android programmer), but I really want to test it on my Django/Android project.

You'd be happy to hear that our controller is written in Python, using Django as the web framework of choice :)

mboersma added the bug label Oct 28, 2014

mboersma added this to the 0.16 milestone Oct 28, 2014

mboersma changed the title ~~store-volume fails after update from v0.14.1 to v0.15.0~~ store-volume fails after update to v0.15.0 Oct 28, 2014

mboersma mentioned this issue Oct 28, 2014

fix(deisctl/units): remove extra commands in store-volume Start-Pre #2302

Merged

carmstrong mentioned this issue Oct 28, 2014

Allow to use any S3 compatible store #2295

Closed

carmstrong mentioned this issue Oct 28, 2014

deis-store-volume fails to restart #2286

Closed

carmstrong mentioned this issue Oct 29, 2014

fix(deisctl): use wait groups for stopping and uninstalling store #2307

Merged

EvanK mentioned this issue Oct 29, 2014

store-volume stuck in start-pre/auto-restart loop on deisctl start platform #2317

Closed

carmstrong mentioned this issue Oct 29, 2014

store: block until Ceph is healthy #2321

Closed

adrianmoya mentioned this issue Oct 30, 2014

v0.14.0 Builder Start Failure: No such id: deis/builder:v0.14.0 #2248

Closed

hellojustin mentioned this issue Mar 9, 2015

deis-store-* stuck in auto-restart loop after EC2 AutoScaling (Deis 1.3.0) #3240

Closed

store-volume fails after update to v0.15.0 #2300

store-volume fails after update to v0.15.0 #2300

Comments

mboersma commented Oct 28, 2014

mboersma commented Oct 28, 2014

mboersma commented Oct 28, 2014

carmstrong commented Oct 28, 2014

mboersma commented Oct 28, 2014

carmstrong commented Oct 28, 2014

mboersma commented Oct 28, 2014

carmstrong commented Oct 28, 2014

mboersma commented Oct 28, 2014

carmstrong commented Oct 28, 2014

mboersma commented Oct 28, 2014

carmstrong commented Oct 28, 2014

mboersma commented Oct 28, 2014

carmstrong commented Oct 28, 2014

carmstrong commented Oct 29, 2014

carmstrong commented Oct 29, 2014

EvanK commented Oct 29, 2014

carmstrong commented Oct 29, 2014

EvanK commented Oct 29, 2014

carmstrong commented Oct 29, 2014

EvanK commented Oct 29, 2014

carmstrong commented Oct 29, 2014

carmstrong commented Oct 29, 2014

EvanK commented Oct 29, 2014

carmstrong commented Oct 29, 2014

adrianmoya commented Oct 29, 2014

carmstrong commented Oct 29, 2014

adrianmoya commented Oct 29, 2014

adrianmoya commented Oct 29, 2014

carmstrong commented Oct 29, 2014

carmstrong commented Oct 29, 2014

adrianmoya commented Oct 29, 2014

carmstrong commented Oct 29, 2014

adrianmoya commented Oct 29, 2014

carmstrong commented Oct 29, 2014

adrianmoya commented Oct 29, 2014

carmstrong commented Oct 29, 2014

adrianmoya commented Oct 29, 2014

carmstrong commented Oct 29, 2014

adrianmoya commented Oct 30, 2014

carmstrong commented Oct 30, 2014

adrianmoya commented Oct 30, 2014

geovanisouza92 commented Nov 1, 2014

carmstrong commented Nov 1, 2014

EvanK commented Nov 1, 2014

adrianmoya commented Nov 1, 2014

carmstrong commented Nov 2, 2014

carmstrong commented Nov 2, 2014

geovanisouza92 commented Nov 2, 2014

bacongobbler commented Nov 2, 2014