Skip to content
This repository has been archived by the owner on Nov 30, 2021. It is now read-only.

store-volume fails after update to v0.15.0 #2300

Closed
mboersma opened this issue Oct 28, 2014 · 52 comments · Fixed by #2307
Closed

store-volume fails after update to v0.15.0 #2300

mboersma opened this issue Oct 28, 2014 · 52 comments · Fixed by #2307
Labels
Milestone

Comments

@mboersma
Copy link
Member

Following the upgrade instructions didn't work for me on 3-node vagrant.

$ git checkout v0.14.1
$ make -C deisctl/ build
$ deisctl --version
0.14.1
$ deisctl refresh-units
$ deisctl config platform set domain=local3.deisapp.com
$ deisctl install platform
$ deisctl start platform
$ deisctl uninstall platform
$ git checkout v0.15.0
$ make -C deisctl/ build
$ deisctl --version
0.15.0
$ deisctl refresh-units
$ deisctl config platform set version=v0.15.0
$ deisctl install platform
$ deisctl start platform
● ▴ ■
■ ● ▴ Starting Deis...
▴ ■ ●

Storage subsystem...
deis-store-monitor.service: activating/start-pre                                 
deis-store-monitor.service: active/running                                 
deis-store-daemon.service: activating/start-pre                                 
deis-store-daemon.service: active/running                                 
deis-store-metadata.service: activating/start-pre                                 
deis-store-metadata.service: active/running                                 
deis-store-volume.service: activating/start-pre                                 
deis-store-gateway.service: activating/start-pre                                 
deis-store-gateway.service: active/running                                 
deis-store-volume.service: activating/auto-restart                                 
deis-store-volume.service: activating/start-pre                                 
deis-store-volume.service: activating/auto-restart                                 
deis-store-volume.service: activating/start-pre                                 
deis-store-volume.service: activating/auto-restart                                 
deis-store-volume.service: activating/start-pre                                 
deis-store-volume.service: activating/auto-restart                                 
deis-store-volume.service: activating/start-pre 
...
@mboersma mboersma added the bug label Oct 28, 2014
@mboersma mboersma added this to the 0.16 milestone Oct 28, 2014
@mboersma mboersma changed the title store-volume fails after update from v0.14.1 to v0.15.0 store-volume fails after update to v0.15.0 Oct 28, 2014
@mboersma
Copy link
Member Author

$ journalctl -u deis-store-volume
-- Logs begin at Tue 2014-10-28 19:13:28 UTC, end at Tue 2014-10-28 20:36:19 UTC. --
Oct 28 19:50:41 deis-1 systemd[1]: Starting deis-store-volume...
Oct 28 19:50:41 deis-1 sh[12964]: waiting for store-monitor...
Oct 28 19:50:41 deis-1 bash[12972]: cut: write error: Broken pipe
Oct 28 19:52:11 deis-1 systemd[1]: deis-store-volume.service start-pre operation timed out. Terminating.
Oct 28 19:52:11 deis-1 systemd[1]: Failed to start deis-store-volume.
Oct 28 19:52:11 deis-1 systemd[1]: Unit deis-store-volume.service entered failed state.
Oct 28 19:52:16 deis-1 systemd[1]: deis-store-volume.service holdoff time over, scheduling restart.
Oct 28 19:52:16 deis-1 systemd[1]: Stopping deis-store-volume...
Oct 28 19:52:16 deis-1 systemd[1]: Starting deis-store-volume...
Oct 28 19:52:16 deis-1 sh[13163]: waiting for store-monitor...
Oct 28 19:52:16 deis-1 bash[13169]: cut: write error: Broken pipe
Oct 28 19:53:46 deis-1 systemd[1]: deis-store-volume.service start-pre operation timed out. Terminating.
Oct 28 19:53:46 deis-1 systemd[1]: Failed to start deis-store-volume.
...
$ etcdctl get /deis/store/monSetupComplete
youBetcha

@mboersma
Copy link
Member Author

Based on cut: write error: Broken pipe in the above log, I am guessing store-volume is faling somewhere in here:
ExecStartPre=/bin/bash -c "HOSTS=etcdctl ls /deis/store/hosts | cut -d/ -f5 | etcdctl ls /deis/store/hosts | cut -d/ -f5 | awk '{if(NR == 1) {printf $0} else {printf ","$0}}' && mount -t ceph $HOSTS:/ /var/lib/deis/store -o name=admin,secret=etcdctl get /deis/store/adminKeyring | grep 'key =' | cut -d' ' -f3"

@carmstrong
Copy link
Contributor

That means either /deis/store/hosts or /deis/store/adminKeyring isn't what we expect it to be, but both should certainly be set in a working cluster.

@mboersma
Copy link
Member Author

Both those look to be correctly populated in etcd. I ran the StartPre command by hand and it just hangs:

# HOSTS=`etcdctl ls /deis/store/hosts | cut -d/ -f5 | awk '{if(NR == 1) {printf $0} else {printf ","$0}}'` && mount -t ceph $HOSTS:/ /var/lib/deis/store -o name=admin,secret=`etcdctl get /deis/store/adminKeyring | grep 'key =' | cut -d' ' -f3`

@carmstrong
Copy link
Contributor

it just hangs:

Sounds like some of your monitors are down. What does ceph -s show in one of the Ceph containers?

@mboersma
Copy link
Member Author

Actually I'm wrong, it eventually returned:

# HOSTS=`etcdctl ls /deis/store/hosts | cut -d/ -f5 | awk '{if(NR == 1) {printf $0} else {printf ","$0}}'` && mount -t ceph $HOSTS:/ /var/lib/deis/store -o name=admin,secret=`etcdctl get /deis/store/adminKeyring | grep 'key =' | cut -d' ' -f3`
mount: 172.17.8.100,172.17.8.102,172.17.8.101:/: can't read superblock
$ fleetctl list-machines
MACHINE     IP      METADATA
11de358a... 172.17.8.101    -
6c5a5b0a... 172.17.8.100    -
ea7309c6... 172.17.8.102    -
$ nse deis-store-monitor
groups: cannot find name for group ID 11
root@deis-1:/# ceph -s
    cluster 82a7a99c-797a-44a4-8ffa-e3dd27a5daf4
     health HEALTH_WARN 677 pgs degraded; 667 pgs down; 667 pgs peering; 667 pgs stuck inactive; 1344 pgs stuck unclean; 17 requests are blocked > 32 sec; recovery 328/903 objects degraded (36.323%)
     monmap e3: 3 mons at {deis-1=172.17.8.100:6789/0,deis-2=172.17.8.101:6789/0,deis-3=172.17.8.102:6789/0}, election epoch 8, quorum 0,1,2 deis-1,deis-2,deis-3
     mdsmap e4: 1/1/1 up {0=deis-2=up:creating}, 2 up:standby
     osdmap e34: 3 osds: 1 up, 1 in
      pgmap v252: 1344 pgs, 12 pools, 340 MB data, 301 objects
            9175 MB used, 6944 MB / 16402 MB avail
            328/903 objects degraded (36.323%)
                 677 active+degraded
                 667 down+peering

@carmstrong
Copy link
Contributor

osdmap e34: 3 osds: 1 up, 1 in. One of your OSDs is down, resulting in placement groups moving hosts. The CephFS mount request is blocked: That's what 17 requests are blocked > 32 sec; means.

It sounds like the larger issue here is that during an upgrade, we're either not restarting all the daemons properly, or in the right order.

@mboersma
Copy link
Member Author

during an upgrade, we're either not restarting all the daemons properly, or in the right order.

That seems likely, since this was easy to reproduce on upgrade but I haven't seen it on a clean install.

@carmstrong
Copy link
Contributor

@mboersma Does restarting the downed deis-store-daemon component recover things?

@mboersma
Copy link
Member Author

@carmstrong yes indeed it does recover store-volume. deisctl restart store-daemon ftw.

@carmstrong
Copy link
Contributor

yes indeed it does recover store-volume.

Excellent. I'm digging into this now - it's definitely not an ideal upgrade experience, but that workaround is pretty straightforward.

@mboersma
Copy link
Member Author

It is possible I hit a non-deterministic error and that this doesn't actually have to do with upgrading. It's just that all my fresh v0.15.0 provisions have gone smoothly, and this was the first upgrade I tried since the release and my experience matched that of at least one IRC user, so I assumed upgrade was the key.

@carmstrong
Copy link
Contributor

Yeah, definitely appropriate to raise a red flag. I'm seeing if I can reproduce this on an upgrade.

@carmstrong
Copy link
Contributor

@mboersma The bad news is that I'm seeing is. But the good news is that I'm seeing this.

root@deis-1:/# ceph -s
    cluster c50a4ca5-36ba-4796-a7ae-c8fa173350fe
     health HEALTH_WARN 444 pgs peering; 444 pgs stuck inactive; 444 pgs stuck unclean; 9 requests are blocked > 32 sec
     monmap e3: 3 mons at {deis-1=172.17.8.100:6789/0,deis-2=172.17.8.101:6789/0,deis-3=172.17.8.102:6789/0}, election epoch 10, quorum 0,1,2 deis-1,deis-2,deis-3
     mdsmap e4: 1/1/1 up {0=deis-3=up:creating}, 2 up:standby
     osdmap e33: 3 osds: 3 up, 3 in
      pgmap v154: 1344 pgs, 12 pools, 340 MB data, 292 objects
            21617 MB used, 26563 MB / 49206 MB avail
                 444 peering
                 900 active+clean

@mboersma In your case, you had a downed OSD. In my case, I have a downed MDS:

4413:   172.17.8.102:6805/1 'deis-3' mds.0.1 up:creating seq 1
4416:   172.17.8.101:6805/1 'deis-2' mds.-1.0 up:standby seq 1
4419:   172.17.8.100:6805/1 'deis-1' mds.-1.0 up:standby seq 1

2014-10-29 00:05:18.856786 7fa9a47f4700  0 mon.deis-3@2(peon).data_health(10) update_stats avail 57% total 16795648 used 6756308 avail 9687820
2014-10-29 00:06:18.857303 7fa9a47f4700  0 mon.deis-3@2(peon).data_health(10) update_stats avail 57% total 16795648 used 6761712 avail 9687344
2014-10-29 00:07:00.572592 7fa9a24ed700  0 -- 172.17.8.102:6789/0 >> 172.17.8.101:0/3607606102 pipe(0x304e580 sd=19 :6789 s=0 pgs=0 cs=0 l=0 c=0x301c7e0).accept peer addr is really 172.17.8.101:0/3607606102 (socket is 172.17.8.101:48241/0)
2014-10-29 00:07:18.857574 7fa9a47f4700  0 mon.deis-3@2(peon).data_health(10) update_stats avail 57% total 16795648 used 6762240 avail 9686848
2014-10-29 00:08:18.857905 7fa9a47f4700  0 mon.deis-3@2(peon).data_health(10) update_stats avail 57% total 16795648 used 6762692 avail 9686396
2014-10-29 00:09:00.616919 7fa9a24ed700  0 -- 172.17.8.102:6789/0 >> 172.17.8.102:0/3071133433 pipe(0x304f980 sd=19 :6789 s=0 pgs=0 cs=0 l=0 c=0x301d180).accept peer addr is really 172.17.8.102:0/3071133433 (socket is 172.17.8.102:46427/0)
2014-10-29 00:09:18.858163 7fa9a47f4700  0 mon.deis-3@2(peon).data_health(10) update_stats avail 57% total 16795648 used 6763196 avail 9685892
2014-10-29 00:10:18.858526 7fa9a47f4700  0 mon.deis-3@2(peon).data_health(10) update_stats avail 57% total 16795648 used 6763460 avail 9685660
2014-10-29 00:11:18.858880 7fa9a47f4700  0 mon.deis-3@2(peon).data_health(10) update_stats avail 57% total 16795648 used 6763884 avail 9685268
2014-10-29 00:11:39.991963 7fa9a3ff3700  0 mon.deis-3@2(peon) e3 handle_command mon_command({"prefix": "status"} v 0) v1
2014-10-29 00:11:53.448984 7fa9a3ff3700  0 mon.deis-3@2(peon) e3 handle_command mon_command({"prefix": "status"} v 0) v1
2014-10-29 00:11:54.777507 7fa9a3ff3700  0 mon.deis-3@2(peon) e3 handle_command mon_command({"prefix": "status"} v 0) v1
2014-10-29 00:12:01.541955 7fa9a3ff3700  0 mon.deis-3@2(peon) e3 handle_command mon_command({"prefix": "status"} v 0) v1
2014-10-29 00:12:16.611664 7fa9a24ed700  0 -- 172.17.8.102:6789/0 >> 172.17.8.101:0/1825565583 pipe(0x304e300 sd=19 :6789 s=0 pgs=0 cs=0 l=0 c=0x301d020).accept peer addr is really 172.17.8.101:0/1825565583 (socket is 172.17.8.101:48945/0)
2014-10-29 00:12:17.000572 7fa9a22eb700  0 -- 172.17.8.102:6789/0 >> 172.17.8.100:0/3657654136 pipe(0x304e580 sd=22 :6789 s=0 pgs=0 cs=0 l=0 c=0x301cec0).accept peer addr is really 172.17.8.100:0/3657654136 (socket is 172.17.8.100:54392/0)
2014-10-29 00:12:18.859323 7fa9a47f4700  0 mon.deis-3@2(peon).data_health(10) update_stats avail 57% total 16795648 used 6764144 avail 9685008
2014-10-29 00:12:45.677039 7fa9a3ff3700  0 mon.deis-3@2(peon) e3 handle_command mon_command({"prefix": "mds stat"} v 0) v1
2014-10-29 00:13:18.861155 7fa9a47f4700  0 mon.deis-3@2(peon).data_health(10) update_stats avail 57% total 16795648 used 6759708 avail 9689348

I'm going to make a change to perform an ordered stop/uninstall with deisctl and see if that remedies this.

@carmstrong
Copy link
Contributor

Doing deisctl restart store-daemon brought everything up again just fine.

@EvanK
Copy link

EvanK commented Oct 29, 2014

FWIW, I'm seeing the same death loop on deis-store-volume.service (activating/start-pre -> activating/auto-restart, et al) on a fresh bare-metal install of 0.15.0...so this may not just be an upgrade problem?

@carmstrong
Copy link
Contributor

@EvanK :( That means there's still a race condition somewhere. Can you log into one of your machines, nse deis-store-monitor, and ceph -s for me please?

@EvanK
Copy link

EvanK commented Oct 29, 2014

@carmstrong
Copy link
Contributor

@EvanK Can you deisctl list? Sounds like your monitors aren't running.

@EvanK
Copy link

EvanK commented Oct 29, 2014

@carmstrong That's what it looks like:

core@deis-staging-node1 ~ $ deisctl list
UNIT                MACHINE         LOAD    ACTIVE      SUB
deis-builder.service        1fef40c1.../10.60.61.37 loaded  inactive    dead
deis-cache.service      f11329f5.../10.60.61.36 loaded  inactive    dead
deis-controller.service     1fef40c1.../10.60.61.37 loaded  inactive    dead
deis-database.service       95fcde77.../10.60.61.35 loaded  inactive    dead
deis-logger.service     95fcde77.../10.60.61.35 loaded  inactive    dead
deis-logspout.service       1fef40c1.../10.60.61.37 loaded  inactive    dead
deis-logspout.service       95fcde77.../10.60.61.35 loaded  inactive    dead
deis-logspout.service       f11329f5.../10.60.61.36 loaded  inactive    dead
deis-publisher.service      1fef40c1.../10.60.61.37 loaded  inactive    dead
deis-publisher.service      95fcde77.../10.60.61.35 loaded  inactive    dead
deis-publisher.service      f11329f5.../10.60.61.36 loaded  inactive    dead
deis-registry.service       f11329f5.../10.60.61.36 loaded  inactive    dead
deis-router@1.service       95fcde77.../10.60.61.35 loaded  inactive    dead
deis-router@2.service       f11329f5.../10.60.61.36 loaded  inactive    dead
deis-router@3.service       1fef40c1.../10.60.61.37 loaded  inactive    dead
deis-store-daemon.service   1fef40c1.../10.60.61.37 loaded  active      running
deis-store-daemon.service   95fcde77.../10.60.61.35 loaded  active      running
deis-store-daemon.service   f11329f5.../10.60.61.36 loaded  active      running
deis-store-gateway.service  1fef40c1.../10.60.61.37 loaded  active      running
deis-store-metadata.service 1fef40c1.../10.60.61.37 loaded  active      running
deis-store-metadata.service 95fcde77.../10.60.61.35 loaded  active      running
deis-store-metadata.service f11329f5.../10.60.61.36 loaded  active      running
deis-store-monitor.service  1fef40c1.../10.60.61.37 loaded  inactive    dead
deis-store-monitor.service  95fcde77.../10.60.61.35 loaded  inactive    dead
deis-store-monitor.service  f11329f5.../10.60.61.36 loaded  inactive    dead
deis-store-volume.service   1fef40c1.../10.60.61.37 loaded  activating  start-pre
deis-store-volume.service   95fcde77.../10.60.61.35 loaded  activating  start-pre
deis-store-volume.service   f11329f5.../10.60.61.36 loaded  activating  start-pre

This happened right out the gate from a deisctl start platform, where everything appeared to start -- up until the store-volume loop.

@carmstrong
Copy link
Contributor

@EvanK That's really odd - a state of "inactive/dead" implies that the services were stopped, not failed. Can you log into one of your CoreOS hosts and journalctl -u deis-store-monitor? I'd like to see what happened.

But without your store-monitors, none of the store components will function properly. In either event, you'll need to deisctl start platform from the beginning so we make sure those are started. Alternatively, deisctl start store-monitor.

@carmstrong
Copy link
Contributor

@EvanK I also deployed a fresh Deis 3 times with the deisctl built in #2307 and haven't run into an issue with store. Are you in a position where you can easily reprovision the cluster? I'd like to have some confidence that the fixes in that branch are an improvement on the current experience, and I haven't seen any store issues locally.

If you wouldn't mind starting over, I could build you a deisctl off that branch.

@EvanK
Copy link

EvanK commented Oct 29, 2014

@carmstrong I've stopped and restarted the platform a few times now since the coreos install. Here's the output of the latest start platform, as well as the entire log from that journalctl: https://gist.github.com/EvanK/d14303cc0decfe6955f9

The relevant lines for the latest stop/start are from 17:47 on. I am going to tear down these machines & completely recreate them again, following the same bare-metal docs

@carmstrong
Copy link
Contributor

@EvanK We just merged #2307. If you still see issues starting the platform after using the deisctl I built for you, please open a new issue and I'll dive in. Thanks!

@adrianmoya
Copy link
Contributor

I've just hit this issue installing fresh 0.15.0 on a private openstack. After provisioning the cluster, I installed the units, and then run deis platform start and it hanged:

● ▴ ■
■ ● ▴ Starting Deis...
▴ ■ ●

Storage subsystem...
deis-store-monitor.service: activating/start-pre                                 
deis-store-monitor.service: active/running                                 
deis-store-daemon.service: activating/start-pre                                 
deis-store-daemon.service: active/running                                 
deis-store-metadata.service: activating/start-pre                                 
deis-store-metadata.service: active/running                                 
deis-store-gateway.service: activating/start-pre                                 
deis-store-volume.service: activating/start-pre                                 
deis-store-volume.service: activating/auto-restart                                 
deis-store-gateway.service: active/running                                 
deis-store-volume.service: activating/start-pre                                 
deis-store-volume.service: activating/auto-restart                                 
deis-store-volume.service: activating/start-pre                                 
deis-store-volume.service: activating/auto-restart                                 
deis-store-volume.service: activating/start-pre                                 
deis-store-volume.service: activating/auto-restart                                 
deis-store-volume.service: activating/start-pre                                 
deis-store-volume.service: activating/auto-restart                                 
deis-store-volume.service: activating/start-pre                                 
deis-store-volume.service: active/running                                 
Logging subsystem...

I did some of the advice in this thread:

deisctl list

UNIT                MACHINE         LOAD    ACTIVE      SUB
deis-builder.service        2036302f.../10.21.12.10 loaded  inactive    dead
deis-cache.service      25a78c22.../10.21.12.9  loaded  inactive    dead
deis-controller.service     e07df3c0.../10.21.12.11 loaded  inactive    dead
deis-database.service       2036302f.../10.21.12.10 loaded  inactive    dead
deis-logger.service     25a78c22.../10.21.12.9  loaded  inactive    dead
deis-logspout.service       2036302f.../10.21.12.10 loaded  inactive    dead
deis-logspout.service       25a78c22.../10.21.12.9  loaded  inactive    dead
deis-logspout.service       e07df3c0.../10.21.12.11 loaded  inactive    dead
deis-publisher.service      2036302f.../10.21.12.10 loaded  inactive    dead
deis-publisher.service      25a78c22.../10.21.12.9  loaded  inactive    dead
deis-publisher.service      e07df3c0.../10.21.12.11 loaded  inactive    dead
deis-registry.service       e07df3c0.../10.21.12.11 loaded  inactive    dead
deis-router@1.service       25a78c22.../10.21.12.9  loaded  inactive    dead
deis-router@2.service       2036302f.../10.21.12.10 loaded  inactive    dead
deis-router@3.service       e07df3c0.../10.21.12.11 loaded  inactive    dead
deis-store-daemon.service   2036302f.../10.21.12.10 loaded  active      running
deis-store-daemon.service   25a78c22.../10.21.12.9  loaded  active      running
deis-store-daemon.service   e07df3c0.../10.21.12.11 loaded  active      running
deis-store-gateway.service  2036302f.../10.21.12.10 loaded  active      running
deis-store-metadata.service 2036302f.../10.21.12.10 loaded  active      running
deis-store-metadata.service 25a78c22.../10.21.12.9  loaded  active      running
deis-store-metadata.service e07df3c0.../10.21.12.11 loaded  active      running
deis-store-monitor.service  2036302f.../10.21.12.10 loaded  active      running
deis-store-monitor.service  25a78c22.../10.21.12.9  loaded  active      running
deis-store-monitor.service  e07df3c0.../10.21.12.11 loaded  active      running
deis-store-volume.service   2036302f.../10.21.12.10 loaded  active      running
deis-store-volume.service   25a78c22.../10.21.12.9  loaded  active      running
deis-store-volume.service   e07df3c0.../10.21.12.11 loaded  active      running

ceph -s command inside Ceph container:

    cluster c97e62d4-4920-4c3d-baa5-2543a9269026
     health HEALTH_OK
     monmap e3: 3 mons at {deis-2036302f-67e2-4c66-a609-9d0fb9b97bcf.novalocal=10.21.12.10:6789/0,deis-25a78c22-0bb1-43a9-bf27-8d0265f3ed51.novalocal=10.21.12.9:6789/0,deis-e07df3c0-3d13-409c-93e8-0e7b2fdf63f0.novalocal=10.21.12.11:6789/0}, election epoch 8, quorum 0,1,2 deis-25a78c22-0bb1-43a9-bf27-8d0265f3ed51.novalocal,deis-2036302f-67e2-4c66-a609-9d0fb9b97bcf.novalocal,deis-e07df3c0-3d13-409c-93e8-0e7b2fdf63f0.novalocal
     mdsmap e5: 1/1/1 up {0=deis-e07df3c0-3d13-409c-93e8-0e7b2fdf63f0.novalocal=up:active}, 2 up:standby
     osdmap e25: 3 osds: 3 up, 3 in
      pgmap v337: 960 pgs, 9 pools, 8216 bytes data, 65 objects
            16602 MB used, 37174 MB / 54326 MB avail
                 960 active+clean

journalctl -u deis-store-monitor

Oct 29 18:57:14 deis-25a78c22-0bb1-43a9-bf27-8d0265f3ed51.novalocal sh[10606]: 2014-10-29 18:57:14.597274 7f55a1ae0700  0 mon.deis-25a78c22-0bb1-43a9-bf27-8d0265f3ed51.novalocal@0(leader).data_health(8) upd
Oct 29 18:58:14 deis-25a78c22-0bb1-43a9-bf27-8d0265f3ed51.novalocal sh[10606]: 2014-10-29 18:58:14.597508 7f55a1ae0700  0 mon.deis-25a78c22-0bb1-43a9-bf27-8d0265f3ed51.novalocal@0(leader).data_health(8) upd
Oct 29 18:58:26 deis-25a78c22-0bb1-43a9-bf27-8d0265f3ed51.novalocal sh[10606]: 2014-10-29 18:58:26.152722 7f55a12df700  0 log [INF] : pgmap v341: 960 pgs: 960 active+clean; 8216 bytes data, 16598 MB used, 3
Oct 29 18:58:28 deis-25a78c22-0bb1-43a9-bf27-8d0265f3ed51.novalocal sh[10606]: 2014-10-29 18:58:28.216490 7f55a12df700  0 log [INF] : pgmap v342: 960 pgs: 960 active+clean; 8216 bytes data, 16592 MB used, 3
Oct 29 18:58:31 deis-25a78c22-0bb1-43a9-bf27-8d0265f3ed51.novalocal sh[10606]: 2014-10-29 18:58:31.316164 7f55a12df700  0 log [INF] : pgmap v343: 960 pgs: 960 active+clean; 8216 bytes data, 16567 MB used, 3
Oct 29 18:59:14 deis-25a78c22-0bb1-43a9-bf27-8d0265f3ed51.novalocal sh[10606]: 2014-10-29 18:59:14.597686 7f55a1ae0700  0 mon.deis-25a78c22-0bb1-43a9-bf27-8d0265f3ed51.novalocal@0(leader).data_health(8) upd
Oct 29 19:00:14 deis-25a78c22-0bb1-43a9-bf27-8d0265f3ed51.novalocal sh[10606]: 2014-10-29 19:00:14.597916 7f55a1ae0700  0 mon.deis-25a78c22-0bb1-43a9-bf27-8d0265f3ed51.novalocal@0(leader).data_health(8) upd
Oct 29 19:00:26 deis-25a78c22-0bb1-43a9-bf27-8d0265f3ed51.novalocal sh[10606]: 2014-10-29 19:00:26.155586 7f55a12df700  0 log [INF] : pgmap v344: 960 pgs: 960 active+clean; 8216 bytes data, 16567 MB used, 3
Oct 29 19:00:28 deis-25a78c22-0bb1-43a9-bf27-8d0265f3ed51.novalocal sh[10606]: 2014-10-29 19:00:28.322457 7f55a12df700  0 log [INF] : pgmap v345: 960 pgs: 960 active+clean; 8216 bytes data, 16568 MB used, 3
Oct 29 19:00:31 deis-25a78c22-0bb1-43a9-bf27-8d0265f3ed51.novalocal sh[10606]: 2014-10-29 19:00:31.339104 7f55a12df700  0 log [INF] : pgmap v346: 960 pgs: 960 active+clean; 8216 bytes data, 16564 MB used, 3
Oct 29 19:01:14 deis-25a78c22-0bb1-43a9-bf27-8d0265f3ed51.novalocal sh[10606]: 2014-10-29 19:01:14.598108 7f55a1ae0700  0 mon.deis-25a78c22-0bb1-43a9-bf27-8d0265f3ed51.novalocal@0(leader).data_health(8) upd
Oct 29 19:02:14 deis-25a78c22-0bb1-43a9-bf27-8d0265f3ed51.novalocal sh[10606]: 2014-10-29 19:02:14.598344 7f55a1ae0700  0 mon.deis-25a78c22-0bb1-43a9-bf27-8d0265f3ed51.novalocal@0(leader).data_health(8) upd
Oct 29 19:02:26 deis-25a78c22-0bb1-43a9-bf27-8d0265f3ed51.novalocal sh[10606]: 2014-10-29 19:02:26.153705 7f55a12df700  0 log [INF] : pgmap v347: 960 pgs: 960 active+clean; 8216 bytes data, 16559 MB used, 3
Oct 29 19:02:28 deis-25a78c22-0bb1-43a9-bf27-8d0265f3ed51.novalocal sh[10606]: 2014-10-29 19:02:28.361769 7f55a12df700  0 log [INF] : pgmap v348: 960 pgs: 960 active+clean; 8216 bytes data, 16554 MB used, 3
Oct 29 19:02:31 deis-25a78c22-0bb1-43a9-bf27-8d0265f3ed51.novalocal sh[10606]: 2014-10-29 19:02:31.353229 7f55a12df700  0 log [INF] : pgmap v349: 960 pgs: 960 active+clean; 8216 bytes data, 16555 MB used, 3

journalctl -u deis-store-volume

Oct 29 17:10:27 deis-25a78c22-0bb1-43a9-bf27-8d0265f3ed51.novalocal sh[11737]: waiting for store-monitor...
Oct 29 17:10:27 deis-25a78c22-0bb1-43a9-bf27-8d0265f3ed51.novalocal bash[11743]: cut: write error: Broken pipe
Oct 29 17:11:57 deis-25a78c22-0bb1-43a9-bf27-8d0265f3ed51.novalocal systemd[1]: deis-store-volume.service start-pre operation timed out. Terminating.
Oct 29 17:11:57 deis-25a78c22-0bb1-43a9-bf27-8d0265f3ed51.novalocal systemd[1]: Failed to start deis-store-volume.
Oct 29 17:11:57 deis-25a78c22-0bb1-43a9-bf27-8d0265f3ed51.novalocal systemd[1]: Unit deis-store-volume.service entered failed state.
Oct 29 17:12:02 deis-25a78c22-0bb1-43a9-bf27-8d0265f3ed51.novalocal systemd[1]: deis-store-volume.service holdoff time over, scheduling restart.
Oct 29 17:12:02 deis-25a78c22-0bb1-43a9-bf27-8d0265f3ed51.novalocal systemd[1]: Stopping deis-store-volume...
Oct 29 17:12:02 deis-25a78c22-0bb1-43a9-bf27-8d0265f3ed51.novalocal systemd[1]: Starting deis-store-volume...
Oct 29 17:12:02 deis-25a78c22-0bb1-43a9-bf27-8d0265f3ed51.novalocal sh[11794]: waiting for store-monitor...
Oct 29 17:12:02 deis-25a78c22-0bb1-43a9-bf27-8d0265f3ed51.novalocal bash[11800]: cut: write error: Broken pipe
Oct 29 17:12:27 deis-25a78c22-0bb1-43a9-bf27-8d0265f3ed51.novalocal systemd[1]: Started deis-store-volume.

If you want me to try anything else, please ask.

@carmstrong
Copy link
Contributor

@adrianmoya I suspect your issue is also solved by #2307. If you're using OS X locally, can you try using this deisctl instead of v0.15.0: https://s3-us-west-2.amazonaws.com/opdemand/deisctl-pr2307-darwin

@adrianmoya
Copy link
Contributor

@carmstrong I'm on linux

@adrianmoya
Copy link
Contributor

I',ve just looked at #2307 and wanted to add that in my case, this is a fresh install, and not an upgrade. It hanged on my first run of deis start platform.

@carmstrong
Copy link
Contributor

and wanted to add that in my case, this is a fresh install, and not an upgrade.

Right, I saw that too and tested in that PR. Commit 59f2bb5 should help on start.

@carmstrong
Copy link
Contributor

@adrianmoya If you have a Go environment set up locally, you should be able to build deisctl with make -C deisctl install build. I'm going to fire up a Linux VM and see if I can't build it for you.

@adrianmoya
Copy link
Contributor

@carmstrong ok, I builded deisctl from master, stopped/uninstalled what was done, and installed/started with this new deisctl, but the result seems to be the same:

● ▴ ■
■ ● ▴ Starting Deis...
▴ ■ ●

Storage subsystem...
deis-store-monitor.service: activating/start-pre                                 
deis-store-monitor.service: active/running                                 
deis-store-daemon.service: activating/start-pre                                 
deis-store-daemon.service: active/running                                 
deis-store-metadata.service: active/running                                 
deis-store-gateway.service: active/running                                 
deis-store-volume.service: activating/start-pre                                 
deis-store-volume.service: activating/auto-restart                                 
deis-store-volume.service: activating/start-pre                                 
deis-store-volume.service: activating/auto-restart                                 
deis-store-volume.service: activating/start-pre                                 
deis-store-volume.service: activating/auto-restart                                 
deis-store-volume.service: activating/start-pre                                 
deis-store-volume.service: activating/auto-restart                                 
deis-store-volume.service: activating/start-pre                                 
deis-store-volume.service: deactivating/final-sigterm                                 
deis-store-volume.service: activating/auto-restart                                 
deis-store-volume.service: activating/start-pre                                 
deis-store-volume.service: activating/auto-restart                                 
deis-store-volume.service: activating/start-pre                                 
deis-store-volume.service: activating/auto-restart                                 
deis-store-volume.service: activating/start-pre                                 
deis-store-volume.service: activating/auto-restart                                 
deis-store-volume.service: activating/start-pre                                 
deis-store-volume.service: activating/auto-restart                                 
deis-store-volume.service: activating/start-pre                                 
deis-store-volume.service: activating/auto-restart   

@carmstrong
Copy link
Contributor

@adrianmoya Can you post a gist of etcdctl ls --recursive /?

@adrianmoya
Copy link
Contributor

@carmstrong
Copy link
Contributor

@adrianmoya And ceph -s still looks healthy? Honestly, I'm at a loss here. I'm unable to reproduce this locally :(

@adrianmoya
Copy link
Contributor

There's a warn:

    cluster c97e62d4-4920-4c3d-baa5-2543a9269026
     health HEALTH_WARN 649 pgs peering; 649 pgs stuck inactive; 649 pgs stuck unclean; 3 requests are blocked > 32 sec; mds cluster is degraded
     monmap e3: 3 mons at {deis-2036302f-67e2-4c66-a609-9d0fb9b97bcf.novalocal=10.21.12.10:6789/0,deis-25a78c22-0bb1-43a9-bf27-8d0265f3ed51.novalocal=10.21.12.9:6789/0,deis-e07df3c0-3d13-409c-93e8-0e7b2fdf63f0.novalocal=10.21.12.11:6789/0}, election epoch 12, quorum 0,1,2 deis-25a78c22-0bb1-43a9-bf27-8d0265f3ed51.novalocal,deis-2036302f-67e2-4c66-a609-9d0fb9b97bcf.novalocal,deis-e07df3c0-3d13-409c-93e8-0e7b2fdf63f0.novalocal
     mdsmap e10: 1/1/1 up {0=deis-2036302f-67e2-4c66-a609-9d0fb9b97bcf.novalocal=up:replay}, 2 up:standby
     osdmap e36: 3 osds: 3 up, 3 in
      pgmap v706: 960 pgs, 9 pools, 8430 bytes data, 65 objects
            16553 MB used, 37217 MB / 54326 MB avail
                 649 peering
                 311 active+clean

@carmstrong
Copy link
Contributor

Ok. mds cluster is degraded. You can also see that your main mds is in a status of 'replaying': mdsmap e10: 1/1/1 up {0=deis-2036302f-67e2-4c66-a609-9d0fb9b97bcf.novalocal=up:replay}, 2 up:standby.

I suspect that the cluster will recover on its own once the placement groups have migrated - and if it doesn't, you should be able to deisctl restart store-daemon. Did you stop the old cluster with deisctl stop platform before you uninstalled it?

@adrianmoya
Copy link
Contributor

Yes, I stoped, then uninstalled, and then installed/started again. I've just notice that the installation continued and it's currently installing the Control plane...

@carmstrong
Copy link
Contributor

I've just notice that the installation continued and it's currently installing the Control plane...

Good, that's what I was hoping. It looks like your cluster just took a little longer to recover.

@adrianmoya
Copy link
Contributor

After starting from scratch, the only component that didn't start was the builder:

deisctl list
UNIT                MACHINE         LOAD    ACTIVE      SUB
deis-builder.service        a9444298.../10.21.12.13 loaded  activating  start-post
deis-cache.service      be5aa61b.../10.21.12.12 loaded  active      running
deis-controller.service     e71e4d58.../10.21.12.14 loaded  active      running
deis-database.service       a9444298.../10.21.12.13 loaded  active      running
deis-logger.service     be5aa61b.../10.21.12.12 loaded  active      running
deis-logspout.service       a9444298.../10.21.12.13 loaded  active      running
deis-logspout.service       be5aa61b.../10.21.12.12 loaded  active      running
deis-logspout.service       e71e4d58.../10.21.12.14 loaded  active      running
deis-publisher.service      a9444298.../10.21.12.13 loaded  active      running
deis-publisher.service      be5aa61b.../10.21.12.12 loaded  active      running
deis-publisher.service      e71e4d58.../10.21.12.14 loaded  active      running
deis-registry.service       e71e4d58.../10.21.12.14 loaded  active      running
deis-router@1.service       e71e4d58.../10.21.12.14 loaded  active      running
deis-router@2.service       a9444298.../10.21.12.13 loaded  active      running
deis-router@3.service       be5aa61b.../10.21.12.12 loaded  active      running
deis-store-daemon.service   a9444298.../10.21.12.13 loaded  active      running
deis-store-daemon.service   be5aa61b.../10.21.12.12 loaded  active      running
deis-store-daemon.service   e71e4d58.../10.21.12.14 loaded  active      running
deis-store-gateway.service  a9444298.../10.21.12.13 loaded  active      running
deis-store-metadata.service a9444298.../10.21.12.13 loaded  active      running
deis-store-metadata.service be5aa61b.../10.21.12.12 loaded  active      running
deis-store-metadata.service e71e4d58.../10.21.12.14 loaded  active      running
deis-store-monitor.service  a9444298.../10.21.12.13 loaded  active      running
deis-store-monitor.service  be5aa61b.../10.21.12.12 loaded  active      running
deis-store-monitor.service  e71e4d58.../10.21.12.14 loaded  active      running
deis-store-volume.service   a9444298.../10.21.12.13 loaded  active      running
deis-store-volume.service   be5aa61b.../10.21.12.12 loaded  active      running
deis-store-volume.service   e71e4d58.../10.21.12.14 loaded  active      running

Journal for builder: https://gist.github.com/adrianmoya/d71f43e66dd76159cdb9

@carmstrong
Copy link
Contributor

@adrianmoya That's a known issue, at least - I think #2214. Try deisctl restart builder and see if it comes back up the second try.

@adrianmoya
Copy link
Contributor

@carmstrong No luck restarting the builder. Is there anything else I can check?

Oct 30 13:55:07 deis-a9444298-2a4d-4c58-ab53-855782330803.novalocal sh[6952]: 2014-10-30T13:55:07Z c8f6bf49cc9e confd[15655]: ERROR Config check failed: exit status 1
Oct 30 13:55:07 deis-a9444298-2a4d-4c58-ab53-855782330803.novalocal sh[6952]: 2014-10-30T13:55:07Z c8f6bf49cc9e confd[15655]: ERROR Config check failed: exit status 1
Oct 30 13:55:07 deis-a9444298-2a4d-4c58-ab53-855782330803.novalocal sh[6952]: builder: waiting for confd to write initial templates...
Oct 30 13:55:12 deis-a9444298-2a4d-4c58-ab53-855782330803.novalocal systemd[1]: deis-builder.service stop-sigterm timed out. Killing.
Oct 30 13:55:12 deis-a9444298-2a4d-4c58-ab53-855782330803.novalocal systemd[1]: deis-builder.service: control process exited, code=killed status=9
Oct 30 13:55:13 deis-a9444298-2a4d-4c58-ab53-855782330803.novalocal systemd[1]: Stopped deis-builder.
Oct 30 13:55:13 deis-a9444298-2a4d-4c58-ab53-855782330803.novalocal systemd[1]: Unit deis-builder.service entered failed state.
Oct 30 13:55:13 deis-a9444298-2a4d-4c58-ab53-855782330803.novalocal docker[31321]: deis-builder
Oct 30 15:09:44 deis-a9444298-2a4d-4c58-ab53-855782330803.novalocal systemd[1]: Starting deis-builder...
Oct 30 15:09:44 deis-a9444298-2a4d-4c58-ab53-855782330803.novalocal sh[9722]: Error: No such image or container: deis-builder
Oct 30 15:09:44 deis-a9444298-2a4d-4c58-ab53-855782330803.novalocal sh[9734]: Waiting for builder on 2223/tcp...
Oct 30 15:10:16 deis-a9444298-2a4d-4c58-ab53-855782330803.novalocal sh[9733]: 2014-10-30T15:10:16Z 6ebc9c43e463 confd[77]: ERROR Config check failed: exit status 1
Oct 30 15:10:16 deis-a9444298-2a4d-4c58-ab53-855782330803.novalocal sh[9733]: 2014-10-30T15:10:16Z 6ebc9c43e463 confd[77]: ERROR Config check failed: exit status 1
Oct 30 15:10:16 deis-a9444298-2a4d-4c58-ab53-855782330803.novalocal sh[9733]: builder: waiting for confd to write initial templates...

@geovanisouza92
Copy link

Hello guys,

I'm starting on deis with my notebook:

  • Ubuntu Trusty 14.04;
  • 4GB RAM;
  • i5 1st gen;
  • Vagrant 1.6.5;
  • Virtualbox 4.3.10.

And I'm stuck on same issue: store-volume loop. Tried restarting VM's, restarting store-daemon and so on. Same logs. Now I'll starting again from master branch.

@carmstrong
Copy link
Contributor

4GB RAM;

You're likely going to run into issues deploying a 3-node Deis cluster with so little RAM. However, that shouldn't be the cause of the store-volume issues. Can you check the journal for the service on all of your VMs? I'm thinking one of them is failing.

@EvanK
Copy link

EvanK commented Nov 1, 2014

@carmstrong Are there minimum viable specs published anywhere? I couldn't find anything in the documentation beyond a reference to "large" EC2 instances...

@adrianmoya
Copy link
Contributor

I had to give up on 0.15.0, will wait to 0.16.0 to try again the installation and see if the "component X got stuck during installation" errors get better. :(

@carmstrong
Copy link
Contributor

Are there minimum viable specs published anywhere?

We only allude to the memory requirements in the main README. Having hard requirements is difficult because Deis is a distributed system, but ideally you want all components to be able to run on a single host if you're running in a degraded state. So I'd say 8GB of RAM per node to give you enough breathing room for that.

I'm working on putting together "Troubleshooting Deis" documentation which will detail commonly-encountered issues

@carmstrong
Copy link
Contributor

I had to give up on 0.15.0, will wait to 0.16.0 to try again the installation and see if the "component X got stuck during installation" errors get better. :(

I'm really sorry you're running into issues :( We may cut a v0.15.1 with some fixes for both the confd loop and the start order for the deis-store components so that they're more reliable.

@geovanisouza92
Copy link

@carmstrong You're right. First, I tried to run 3 instances and the result was so much memory pagination. Then, I changed the $vm_memory to 1024 and deployed 2 instances.

In both, the results was the same. The store-volume start-pre/auto-restart infinite loop, with some OSD nodes down (3 nodes, 1 up; 2 nodes, 1 up). The logs is identical to those reported in this issue and some related that I found near.

I tried restart the nodes, the deamon, the gateway, and so on. No one worked.

Anyway this project is very very promising. Please, don't give up! :D

I'll be here testing and reporting what I can, because I don't know Go (I'm an Python/Android programmer), but I really want to test it on my Django/Android project.

@bacongobbler
Copy link
Member

because I don't know Go (I'm an Python/Android programmer), but I really want to test it on my Django/Android project.

You'd be happy to hear that our controller is written in Python, using Django as the web framework of choice :)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants