store-volume fails after update to v0.15.0 #2300
Comments
$ journalctl -u deis-store-volume
-- Logs begin at Tue 2014-10-28 19:13:28 UTC, end at Tue 2014-10-28 20:36:19 UTC. --
Oct 28 19:50:41 deis-1 systemd[1]: Starting deis-store-volume...
Oct 28 19:50:41 deis-1 sh[12964]: waiting for store-monitor...
Oct 28 19:50:41 deis-1 bash[12972]: cut: write error: Broken pipe
Oct 28 19:52:11 deis-1 systemd[1]: deis-store-volume.service start-pre operation timed out. Terminating.
Oct 28 19:52:11 deis-1 systemd[1]: Failed to start deis-store-volume.
Oct 28 19:52:11 deis-1 systemd[1]: Unit deis-store-volume.service entered failed state.
Oct 28 19:52:16 deis-1 systemd[1]: deis-store-volume.service holdoff time over, scheduling restart.
Oct 28 19:52:16 deis-1 systemd[1]: Stopping deis-store-volume...
Oct 28 19:52:16 deis-1 systemd[1]: Starting deis-store-volume...
Oct 28 19:52:16 deis-1 sh[13163]: waiting for store-monitor...
Oct 28 19:52:16 deis-1 bash[13169]: cut: write error: Broken pipe
Oct 28 19:53:46 deis-1 systemd[1]: deis-store-volume.service start-pre operation timed out. Terminating.
Oct 28 19:53:46 deis-1 systemd[1]: Failed to start deis-store-volume.
...
$ etcdctl get /deis/store/monSetupComplete
youBetcha
|
Based on |
That means either |
Both those look to be correctly populated in etcd. I ran the StartPre command by hand and it just hangs: # HOSTS=`etcdctl ls /deis/store/hosts | cut -d/ -f5 | awk '{if(NR == 1) {printf $0} else {printf ","$0}}'` && mount -t ceph $HOSTS:/ /var/lib/deis/store -o name=admin,secret=`etcdctl get /deis/store/adminKeyring | grep 'key =' | cut -d' ' -f3` |
Sounds like some of your monitors are down. What does |
Actually I'm wrong, it eventually returned: # HOSTS=`etcdctl ls /deis/store/hosts | cut -d/ -f5 | awk '{if(NR == 1) {printf $0} else {printf ","$0}}'` && mount -t ceph $HOSTS:/ /var/lib/deis/store -o name=admin,secret=`etcdctl get /deis/store/adminKeyring | grep 'key =' | cut -d' ' -f3`
mount: 172.17.8.100,172.17.8.102,172.17.8.101:/: can't read superblock
$ fleetctl list-machines
MACHINE IP METADATA
11de358a... 172.17.8.101 -
6c5a5b0a... 172.17.8.100 -
ea7309c6... 172.17.8.102 -
$ nse deis-store-monitor
groups: cannot find name for group ID 11
root@deis-1:/# ceph -s
cluster 82a7a99c-797a-44a4-8ffa-e3dd27a5daf4
health HEALTH_WARN 677 pgs degraded; 667 pgs down; 667 pgs peering; 667 pgs stuck inactive; 1344 pgs stuck unclean; 17 requests are blocked > 32 sec; recovery 328/903 objects degraded (36.323%)
monmap e3: 3 mons at {deis-1=172.17.8.100:6789/0,deis-2=172.17.8.101:6789/0,deis-3=172.17.8.102:6789/0}, election epoch 8, quorum 0,1,2 deis-1,deis-2,deis-3
mdsmap e4: 1/1/1 up {0=deis-2=up:creating}, 2 up:standby
osdmap e34: 3 osds: 1 up, 1 in
pgmap v252: 1344 pgs, 12 pools, 340 MB data, 301 objects
9175 MB used, 6944 MB / 16402 MB avail
328/903 objects degraded (36.323%)
677 active+degraded
667 down+peering |
It sounds like the larger issue here is that during an upgrade, we're either not restarting all the daemons properly, or in the right order. |
That seems likely, since this was easy to reproduce on upgrade but I haven't seen it on a clean install. |
@mboersma Does restarting the downed deis-store-daemon component recover things? |
@carmstrong yes indeed it does recover store-volume. |
Excellent. I'm digging into this now - it's definitely not an ideal upgrade experience, but that workaround is pretty straightforward. |
It is possible I hit a non-deterministic error and that this doesn't actually have to do with upgrading. It's just that all my fresh v0.15.0 provisions have gone smoothly, and this was the first upgrade I tried since the release and my experience matched that of at least one IRC user, so I assumed upgrade was the key. |
Yeah, definitely appropriate to raise a red flag. I'm seeing if I can reproduce this on an upgrade. |
@mboersma The bad news is that I'm seeing is. But the good news is that I'm seeing this. root@deis-1:/# ceph -s
cluster c50a4ca5-36ba-4796-a7ae-c8fa173350fe
health HEALTH_WARN 444 pgs peering; 444 pgs stuck inactive; 444 pgs stuck unclean; 9 requests are blocked > 32 sec
monmap e3: 3 mons at {deis-1=172.17.8.100:6789/0,deis-2=172.17.8.101:6789/0,deis-3=172.17.8.102:6789/0}, election epoch 10, quorum 0,1,2 deis-1,deis-2,deis-3
mdsmap e4: 1/1/1 up {0=deis-3=up:creating}, 2 up:standby
osdmap e33: 3 osds: 3 up, 3 in
pgmap v154: 1344 pgs, 12 pools, 340 MB data, 292 objects
21617 MB used, 26563 MB / 49206 MB avail
444 peering
900 active+clean @mboersma In your case, you had a downed OSD. In my case, I have a downed MDS:
I'm going to make a change to perform an ordered stop/uninstall with deisctl and see if that remedies this. |
Doing |
FWIW, I'm seeing the same death loop on |
@EvanK :( That means there's still a race condition somewhere. Can you log into one of your machines, |
@carmstrong I got errors for both: https://gist.github.com/EvanK/9c5bea91d9e47367c076 |
@EvanK Can you |
@carmstrong That's what it looks like:
This happened right out the gate from a |
@EvanK That's really odd - a state of "inactive/dead" implies that the services were stopped, not failed. Can you log into one of your CoreOS hosts and But without your store-monitors, none of the store components will function properly. In either event, you'll need to |
@EvanK I also deployed a fresh Deis 3 times with the deisctl built in #2307 and haven't run into an issue with store. Are you in a position where you can easily reprovision the cluster? I'd like to have some confidence that the fixes in that branch are an improvement on the current experience, and I haven't seen any store issues locally. If you wouldn't mind starting over, I could build you a |
@carmstrong I've stopped and restarted the platform a few times now since the coreos install. Here's the output of the latest The relevant lines for the latest stop/start are from 17:47 on. I am going to tear down these machines & completely recreate them again, following the same bare-metal docs |
I've just hit this issue installing fresh 0.15.0 on a private openstack. After provisioning the cluster, I installed the units, and then run
I did some of the advice in this thread:
If you want me to try anything else, please ask. |
@adrianmoya I suspect your issue is also solved by #2307. If you're using OS X locally, can you try using this deisctl instead of v0.15.0: https://s3-us-west-2.amazonaws.com/opdemand/deisctl-pr2307-darwin |
@carmstrong I'm on linux |
I',ve just looked at #2307 and wanted to add that in my case, this is a fresh install, and not an upgrade. It hanged on my first run of |
Right, I saw that too and tested in that PR. Commit 59f2bb5 should help on start. |
@adrianmoya If you have a Go environment set up locally, you should be able to build deisctl with |
@carmstrong ok, I builded deisctl from master, stopped/uninstalled what was done, and installed/started with this new deisctl, but the result seems to be the same:
|
@adrianmoya Can you post a gist of |
@adrianmoya And |
There's a warn:
|
Ok. I suspect that the cluster will recover on its own once the placement groups have migrated - and if it doesn't, you should be able to |
Yes, I stoped, then uninstalled, and then installed/started again. I've just notice that the installation continued and it's currently installing the Control plane... |
Good, that's what I was hoping. It looks like your cluster just took a little longer to recover. |
After starting from scratch, the only component that didn't start was the builder:
Journal for builder: https://gist.github.com/adrianmoya/d71f43e66dd76159cdb9 |
@adrianmoya That's a known issue, at least - I think #2214. Try |
@carmstrong No luck restarting the builder. Is there anything else I can check?
|
Hello guys, I'm starting on deis with my notebook:
And I'm stuck on same issue: |
You're likely going to run into issues deploying a 3-node Deis cluster with so little RAM. However, that shouldn't be the cause of the |
@carmstrong Are there minimum viable specs published anywhere? I couldn't find anything in the documentation beyond a reference to "large" EC2 instances... |
I had to give up on 0.15.0, will wait to 0.16.0 to try again the installation and see if the "component X got stuck during installation" errors get better. :( |
We only allude to the memory requirements in the main README. Having hard requirements is difficult because Deis is a distributed system, but ideally you want all components to be able to run on a single host if you're running in a degraded state. So I'd say 8GB of RAM per node to give you enough breathing room for that. I'm working on putting together "Troubleshooting Deis" documentation which will detail commonly-encountered issues |
I'm really sorry you're running into issues :( We may cut a v0.15.1 with some fixes for both the confd loop and the start order for the deis-store components so that they're more reliable. |
@carmstrong You're right. First, I tried to run 3 instances and the result was so much memory pagination. Then, I changed the In both, the results was the same. The I tried restart the nodes, the deamon, the gateway, and so on. No one worked. Anyway this project is very very promising. Please, don't give up! :D I'll be here testing and reporting what I can, because I don't know Go (I'm an Python/Android programmer), but I really want to test it on my Django/Android project. |
You'd be happy to hear that our controller is written in Python, using Django as the web framework of choice :) |
Following the upgrade instructions didn't work for me on 3-node vagrant.
The text was updated successfully, but these errors were encountered: