Skip to content
This repository has been archived by the owner on Sep 4, 2021. It is now read-only.

wait postgres-wait error: discoverd: timed out waiting for instances #4301

Closed
artemave opened this issue Dec 8, 2017 · 9 comments
Closed

Comments

@artemave
Copy link

artemave commented Dec 8, 2017

Hello,

I am trying to create a flynn cluster on fresh aws and I can't get past bootstrapping. Here is the error I am getting (tried couple of times):

ubuntu@ip-10-1-1-17:~$ sudo CLUSTER_DOMAIN=prs.dev.travcorpservices.com flynn-host bootstrap --min-hosts 3 --discovery https://discovery.flynn.io/clusters/baf27702-8341-4255-af2a-7f04b4b905d3
16:53:45.060950 check online-hosts
16:53:45.598492 require-env require-env
16:53:45.598508 resource-check resource-check
16:53:45.600570 run-app discoverd
16:53:54.069550 run-app flannel
16:53:55.469882 wait-hosts wait-hosts
16:53:59.489003 gen-random pg-password
16:53:59.489044 gen-random pg-password 53e53b62d679f540d83b9e2e9ee43f45
16:53:59.489049 run-app postgres
16:54:04.639654 gen-random controller-key
16:54:04.639722 gen-random controller-key 0be1c11b9daf7be7895f882505295491
16:54:04.639726 gen-random bootstrap-id
16:54:04.639741 gen-random bootstrap-id f08ccd2c-a589-40d5-a0c7-eb7d43165395
16:54:04.639743 gen-random dashboard-session-secret
16:54:04.639757 gen-random dashboard-session-secret 1508f7666f367673d3770d49edc5bd28
16:54:04.639759 gen-random dashboard-login-token
16:54:04.639772 gen-random dashboard-login-token 8e38038b205291479116f58e57d5c4ad
16:54:04.639774 gen-random name-seed
16:54:04.639785 gen-random name-seed 17e148bd0e27e3bc05ac
16:54:04.639787 gen-random router-sticky-key
16:54:04.639812 gen-random router-sticky-key Egw9ws8S9JWPJZ9VH33CzbVNUplLMVv7bsbg8jOiMpw=
16:54:04.639814 wait postgres-wait

16:59:04.640803 wait postgres-wait error: discoverd: timed out waiting for instances

Logs: https://gist.github.com/anonymous/91b928292cd38d0b0c509b2f05c73ff2

Also, how do I redo bootstrap?

I tried sudo flynn-host stop ID but that would give me invalid ID17:43:55.642135 host.go:166: could not stop all jobs (I got the ids from sudo flynn-host list).

Then I sudo service stop flynn-host on all machines and run sudo flynn-host destroy-volumes which seemed to have worked, but then I tried to bootstrap after that I got:

ubuntu@ip-10-1-1-17:~$ sudo CLUSTER_DOMAIN=prs.dev.travcorpservices.com flynn-host bootstrap --min-hosts 3 --discovery https://discovery.flynn.io/clusters/baf27702-8341-4255-af2a-7f04b4b905d3
17:48:56.705302 check online-hosts
17:48:57.136492 require-env require-env
17:48:57.136509 resource-check resource-check
17:48:57.138881 resource-check resource-check error: conflicts detected!

The following hosts have conflicting services listening on ports Flynn is configured to use:
10.1.1.17: tcp:5002 tcp:1111 10.1.1.27: tcp:5002 tcp:1111

Thank you!

@lmars
Copy link
Contributor

lmars commented Dec 9, 2017

If you want to restart an installation from a clean slate, re-run the install script but set the --clean flag:

$ curl -fsSLo install-flynn https://dl.flynn.io/install-flynn
$ sudo bash install-flynn --clean

Feel free to inspect the install script before running it.

As for the bootstrapping issues, are you sure that all traffic is being permitted between the hosts in the cluster?

@artemave
Copy link
Author

artemave commented Dec 9, 2017

are you sure that all traffic is being permitted between the hosts in the cluster?

It is to my best knowledge (sg-ffff8284 is a security group that all cluster nodes cluster share):

image

@lmars
Copy link
Contributor

lmars commented Dec 9, 2017

@artemave there are errors in the PostgreSQL logs like:

dial tcp 100.100.86.2:5433: i/o timeout
dial tcp 100.100.57.2:5433: i/o timeout

which suggest there is an issue with the network, are you sure all nodes are in the same security group?

@artemave
Copy link
Author

artemave commented Dec 9, 2017

It is a network issue. And all nodes are in the same security group.

I have another cluster on another aws account (created differently) where I installed flynn without a problem. One difference between the two that I noticed is that I can ping 100.100.*.* addresses (an overlay network created by flynn?) from within the working cluster, but in the broken one those pings fail.

Which bit of aws setup could prevent this?

@titanous
Copy link
Contributor

titanous commented Dec 9, 2017

Do you have a ufw/iptables firewall? The security group config you posted should be fine, as long as all of the hosts are in the group.

@artemave
Copy link
Author

artemave commented Dec 9, 2017

All nodes are in the same security groups:

image

@artemave
Copy link
Author

artemave commented Dec 9, 2017

I didn't specifically set any firewall rules. Here is the iptables output I found in flynn logs (link above):

image

@artemave
Copy link
Author

If run into a problem while following these instructions, ensure that network traffic is flowing unimpeded through the flannel.1, flynnbr0, and veth*

How can I test that? (a bit out of my depths here)

@artemave
Copy link
Author

Managed to fix it. The outbound security group rules weren't permissive enough.

Once I changed this
image

into this
image

the bootstrap went through.

Thank you for your help. Flynn is awesome!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants