Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CONSUL_RETRY_JOIN_WAN env #48

Conversation

tjcelaya
Copy link
Contributor

@tjcelaya tjcelaya commented Dec 8, 2017

Resolves #47. Will need to be rebased once #46 is merged.

Wanted some feedback about how best to show the multi-DC example. Options I'm considering:

  • new multi-datacenter-compose.yml defining 2 Consul services
  • new partial test-dc2-compose.yml which just has consul-dc2 and gives an example with multiple usages of -f, e.g.
docker-compose -f local-compose.yml -f test-dc2-compose.yml up --scale consul=3 --scale consul-dc2=3

The naming of the second-datacenter consul service definitely needs some improvement though.

@tjcelaya tjcelaya changed the title Add Add CONSUL_RETRY_JOIN_WAN env Dec 8, 2017
Copy link
Contributor

@jwreagor jwreagor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During my upgrade PR I didn't break out the compose files into separate examples directories because I wanted to focus on the upgrade itself.

Other blueprints are starting to organize with an examples/ directory and I feel like that is prudent now. We could have examples/compose/, examples/triton/, and examples/multidc/. We could also combine multi-DC with the Triton deployment example as well, if that fits.

There are differences between running locally and running in a staging/prod type environment, just like there is a difference between running single versus multi-DC.

Thoughts? Is that too much re-organization for what you wanted to add in this PR?

Also, I think the usability of the multi-compose file behavior could workout well. Maybe we add a separate multi-dc compose file as an option alongside our triton example?

README.md Outdated
```
==> Error parsing /etc/consul/consul.hcl: ... unexpected token while parsing list: IDENT
```

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 for pointing out the error message.

@jwreagor
Copy link
Contributor

I need to try this out myself, hoping to get to that before we finalize.

@tjcelaya
Copy link
Contributor Author

tjcelaya commented Dec 13, 2017

So the setup script works fine, I've successfully gotten clusters in us-sw-1 and us-east-1 talking to each other but...

While consul members -wan lists all nodes and both data centers are listed in the web UI, kv operations and attempts to navigate to the remote data center from the UI always fail with 500: No path to datacenter or some other rpc error.

@tjcelaya
Copy link
Contributor Author

tjcelaya commented Dec 13, 2017

Finally sorted this all out. The issue comes from the way the RPC and Serf servers are decoupled.
If you specify all of the following:

  • bind_addr = private
  • adverties_addr = private
  • advertise_wan_addr = public
  • serf_lan(_bind) = private
  • serf_wan(_bind) = public
  • translate_wan_addrs = true

You'll get the following very helpful error when doing anything across data centers:

rpc error getting client: failed to get conn: dial tcp $PRIVATE:0->$PUBLIC:8300: i/o timeout

Logging onto the remote node and running netstat -ltun reveals that the RPC server is listening on the bind_addr so it will reject connection attempts that reach it by its public address (actual IPs have been changed to their source fields):

bash-4.3# netstat -ltun
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp        0      0 bind_addr:8300          0.0.0.0:*               LISTEN
tcp        0      0 serf_wan:8302           0.0.0.0:*               LISTEN
tcp        0      0 serf_lan:8301           0.0.0.0:*               LISTEN
tcp        0      0 :::53                   :::*                    LISTEN
tcp        0      0 :::8500                 :::*                    LISTEN
udp        0      0 serf_wan:8302           0.0.0.0:*
udp        0      0 serf_lan:8301           0.0.0.0:*
udp        0      0 :::53                   :::*

Not sure how we could improve this other than making users aware of this pitfall and recommending they use tunnelling or add firewall rules to only allow communication from the other cluster. The addition of a cross-datacenter-but-not-internet-public network would allow us to avoid setting bind_addr = 0.0.0.0, but I don't think that is available in Triton Cloud yet.

@tjcelaya
Copy link
Contributor Author

One remaining question (unless I remember the other question):

  • What to do about the other parameters we need to set to get all of this wired up. I've structured preStart to allow users to set those environment variables if desired, but it might make more sense move them into the if [ -n "$CONSUL_RETRY_JOIN_WAN" ] block so they only appear in consul.hcl if the user has specified they want to try to join other nodes over the WAN. CONSUL_BIND_ADDR is a particularly problematic one given the above caveats, it needs to be set to 0.0.0.0, unless the user has specified it explicitly.

@tjcelaya
Copy link
Contributor Author

Admittedly the setup-multi-datacenter.sh script could be simplified into instructions for how to clone the single-dc docker-compose.yml + _env but I'm leaving that up to review.

@jwreagor
Copy link
Contributor

This looks ready in terms of CR. Monday I can run all the scripts and fire up a multi data center Consul cluster, then we can get the functional bits approved.

Copy link
Contributor

@jwreagor jwreagor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nits and a few other things when I started working with this branch. make build definitely needs to be fixed (not a huge issue though).

README.md Outdated
- If not specified, automatic detection of the datacenter will be attempted. See [issue #23](https://github.com/autopilotpattern/consul/issues/23) for more details.
- Consul's default of "dc1" will be used if none of the above apply.

- `CONSUL_BIND_ADDR`: Explicitly set the corresponding Consul configuration. This value will be set to `0.0.0.0` if not specified and `CONSUL_RETRY_JOIN_WAN` is provided. Be aware of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sentence ends at Be aware of.

@@ -0,0 +1,37 @@
version: '2.1'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I don't see a good reason to run multi-DC on the same local host and would rather not have this file here. Unless you see a reason to iterate on multi-DC locally and that being something many users of this pattern might do.

consul:
build: .
build: ../../
image: autopilotpattern/consul:${TAG:-latest}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the point of both build and image? This file was more helpful without image and just building the container locally. I'm not sure what that is buying us.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was using the docker-compose file to build the image as well. I'll remove it.

@@ -0,0 +1,23 @@
version: '2.1'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather have this file within examples/multi-dc/docker-compose.yml.tmpl and not under examples/triton so that users can understand how to run Consul on Triton without getting DC involved. examples/multi-dc can also host all the templates you are rendering.

@@ -42,9 +42,11 @@ check() {
# make sure Docker client is pointed to the same place as the Triton client
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on moving setup.sh into examples/triton? It doesn't seem like a requirement for examples/compose or examples/multi-dc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The multi-dc example's setup script calls setup.sh once per datacenter so there is a dependency between the two. Nothing preventing us from referencing it as ../triton/setup.sh but it would lead to some confusion.

If the multi-dc example were named triton-multi-dc, or lived in examples/triton/multi-dc it might make the script dependency more obvious. Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this isn't a code library you can copy the check() function into your own shell script. Your script is superseding that setup anyway so owning it might help extend things later, like automatic setup across multiple profiles instead of leaving that up to the end user (removing those extra export steps over each DC). The latter isn't something we need to be concerned with ATM though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check was copied into ./setup-multi-dc.sh and renamed to generate_env

@@ -0,0 +1,103 @@
#!/bin/bash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this into examples/multi-dc/setup.sh so we keep all example assets together.

makefile Outdated

# ------------------------------------------------
# Multi-datacenter usage
clean/triton:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line isn't an issue but make build is broken due to file path issue in test/Dockerfile. We'll also want to make sure make test still works. Local tests passed for me when I patched.

@tjcelaya
Copy link
Contributor Author

Just realized we don't actually need use triton profile set at all and could instead eval "TRITON_PROFILE=$profile triton env -d" within the script so the user doesn't have to worry their own environment variables. I'll be fixing this as well

@tjcelaya
Copy link
Contributor Author

Discovered the command for getting the current account ID wasn't passing down the triton profile (since it doesn't accept that as a parameter) but can be fixed by passing it as an env on the triton account get invocation. That was preventing every consul instance I created from joining any other since the UUID from a different account was being used from my "current" profile. It should be possible to the the script to generate correct _env files again.

@tjcelaya tjcelaya mentioned this pull request Dec 19, 2017
@tjcelaya
Copy link
Contributor Author

This PR should be ready to merge now that all the documentation typos have been fixed and #50 is nearing completion.

@jwreagor
Copy link
Contributor

jwreagor commented Jan 4, 2018

@tjcelaya Feel free to squash and merge. I'm ok with supporting this.

@tjcelaya tjcelaya merged commit e4f9031 into autopilotpattern:master Jan 5, 2018
@tjcelaya tjcelaya deleted the enhancement/47-retry-join-wan-in-config-file branch January 5, 2018 23:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants