-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gorouter connects to NATS-TLS even though it shouldn't #185
Comments
- this caused non-tls clients to try to connect to nats-tls [185](cloudfoundry/routing-release#185)
Turns out this is a feature of nats. You can optionally turn it off by setting this I tested out adding this See branch here: cloudfoundry/nats-release@00dcd0d I suspect this is also related to this issue: cloudfoundry/nats-release#25 |
hi @ameowlia thanks for taking a look at this! Makes a lot of sense to me. Yes we also suspected the CPU spikes in cloudfoundry/nats-release#25 to be caused by lots of (failed) TLS handshakes. |
Coming to think of it: There might be ramifications for changing this on the NATS side. If e.g. route-emitter connects to nats.service.cf.internal on the TLS port, it probably wants a list of NATS-servers advertised (because it does not use a fixed list like gorouter). So if we turn off the advertisement on NATS-TLS and route-emitter needs to reconnect, what will happen? Since gorouter is the main issue as it still uses non-TLS as well as a static list of NATS servers from config instead of advertised, can't we fix this on gorouter side instead? |
Hi @domdom82, I hear your concern. I will be working on finding the best solution over the next few days and I will make sure that re-connecting will still work with whatever solution we decide on. |
* When advertising is on, clients that are incompatible with nats-tls job will try to connect to nats-tls if nats is unavailable * Clients wanting to connect via TLS get downgraded when connecting to nats if nats-tls is unavailable [#25](#25) [cloudfoundry/routing-release#185](cloudfoundry/routing-release#185)
* currently some clients in CF are not configured to properly use nats-tls and rely on being able to downgrade to nats. * we are keeping this ability to downgrade to nats for backwards compatability. * we plan removing the ability to downgrade to nats once we fix all the clients [#25](#25) [cloudfoundry/routing-release#185](cloudfoundry/routing-release#185)
* When advertising is on, clients that are incompatible with nats-tls job will try to connect to nats-tls if nats is unavailable * Clients wanting to connect via TLS get downgraded when connecting to nats if nats-tls is unavailable [cloudfoundry#25](cloudfoundry#25) [cloudfoundry/routing-release#185](cloudfoundry/routing-release#185)
@domdom82, this has been fixed in versions v35 -> v38. It took many releases to get it just right 😅 . This release will become available in cf-deployment very soon. All clients are either getting a list of IPs (like gorouter, I think) or are using bosh-dns aliases (like route emitter, I think). In both cases, the clients are able to find out about all the VMs that they care about and when one goes down they connect to the correct process. @domdom82, feel free to re-open if you still see issues. Thanks for being such a great community member! |
Hi @ameowlia , thanks a lot for the fast support and solution on this issue! |
Issue
With the introduction of NATS-TLS we noticed a surge in
TLS handshake error: remote error: tls: bad certificate
messages in NATS logs during an outage RCA where all routes got dropped. When inspecting errors we saw in logs, I tried to reproduce this and managed to do so by intentionally closing the Gorouter-(non-TLS)NATS connection and observer the reconnect.Affected Versions
NATS-release: v34
routing-release: 0.206.0
Context
Here is my gorouter.yml describing NATS:
So it should only talk to that NATS server on port 4222 (which is non-TLS).
Next I used a tool called tcpkill to sever the connection from gorouter to NATS:
(10.0.65.11 is my gorouter ip)
I then did a
tail -f
on thenats-tls
job:However, gorouter logs only show connections to regular NATS on port 4222:
When I do a
netstat
on the gorouter I can see there are closed connections to NATS-tls on port 4224Afterwards the connection is back to the non-TLS NATS:
Steps to Reproduce
See above
Expected result
Current result
Possible Fix
I suspect this to be a quirk of how the NATS protocol and the go NATS client works. I think I have read that whenever a client connects to a NATS server, it passes down the entire cluster to the client. So the client will try any of the servers in the cluster, regardless of initial configuration. However this is not just misleading (as the user expects to connect only to the NATS servers she configured) but also can be a source of errors if gorouter keeps retrying and failing on the wrong NATS-TLS servers while the dreaded "route staleness" timers keeps ticking. In bad situations, we observed that routes are pruned because gorouter did not manage to get an update for them in time.
If at all possible, gorouter should only ever connect to NATS servers as configured.
Additional Context
With a little help on how gorouter uses the go NATS client, I would like to volunteer for a PR, just ping me back.
The text was updated successfully, but these errors were encountered: