Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gorouter connects to NATS-TLS even though it shouldn't #185

Closed
domdom82 opened this issue Oct 5, 2020 · 6 comments
Closed

Gorouter connects to NATS-TLS even though it shouldn't #185

domdom82 opened this issue Oct 5, 2020 · 6 comments

Comments

@domdom82
Copy link
Contributor

domdom82 commented Oct 5, 2020

Issue

With the introduction of NATS-TLS we noticed a surge in TLS handshake error: remote error: tls: bad certificate messages in NATS logs during an outage RCA where all routes got dropped. When inspecting errors we saw in logs, I tried to reproduce this and managed to do so by intentionally closing the Gorouter-(non-TLS)NATS connection and observer the reconnect.

Affected Versions

NATS-release: v34
routing-release: 0.206.0

Context

Here is my gorouter.yml describing NATS:

---
(...)

nats:

  - host: 10.0.65.3
    port: 4222
    user: nats
    pass: <redacted>

(...)

So it should only talk to that NATS server on port 4222 (which is non-TLS).

Next I used a tool called tcpkill to sever the connection from gorouter to NATS:

tcpkill host 10.0.65.3 and port 4222
tcpkill: listening on eth0 [host 10.0.65.3 and port 4222]
10.0.65.3:4222 > 10.0.65.11:14796: R 3664461152:3664461152(0) win 0
10.0.65.3:4222 > 10.0.65.11:14796: R 3664461640:3664461640(0) win 0
10.0.65.3:4222 > 10.0.65.11:14796: R 3664462616:3664462616(0) win 0
10.0.65.11:14796 > 10.0.65.3:4222: R 2724051054:2724051054(0) win 0
10.0.65.11:14796 > 10.0.65.3:4222: R 2724051497:2724051497(0) win 0
10.0.65.11:14796 > 10.0.65.3:4222: R 2724052383:2724052383(0) win 0

(10.0.65.11 is my gorouter ip)

I then did a tail -f on the nats-tls job:

/var/vcap/sys/log/nats-tls# tail -f nats-tls.stderr.log
10.0.65.3[7] 2020/10/05 09:49:08.002081 [ERR] 10.0.65.11:15122 - cid:31 - TLS handshake error: remote error: tls: bad certificate
[7] 2020/10/05 09:49:10.004523 [ERR] 10.0.65.11:15126 - cid:32 - TLS handshake error: remote error: tls: bad certificate
[7] 2020/10/05 09:49:12.004538 [ERR] 10.0.65.11:15130 - cid:33 - TLS handshake error: remote error: tls: bad certificate
[7] 2020/10/05 09:49:14.003927 [ERR] 10.0.65.11:15136 - cid:34 - TLS handshake error: remote error: tls: bad certificate
[7] 2020/10/05 09:49:16.005003 [ERR] 10.0.65.11:15140 - cid:35 - TLS handshake error: remote error: tls: bad certificate
[7] 2020/10/05 09:49:18.003562 [ERR] 10.0.65.11:15148 - cid:36 - TLS handshake error: remote error: tls: bad certificate
[7] 2020/10/05 09:49:20.003168 [ERR] 10.0.65.11:15166 - cid:37 - TLS handshake error: remote error: tls: bad certificate
[7] 2020/10/05 09:49:22.004106 [ERR] 10.0.65.11:15170 - cid:38 - TLS handshake error: remote error: tls: bad certificate
[7] 2020/10/05 09:59:08.513947 [ERR] 10.0.65.11:16116 - cid:39 - TLS handshake error: remote error: tls: bad certificate
[7] 2020/10/05 09:59:10.515169 [ERR] 10.0.65.11:16122 - cid:40 - TLS handshake error: remote error: tls: bad certificate
[7] 2020/10/05 09:59:12.517650 [ERR] 10.0.65.11:16126 - cid:41 - TLS handshake error: remote error: tls: bad certificate
[7] 2020/10/05 09:59:14.517437 [ERR] 10.0.65.11:16132 - cid:42 - TLS handshake error: remote error: tls: bad certificate
[7] 2020/10/05 09:59:16.515532 [ERR] 10.0.65.11:16136 - cid:43 - TLS handshake error: remote error: tls: bad certificate

However, gorouter logs only show connections to regular NATS on port 4222:

{"log_level":1,"timestamp":"2020-10-05T09:59:08.483204195Z","message":"nats-connection-disconnected","source":"vcap.gorouter.nats","data":{"nats-host":"10.0.65.3:4222"}}
{"log_level":1,"timestamp":"2020-10-05T09:59:08.514498419Z","message":"nats-connection-reconnected","source":"vcap.gorouter.nats","data":{"nats-host":"10.0.65.3:4222"}}
{"log_level":1,"timestamp":"2020-10-05T09:59:09.027743122Z","message":"nats-connection-disconnected","source":"vcap.gorouter.nats","data":{"nats-host":"10.0.65.3:4222"}}
{"log_level":1,"timestamp":"2020-10-05T09:59:10.516309034Z","message":"nats-connection-reconnected","source":"vcap.gorouter.nats","data":{"nats-host":"10.0.65.3:4222"}}
{"log_level":1,"timestamp":"2020-10-05T09:59:10.659578378Z","message":"nats-connection-disconnected","source":"vcap.gorouter.nats","data":{"nats-host":"10.0.65.3:4222"}}
{"log_level":1,"timestamp":"2020-10-05T09:59:12.518167961Z","message":"nats-connection-reconnected","source":"vcap.gorouter.nats","data":{"nats-host":"10.0.65.3:4222"}}
{"log_level":1,"timestamp":"2020-10-05T09:59:12.835702916Z","message":"nats-connection-disconnected","source":"vcap.gorouter.nats","data":{"nats-host":"10.0.65.3:4222"}}

When I do a netstat on the gorouter I can see there are closed connections to NATS-tls on port 4224

netstat -pant | grep 10.0.65.3
tcp        1      0 10.0.65.11:21686        10.0.65.3:4224          CLOSE_WAIT  7716/gorouter
tcp        0      0 10.0.65.11:13534        10.0.65.3:4224          ESTABLISHED 8069/discovery-regi

Afterwards the connection is back to the non-TLS NATS:

netstat -pant | grep 10.0.65.3
tcp        0      0 10.0.65.11:21318        10.0.65.3:4222          ESTABLISHED 7716/gorouter
tcp        0      0 10.0.65.11:13534        10.0.65.3:4224          ESTABLISHED 8069/discovery-regi

Steps to Reproduce

See above

Expected result

  • Gorouter should only open NATS connections as configured.
  • Gorouter should log all connections and connection attempts, including NATS-tls ones.

Current result

  • Gorouter opens connections to NATS servers outside the configuration and to no success (since mTLS is not yet fully working)
  • Gorouter logs only connections to non-TLS NATS but not TLS-NATS.

Possible Fix

I suspect this to be a quirk of how the NATS protocol and the go NATS client works. I think I have read that whenever a client connects to a NATS server, it passes down the entire cluster to the client. So the client will try any of the servers in the cluster, regardless of initial configuration. However this is not just misleading (as the user expects to connect only to the NATS servers she configured) but also can be a source of errors if gorouter keeps retrying and failing on the wrong NATS-TLS servers while the dreaded "route staleness" timers keeps ticking. In bad situations, we observed that routes are pruned because gorouter did not manage to get an update for them in time.

If at all possible, gorouter should only ever connect to NATS servers as configured.

Additional Context

With a little help on how gorouter uses the go NATS client, I would like to volunteer for a PR, just ping me back.

ameowlia added a commit to cloudfoundry/nats-release that referenced this issue Oct 9, 2020
- this caused non-tls clients to try to connect to nats-tls
[185](cloudfoundry/routing-release#185)
@ameowlia
Copy link
Member

ameowlia commented Oct 9, 2020

Turns out this is a feature of nats. You can optionally turn it off by setting this no_advertise property. https://docs.nats.io/nats-server/configuration/clustering/cluster_config

I tested out adding this no_advertise: true property to the nats-tls config and it stopped this behavior (non-tls clients attempting to connenct to nats-tls) from happening. It also stops the CPU on the NATS VM from spiking, which I think is because nats-tls was trying to negotiate so many tls connections which continually failed.

See branch here: cloudfoundry/nats-release@00dcd0d

I suspect this is also related to this issue: cloudfoundry/nats-release#25

@domdom82
Copy link
Contributor Author

hi @ameowlia thanks for taking a look at this! Makes a lot of sense to me. Yes we also suspected the CPU spikes in cloudfoundry/nats-release#25 to be caused by lots of (failed) TLS handshakes.

@domdom82
Copy link
Contributor Author

Coming to think of it: There might be ramifications for changing this on the NATS side. If e.g. route-emitter connects to nats.service.cf.internal on the TLS port, it probably wants a list of NATS-servers advertised (because it does not use a fixed list like gorouter). So if we turn off the advertisement on NATS-TLS and route-emitter needs to reconnect, what will happen?

Since gorouter is the main issue as it still uses non-TLS as well as a static list of NATS servers from config instead of advertised, can't we fix this on gorouter side instead?

@ameowlia
Copy link
Member

Hi @domdom82, I hear your concern. I will be working on finding the best solution over the next few days and I will make sure that re-connecting will still work with whatever solution we decide on.

ameowlia added a commit to cloudfoundry/nats-release that referenced this issue Oct 12, 2020
* When advertising is on, clients that are incompatible with nats-tls
job will try to connect to nats-tls if nats is unavailable
* Clients wanting to connect via TLS get downgraded when connecting to
nats if nats-tls is unavailable

[#25](#25)
[cloudfoundry/routing-release#185](cloudfoundry/routing-release#185)
ameowlia added a commit to cloudfoundry/nats-release that referenced this issue Oct 13, 2020
* currently some clients in CF are not configured to properly use
nats-tls and rely on being able to downgrade to nats.
* we are keeping this ability to downgrade to nats for backwards
compatability.
* we plan removing the ability to downgrade to nats once we fix all the
clients

[#25](#25)
[cloudfoundry/routing-release#185](cloudfoundry/routing-release#185)
psycofdj pushed a commit to orange-cloudfoundry/nats-release that referenced this issue Oct 14, 2020
* When advertising is on, clients that are incompatible with nats-tls
job will try to connect to nats-tls if nats is unavailable
* Clients wanting to connect via TLS get downgraded when connecting to
nats if nats-tls is unavailable

[cloudfoundry#25](cloudfoundry#25)
[cloudfoundry/routing-release#185](cloudfoundry/routing-release#185)
@ameowlia
Copy link
Member

@domdom82, this has been fixed in versions v35 -> v38. It took many releases to get it just right 😅 . This release will become available in cf-deployment very soon.

All clients are either getting a list of IPs (like gorouter, I think) or are using bosh-dns aliases (like route emitter, I think). In both cases, the clients are able to find out about all the VMs that they care about and when one goes down they connect to the correct process.

@domdom82, feel free to re-open if you still see issues. Thanks for being such a great community member!

@plowin
Copy link
Contributor

plowin commented Oct 16, 2020

Hi @ameowlia , thanks a lot for the fast support and solution on this issue!
I was wondering about the vision on the nats cluster in the future. AFAIK, gorouter is the only deployment which is not capable of encrypting traffic to nats. I cannot judge about the efforts for enabling this but if it could be achieved, a feature toggle could switch off the intermediate duplicated nats cluster-nodes and stop deploying the non-tls nats jobs - or are there other dependencies?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants