Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nats outage due to "TLS handshake error: remote error: tls: bad certificate" errors #25

Closed
stefanlay opened this issue Sep 9, 2020 · 24 comments

Comments

@stefanlay
Copy link
Member

stefanlay commented Sep 9, 2020

We are using cf-deployment v13.8.0 since Aug 13 and v13.12.0 since two weeks.

On two different installations of cf we had severe issues with NATS.

The most recent one at Sep 9, 2020 led to a complete loss of all gorouter routes.

We use two NAT VMs in our deployment

At the same time both NATS cpu's went up to nearly 100%. The gorouter pruned all routes. The nats logs show many errors "TLS handshake error: remote error: tls: bad certificate" for different target server, see
nats_bad_certificate_eu10.txt

The situation did not change until we restarted one VM. Then the routes recovered. When the restarted the other VM, the routes went down again. Then we restarted the first one again, and the metrics looked ok, see
Screen Shot 2020-09-09 at 17 44 27

But still we see that one VM is not fully functional. On AWS dashboard we see that network traffic mainly goes via one VM:
Screen Shot 2020-09-09 at 17 51 53

The other issue happened at Sept 4, 2020

Here only the CPU of one NATS VM went up, but only to 30%. About 35% of the routes were pruned by the gorouter. Nats recovers after about 3 minute. But it seems that the instance where the CPU spiked does not have much network traffic as we see on the AWS console and on the VM itself using netstat and tcpdump.

The errors on the NATS VMs were different at the beginning compared to the issue above. It started with "TLS handshake error" from bosh-dns and "Slow Consumer" from nats, both only on 10.0.65.3.

About 4 seconds later (Sep 4, 2020 @ 07:07:58.250) errors like that started on both VMs 10.0.129.3 nats-tls [6] 2020/09/04 07:07:57.937454 [ERR] 10.0.73.85:40008 - cid:581 - TLS handshake error: remote error: tls: bad certificate

See
nats.txt.

We would like to understand:

  • What is the route cause of the issue and how can we prevent it
  • How can we recover from the current state that only one of two NATS VMs is really functional
@mcwumbly
Copy link

mcwumbly commented Sep 9, 2020

Hey @stefanlay, @jrussett and I are reviewing this issue together today. Here are some of our initial thoughts and questions:

We are using cf-deployment v13.8.0 since Aug 13 and v13.12.0 since two weeks.

On two different installations of cf we had severe issues with NATS.

Both versions of cf-deployment use the same version of nats-release (v34). And both enable nats-tls by default.

❓ Which version of cf-deployment and nats-release were you using before your recent upgrades? Was nats-tls enabled before your upgrade 2 weeks ago?

We use two NAT VMs in our deployment

❓ Is that one per AZ? Do you have two availability zones?

The situation did not change until we restarted one VM. Then the routes recovered. When the restarted the other VM, the routes went down again. Then we restarted the first one again, and the metrics looked ok, see
But still we see that one VM is not fully functional. On AWS dashboard we see that network traffic mainly goes via one VM

🤔 Here's one hypothesis that may explain why one VM is getting all the traffic: When a NATS VM (let's call it nats/0) is restarted, any clients (diego cell route emitters, gorouters) that were connected to it may re-connect to the other nats-server that is still running (let's call it nats/1). When nats/0 comes back up, nothing causes the clients to disconnect from nats/1, so they all continue to stay connected to it, and it continues to receive all the traffic. In a typical deployment, the clients roll after the nats servers so load is more evenly distributed as they come up and connect to one or the other more randomly.

How can we recover from the current state that only one of two NATS VMs is really functional

Perhaps rolling the diego cells and gorouters will rebalance the load. Or, you could restart the route-emitter jobs on each cell and/or gorouter jobs on each router.

At the same time both NATS cpu's went up to nearly 100%. The gorouter pruned all routes. The nats logs show many errors "TLS handshake error: remote error: tls: bad certificate" for different target server, see
nats_bad_certificate_eu10.txt

What is the root cause of the issue and how can we prevent it

It's hard to say which is cause and effect here. In the other log file you attached we also saw many i/o timeout errors in addition to the certificate errors.

There was another recent issue where network latency turned out to the root cause, but that resulted in "Authorization Violation" errors and some increased CPU load. Perhaps something similar is going on here.

@plowin
Copy link

plowin commented Sep 9, 2020

Hi, thanks for the quick reply!
We will continue investigations tomorrow, CET. Just a quick reply to the questions:

Which version of cf-deployment and nats-release were you using before your recent upgrades? Was nats-tls enabled before your upgrade 2 weeks ago?

We upgraded from cf-deployment v12.45.0 which uses nats v34 as well but does not seem to enable nats-tls.

Is that one per AZ? Do you have two availability zones?

Yes, CF is deployed in 2 AZs, 1 nats per AZ

Here is also the config - $nats/73bee12e-66ee-4d95-bc8f-e2f8bef7cbc8:~# cat /var/vcap/jobs/nats/config/nats.conf:

net: "10.0.65.3"
port: 4222
prof_port: 0
monitor_port: 0

debug: false
trace: false
logtime: true

authorization {
  user: "<value-redacted>"
  password: "<value-redacted>"
  timeout: 15
}

cluster {
  host: "10.0.65.3"
  port: 4223

  authorization {
    user: "<value-redacted>"
    password: "<value-redacted>"
    timeout: 15
  }


  tls {
    ca_file: "/var/vcap/jobs/nats/config/internal_tls/ca.pem"
    cert_file: "/var/vcap/jobs/nats/config/internal_tls/certificate.pem"
    key_file: "/var/vcap/jobs/nats/config/internal_tls/private_key.pem"
    cipher_suites: [
      "TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384",
      "TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256"
    ]
    curve_preferences: [
      "CurveP384"
    ]
    timeout: 5 # seconds
    verify: true
  }


  routes = [

    nats-route://<value-redacted>:<value-redacted>@73bee12e-66ee-4d95-bc8f-e2f8bef7cbc8.nats.service.cf.internal:4223

    nats-route://<value-redacted>:<value-redacted>@8ce5aa44-6e97-48f4-a3f8-2484c7a46202.nats.service.cf.internal:4223

    nats-route://<value-redacted>:<value-redacted>@73bee12e-66ee-4d95-bc8f-e2f8bef7cbc8.nats.service.cf.internal:4225

    nats-route://<value-redacted>:<value-redacted>@8ce5aa44-6e97-48f4-a3f8-2484c7a46202.nats.service.cf.internal:4225

  ]
}

@stefanlay
Copy link
Member Author

Hi @mcwumbly, also from me many thanks for your quick reply!

We have looked more in depth into the logs.

We checked how the issue started. On both NATS VMs the errors "TLS handshake error: remote error: tls: bad certificate" started at around the same time and increased in number similarly at 11:44:05:
Screen Shot 2020-09-10 at 14 05 46
Screen Shot 2020-09-10 at 14 06 27

We do not see any other logs on the NATS VMs which may indicate a root cause.

At the same time we see an increase in CPU on both NATS VMs:
Screen Shot 2020-09-10 at 14 10 30

So we think that the increase in CPU is caused by the TLS issues.

On diego we see errors in route-emitter only later:
Screen Shot 2020-09-10 at 14 06 47

We are not very familiar with the message flow in CF NATS. We wonder if the connections where the NATS VMs report TLS issues are opened by the NATS VMs. We thought that the route-emitter calls NATS as a message bus, and not vice versa. Why do we then only see such errors on the NATS VM, but not on the diego cells?

What puzzles us even more is that at the time we restarted the VM on zone 1 (10.0.65.3) the error message also vanished on the other VM:

Exact times from BOSH log:
Task 17283817 | 12:45:56 | L stopping jobs: nats/73bee12e-66ee-4d95-bc8f-e2f8bef7cbc8 (0) (canary)
Task 17283817 | 12:46:15 | L starting jobs: nats/73bee12e-66ee-4d95-bc8f-e2f8bef7cbc8 (0) (canary)

Logs from stopped VM:
Screen Shot 2020-09-10 at 14 07 09

Logs from the other VM:
Screen Shot 2020-09-10 at 14 07 17

Why are the TLS errors gone on the VM which was not touched by us?

We would be very happy if you could give us some ideas where we can look into to analyse this issue further.

@stefanlay
Copy link
Member Author

stefanlay commented Sep 10, 2020

Here, for completeness also the config for nats-tls

nats/73bee12e-66ee-4d95-bc8f-e2f8bef7cbc8:/var/vcap/jobs/nats-tls/config# cat nats-tls.conf

port: 4224
prof_port: 0
monitor_port: 0

debug: false
trace: false
logtime: true

authorization {
  user: <value-redacted>"
  password: <value-redacted>"
  timeout: 15
}

tls {
  ca_file: "/var/vcap/jobs/nats-tls/config/external_tls/ca.pem"
  cert_file: "/var/vcap/jobs/nats-tls/config/external_tls/certificate.pem"
  key_file: "/var/vcap/jobs/nats-tls/config/external_tls/private_key.pem"
  cipher_suites: [
    "TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384",
    "TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256"
  ]
  curve_preferences: [
    "CurveP384"
  ]
  timeout: 5 # seconds
  verify: true
}

cluster {
  host: "10.0.65.3"
  port: 4225

  authorization {
    user: <value-redacted>"
    password: <value-redacted>"
    timeout: 15
  }


  tls {
    ca_file: "/var/vcap/jobs/nats-tls/config/internal_tls/ca.pem"
    cert_file: "/var/vcap/jobs/nats-tls/config/internal_tls/certificate.pem"
    key_file: "/var/vcap/jobs/nats-tls/config/internal_tls/private_key.pem"
    cipher_suites: [
      "TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384",
      "TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256"
    ]
    curve_preferences: [
      "CurveP384"
    ]
    timeout: 5 # seconds
    verify: true
  }


  routes = [

    nats-route://<value-redacted>:<value-redacted>@73bee12e-66ee-4d95-bc8f-e2f8bef7cbc8.nats.service.cf.internal:4223

    nats-route://<value-redacted>:<value-redacted>@8ce5aa44-6e97-48f4-a3f8-2484c7a46202.nats.service.cf.internal:4223

    nats-route://<value-redacted>:<value-redacted>@73bee12e-66ee-4d95-bc8f-e2f8bef7cbc8.nats.service.cf.internal:4225

    nats-route://<value-redacted>:<value-redacted>@8ce5aa44-6e97-48f4-a3f8-2484c7a46202.nats.service.cf.internal:4225

  ]
}`

@mcwumbly
Copy link

We are not very familiar with the message flow in CF NATS. We wonder if the connections where the NATS VMs report TLS issues are opened by the NATS VMs. We thought that the route-emitter calls NATS as a message bus, and not vice versa. Why do we then only see such errors on the NATS VM, but not on the diego cells?

Here are the main interactions with NATS within CF:

  1. NATS <-> NATS
    NATS instances connect to each other in order to share messages they each receive so that subscribers can connect to any instance and receive all messages. These are connections use TLS by default now in CF deployment.
  2. Diego Cell (route emitter) -> NATS
    Route emitters on the Diego Cells connect to nats and publish messages about the endpoints that exists on the cell for a given route. They currently connect without TLS.
  3. Gorouter <- NATS
    Gorouter instances on the router VMs connect to nats to subscribe to messages sent by the route emitters. They use these messages to keep their in-memory routing tables up to date. They currently connect without TLS.

Why are the TLS errors gone on the VM which was not touched by us?

Since the TLS connectivity is between the two NATS VMs, and there are only two of them in your current deployment, when you take one down, the other has not TLS connections to establish.

Regarding this error specifically:

07:07:57.937454 [ERR] 10.0.73.85:40008 - cid:581 - TLS handshake error: remote error: tls: bad certificate

It looks like it's either coming from github.com/nats-io/nats-server/server/leafnode.go or github.com/nats-io/nats-server/server/server.go, and then further down, from within the go standard library, either in github.com/golang/go/src/crypto/tls/handshake_server.go or in a handful of places in github.com/golang/go/src/crypto/tls/handshake_client.go

Not sure why the certificate would become "bad". If you see this issue again, I might do the following:

  1. Hop onto one of the NATS VMs and save off the certs listed in in the config you posted above for later inspection.
  2. monit restart nats-tls
    Does this fix it? If so, it's not the cert themselves that is the issue.
  3. Restart the VM (as you have been)
    Presumably this fixes the issue as it has been...
  4. Grab the certs after the restart
    Diff these against the ones saved off in step (1). Are they the same? If so, it's not the cert themselves that is the issue. (I would think they would be, unless the VMs are being re-created, but perhaps I'm forgetting something about how BOSH works...

@stefanlay
Copy link
Member Author

Diego Cell (route emitter) -> NATS
Route emitters on the Diego Cells connect to nats and publish messages about the endpoints that exists on the cell for a given route. They currently connect without TLS.
Gorouter <- NATS
Gorouter instances on the router VMs connect to nats to subscribe to messages sent by the route emitters. They use these messages to keep their in-memory routing tables up to date. They currently connect without TLS.

We could confirm that in our case the router.register go over the unencrypted port 4222 opened by the nats job.
We however also observe that there are many connections open for 4224 (from nats-tls job), also to diego-cells and gorouter:

nats/73bee12e-66ee-4d95-bc8f-e2f8bef7cbc8:~# netstat -tupne | grep gnatsd | grep 4222 | wc -l
372
nats/73bee12e-66ee-4d95-bc8f-e2f8bef7cbc8:~# netstat -tupne | grep gnatsd | grep 4224 | wc -l
320

If the traffic goes via the 4222 port, what are these connections used for? They seem not to have a lot of traffic. But establishing these connections leads to the errors we see.

07:07:57.937454 [ERR] 10.0.73.85:40008 - cid:581 - TLS handshake error: remote error: tls: bad certificate

We also wonder why in https://github.com/cloudfoundry/cf-deployment/releases/tag/v13.5.0 there is the statement "Enables TLS between route-emitter and NATS."

Not sure why the certificate would become "bad". If you see this issue again, I might do the following:

We now think that the certificate itself is not an issue. We see missing registry update already before the first TLS handshake error occurs. We think that the error "TLS handshake error: remote error: tls: bad certificate" also comes when the handshake is cancelled due to e.g. timeout. Cf deployment sets the timeout: 5 # seconds. In our logs wee see that for a given diego cell the error is logged very regularly every 6 or 7 seconds, which may fit to a retry logic after 1 second:
Screen Shot 2020-09-11 at 18 25 35

We made an observation which let us assume that the initial issue started already earlier. About two minutes before the routes went down to zero, routes were pruned, but one group of gorouters had a different number of routes than the other group:
One group has 34396 routes, the other has 36047 routes (before all had 40099):
Screen Shot 2020-09-11 at 17 44 59

It looks like both NATS instances have different numbers, that means they can't exchange router.register message anymore.

If that is the case, how could such an issue lead to the complete loss of routes later? We did not observe network issues on other VMs, why should two NATS VMs in two AZs have network issues at the same time.

@mcwumbly
Copy link

We also wonder why in https://github.com/cloudfoundry/cf-deployment/releases/tag/v13.5.0 there is the statement "Enables TLS between route-emitter and NATS."

Huh, whelp, you're right. Looks like cf-d uses tls between route-emitter and NATS.

In our logs wee see that for a given diego cell the error is logged very regularly every 6 or 7 seconds, which may fit to a retry logic after 1 second:

Now I'm curious how many VMs you see this error for. In the screenshot, it looks like a single Diego Cell and a single NATS VM, but that's probably just because you're filtering things down to understanding timing.

During that time, do you see the error from other Diego Cells as well? Do all the Diego Cells have an issue with a single NATS VM? Or do some of them have the same issue with the other NATS VM?

@stefanlay
Copy link
Member Author

Huh, whelp, you're right. Looks like cf-d uses tls between route-emitter and NATS.

We are using cf-deployment, but nevertheless we observe that the router.register messages go over the unencrypted port 4222 opened by the nats job. We see it with tcpdump.

These are our job definitions in the manifests:

- name: nats
    properties:
      nats:
        hostname: nats.service.cf.internal
        internal:
          tls:
            ca: "((nats_internal_cert.ca))"
            certificate: "((nats_internal_cert.certificate))"
            enabled: true
            private_key: "((nats_internal_cert.private_key))"
        password: "((nats_password))"
        user: nats
    provides:
      nats:
        as: nats
        shared: true
    release: nats
  - custom_provider_definitions:
    - name: nats-tls-address
      type: address
    name: nats-tls
    properties:
      nats:
        external:
          tls:
            ca: "((nats_client_cert.ca))"
            certificate: "((nats_server_cert.certificate))"
            private_key: "((nats_server_cert.private_key))"
        hostname: nats.service.cf.internal
        internal:
          tls:
            ca: "((nats_internal_cert.ca))"
            certificate: "((nats_internal_cert.certificate))"
            enabled: true
            private_key: "((nats_internal_cert.private_key))"
        password: "((nats_password))"
        user: nats
    provides:
      nats-tls:
        as: nats-tls
        shared: true
    release: nats
  name: nats
  networks:
  - name: cf
    static_ips:
    - 10.0.65.3
    - 10.0.129.3
  stemcell: default
  vm_type: vm_2cpu_4gb
  - consumes:
      nats:
        deployment: cf
        from: nats
      nats-tls:
        deployment: cf
        from: nats-tls
    name: route_emitter
    properties:
      bpm:
        enabled: true
      diego:
        route_emitter:
          bbs:
            ca_cert: ((!/bosh-hcp-live-eu10/cf/diego_bbs_client.ca))
            client_cert: ((!/bosh-hcp-live-eu10/cf/diego_bbs_client.certificate))
            client_key: ((!/bosh-hcp-live-eu10/cf/diego_bbs_client.private_key))
          local_mode: true
          log_level: error
          nats:
            tls:
              client_cert: ((!/bosh-hcp-live-eu10/cf/nats_client_cert.certificate))
              client_key: ((!/bosh-hcp-live-eu10/cf/nats_client_cert.private_key))
              enabled: true
      internal_routes:
        enabled: true
      logging:
        format:
          timestamp: rfc3339
      loggregator:
        ca_cert: ((!/bosh-hcp-live-eu10/cf/loggregator_tls_agent.ca))
        cert: ((!/bosh-hcp-live-eu10/cf/loggregator_tls_agent.certificate))
        key: ((!/bosh-hcp-live-eu10/cf/loggregator_tls_agent.private_key))
        use_v2_api: true
      tcp:
        enabled: false
      uaa:
        ca_cert: ((!/bosh-hcp-live-eu10/cf/uaa_ssl.ca))
    release: diego
  - name: gorouter
    properties:
      router:
        backends:
          cert_chain: "((gorouter_backend_tls.certificate))"
          private_key: "((gorouter_backend_tls.private_key))"
        ca_certs: |
          ((diego_instance_identity_ca.ca))
          ((cc_tls.ca))
          ((uaa_ssl.ca))
          ((network_policy_server_external.ca))
        disable_log_forwarded_for: true
        disable_log_source_ip: true
        enable_ssl: true
        extra_headers_to_log:
        - X-CorrelationID
        - tenantid
        - X-SCP-Request-Id
        - X-Cf-App-Instance
        logging_level: info
        max_idle_connections: 49500
        route_services_secret: "((router_route_services_secret))"
        secure_cookies: true
        ssl_skip_validation: false
        status:
          password: "((router_status_password))"
          port: 8080
          user: router-status
        tls_pem:
        - cert_chain: "((router_ssl.certificate))"
          private_key: "((router_ssl.private_key))"
        tracing:
          enable_zipkin: true
      uaa:
        ca_cert: "((uaa_ssl.ca))"
        clients:
          gorouter:
            secret: "((uaa_clients_gorouter_secret))"
        ssl:
          port: 8443
    provides:
      gorouter:
        as: gorouter_link
        shared: true
    release: routing

Now I'm curious how many VMs you see this error for. In the screenshot, it looks like a single Diego Cell and a single NATS VM, but that's probably just because you're filtering things down to understanding timing.
During that time, do you see the error from other Diego Cells as well? Do all the Diego Cells have an issue with a single NATS VM? Or do some of them have the same issue with the other NATS VM?

This screenshot was filtered for one NATS VM and one diego cell.
Without the filter for the diego cell IP we see the errors on both NATS VMS for all diego cells, starting slowly with basically the same diego cells:


  | Time | syslog_hostname | syslog_message
-- | -- | -- | --
  | Sep 9, 2020 @ 11:44:11.599 | 10.0.65.3 | [6] 2020/09/09 11:44:11.148101 [ERR] 10.0.138.123:59060 - cid:1858 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:11.599 | 10.0.65.3 | [6] 2020/09/09 11:44:11.546592 [ERR] 10.0.65.173:33842 - cid:1866 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:11.599 | 10.0.65.3 | [6] 2020/09/09 11:44:11.501206 [ERR] 10.0.138.9:39040 - cid:1865 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:11.599 | 10.0.65.3 | [6] 2020/09/09 11:44:11.473439 [ERR] 10.0.73.113:36016 - cid:1864 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:11.599 | 10.0.65.3 | [6] 2020/09/09 11:44:11.222588 [ERR] 10.0.65.130:42656 - cid:1860 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:11.599 | 10.0.65.3 | [6] 2020/09/09 11:44:11.246858 [ERR] 10.0.74.242:52450 - cid:1861 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:11.599 | 10.0.65.3 | [6] 2020/09/09 11:44:11.342535 [ERR] 10.0.73.182:48596 - cid:1862 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:11.599 | 10.0.65.3 | [6] 2020/09/09 11:44:11.445484 [ERR] 10.0.137.166:58246 - cid:1863 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:11.599 | 10.0.65.3 | [6] 2020/09/09 11:44:11.177199 [ERR] 10.0.75.9:56096 - cid:1859 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:11.598 | 10.0.65.3 | [6] 2020/09/09 11:44:11.105175 [ERR] 10.0.138.221:50230 - cid:1857 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:11.598 | 10.0.65.3 | [6] 2020/09/09 11:44:10.797854 [ERR] 10.0.129.160:41116 - cid:1854 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:11.598 | 10.0.65.3 | [6] 2020/09/09 11:44:10.936417 [ERR] 10.0.74.177:53814 - cid:1856 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:11.598 | 10.0.65.3 | [6] 2020/09/09 11:44:10.710114 [ERR] 10.0.73.121:55862 - cid:1853 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:11.598 | 10.0.65.3 | [6] 2020/09/09 11:44:10.606294 [ERR] 10.0.65.165:53840 - cid:1852 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:11.598 | 10.0.65.3 | [6] 2020/09/09 11:44:10.858325 [ERR] 10.0.73.159:33884 - cid:1855 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:10.598 | 10.0.65.3 | [6] 2020/09/09 11:44:10.068292 [ERR] 10.0.129.172:59612 - cid:1847 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:10.598 | 10.0.65.3 | [6] 2020/09/09 11:44:10.550608 [ERR] 10.0.73.198:43966 - cid:1851 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:10.598 | 10.0.65.3 | [6] 2020/09/09 11:44:10.314923 [ERR] 10.0.129.179:47570 - cid:1848 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:10.598 | 10.0.65.3 | [6] 2020/09/09 11:44:10.373926 [ERR] 10.0.137.191:39952 - cid:1849 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:10.598 | 10.0.65.3 | [6] 2020/09/09 11:44:10.552418 [ERR] 10.0.129.151:43796 - cid:1850 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:10.598 | 10.0.65.3 | [6] 2020/09/09 11:44:09.949956 [ERR] 10.0.74.255:33522 - cid:1845 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:10.598 | 10.0.65.3 | [6] 2020/09/09 11:44:10.034569 [ERR] 10.0.65.177:34912 - cid:1846 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:10.598 | 10.0.65.3 | [6] 2020/09/09 11:44:09.897605 [ERR] 10.0.137.146:45270 - cid:1844 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:09.598 | 10.0.65.3 | [6] 2020/09/09 11:44:09.508946 [ERR] 10.0.138.237:40810 - cid:1841 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:09.598 | 10.0.65.3 | [6] 2020/09/09 11:44:09.109498 [ERR] 10.0.129.149:35194 - cid:1838 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:09.598 | 10.0.65.3 | [6] 2020/09/09 11:44:09.288424 [ERR] 10.0.73.169:52726 - cid:1839 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:09.598 | 10.0.65.3 | [6] 2020/09/09 11:44:08.878961 [ERR] 10.0.65.132:60910 - cid:1835 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:09.598 | 10.0.65.3 | [6] 2020/09/09 11:44:08.648058 [ERR] 10.0.65.178:44210 - cid:1834 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:09.598 | 10.0.65.3 | [6] 2020/09/09 11:44:09.303251 [ERR] 10.0.74.170:50196 - cid:1840 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:09.598 | 10.0.65.3 | [6] 2020/09/09 11:44:09.005268 [ERR] 10.0.65.167:54718 - cid:1836 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:09.598 | 10.0.65.3 | [6] 2020/09/09 11:44:09.569452 [ERR] 10.0.129.119:37666 - cid:1842 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:09.598 | 10.0.65.3 | [6] 2020/09/09 11:44:09.579891 [ERR] 10.0.65.80:52536 - cid:1843 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:09.598 | 10.0.65.3 | [6] 2020/09/09 11:44:09.075789 [ERR] 10.0.65.166:48110 - cid:1837 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:08.598 | 10.0.65.3 | [6] 2020/09/09 11:44:08.443735 [ERR] 10.0.65.175:49190 - cid:1833 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:08.598 | 10.0.65.3 | [6] 2020/09/09 11:44:07.910095 [ERR] 10.0.138.181:52570 - cid:1828 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:08.598 | 10.0.65.3 | [6] 2020/09/09 11:44:07.888715 [ERR] 10.0.65.193:52232 - cid:1826 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:08.598 | 10.0.65.3 | [6] 2020/09/09 11:44:08.363453 [ERR] 10.0.73.145:46152 - cid:1832 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:08.598 | 10.0.65.3 | [6] 2020/09/09 11:44:08.123696 [ERR] 10.0.75.3:37964 - cid:1830 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:08.598 | 10.0.65.3 | [6] 2020/09/09 11:44:08.276974 [ERR] 10.0.73.197:36044 - cid:1831 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:08.598 | 10.0.65.3 | [6] 2020/09/09 11:44:07.958496 [ERR] 10.0.65.192:54046 - cid:1827 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:08.598 | 10.0.65.3 | [6] 2020/09/09 11:44:07.924265 [ERR] 10.0.138.112:55180 - cid:1829 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:07.598 | 10.0.65.3 | [6] 2020/09/09 11:44:07.080042 [ERR] 10.0.138.221:50110 - cid:1822 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:07.598 | 10.0.65.3 | [6] 2020/09/09 11:44:07.127090 [ERR] 10.0.138.123:58656 - cid:1823 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:07.598 | 10.0.65.3 | [6] 2020/09/09 11:44:07.155570 [ERR] 10.0.75.9:55954 - cid:1824 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:07.598 | 10.0.65.3 | [6] 2020/09/09 11:44:07.442391 [ERR] 10.0.73.113:35944 - cid:1825 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:06.598 | 10.0.65.3 | [6] 2020/09/09 11:44:06.520897 [ERR] 10.0.73.198:43548 - cid:1821 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:06.598 | 10.0.65.3 | [6] 2020/09/09 11:44:05.867984 [ERR] 10.0.137.146:45172 - cid:1820 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:05.597 | 10.0.65.3 | [6] 2020/09/09 11:44:05.475402 [ERR] 10.0.138.237:40700 - cid:1819 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:05.597 | 10.0.65.3 | [6] 2020/09/09 11:44:05.326644 [ERR] 10.0.73.149:33918 - cid:1818 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:05.597 | 10.0.65.3 | [6] 2020/09/09 11:44:05.221778 [ERR] 10.0.74.205:39164 - cid:1817 - TLS handshake error: remote error: tls: bad certificate

  | Time | syslog_hostname | syslog_message
-- | -- | -- | --
  | Sep 9, 2020 @ 11:44:12.266 | 10.0.129.3 | [6] 2020/09/09 11:44:11.929194 [ERR] 10.0.138.181:46890 - cid:953 - TLS handshake error: remote error: tls: bad certificate
  | Sep 9, 2020 @ 11:44:12.266 | 10.0.129.3 | [6] 2020/09/09 11:44:11.756611 [ERR] 10.0.73.185:53044 - cid:952 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:12.266 | 10.0.129.3 | [6] 2020/09/09 11:44:11.473114 [ERR] 10.0.137.166:34978 - cid:948 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:12.266 | 10.0.129.3 | [6] 2020/09/09 11:44:11.515925 [ERR] 10.0.65.173:54998 - cid:947 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:12.266 | 10.0.129.3 | [6] 2020/09/09 11:44:11.482491 [ERR] 10.0.138.9:49774 - cid:949 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:12.266 | 10.0.129.3 | [6] 2020/09/09 11:44:12.204362 [ERR] 10.0.65.66:37696 - cid:954 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:12.266 | 10.0.129.3 | [6] 2020/09/09 11:44:11.687101 [ERR] 10.0.138.254:56992 - cid:951 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:12.266 | 10.0.129.3 | [6] 2020/09/09 11:44:12.160692 [ERR] 10.0.75.3:36110 - cid:958 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:12.266 | 10.0.129.3 | [6] 2020/09/09 11:44:11.485875 [ERR] 10.0.73.113:34146 - cid:950 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:12.266 | 10.0.129.3 | [6] 2020/09/09 11:44:12.148864 [ERR] 10.0.75.10:57950 - cid:957 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:12.266 | 10.0.129.3 | [6] 2020/09/09 11:44:11.334133 [ERR] 10.0.73.182:39756 - cid:946 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:12.266 | 10.0.129.3 | [6] 2020/09/09 11:44:12.008478 [ERR] 10.0.138.112:45620 - cid:955 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:11.266 | 10.0.129.3 | [6] 2020/09/09 11:44:11.188478 [ERR] 10.0.65.130:35244 - cid:942 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:11.266 | 10.0.129.3 | [6] 2020/09/09 11:44:11.191575 [ERR] 10.0.75.9:60366 - cid:944 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:11.266 | 10.0.129.3 | [6] 2020/09/09 11:44:10.914024 [ERR] 10.0.65.132:54724 - cid:939 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:11.266 | 10.0.129.3 | [6] 2020/09/09 11:44:11.238365 [ERR] 10.0.74.242:54232 - cid:945 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:11.266 | 10.0.129.3 | [6] 2020/09/09 11:44:11.094700 [ERR] 10.0.138.221:44150 - cid:941 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:11.266 | 10.0.129.3 | [6] 2020/09/09 11:44:10.948434 [ERR] 10.0.74.177:33944 - cid:940 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:11.266 | 10.0.129.3 | [6] 2020/09/09 11:44:11.137959 [ERR] 10.0.138.123:57882 - cid:943 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:11.265 | 10.0.129.3 | [6] 2020/09/09 11:44:10.767755 [ERR] 10.0.129.160:37108 - cid:937 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:11.265 | 10.0.129.3 | [6] 2020/09/09 11:44:10.570042 [ERR] 10.0.65.165:51636 - cid:934 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:11.265 | 10.0.129.3 | [6] 2020/09/09 11:44:10.690861 [ERR] 10.0.73.121:58842 - cid:936 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:11.265 | 10.0.129.3 | [6] 2020/09/09 11:44:10.845952 [ERR] 10.0.73.159:53412 - cid:938 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:11.265 | 10.0.129.3 | [6] 2020/09/09 11:44:10.542238 [ERR] 10.0.73.198:37998 - cid:935 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:11.265 | 10.0.129.3 | [6] 2020/09/09 11:44:10.382709 [ERR] 10.0.137.191:54544 - cid:933 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:10.265 | 10.0.129.3 | [6] 2020/09/09 11:44:09.299820 [ERR] 10.0.73.169:60562 - cid:923 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:10.265 | 10.0.129.3 | [6] 2020/09/09 11:44:09.920351 [ERR] 10.0.65.193:33270 - cid:928 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:10.265 | 10.0.129.3 | [6] 2020/09/09 11:44:10.029356 [ERR] 10.0.129.172:49020 - cid:930 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:10.265 | 10.0.129.3 | [6] 2020/09/09 11:44:09.530868 [ERR] 10.0.129.119:51518 - cid:924 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:10.265 | 10.0.129.3 | [6] 2020/09/09 11:44:09.524674 [ERR] 10.0.138.237:54762 - cid:926 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:10.265 | 10.0.129.3 | [6] 2020/09/09 11:44:10.003625 [ERR] 10.0.65.177:49794 - cid:929 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:10.265 | 10.0.129.3 | [6] 2020/09/09 11:44:09.290979 [ERR] 10.0.74.170:52174 - cid:922 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:10.265 | 10.0.129.3 | [6] 2020/09/09 11:44:09.964725 [ERR] 10.0.74.255:59862 - cid:931 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:10.265 | 10.0.129.3 | [6] 2020/09/09 11:44:09.886873 [ERR] 10.0.137.146:44804 - cid:927 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:10.265 | 10.0.129.3 | [6] 2020/09/09 11:44:09.535475 [ERR] 10.0.65.80:54340 - cid:925 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:10.265 | 10.0.129.3 | [6] 2020/09/09 11:44:10.004630 [ERR] 10.0.65.192:39552 - cid:932 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:09.265 | 10.0.129.3 | [6] 2020/09/09 11:44:08.397363 [ERR] 10.0.65.175:46244 - cid:915 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:09.265 | 10.0.129.3 | [6] 2020/09/09 11:44:09.036713 [ERR] 10.0.65.166:59932 - cid:920 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:09.265 | 10.0.129.3 | [6] 2020/09/09 11:44:08.354454 [ERR] 10.0.73.145:51136 - cid:917 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:09.265 | 10.0.129.3 | [6] 2020/09/09 11:44:08.617024 [ERR] 10.0.65.178:45952 - cid:918 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:09.265 | 10.0.129.3 | [6] 2020/09/09 11:44:08.310603 [ERR] 10.0.73.197:42604 - cid:916 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:09.265 | 10.0.129.3 | [6] 2020/09/09 11:44:08.973381 [ERR] 10.0.65.167:36454 - cid:919 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:09.265 | 10.0.129.3 | [6] 2020/09/09 11:44:09.080280 [ERR] 10.0.129.149:48980 - cid:921 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:08.265 | 10.0.129.3 | [6] 2020/09/09 11:44:08.114736 [ERR] 10.0.75.3:36078 - cid:914 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:08.265 | 10.0.129.3 | [6] 2020/09/09 11:44:07.933143 [ERR] 10.0.138.112:44860 - cid:913 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:08.265 | 10.0.129.3 | [6] 2020/09/09 11:44:07.463878 [ERR] 10.0.73.113:34074 - cid:911 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:08.265 | 10.0.129.3 | [6] 2020/09/09 11:44:07.889462 [ERR] 10.0.138.181:46832 - cid:912 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:07.265 | 10.0.129.3 | [6] 2020/09/09 11:44:07.069087 [ERR] 10.0.138.221:44030 - cid:908 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:07.265 | 10.0.129.3 | [6] 2020/09/09 11:44:06.512403 [ERR] 10.0.73.198:37580 - cid:907 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:07.265 | 10.0.129.3 | [6] 2020/09/09 11:44:07.116756 [ERR] 10.0.138.123:57488 - cid:909 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:07.265 | 10.0.129.3 | [6] 2020/09/09 11:44:07.167517 [ERR] 10.0.75.9:60222 - cid:910 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:06.264 | 10.0.129.3 | [6] 2020/09/09 11:44:05.338333 [ERR] 10.0.73.149:46298 - cid:904 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:06.264 | 10.0.129.3 | [6] 2020/09/09 11:44:05.485023 [ERR] 10.0.138.237:54652 - cid:905 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:06.264 | 10.0.129.3 | [6] 2020/09/09 11:44:05.850054 [ERR] 10.0.137.146:44708 - cid:906 - TLS handshake error: remote error: tls: bad certificate

  | Sep 9, 2020 @ 11:44:05.264 | 10.0.129.3 | [6] 2020/09/09 11:44:05.235283 [ERR] 10.0.74.205:40652 - cid:903 - TLS handshake error: remote error: tls: bad certificate

in the end we see errors for all cells. We have 451 cells, and for each cell we see a log entry around each 6s, leading to a picture like this for one NATS VM, on the other it looks the same (Note that all the errors are logged each second):
Screen Shot 2020-09-11 at 21 29 18

@stefanlay
Copy link
Member Author

We found some indication that the TLS issue is not the root cause but an effect of an unhealthy cluster.

We found log messages telling that one NATS VM on port 4223 which is used for updating the other instance about new messages is considered as being a "slow consumer" by the other NATS VM:

  | Sep 9, 2020 @ 11:43:09.585 | 10.0.65.3 | nats | [6] 2020/09/09 11:43:09.004747 [INF] 10.0.129.3:44466 - rid:2279 - Route connection created
  | Sep 9, 2020 @ 11:43:09.585 | 10.0.65.3 | nats | [6] 2020/09/09 11:43:09.036429 [INF] 10.0.129.3:4223 - rid:2280 - Route connection created
  | Sep 9, 2020 @ 11:43:09.252 | 10.0.129.3 | nats | [6] 2020/09/09 11:43:09.003813 [INF] 10.0.65.3:4223 - rid:467 - Route connection created
  | Sep 9, 2020 @ 11:43:09.252 | 10.0.129.3 | nats | [6] 2020/09/09 11:43:09.037182 [INF] 10.0.65.3:57280 - rid:468 - Route connection created
  | Sep 9, 2020 @ 11:43:08.585 | 10.0.65.3 | nats | [6] 2020/09/09 11:43:07.883684 [INF] 10.0.129.3:4223 - rid:2278 - Route connection created
  | Sep 9, 2020 @ 11:43:08.585 | 10.0.65.3 | nats | [6] 2020/09/09 11:43:07.895264 [INF] 10.0.129.3:44378 - rid:2277 - Route connection created
  | Sep 9, 2020 @ 11:43:08.252 | 10.0.129.3 | nats | [6] 2020/09/09 11:43:07.883814 [INF] 10.0.65.3:57278 - rid:466 - Route connection created
  | Sep 9, 2020 @ 11:43:08.252 | 10.0.129.3 | nats | [6] 2020/09/09 11:43:07.876120 [INF] 10.0.65.3:4223 - rid:465 - Route connection created
  | Sep 9, 2020 @ 11:43:07.585 | 10.0.65.3 | nats | [6] 2020/09/09 11:43:06.712144 [INF] 10.0.129.3:4223 - rid:2275 - Slow Consumer Detected: WriteDeadline of 2s Exceeded
  | Sep 9, 2020 @ 11:42:48.582 | 10.0.65.3 | nats | [6] 2020/09/09 11:42:47.711481 [INF] 10.0.129.3:44266 - rid:2276 - Route connection created
  | Sep 9, 2020 @ 11:42:48.248 | 10.0.129.3 | nats | [6] 2020/09/09 11:42:47.710787 [INF] 10.0.65.3:4223 - rid:464 - Route connection created
  | Sep 9, 2020 @ 11:42:48.248 | 10.0.129.3 | nats | [6] 2020/09/09 11:42:47.571641 [INF] 10.0.65.3:57166 - rid:463 - Route connection created
  | Sep 9, 2020 @ 11:42:47.582 | 10.0.65.3 | nats | [6] 2020/09/09 11:42:47.570987 [INF] 10.0.129.3:4223 - rid:2275 - Route connection created
  | Sep 9, 2020 @ 11:42:46.582 | 10.0.65.3 | nats | [6] 2020/09/09 11:42:46.469634 [INF] 10.0.129.3:4223 - rid:2273 - Slow Consumer Detected: WriteDeadline of 2s Exceeded
  | Sep 9, 2020 @ 11:42:27.578 | 10.0.65.3 | nats | [6] 2020/09/09 11:42:26.620253 [INF] 10.0.129.3:44142 - rid:2274 - Route connection created
  | Sep 9, 2020 @ 11:42:27.244 | 10.0.129.3 | nats | [6] 2020/09/09 11:42:26.619556 [INF] 10.0.65.3:4223 - rid:462 - Route connection created
  | Sep 9, 2020 @ 11:42:27.244 | 10.0.129.3 | nats | [6] 2020/09/09 11:42:26.555847 [INF] 10.0.65.3:57042 - rid:461 - Route connection created
  | Sep 9, 2020 @ 11:42:26.578 | 10.0.65.3 | nats | [6] 2020/09/09 11:42:26.555190 [INF] 10.0.129.3:4223 - rid:2273 - Route connection created
  | Sep 9, 2020 @ 11:42:25.578 | 10.0.65.3 | nats | [6] 2020/09/09 11:42:25.486874 [INF] 10.0.129.3:4223 - rid:2271 - Slow Consumer Detected: WriteDeadline of 2s Exceeded
  | Sep 9, 2020 @ 11:42:07.575 | 10.0.65.3 | nats | [6] 2020/09/09 11:42:07.186278 [INF] 10.0.129.3:4223 - rid:2271 - Route connection created
  | Sep 9, 2020 @ 11:42:07.575 | 10.0.65.3 | nats | [6] 2020/09/09 11:42:07.232365 [INF] 10.0.129.3:44032 - rid:2272 - Route connection created
  | Sep 9, 2020 @ 11:42:07.239 | 10.0.129.3 | nats | [6] 2020/09/09 11:42:07.231454 [INF] 10.0.65.3:4223 - rid:460 - Route connection created
  | Sep 9, 2020 @ 11:42:07.239 | 10.0.129.3 | nats | [6] 2020/09/09 11:42:07.187467 [INF] 10.0.65.3:56932 - rid:459 - Route connection created
  | Sep 9, 2020 @ 11:42:06.575 | 10.0.65.3 | nats | [6] 2020/09/09 11:42:06.109442 [INF] 10.0.129.3:43920 - rid:2269 - Slow Consumer Detected: WriteDeadline of 2s Exceeded
  | Sep 9, 2020 @ 11:41:47.571 | 10.0.65.3 | nats | [6] 2020/09/09 11:41:46.620497 [INF] 10.0.129.3:4223 - rid:2270 - Route connection created
  | Sep 9, 2020 @ 11:41:47.235 | 10.0.129.3 | nats | [6] 2020/09/09 11:41:46.621173 [INF] 10.0.65.3:56820 - rid:458 - Route connection created
  | Sep 9, 2020 @ 11:41:47.235 | 10.0.129.3 | nats | [6] 2020/09/09 11:41:46.518866 [INF] 10.0.65.3:4223 - rid:457 - Route connection created
  | Sep 9, 2020 @ 11:41:46.571 | 10.0.65.3 | nats | [6] 2020/09/09 11:41:46.519564 [INF] 10.0.129.3:43920 - rid:2269 - Route connection created
  | Sep 9, 2020 @ 11:41:45.571 | 10.0.65.3 | nats | [6] 2020/09/09 11:41:45.375855 [INF] 10.0.129.3:43896 - rid:2267 - Slow Consumer Detected: WriteDeadline of 2s Exceeded
  | Sep 9, 2020 @ 11:41:28.231 | 10.0.129.3 | nats | [6] 2020/09/09 11:41:27.451784 [INF] 10.0.65.3:4223 - rid:455 - Route connection created
  | Sep 9, 2020 @ 11:41:28.231 | 10.0.129.3 | nats | [6] 2020/09/09 11:41:27.506137 [INF] 10.0.65.3:56696 - rid:456 - Route connection created
  | Sep 9, 2020 @ 11:41:27.567 | 10.0.65.3 | nats | [6] 2020/09/09 11:41:27.505537 [INF] 10.0.129.3:4223 - rid:2268 - Route connection created
  | Sep 9, 2020 @ 11:41:27.567 | 10.0.65.3 | nats | [6] 2020/09/09 11:41:27.452753 [INF] 10.0.129.3:43896 - rid:2267 - Route connection created
  | Sep 9, 2020 @ 11:41:26.567 | 10.0.65.3 | nats | [6] 2020/09/09 11:41:26.386851 [INF] 10.0.129.3:4223 - rid:1767 - Slow Consumer Detected: WriteDeadline of 2s Exceeded
  | Sep 9, 2020 @ 11:41:16.565 | 10.0.65.3 | nats | [6] 2020/09/09 11:41:16.355473 [INF] 10.0.129.23:30582 - cid:1980 - Slow Consumer Detected: WriteDeadline of 2s Exceeded

Nearly 2 Minutes after the first of such log entry the gorouters start pruning some routes, see https://user-images.githubusercontent.com/358527/92950779-8b094d80-f45d-11ea-85a0-b95156f42a91.png

Then, after some time, suddenly all routes are dropped, and then only later, the certificate errors occur for the connections attempts. on port 4224. Btw., we also saw that during a bosh update there are the same error messages "TLS handshake error: remote error: tls: bad certificate" when the NATS job is stopped on another NATS VM, and that this error message is not logged on the diego side.

We now have some questions:

Do you think that such communication issue between the two NATS instances can lead to such a complete loss of routes and such certificate errors?

What would be the best recommendation to analyse such slow consumer issues?

Would it be sufficient to scale out to bigger VMs with more network bandwidth? Would you also recommend to scale out to more than two VMs?

@wallyqs
Copy link

wallyqs commented Sep 15, 2020

Hello from the NATS team 👋 . I'd like to gather some more info on the current setup if possible, I notice that in the release the old version of the NATS Server is being used:

go build -o "${BOSH_INSTALL_TARGET}/bin/gnatsd" .

But is the version of the NATS Server still v1? Or is latest NATS Server v2 being used? I'm saying this because the Slow Consumer behavior between routes would be improved in NATS Server v2 series in some scenarios, specially when there is a very large number of subscriptions.

@rhall-pivotal
Copy link
Contributor

Hi @stefanlay & @wallyqs,

I'm looking into this issue today with @mcwumbly. We've reviewed this discussion, along with the discussion over at nats-io/nats-server#1597.

First, we have a few responses to a few of the questions we've seen.

Do you think that such communication issue between the two NATS instances can lead to such a complete loss of routes and such certificate errors?

A: We now suspect this is due to additional overhead required by all of the connections being made to NATS which are now using TLS. Notably the 450+ diego cells and that CF-Deployment turns on TLS between the route-emitters and NATS could explain why we are just now seeing issues like this.

What would be the best recommendation to analyse such slow consumer issues?

A: We're happy that you opened the other issue on NATS.io, as it sounds like they have had some good insights to what you're seeing. We will subscribe to that issue as well.

Would it be sufficient to scale out to bigger VMs with more network bandwidth? Would you also recommend to scale out to more than two VMs?

A: We think that scaling up the size of the NATS VMs or scaling out the number of instances of the NATS VMs could both be beneficial in this case.

With more instances the load from the TLS connections should be distributed better, and reduce the overhead on each VM overall as the distribution is now greater. With bigger VMs that have more CPU available to them, and perhaps faster Disk & RAM it could also help with what @kozlovic said in the NATS.io issue.

But is the version of the NATS Server still v1? Or is latest NATS Server v2 being used?

A: @wallyqs as you discovered above, the version of NATS is still v1, specifically v1.4.1. We hear your recommendations clearly and we have created an issue to track that here: #26


@stefanlay we think it is worth exploring ways to resolve this issue independently of the upgrade of NATS v2 as that could take some time to accomplish. One approach as you suggested is to scale up the NATS VMs. Another thing to consider may be to use an ops file to turn off TLS between the route-emitters and the NATS VMs.

@Infra-Red
Copy link

@stefanlay @plowin Could you please describe the steps that were taken to solve the problem? At the moment my team is using the cf-deployment version prior to nats-tls has been introduced and we are trying to understand what could be the impact of upgrading our installtion. The total amount of active routes is not as numerous as in your case (we have ~5k active routes), but we are still concerned about the problem described above.

@ameowlia
Copy link
Member

ameowlia commented Oct 8, 2020

Hi all,

We saw another instance of a bug that looked remarkably like this yesterday.

The problem that we saw yesterday:

  • in the deployment route emitter does NOT use nats-tls
  • NATS CPU spiked to 100% followed by "TLS handshake error: remote error: tls: bad certificate" errors
  • the traffic that was causing this error was coming from the diego cells
  • the gorouter did not drop of its routes because pruning was turned off
  • however the service discovery controller (the component that resolves internal routes) lost all of its routes because of pruning
  • restarting the nats vms fixed the issue (at least, momentarily)

Something new I learned:

  • the metrics-discovery-registrar job DOES use nats-tls. This job is located on the diego cell. This is likely the source of the traffic that is having problems connecting to the nats-tls job. I suspect that this is a symptom of the problem, not the cause.

Another thought:
@stefanlay You mentioned that gorouter dropped all of your routes as a result of this bug. I find this shocking and deeply disturbing. Based on your original issue you are on cf-deployment 13.8.0. This is using routing release 0.203.0. There used to be a gorouter config property that allowed users to turn on TTL based pruning for stale routes. However, we got rid of this flag in routing release 0.195.0. We got rid of this flag because of precisely this reason. We don't want any problems with NATS to cause issues for those trying to route to apps.

@stefanlay can you confirm what version of routing-release you have deployed?

@stefanlay
Copy link
Member Author

Hi @Infra-Red,

we tried to reproduce the scenario and have strong indication that the slow consumer issue between the two NATS nodes were causing the outage. As a mitigation of the slow consumer issue we increased the NATS VM size, and therefore also other parameters like network bandwidth. Within last month we only saw two slow consumer errors between NATS nodes which were gone after the route connections were recreated. We are still investigating which metrics we can use to better predict a slow consumer issue. Due to our observation, NATS CPU load is not a good metric.

Hi @ameowlia

We also see that the the metrics-discovery-registrar job uses nats-tls, and the tls errors may also come from these connections when CPU load was 100%.

We have however another explanation why in our case the CPU load was increasing:

We now understood is that due to a bug in diego the route emitters could not use TLS
although TLS was configured. The bug was fixed with cloudfoundry/diego-release@44260f2. Because the TLS certificate contained the DNS name, but the client only knew the IP address, TLS handshake failed. We saw that route-emitters tried to use TLS on port 4224 in the first connection attempt but fell back to non-TLS on port 4222 (it's a feature of NATS to advertise all cluster nodes to the clients). When now the nats process on 4222 cannot accept new connections, the route-emitter again tries on 4224. When this is done from many diego-cells, CPU load increases due to the many TLS handshakes.

I do not understand the purpose of the gorouter config property that allowed users to turn on TTL based pruning for stale routes. To our understanding, gorouter prunes all stale routes after 120s (https://github.com/cloudfoundry/routing-release/blob/abd514ad8b78886dd2487309dd19c5fc1c92cf4b/jobs/gorouter/templates/gorouter.yml.erb#L83). When did this property have effect? Note that we do not use route integrity/backend tls router.backends.enable_tls: false.

You mentioned that you have turned pruning off in your case. How did you do that? With suspend_pruning_if_nats_unavailable?

We were on cf-deployment 13.12.0 but already used routing release 0.206.0 because of cloudfoundry/routing-release#178

Now we are on cf-deployment 13.17.0 and routing release 0.206.0.

ameowlia added a commit that referenced this issue Oct 12, 2020
* When advertising is on, clients that are incompatible with nats-tls
job will try to connect to nats-tls if nats is unavailable
* Clients wanting to connect via TLS get downgraded when connecting to
nats if nats-tls is unavailable

[#25](#25)
[cloudfoundry/routing-release#185](cloudfoundry/routing-release#185)
@ameowlia
Copy link
Member

@stefanlay - I agree with your analysis about why the CPU is increasing. This is consistent with issue I encountered as well.

🌹 Re: Gorouter and pruning

  • Since routing release 0.195.0 we got rid of the flag to prune tls routes based on TTLs
  • Here is the code that tests that gorouter does not prune tls routes

🐛 What to do to mitigate this bug

  • Use TLS routes. This is also sometimes called "TLS to backends" or "route integrity". This will prevent a mass pruning of routes from gorouter.
  • Increase the internal route staleness threshold. The service discovery controller does prune based on TTL. I recommend increasing this to 1hr. This will give you enough time to discover this nats problem and restart the nats server.

📈 Fixing Bug Status

  • We are currently testing using no_advertise on our nats clusters. This prevents clients who should be talking only to nats from finding out about nats-tls. However, some jobs in CF have not been configured to use nats-tls properly and have been relying on the ability to downgrade to nats. This makes a seemless upgrade difficult and we might have to result to continuing to allow downgrades for the time being.
  • We are working on getting off golang 1.14.1 for nats-release. This version of golang has a deadlock bug that could cause problems.

ameowlia added a commit that referenced this issue Oct 13, 2020
* currently some clients in CF are not configured to properly use
nats-tls and rely on being able to downgrade to nats.
* we are keeping this ability to downgrade to nats for backwards
compatability.
* we plan removing the ability to downgrade to nats once we fix all the
clients

[#25](#25)
[cloudfoundry/routing-release#185](cloudfoundry/routing-release#185)
psycofdj pushed a commit to orange-cloudfoundry/nats-release that referenced this issue Oct 14, 2020
* When advertising is on, clients that are incompatible with nats-tls
job will try to connect to nats-tls if nats is unavailable
* Clients wanting to connect via TLS get downgraded when connecting to
nats if nats-tls is unavailable

[cloudfoundry#25](cloudfoundry#25)
[cloudfoundry/routing-release#185](cloudfoundry/routing-release#185)
@plowin
Copy link

plowin commented Oct 14, 2020

Hi @ameowlia, thanks for the analysis and efforts on this issue!
One question regarding your last comment:

Use TLS routes. This is also sometimes called "TLS to backends" or "route integrity". This will prevent a mass pruning of routes from gorouter.

Could you please help me understand how route-integrity and the mass pruning are related? In our case, NATS did not provide updates to the routers and after 120s the routes got pruned. We did not assume route-integrity or diego having an impact here?

@domdom82
Copy link
Contributor

@plowin Since TLS ensures you never accidentally talk to the wrong endpoint, you no longer need TTL-based pruning. You just check the cert issuer and subject. If they match you route forward, if they don't you need to update your route table because the app has moved somewhere else.

Just tested it myself, deployed with route-integrity and stopped NATS. Never got pruned. So this would protect us against a NATS outage as well as misrouting.
(we are still investigating a statement on slack where it is claimed you can't do cf ssh anymore with route integrity enabled. However I managed to do it just fine, so I wonder what's up with that...)

@ameowlia Do you have any insights into the cf ssh vs. route integrity thing?

@ameowlia
Copy link
Member

Hi @plowin,

Could you please help me understand how route-integrity and the mass pruning are related?

When route integrity is enabled this means that gorouter is sending traffic to apps over TLS. The sidecar envoy presents the app's certificate to the gorouter during this handshake. The certificate contains the app's instance guid. The gorouter no only checks that the cert is valid, but it also checks to make sure that the app instance guid in the cert matches the app instance guid that it expects to be talking to. Essentially, this is checking to make sure that the route is not stale. If the guids don't match, then the gorouter assumes this is a stale route and prunes it from its routing table and then retries another entry. This action is logged as something like "pruned-failed-backend-endpoint"

Because we now rely on route-integrity to know if a route is stale, we turned on pruning on a TTL for TLS routes.

😅 @domdom82 you replied 1 minute before me!

@ameowlia
Copy link
Member

@domdom82,

Do you have any insights into the cf ssh vs. route integrity thing?

What I have always heard/read in docs:
There are two versions of route integrity: tls and mtls. Sshing works with tls route integrity, but not mtls route integrity.

Long Answer:
In the "normal" case, when tls route integrity is turned on there are 4 ports exposed on each container (see the example actual-lrp below for the description of all of the ports. The
✨ emoji represents the ports related to this convo). I believe that the ssh agent connects to the DIEGO_CELL_SSH_PORT.

When mtls route integrity is turned on only 2 of those ports are exposed: the DIEGO_CELL_ENVOY_PORT and CONTAINER_ENVOY_SSH_PORT. I'm not sure if the ssh agent can be configured to connect with the CONTAINER_ENVOY_SSH_PORT. Or maybe it already does and but the user need to configure something special?

$ cfdot actual-lrps | jq .

{
  "process_guid": "ab2bd185-9d9a-4628-9cd8-626649ec5432-cb50adac-6861-4f03-92e4-9fcc1a204a1e",
  "index": 0,
  "cell_id": "d8d4f5fe-36f2-4f50-8c4a-8df293f6bc5b",
  "address": "10.0.1.12",                  <------ The cell's IP address where this app instance is running, also sometimes called the host IP. Let's call this DIEGO_CELL_IP.
    "ports": [
      {
        "container_port": 8080,            <------ The port the app is listening on inside of its container. 8080 is the default value. Let's call this CONTAINER_APP_PORT.
        "host_port": 61012,                <------ ✨ The port on the Diego Cell where traffic to your app is sent to before it is forwarded to the overlay address and the container_port. Let's call this DIEGO_CELL_APP_PORT.
        "container_tls_proxy_port": 61001, <------ The port inside of the app container that envoy is listening on for HTTPS traffic. This is the default value (currently unchangeable). Let's call this CONTAINER_ENVOY_PORT.
        "host_tls_proxy_port": 61014,      <------ ✨ The port on the Diego Cell where traffic to your app's envoy sidecar is sent to before it is forwarded to the overlay address and the container_tls_proxy_port. Let's call this DIEGO_CELL_ENVOY_PORT
      },
      {
        "container_port": 2222,            <------ The port exposed on the app container for sshing onto the app container
        "host_port": 61013,                <------ ✨ The port on the Diego Cell where ssh traffic to your app container is sent to before it is forwarded to the overlay address and the ssh container_port. Let's call this DIEGO_CELL_SSH_PORT
        "container_tls_proxy_port": 61002, <------ The ssh port inside of the app container that envoy is listening on for ssh traffic. This is the default value (currently unchangeable).
        "host_tls_proxy_port": 61015       <------ ✨ The port on the Diego Cell where ssh traffic to your app's envoy sidecar is sent to before it is forwarded to the overlay address and the ssh container_tls_proxy_port. Let's call this DIEGO_CELL_ENVOY_SSH_PORT.
      }
    ],
  "instance_address": "10.255.116.6",      <------ The overlay IP address of this app instance, let's call this the OVERLAY_IP
  "state": "RUNNING",
   ...
}

@domdom82
Copy link
Contributor

domdom82 commented Oct 15, 2020

@ameowlia hmm this is weird.
I think I have configured everything so that actual mTLS is happening, yet still I am able to do a cf ssh into my app.

This is what my app looks like on cfdot:

{
  "process_guid": "f339d890-47dc-48fb-892f-b1781ac8e963-9cd719b4-8eae-4ea2-9ee1-db2594216456",
  "index": 0,
  "domain": "cf-apps",
  "instance_guid": "4c3104bb-f789-4501-478c-dd94",
  "cell_id": "3f7599e5-c105-449a-9665-e745bc299bb2",
  "address": "10.0.73.1",
  "ports": [
    {
      "container_port": 8080,
      "host_port": 61012,
      "container_tls_proxy_port": 61001,
      "host_tls_proxy_port": 61014
    },
    {
      "container_port": 2222,
      "host_port": 61013,
      "container_tls_proxy_port": 61002,
      "host_tls_proxy_port": 61015
    }
  ],
  "instance_address": "10.141.12.193",
  "preferred_address": "HOST",
  "crash_count": 0,
  "state": "RUNNING",
  "since": 1602738314291106300,
  "modification_tag": {
    "epoch": "05461369-1851-4fba-490c-e9f3461ec149",
    "index": 2
  },
  "presence": "ORDINARY"
}

So far so good. Here is my diego-cell config of cf-deployment:

name: diego-cell
jobs:
(...)
- name: rep
  properties:
    (...)
    containers:
      proxy:
        enabled: true
        require_and_verify_client_certificates: true
        trusted_ca_certificates:
        - ((gorouter_backend_tls.ca))
        - ((ssh_proxy_backends_tls.ca))
        verify_subject_alt_name:
        - gorouter.service.cf.internal
        - ssh-proxy.service.cf.internal
      trusted_ca_certificates:
      - ((diego_instance_identity_ca.ca))
      - ((credhub_tls.ca))
      - ((uaa_ssl.ca))
(...)
  release: diego
(...)

So to me this looks like mTLS should be enabled, right? (require_and_verify_client_certificates: true)

I also took a tcpdump when I connected the ssh session:

No.     Time           Time Delta     Source                Destination           Protocol Length Calculated window size Info
     39 6.299363       4.100878       10.0.65.71            10.158.16.8           TCP      74     62727                  47948 → 61002 [SYN] Seq=2857877295 Win=62727 Len=0 MSS=8961 SACK_PERM=1 TSval=1867992293 TSecr=0 WS=128

No.     Time           Time Delta     Source                Destination           Protocol Length Calculated window size Info
     40 6.299389       0.000026       10.158.16.8           10.0.65.71            TCP      74     62293                  61002 → 47948 [SYN, ACK] Seq=4288799740 Ack=2857877296 Win=62293 Len=0 MSS=8911 SACK_PERM=1 TSval=577200733 TSecr=1867992293 WS=128

No.     Time           Time Delta     Source                Destination           Protocol Length Calculated window size Info
     41 6.299504       0.000115       10.0.65.71            10.158.16.8           TCP      66     62848                  47948 → 61002 [ACK] Seq=2857877296 Ack=4288799741 Win=62848 Len=0 TSval=1867992293 TSecr=577200733

No.     Time           Time Delta     Source                Destination           Protocol Length Calculated window size Info
     42 6.299598       0.000094       10.0.65.71            10.158.16.8           TLSv1.2  264    62848                  Client Hello

No.     Time           Time Delta     Source                Destination           Protocol Length Calculated window size Info
     43 6.299606       0.000008       10.158.16.8           10.0.65.71            TCP      66     62208                  61002 → 47948 [ACK] Seq=4288799741 Ack=2857877494 Win=62208 Len=0 TSval=577200733 TSecr=1867992294

No.     Time           Time Delta     Source                Destination           Protocol Length Calculated window size Info
     44 6.302096       0.002490       10.158.16.8           10.0.65.71            TLSv1.2  2378   62208                  Server Hello, Certificate, Server Key Exchange, Certificate Request, Server Hello Done

No.     Time           Time Delta     Source                Destination           Protocol Length Calculated window size Info
     45 6.302281       0.000185       10.0.65.71            10.158.16.8           TCP      66     60544                  47948 → 61002 [ACK] Seq=2857877494 Ack=4288802053 Win=60544 Len=0 TSval=1867992296 TSecr=577200735

No.     Time           Time Delta     Source                Destination           Protocol Length Calculated window size Info
     46 6.306147       0.003866       10.0.65.71            10.158.16.8           TLSv1.2  1312   60544                  Certificate, Client Key Exchange, Certificate Verify, Change Cipher Spec, Encrypted Handshake Message

No.     Time           Time Delta     Source                Destination           Protocol Length Calculated window size Info
     47 6.307105       0.000958       10.158.16.8           10.0.65.71            TLSv1.2  117    61056                  Change Cipher Spec, Encrypted Handshake Message

No.     Time           Time Delta     Source                Destination           Protocol Length Calculated window size Info
     48 6.307214       0.000109       10.158.16.8           10.0.65.71            TLSv1.2  115    61056                  Application Data

So the ssh-proxy does connect to the envoy listener on port 61002 and there is a mTLS handshake happening (both parties present a certificate, see packet nr 46). After that there is regular application data going over the wire. (I could log in, got a shell, did some echo hello etc.)

Here is a ps output of my SSH session:

vcap@4c3104bb-f789-4501-478c-dd94:~$ ps -xaf
    PID TTY      STAT   TIME COMMAND
    110 ?        Ssl    0:00 /etc/cf-assets/healthcheck/healthcheck -port=8080 -timeout=1000ms -liveness-interval=30
     22 ?        Ss     0:00 sh -c trap 'kill -9 0' TERM; /etc/cf-assets/envoy/envoy -c /etc/cf-assets/envoy_config/
     62 ?        Sl     0:29  \_ /etc/cf-assets/envoy/envoy -c /etc/cf-assets/envoy_config/envoy.yaml --drain-time-s
     17 ?        Ssl    0:00 npm
    102 ?        S      0:00  \_ sh -c node app.js
    103 ?        Sl     0:00      \_ node app.js
     11 ?        Ssl    0:00 /tmp/lifecycle/diego-sshd --allowedKeyExchanges= --address=0.0.0.0:2222 --allowUnauthen
    132 pts/0    Ss     0:00  \_ /bin/bash
    142 pts/0    R+     0:00      \_ ps -xaf
      1 ?        Ss     0:00 /tmp/garden-init

So there is the envoy running, my node app and the diego-sshd that forked off my shell.

Here is a list of connections on the app container:

vcap@4c3104bb-f789-4501-478c-dd94:~$ netstat -ant
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp        0      0 0.0.0.0:61001           0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:61002           0.0.0.0:*               LISTEN
tcp        0      0 127.0.0.1:61003         0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:2222            0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:8080            0.0.0.0:*               LISTEN
tcp        0      0 10.141.12.193:40308     10.141.12.193:8080      TIME_WAIT
tcp      180      0 10.141.12.193:52974     10.141.12.193:2222      ESTABLISHED
tcp        0    180 10.141.12.193:2222      10.141.12.193:52974     ESTABLISHED
tcp        0      0 10.141.12.193:61002     10.0.65.71:55904        ESTABLISHED
tcp        0      0 10.141.12.193:40180     10.141.12.193:8080      TIME_WAIT

The connection between ssh-proxy and my container on port 61002 can be seen. Not sure about the local connection on port 2222. I guess this is ssh being forwarded by envoy to the sshd? But it all looks like it is working just fine.

@ameowlia
Copy link
Member

@domdom82 maybe it has been fixed 🤷 If you want to talk about sshing with mtls further, lets move the conversation elsewhere.

Related to the original issue @stefanlay, this has been fixed in versions v35 -> v38. It took many releases to get it just right 😅 . This release will become available in cf-deployment very soon.

I am going to consider this fixed. @stefanlay, if you find that this does not fix your issue then please reopen.

@domdom82
Copy link
Contributor

Thanks @ameowlia ! I have opened a docs issue for the cf-ssh thing in lieu of a better place.

@stefanlay
Copy link
Member Author

stefanlay commented Oct 16, 2020

Hi @ameowlia, thanks for all your input, that helps a lot.

I have a question about your fixes to NATS (cloudfoundry/routing-release#185 (comment)). With v37 nats.no_advertise is switched on for the nats job but switched off for the nats-tls job. What will happen to the connections from the route-emitters when the nats vms are updated?

Assume we have two nats vms, NATS1 and NATS2. In the route-emitter configuration nats is configured with "nats.service.cf.internal:4224".

The nats-docu is inconsistent in what will happen:

in [1] (docu for the config option) it states "When set to true, do not advertise this server to clients."
Then only :4222 and :4222 are advertised. When NATS1 is stopped, the route-emitter will always fall back to NATS2:4222.

in [2] (docu for the command line flag) it however states: "Do not advertise known cluster information to clients"
Then the nats-tls job would not advertise at all. When NATS1 is stopped, the route-emitter will retry until the DNS resolution by chance returns the IP address of NATS2 and then connect to 4224.

What is the actual behaviour?

I also think that switching off advertising for the nats-tls jobs makes sense because currently reconnecting to an advertised nats-tls would not work, as the nats nodes are advertised with their IP addresses, but the certificate contains the DNS name. So already now the route-emitter would fall back to the nats job on port 4222.

I think the TLS issues we had were not caused by route-emitters or other clients not using tls like you mentioned in #25 (comment), but by a bug which was fixed by cloudfoundry/diego-release@44260f2

On the other hand, when the diego-VMs are updated after NATS, the route-emitters will again connect on 4224, so in that case that's ok.

In general, i think we still haven't found the root cause of all the issues, as the TLS errors only came later after all routes were dropped (#25 (comment)). But we now have improved our nats monitoring in order to better predict slow consumer issues, so we are ok with having this ticket closed.

[1] https://docs.nats.io/nats-server/configuration/clustering/cluster_config
[2] https://docs.nats.io/nats-server/flags

@ameowlia
Copy link
Member

Hi @stefanlay , you are correct. Clients of nats-tls may fall back to nats when the default values in the release. This is in the release notes as:

⚠️ Warning⚠️ This change in property defaults means that you may tell your clients to use nats_tls, but they can still use nats. This means that all the traffic you think is encrypted might not be. We plan on changing this value back to "true" in the future once we fix our clients.

If all of your nats-tls clients can successfully connect to nats-tls, then you can turn on no_advertise: true for nats as well, which will prevent this downgrading.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants