Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Health-Monitor fails to start because of NATS? #2524

Closed
benjaminguttmann-avtq opened this issue May 24, 2024 · 2 comments
Closed

Health-Monitor fails to start because of NATS? #2524

benjaminguttmann-avtq opened this issue May 24, 2024 · 2 comments

Comments

@benjaminguttmann-avtq
Copy link

Describe the bug
During a bosh create-env (using bosh-deployment) the director is not able to start, it fails during step

Waiting for instance 'bosh/0' to be running... Failed (00:05:02)

When checking the VM the health-monitor state is flapping between not running and connection failed.
After a monit restart all, all jobs are able to start without any issues.

The environment that is currently showing the issues is running around 1300 deployments on the director.

To Reproduce
Unfortunately currently I do not know general approach to reproduce the issue

Expected behavior
Health-Monitor should be able to start without a monit restart all.

Logs
nats.logs

[7] 2024/05/22 11:45:50.195360 [INF] Starting nats-server
[7] 2024/05/22 11:45:50.195404 [INF]   Version:  2.10.14
[7] 2024/05/22 11:45:50.195407 [INF]   Git:      [31af767]
[7] 2024/05/22 11:45:50.195409 [INF]   Name:     NCCJDIEA5H3JBTKNNBC2KDV7M2W4CCAEMVLZNK5RJTLCAUUPAJEHHLIB
[7] 2024/05/22 11:45:50.195411 [INF]   ID:       NCCJDIEA5H3JBTKNNBC2KDV7M2W4CCAEMVLZNK5RJTLCAUUPAJEHHLIB
[7] 2024/05/22 11:45:50.195423 [INF] Using configuration file: /var/vcap/jobs/nats/config/nats.cfg
[7] 2024/05/22 11:45:50.199173 [INF] Listening for client connections on 0.0.0.0:4222
[7] 2024/05/22 11:45:50.199183 [INF] TLS required for client connections
[7] 2024/05/22 11:45:50.199339 [INF] Server is ready
[7] 2024/05/22 11:45:50.218014 [ERR] 10.125.3.15:39418 - cid:5 - authentication error
[7] 2024/05/22 11:45:50.237945 [ERR] 10.125.4.23:50476 - cid:6 - authentication error
[7] 2024/05/22 11:45:50.238401 [ERR] 10.125.2.57:33846 - cid:7 - authentication error
[7] 2024/05/22 11:45:50.243279 [ERR] 10.125.1.112:39118 - cid:9 - authentication error
[7] 2024/05/22 11:45:50.249160 [ERR] 10.125.0.165:41304 - cid:10 - authentication error
[7] 2024/05/22 11:45:50.251315 [ERR] 10.125.6.202:53978 - cid:8 - authentication error
[7] 2024/05/22 11:45:50.253802 [ERR] 10.125.0.124:40872 - cid:11 - authentication error
[7] 2024/05/22 11:45:50.258471 [ERR] 10.125.1.139:37004 - cid:12 - authentication error
[7] 2024/05/22 11:45:50.267431 [ERR] 10.125.2.231:42702 - cid:14 - authentication error
[7] 2024/05/22 11:45:50.267469 [ERR] 10.125.4.241:57278 - cid:13 - authentication error
[7] 2024/05/22 11:45:50.273000 [ERR] 10.125.2.6:52366 - cid:15 - authentication error
[7] 2024/05/22 11:45:50.287764 [ERR] 10.125.4.190:53216 - cid:16 - authentication error
[7] 2024/05/22 11:45:50.287956 [ERR] 10.125.7.5:41334 - cid:18 - authentication error
[7] 2024/05/22 11:45:50.288862 [ERR] 10.125.5.102:48380 - cid:19 - authentication error
[7] 2024/05/22 11:45:50.289510 [ERR] 10.125.0.112:49074 - cid:17 - authentication error
[7] 2024/05/22 11:45:50.293872 [ERR] 10.125.5.130:48124 - cid:20 - authentication error
[7] 2024/05/22 11:45:50.300604 [ERR] 10.125.1.104:39970 - cid:21 - authentication error
[7] 2024/05/22 11:45:50.324253 [ERR] 10.125.7.10:39478 - cid:22 - authentication error
[7] 2024/05/22 11:45:50.335404 [ERR] 10.125.0.156:37566 - cid:23 - authentication error

After monit restart the output of nats changes

[7] 2024/05/22 11:54:38.023891 [INF] Initiating Shutdown...
[7] 2024/05/22 11:54:38.047632 [INF] Server Exiting..
[7] 2024/05/22 11:54:39.157471 [INF] Starting nats-server
[7] 2024/05/22 11:54:39.157519 [INF]   Version:  2.10.14
[7] 2024/05/22 11:54:39.157525 [INF]   Git:      [31af767]
[7] 2024/05/22 11:54:39.157528 [INF]   Name:     NBBIXIY572PPZ6GGUHOU5BNQYESQAE2XFHZERVYBE4RONOKF2K4LG4QA
[7] 2024/05/22 11:54:39.157533 [INF]   ID:       NBBIXIY572PPZ6GGUHOU5BNQYESQAE2XFHZERVYBE4RONOKF2K4LG4QA
[7] 2024/05/22 11:54:39.157621 [INF] Using configuration file: /var/vcap/jobs/nats/config/nats.cfg
[7] 2024/05/22 11:54:39.161438 [INF] Listening for client connections on 0.0.0.0:4222
[7] 2024/05/22 11:54:39.161451 [INF] TLS required for client connections
[7] 2024/05/22 11:54:39.161628 [INF] Server is ready
[7] 2024/05/22 11:54:42.413573 [WRN] 10.125.6.214:43474 - cid:35 - Readloop processing time: 2.340787842s
[7] 2024/05/22 11:54:42.429530 [WRN] 10.125.7.167:40738 - cid:13 - Readloop processing time: 2.3593845s
[7] 2024/05/22 11:54:42.459741 [WRN] 10.125.7.173:49162 - cid:15 - Readloop processing time: 2.387748484s
[7] 2024/05/22 11:54:42.467671 [WRN] 10.125.5.15:39862 - cid:32 - Readloop processing time: 2.396247696s
[7] 2024/05/22 11:54:42.484340 [WRN] 10.125.1.110:36454 - cid:31 - Readloop processing time: 2.411959355s
[7] 2024/05/22 11:54:42.504861 [WRN] 10.125.0.228:33652 - cid:37 - Readloop processing time: 2.4357948s
[7] 2024/05/22 11:54:42.504897 [WRN] 10.125.4.73:38858 - cid:8 - Readloop processing time: 2.45904949s
[7] 2024/05/22 11:54:42.504978 [WRN] 10.125.5.1:58244 - cid:21 - Readloop processing time: 2.431191166s
[7] 2024/05/22 11:54:42.504877 [WRN] 10.125.4.145:42174 - cid:9 - Readloop processing time: 2.430899421s
[7] 2024/05/22 11:54:42.505251 [WRN] 10.125.5.168:41470 - cid:33 - Readloop processing time: 2.434002723s
[7] 2024/05/22 11:54:42.505297 [WRN] 10.125.4.113:40706 - cid:30 - Readloop processing time: 2.43348799s
[7] 2024/05/22 11:54:42.830067 [WRN] 10.125.5.19:40656 - cid:22 - Readloop processing time: 2.741601959s
[7] 2024/05/22 11:54:42.830155 [WRN] 10.125.1.48:39772 - cid:25 - Readloop processing time: 2.751031236s
[7] 2024/05/22 11:54:42.830161 [WRN] 10.125.1.208:40868 - cid:28 - Readloop processing time: 2.750756741s
[7] 2024/05/22 11:54:42.830208 [WRN] 10.125.0.49:58068 - cid:18 - Readloop processing time: 2.741677726s
[7] 2024/05/22 11:54:42.830288 [WRN] 10.125.1.33:58282 - cid:44 - Readloop processing time: 2.736221575s
[7] 2024/05/22 11:54:42.830365 [WRN] 10.125.5.76:58558 - cid:46 - Readloop processing time: 2.72323477s

healthmonitor.log

F, [2024-05-22T11:47:54.412550 #7] FATAL : NATS client error: 'Authorization Violation'
I, [2024-05-22T11:47:54.412628 #7]  INFO : HealthMonitor shutting down...
I, [2024-05-22T11:47:54.412856 #7]  INFO : HealthMonitor exiting!
I, [2024-05-22T11:48:33.748637 #7]  INFO : HealthMonitor starting...
I, [2024-05-22T11:48:33.769915 #7]  INFO : Connected to NATS at 'nats://10.130.5.60:4222'
I, [2024-05-22T11:48:33.770408 #7]  INFO : Logging delivery agent is running...
I, [2024-05-22T11:48:33.770691 #7]  INFO : Event logger is running...
I, [2024-05-22T11:48:33.771060 #7]  INFO : Resurrector is running...
I, [2024-05-22T11:48:34.494305 #7]  INFO : Found deployment 'pg'
I, [2024-05-22T11:48:34.524124 #7]  INFO : Adding agent 87e8d727-9b53-47cd-be4a-e8c304f26611 (pg/3273aa72-4dfe-43a1-8904-9ce897d55cef) to pg...
I, [2024-05-22T11:48:34.524233 #7]  INFO : Adding agent 6fbc8361-b201-4715-a698-60b43b03d211 (pg/134feec0-d756-4bf1-9b65-1a1cc3c90720) to pg...
I, [2024-05-22T11:48:34.524268 #7]  INFO : Adding agent bc90e8f1-280a-4da2-87c5-760794d7d983 (pg/0c52aca9-d951-4702-ad3a-a3f267ce05f5) to pg...
I, [2024-05-22T11:48:34.544785 #7]  INFO : Found deployment 'consul'
I, [2024-05-22T11:48:34.578424 #7]  INFO : Adding agent 77f533b3-e87b-47e8-9c8c-6cb3392fe0ae (consul/646bc279-d852-4e3b-b0e4-8f90e79ba5a0) to consul...
I, [2024-05-22T11:48:34.578514 #7]  INFO : Adding agent abbb535b-b1f5-429f-bbdd-43ddcbda6ccd (consul/d9a14091-f2cb-4da3-87e4-a7518550d415) to consul...

after restart of nats

I, [2024-05-22T11:54:22.451287 #7]  INFO : HealthMonitor exiting!
I, [2024-05-22T11:54:33.132542 #7]  INFO : HealthMonitor starting...
I, [2024-05-22T11:54:33.154141 #7]  INFO : Connected to NATS at 'nats://10.130.5.60:4222'
I, [2024-05-22T11:54:33.154695 #7]  INFO : Logging delivery agent is running...
I, [2024-05-22T11:54:33.155026 #7]  INFO : Event logger is running...
I, [2024-05-22T11:54:33.155387 #7]  INFO : Resurrector is running...
W, [2024-05-22T11:54:33.161064 #7]  WARN : Received heartbeat from unmanaged agent: 20854688-4677-4225-b9c7-df49b0e4e558
W, [2024-05-22T11:54:33.165453 #7]  WARN : Received heartbeat from unmanaged agent: a4c34ef8-63fc-4b18-a236-070d407ec7ce
W, [2024-05-22T11:54:33.204422 #7]  WARN : Received heartbeat from unmanaged agent: f6c9e71c-0aa7-47a8-bb45-7e2e88e26785
W, [2024-05-22T11:54:33.228886 #7]  WARN : Received heartbeat from unmanaged agent: dad550ff-575e-47c0-8bce-1c4f3ab054d9
W, [2024-05-22T11:54:33.243193 #7]  WARN : Received heartbeat from unmanaged agent: 528dff5d-add4-459e-ad75-2416811cdf63
W, [2024-05-22T11:54:33.271042 #7]  WARN : Received heartbeat from unmanaged agent: 8c55eb47-d2f9-4767-851c-5b2cabdc0bb8
W, [2024-05-22T11:54:33.276194 #7]  WARN : Received heartbeat from unmanaged agent: e15a4557-8085-4afa-9cad-698ea12d1088
W, [2024-05-22T11:54:33.300816 #7]  WARN : Received heartbeat from unmanaged agent: c5af3e3d-57fc-4283-8c68-623742248b82
W, [2024-05-22T11:54:33.328242 #7]  WARN : Received heartbeat from unmanaged agent: d870456f-f1b9-4bed-b35d-7bbdf37a1ed5
W, [2024-05-22T11:54:33.355038 #7]  WARN : Received heartbeat from unmanaged agent: 0dbd71ac-3cb1-40f8-a7b6-e46c6bee7e9b
W, [2024-05-22T11:54:33.355502 #7]  WARN : Received heartbeat from unmanaged agent: 0ace22ed-08d6-4633-bdb5-25ffe5639c3d
W, [2024-05-22T11:54:33.368710 #7]  WARN : Received heartbeat from unmanaged agent: b33e61cc-f330-41f1-9f52-a744daa711e8
W, [2024-05-22T11:54:33.371640 #7]  WARN : Received heartbeat from unmanaged agent: f42d37b9-2a91-4fe2-ae0e-2335f4f6df7e
W, [2024-05-22T11:54:33.381485 #7]  WARN : Received heartbeat from unmanaged agent: 86babe86-0325-42ec-859d-f558e20210a2
W, [2024-05-22T11:54:33.397191 #7]  WARN : Received heartbeat from unmanaged agent: 94425005-522e-4b45-a417-8be51a5d8540
W, [2024-05-22T11:54:33.406330 #7]  WARN : Received heartbeat from unmanaged agent: cb0d4086-8134-496e-ac43-d7d73549dd51
W, [2024-05-22T11:54:33.421594 #7]  WARN : Received heartbeat from unmanaged agent: 63ea185d-6964-4870-b4bb-9215ce37cafe
I, [2024-05-22T11:54:33.744708 #7]  INFO : Found deployment 'pg'
W, [2024-05-22T11:54:33.745173 #7]  WARN : Received heartbeat from unmanaged agent: a1a21768-2321-4528-96f4-dab59038f47b
W, [2024-05-22T11:54:33.750820 #7]  WARN : Received heartbeat from unmanaged agent: 72af4e4f-78dc-46b7-bc05-3c5c0035808f
I, [2024-05-22T11:54:33.759072 #7]  INFO : Adding agent 87e8d727-9b53-47cd-be4a-e8c304f26611 (pg/3273aa72-4dfe-43a1-8904-9ce897d55cef) to pg...
I, [2024-05-22T11:54:33.759142 #7]  INFO : Adding agent 6fbc8361-b201-4715-a698-60b43b03d211 (pg/134feec0-d756-4bf1-9b65-1a1cc3c90720) to pg...
I, [2024-05-22T11:54:33.759164 #7]  INFO : Adding agent bc90e8f1-280a-4da2-87c5-760794d7d983 (pg/0c52aca9-d951-4702-ad3a-a3f267ce05f5) to pg...
I, [2024-05-22T11:54:33.779356 #7]  INFO : Found deployment 'consul'
W, [2024-05-22T11:54:33.790751 #7]  WARN : Received heartbeat from unmanaged agent: 1e541c40-abc0-4948-bbdb-36c743d107f3
I, [2024-05-22T11:54:33.793873 #7]  INFO : Adding agent 77f533b3-e87b-47e8-9c8c-6cb3392fe0ae (consul/646bc279-d852-4e3b-b0e4-8f90e79ba5a0) to consul...
I, [2024-05-22T11:54:33.793957 #7]  INFO : Adding agent abbb535b-b1f5-429f-bbdd-43ddcbda6ccd (consul/d9a14091-f2cb-4da3-87e4-a7518550d415) to consul...
I, [2024-05-22T11:54:33.793993 #7]  INFO : Adding agent b23214ce-3a41-4ebb-b96f-d7962b311644 (dnsmasq/6626cef3-2748-41ce-a787-1f6c3ae17553) to consul...

Versions (please complete the following information):

  • Infrastructure: vSphere
  • BOSH version 280.0.23
  • BOSH CLI version version 7.5.7
  • Stemcell version ubuntu-jammy 1.423
  • ... other versions of releases being used (BOSH DNS, Credhub, UAA, BPM, etc)

Deployment info:
If possible, share your (redacted) manifest and any ops files used to deploy
BOSH or any other releases on top of BOSH.

If you used any deployment strategy it'd be helpful to point it out and share as
much about it as possible (e.g. bosh-deployment, PCF, genesis, spiff, etc)

Additional context
https://cloudfoundry.slack.com/archives/C02HPPYQ2/p1712235399314399

@scult
Copy link

scult commented May 27, 2024

Set log level for health_monitor to debug, but that barely did anything and also didn't show any errors.

I, [2024-05-27T11:01:13.255097 #7]  INFO : HealthMonitor starting...
I, [2024-05-27T11:01:13.268576 #7]  INFO : Connected to NATS at 'nats://172.16.2.20:4222'
I, [2024-05-27T11:01:13.269138 #7]  INFO : Logging delivery agent is running...
I, [2024-05-27T11:01:13.269473 #7]  INFO : Event logger is running...
I, [2024-05-27T11:01:13.269972 #7]  INFO : Resurrector is running...
I, [2024-05-27T11:01:20.954334 #7]  INFO : HealthMonitor exiting!
D, [2024-05-27T11:01:20.954452 #7] DEBUG : Pruned 0 old events
I, [2024-05-27T11:01:22.805199 #7]  INFO : HealthMonitor starting...
I, [2024-05-27T11:01:22.819265 #7]  INFO : Connected to NATS at 'nats://172.16.2.20:4222'
I, [2024-05-27T11:01:22.819825 #7]  INFO : Logging delivery agent is running...
I, [2024-05-27T11:01:22.820175 #7]  INFO : Event logger is running...
I, [2024-05-27T11:01:22.820635 #7]  INFO : Resurrector is running...
I, [2024-05-27T11:01:35.107824 #7]  INFO : HealthMonitor exiting!
D, [2024-05-27T11:01:35.107925 #7] DEBUG : Pruned 0 old events
I, [2024-05-27T11:01:46.886601 #7]  INFO : HealthMonitor starting...
I, [2024-05-27T11:01:46.900608 #7]  INFO : Connected to NATS at 'nats://172.16.2.20:4222'
I, [2024-05-27T11:01:46.901222 #7]  INFO : Logging delivery agent is running...
I, [2024-05-27T11:01:46.901588 #7]  INFO : Event logger is running...
I, [2024-05-27T11:01:46.902176 #7]  INFO : Resurrector is running...
I, [2024-05-27T11:01:58.260747 #7]  INFO : HealthMonitor exiting!
D, [2024-05-27T11:01:58.260855 #7] DEBUG : Pruned 0 old events
I, [2024-05-27T11:02:09.994893 #7]  INFO : HealthMonitor starting...
I, [2024-05-27T11:02:10.009937 #7]  INFO : Connected to NATS at 'nats://172.16.2.20:4222'
I, [2024-05-27T11:02:10.010796 #7]  INFO : Logging delivery agent is running...
I, [2024-05-27T11:02:10.011028 #7]  INFO : Event logger is running...
D, [2024-05-27T11:02:10.011111 #7] DEBUG : Pruned 0 old events
D, [2024-05-27T11:02:10.011287 #7] DEBUG : Total 0 events
I, [2024-05-27T11:02:10.011715 #7]  INFO : Resurrector is running...
I, [2024-05-27T11:02:21.423847 #7]  INFO : HealthMonitor exiting!
I, [2024-05-27T11:02:33.212746 #7]  INFO : HealthMonitor starting...
I, [2024-05-27T11:02:33.227436 #7]  INFO : Connected to NATS at 'nats://172.16.2.20:4222'
I, [2024-05-27T11:02:33.228089 #7]  INFO : Logging delivery agent is running...
I, [2024-05-27T11:02:33.228511 #7]  INFO : Event logger is running...
I, [2024-05-27T11:02:33.229181 #7]  INFO : Resurrector is running...
I, [2024-05-27T11:02:44.568336 #7]  INFO : HealthMonitor exiting!
D, [2024-05-27T11:02:44.568438 #7] DEBUG : Pruned 0 old events
I, [2024-05-27T11:02:56.344392 #7]  INFO : HealthMonitor starting...
I, [2024-05-27T11:02:56.358388 #7]  INFO : Connected to NATS at 'nats://172.16.2.20:4222'
I, [2024-05-27T11:02:56.359174 #7]  INFO : Logging delivery agent is running...
I, [2024-05-27T11:02:56.359510 #7]  INFO : Event logger is running...
I, [2024-05-27T11:02:56.359951 #7]  INFO : Resurrector is running...
I, [2024-05-27T11:03:07.734247 #7]  INFO : HealthMonitor exiting!
I, [2024-05-27T11:03:19.518843 #7]  INFO : HealthMonitor starting...
I, [2024-05-27T11:03:19.535007 #7]  INFO : Connected to NATS at 'nats://172.16.2.20:4222'
I, [2024-05-27T11:03:19.535549 #7]  INFO : Logging delivery agent is running...
I, [2024-05-27T11:03:19.535902 #7]  INFO : Event logger is running...
I, [2024-05-27T11:03:19.536354 #7]  INFO : Resurrector is running...
I, [2024-05-27T11:03:30.903580 #7]  INFO : HealthMonitor exiting!
I, [2024-05-27T11:03:42.831800 #7]  INFO : HealthMonitor starting...
I, [2024-05-27T11:03:42.847691 #7]  INFO : Connected to NATS at 'nats://172.16.2.20:4222'
I, [2024-05-27T11:03:42.848254 #7]  INFO : Logging delivery agent is running...
I, [2024-05-27T11:03:42.848938 #7]  INFO : Event logger is running...
I, [2024-05-27T11:03:42.849384 #7]  INFO : Resurrector is running...

@benjaminguttmann-avtq
Copy link
Author

After scaling the BOSH director to a bigger vm_type the issue got resolved. It seems like the director was just 'too slow' to come up in time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

2 participants