Unreachable NATS leads to unreasonable amount of HTTP metadata service requests #118

voelzmo · 2017-02-14T15:22:25Z

When for some reason the BOSH NATS isn't available (Director update, network problems, etc), the agent exits and re-starts every few seconds. Reason is that heartbeats cannot be sent to the HM.
Here is an example from the agent logs that we see every few seconds:

2017-02-14_13:15:41.94560 [main] 2017/02/14 13:15:41 DEBUG - Starting agent
2017-02-14_13:15:41.94565 [File System] 2017/02/14 13:15:41 DEBUG - Reading file /var/vcap/bosh/agent.json
2017-02-14_13:15:41.94566 [File System] 2017/02/14 13:15:41 DEBUG - Read content
2017-02-14_13:15:41.94566 ********************
--
2017-02-14_13:15:55.06415 [Cmd Runner] 2017/02/14 13:15:55 DEBUG - Stdout:
2017-02-14_13:15:55.06415 [Cmd Runner] 2017/02/14 13:15:55 DEBUG - Stderr: SIOCDARP(dontpub): Network is unreachable
2017-02-14_13:15:55.06416 [Cmd Runner] 2017/02/14 13:15:55 DEBUG - Successful: false (255)
2017-02-14_13:15:55.06416 [NATS Handler] 2017/02/14 13:15:55 ERROR - Cleaning ip-mac address cache for: 192.168.1.11
2017-02-14_13:15:55.06617 [main] 2017/02/14 13:15:55 ERROR - App run Running agent: Message Bus Handler: Starting nats handler: Connecting: dial tcp 192.168.1.11:4222: getsockopt: connection refused
2017-02-14_13:15:55.06618 [main] 2017/02/14 13:15:55 ERROR - Agent exited with error: Running agent: Message Bus Handler: Starting nats handler: Connecting: dial tcp 192.168.1.11:4222: getsockopt: connection refused

Each agent startup makes 4 (or 5?) calls to the metadata service, which makes up for a pretty big amount for huge CF installations using HTTP metadata service: Multiply that by VMs and by Director downtime during an bosh-init update.

I'm open for suggestions how to approach this issue. Possible workarounds are:

using config-drive instead of HTTP metadata service (disk reads are probably less of a problem than HTTP access). However, there might be other reasons why people chose HTTP metadata service over config-drive.
Installing NATS separately from the Director to reduce downtime. Possibly clustered? Not sure what the state is here?
Some kind of exponential backoff in the agent instead of just exiting when NATSHandler cannot connect. This would avoid the source of frequent metadata access due to NATS being not available
something else?

I'd prefer the exponential backoff solution, what do you think?

The text was updated successfully, but these errors were encountered:

voelzmo · 2017-02-26T04:39:25Z

@cppforlife care to elaborate on the "planned-enhancement" part? Which part are you planning for? :)

aashah · 2019-03-19T22:48:56Z

Hello,

It looks like we have not responded to this in a reasonable time, and unfortunately a year has passed. Given this, we are going to close the issue, but please feel free to re-open the issue. In doing so, a new Pivotal Tracker story will be generated, and we can revisit this.

Thanks again for submitting this, and apologies for closing without a proper resolution at this moment.

BOSH Systems Team - @aashah @luan @jaresty

voelzmo · 2019-03-25T10:05:56Z

Hey @aashah,

Thanks for replying! While the original issue still exists, we have found a workaround in our environments to keep the Director downtime during updates as small as possible, thereby mitigating this.

I'll leave it up to you if you want to have this issue for documenting this behavior in case someone analyzes their network traffic or even look at one of the possible solution suggestions above – or if this should remain closed.

aashah · 2019-03-27T16:27:56Z

Seems reasonable to keep this open given the original issue remains.

Side-question: What was your workaround?

cf-gitbot · 2019-03-27T16:27:58Z

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/164934963

The labels on this github issue will be updated when the story is started.

voelzmo · 2019-04-09T15:36:02Z

We create-env an 'outer director' which is just used to install and 'inner director'. As with the other regular bosh deployments, the inner director is only down on stemcell updates, not on every configuration and release change as with create-env. That was good enough for us at this time.

edwardstudy · 2019-04-22T06:28:26Z

Hi, any update about this IMP? I just saw tracker removed this story.

…lable This change introduces exponential backoff and jitter to that initial connection logic to NATS. It also increases the raw timeout and retries when connecting. This change also introduces an extended and randomized timeout when publishing messages to the nats client. This prevents all of the agents from exiting at the same time when the director is being deployed. [#164934963](https://www.pivotaltracker.com/story/show/164934963) Fixes #118 Co-authored-by: Charles Hansen <chansen@pivotal.io>

bosh-admin-bot · 2021-07-27T15:30:18Z

This issue was marked as Stale because it has been open for 21 days without any activity. If no activity takes place in the coming 7 days it will automatically be close. To prevent this from happening remove the Stale label or comment below.

cppforlife added planned-enhancement problem open-for-contribution labels Feb 17, 2017

mhoran mentioned this issue Aug 11, 2017

windows agent cpu spikes when the director stops #137

Closed

jfmyers9 mentioned this issue Aug 13, 2018

support http registry local cache #172

Closed

aashah closed this as completed Mar 19, 2019

aashah reopened this Mar 27, 2019

cf-gitbot added the unscheduled label Mar 27, 2019

bosh-admin-bot added the Stale label Jul 27, 2021

bosh-admin-bot closed this as completed Aug 3, 2021

beyhan mentioned this issue Nov 15, 2021

Replace yagnats with nats.go #250

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unreachable NATS leads to unreasonable amount of HTTP metadata service requests #118

Unreachable NATS leads to unreasonable amount of HTTP metadata service requests #118

voelzmo commented Feb 14, 2017

voelzmo commented Feb 26, 2017

aashah commented Mar 19, 2019

voelzmo commented Mar 25, 2019

aashah commented Mar 27, 2019

cf-gitbot commented Mar 27, 2019

voelzmo commented Apr 9, 2019

edwardstudy commented Apr 22, 2019

bosh-admin-bot commented Jul 27, 2021

Unreachable NATS leads to unreasonable amount of HTTP metadata service requests #118

Unreachable NATS leads to unreasonable amount of HTTP metadata service requests #118

Comments

voelzmo commented Feb 14, 2017

voelzmo commented Feb 26, 2017

aashah commented Mar 19, 2019

voelzmo commented Mar 25, 2019

aashah commented Mar 27, 2019

cf-gitbot commented Mar 27, 2019

voelzmo commented Apr 9, 2019

edwardstudy commented Apr 22, 2019

bosh-admin-bot commented Jul 27, 2021