Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unreachable NATS leads to unreasonable amount of HTTP metadata service requests #118

Closed
voelzmo opened this issue Feb 14, 2017 · 8 comments

Comments

@voelzmo
Copy link
Contributor

voelzmo commented Feb 14, 2017

When for some reason the BOSH NATS isn't available (Director update, network problems, etc), the agent exits and re-starts every few seconds. Reason is that heartbeats cannot be sent to the HM.
Here is an example from the agent logs that we see every few seconds:

2017-02-14_13:15:41.94560 [main] 2017/02/14 13:15:41 DEBUG - Starting agent
2017-02-14_13:15:41.94565 [File System] 2017/02/14 13:15:41 DEBUG - Reading file /var/vcap/bosh/agent.json
2017-02-14_13:15:41.94566 [File System] 2017/02/14 13:15:41 DEBUG - Read content
2017-02-14_13:15:41.94566 ********************
--
2017-02-14_13:15:55.06415 [Cmd Runner] 2017/02/14 13:15:55 DEBUG - Stdout:
2017-02-14_13:15:55.06415 [Cmd Runner] 2017/02/14 13:15:55 DEBUG - Stderr: SIOCDARP(dontpub): Network is unreachable
2017-02-14_13:15:55.06416 [Cmd Runner] 2017/02/14 13:15:55 DEBUG - Successful: false (255)
2017-02-14_13:15:55.06416 [NATS Handler] 2017/02/14 13:15:55 ERROR - Cleaning ip-mac address cache for: 192.168.1.11
2017-02-14_13:15:55.06617 [main] 2017/02/14 13:15:55 ERROR - App run Running agent: Message Bus Handler: Starting nats handler: Connecting: dial tcp 192.168.1.11:4222: getsockopt: connection refused
2017-02-14_13:15:55.06618 [main] 2017/02/14 13:15:55 ERROR - Agent exited with error: Running agent: Message Bus Handler: Starting nats handler: Connecting: dial tcp 192.168.1.11:4222: getsockopt: connection refused

Each agent startup makes 4 (or 5?) calls to the metadata service, which makes up for a pretty big amount for huge CF installations using HTTP metadata service: Multiply that by VMs and by Director downtime during an bosh-init update.

I'm open for suggestions how to approach this issue. Possible workarounds are:

  • using config-drive instead of HTTP metadata service (disk reads are probably less of a problem than HTTP access). However, there might be other reasons why people chose HTTP metadata service over config-drive.
  • Installing NATS separately from the Director to reduce downtime. Possibly clustered? Not sure what the state is here?
  • Some kind of exponential backoff in the agent instead of just exiting when NATSHandler cannot connect. This would avoid the source of frequent metadata access due to NATS being not available
  • something else?

I'd prefer the exponential backoff solution, what do you think?

@voelzmo
Copy link
Contributor Author

voelzmo commented Feb 26, 2017

@cppforlife care to elaborate on the "planned-enhancement" part? Which part are you planning for? :)

@aashah
Copy link
Contributor

aashah commented Mar 19, 2019

Hello,

It looks like we have not responded to this in a reasonable time, and unfortunately a year has passed. Given this, we are going to close the issue, but please feel free to re-open the issue. In doing so, a new Pivotal Tracker story will be generated, and we can revisit this.

Thanks again for submitting this, and apologies for closing without a proper resolution at this moment.

BOSH Systems Team - @aashah @luan @jaresty

@aashah aashah closed this as completed Mar 19, 2019
@voelzmo
Copy link
Contributor Author

voelzmo commented Mar 25, 2019

Hey @aashah,

Thanks for replying! While the original issue still exists, we have found a workaround in our environments to keep the Director downtime during updates as small as possible, thereby mitigating this.

I'll leave it up to you if you want to have this issue for documenting this behavior in case someone analyzes their network traffic or even look at one of the possible solution suggestions above – or if this should remain closed.

@aashah
Copy link
Contributor

aashah commented Mar 27, 2019

Seems reasonable to keep this open given the original issue remains.

Side-question: What was your workaround?

@aashah aashah reopened this Mar 27, 2019
@cf-gitbot
Copy link

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/164934963

The labels on this github issue will be updated when the story is started.

@voelzmo
Copy link
Contributor Author

voelzmo commented Apr 9, 2019

We create-env an 'outer director' which is just used to install and 'inner director'. As with the other regular bosh deployments, the inner director is only down on stemcell updates, not on every configuration and release change as with create-env. That was good enough for us at this time.

@edwardstudy
Copy link
Contributor

Hi, any update about this IMP? I just saw tracker removed this story.

h4xnoodle pushed a commit that referenced this issue Nov 6, 2019
…lable

This change introduces exponential backoff and jitter to that initial
connection logic to NATS. It also increases the raw timeout and retries
when connecting.

This change also introduces an extended and randomized timeout when
publishing messages to the nats client. This prevents all of the agents
from exiting at the same time when the director is being deployed.

[#164934963](https://www.pivotaltracker.com/story/show/164934963)

Fixes #118

Co-authored-by: Charles Hansen <chansen@pivotal.io>
@bosh-admin-bot
Copy link

This issue was marked as Stale because it has been open for 21 days without any activity. If no activity takes place in the coming 7 days it will automatically be close. To prevent this from happening remove the Stale label or comment below.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants