Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set NATs ping interval to 30 seconds #12

Merged
merged 1 commit into from
Jul 19, 2016

Conversation

swetharepakula
Copy link
Contributor

[#126149627]

Signed-off-by: Jonathan Berkhahn jaberkha@us.ibm.com

[#126149627]

Signed-off-by: Jonathan Berkhahn <jaberkha@us.ibm.com>
@cfdreddbot
Copy link

Hey swetharepakula!

Thanks for submitting this pull request! I'm here to inform the recipients of the pull request that you and the commit authors have already signed the CLA.

@cf-gitbot
Copy link

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/126331001

The labels on this github issue will be updated when the story is started.

@crhino
Copy link
Contributor

crhino commented Jul 15, 2016

@swetharepakula Can you give us a description of the issue/use case this PR is attempting to address and how to accurately test this change?

@swetharepakula
Copy link
Contributor Author

Hi @crhino,

Currently the NATs clients have a default interval of 2 minutes, and will try to wait for 2/3 (depending on which nats-io version is used) ping misses before switching the NATs server. So if a NATs server goes down it can take 4-6 minutes to reconnect to NATs. We are trying to reduce that turn around time to something more reasonable.

We looked at 3 use cases:

  1. When one NATs server goes down, how long before the nats client switches to the another server.
  2. When all NATs servers are down, how long before the nats client is able to connect back to a server.
  3. When the connection is severed/disconnected and the nats client is unaware of the broken connection, how long before it switches to the next server.

We outlined how to test those scenarios in our story with our results with the changes in route_registrar.

Thanks
@swetharepakula

@shashwathi
Copy link
Contributor

Hi @swetharepakula,

I have question about the ping interval value chosen in this PR(i.e., 30 sec). Could you please explain as how did you come up with that number?

I have a test case in which 30sec Ping interval wont solve the problem. Lets assume that your deployment have 5 NATs servers, NATs ping interval is 30sec and default number of pings is 2. If you have more than 3 NATS servers unavailable, all the routes would be pruned by router.

Trying to hardcode the value of ping interval might not be the best way to approach this problem. There are couple of factors that you will need to consider for calculating ping interval value

  • Number of NATs servers
  • Pruning interval of router

I came up with a mathematical equation which could potentially solve this problem.
ping_interval = (pruning_interval_for_router) / (NATS_default_number_of_pings)*(number_of_nats_servers)

This ensures all the NATs servers receives default number of pings(i.e., 2) before pruning all the routes. I think this is a much cleaner solution but I would love to know your thoughts.

Regards
Shash

@swetharepakula
Copy link
Contributor Author

Hi @shashwathi,

Ping intervals ensure that an established connection to a server remains valid. So in our PR this means we will be checking every 30s.

After this time expires the NATS connection will try and connect to a new valid NATS server. In your example above, if the client tries connect to one of the other down servers it would fail and the NATS client would immediately try a new server. So having a ping interval of 30s means we should connect to a valid NATS server in ~1 minute of a server failure.

Thanks,
Swetha

@shashwathi shashwathi closed this Jul 19, 2016
@shashwathi shashwathi reopened this Jul 19, 2016
@shashwathi shashwathi merged commit e378286 into cloudfoundry:master Jul 19, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants