Add exponential increase on timeout in #sshable? for services that do not respond in 8s #214

jeremywadsack · 2017-07-19T21:46:21Z

For AWS Spot Request, instances never complete the #setup process eventually timing out through the #wait_for block because the default 8s is apparently not long enough to get the ssh connection. Through testing 11s seemed to work, but this is likely highly variable for regions, instance types, images, and providers.

This change implements a 1.5 increase in timeout each time #sshable? is called, capped at 60s. If a successful connection is made, then the timeout is reset to the initial value.

I tried to keep this as simple as possible. This solves the problem for AWS Spot Requets and does not adversely affect regular AWS instances. I have not tested this with other providers, but a quick check it didn't seem like this would be a problem.

The only concern I have is that there's no way to reset the @sshable_timeout value, but I don't see an instance where sshable? would be called successively on an instance, would continue to fail, and then you'd want to call it on that instance again, but starting at the lower timeout.

One other question is whether it's appropriate to increase the timeout for all failures, or if we should only increase the timeout with a Timeout::Error.

Note that there were no specs for Fog::Compute::Server, so I implemented specs for the existing behavior of #sshable? (leaving the specs of the remaining methods to others to complete coverage).

I'm happy to squash commits, but for readability, I thought it would help to see the separation of the two parts.

… not respond in 8s Issue fog/fog-aws#372 For AWS Spot Request, instances never complete the `#setup` process eventually timing out through the `#wait_for` block because the default 8s is apparently not long enough to get the ssh connection. Through testing 11s seemed to work, but this is likely highly variable for regions, instance types, images, and providers. This change implements a 1.5 increase in timeout each time #sshable? is called, capped at 60s. If a successful connection is made, then the timeout is reset to the initial value.

coveralls · 2017-07-19T21:53:24Z

Coverage decreased (-0.7%) to 73.76% when pulling 8ea03d6 on keylimetoolbox:exponential_timeout into 71513c5 on fog:master.

coveralls · 2017-07-19T22:02:09Z

Coverage decreased (-0.7%) to 73.76% when pulling dd90fc4 on keylimetoolbox:exponential_timeout into 71513c5 on fog:master.

geemus · 2017-07-25T19:23:14Z

spec/compute/models/server_spec.rb

+
+  describe "#sshable?" do
+    before do
+      # I'm not sure why #sshable? depends on a method that's not defined except in implementing classes


Good question. Maybe we should define it in the base class also, but just have it always return false? That would clear up this confusion/oddity, hopefully. Thoughts?

How about a method that raises NotImplementedError? That's what we usually do for abstract interfaces on my team. It basically maintains the same behavior as currently without hidden side effects.

OTOH, http://chrisstump.online/2016/03/23/stop-abusing-notimplementederror/. Maybe we should just document that it must be implemented?

geemus · 2017-07-25T19:25:36Z

Looking good overall, one small inline comment above and one small comment below.

The question of which errors to retry on is an interesting one. I guess my inclination would be to start conservatively, with the fewest needed errors resulting in retry. This is the smallest expansion in scope/change in behavior, which limits impact. But more importantly, it should be easier to add more things to the retry later without breaking things for any one than to remove them later. Does that sound reasonable?

Thanks!

jeremywadsack · 2017-07-25T20:14:29Z

Agreed on retrying with limited errors. I think increasing the timeout should only happen in Timeout::Error. If authentication fails that shouldn't affect the error (in fact, perhaps the should reset the timeout because now you've connected.

…nect, reset the timeout.

jeremywadsack · 2017-07-25T20:58:41Z

@geemus 2c1d839 updates the logic to only increment the timeout period in the case of Timeout::Error and to reset it for either of the two errors for which a connection would have been made. I opted not to do anything for SystemCall because that seems like it could encompass a whole slew of errors before connections are made (socket issues?).

I also cleaned up some of the deprecation warnings from minitest. I hope you don't mind tacking that into this PR. There are still some ruby warnings, but I didn't want to get too deep in code cleanup on code I wasn't working on.

Let me know your preference on how to handle #ready?. I can return false, raise NotImplementedError, or leave it as is as suggested by the post I linked.

icco · 2017-07-25T22:12:31Z

fog-core.gemspec

@@ -2,6 +2,7 @@
 lib = File.expand_path("../lib", __FILE__)
 $LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
 require "fog/core/version"
+require 'english'


I don't think english is supported by older versions of ruby which fog-core still supports. Also, double quotes.

See https://travis-ci.org/fog/fog-core/jobs/257477755#L462

@icco Sorry about that. I was just trying to get rid of the warning that $INPUT_RECORD_SEPARATOR isn't defined. I'll back that out.

coveralls · 2017-07-25T22:51:38Z

Coverage decreased (-0.6%) to 73.868% when pulling 927a629 on keylimetoolbox:exponential_timeout into 71513c5 on fog:master.

geemus · 2017-07-26T14:29:35Z

Thanks for continuing to discuss and iterate. Looks good. As for the abstract interface part, I don't think I have a good default answer ready and there are many pros/cons as you suggest. I guess in this particular case I would still lean toward having the base class just return false, the abstract server is in fact not ready, and if something hasn't defined that yet, also not being ready seems reasonable. As long as we have a comment to that effect in case people come back to it later and wonder. Does that seem reasonable to you? Thanks!

coveralls · 2017-07-26T17:45:31Z

Coverage decreased (-0.6%) to 73.836% when pulling 2eff6cc on keylimetoolbox:exponential_timeout into 71513c5 on fog:master.

geemus · 2017-07-26T18:49:15Z

@jeremywadsack seems good to me. Just to double check, is this complete and ready from your perspective?

jeremywadsack · 2017-07-26T20:39:56Z

@geemus Yes, I believe I've addressed all the issues and Travis is passing.

Do you need me to squash/rebase?

geemus · 2017-07-31T15:14:33Z

@jeremywadsack thanks for the offer, but I'm not worried about it. Thanks!

jeremywadsack · 2017-07-31T16:18:36Z

Thanks @geemus. I just noticed that the documentation comments I wrote for #ready? is missing something. It should say "Returns false by default"

fixes #214

geemus · 2017-08-01T13:47:04Z

Good catch, I think that ^ should help. I'll see about cutting a release in the next couple days as well.

geemus · 2017-08-01T16:47:38Z

Released in v1.45.0, thanks again!

jeremywadsack added 2 commits July 14, 2017 14:12

Implement spec coverage for Fog::Compute::Server#sshable?

ee55ecc

Compatible with Ruby 1.8 and code cleanup

dd90fc4

geemus reviewed Jul 25, 2017

View reviewed changes

jeremywadsack added 2 commits July 25, 2017 13:50

Only increment sshable? timeout if Timeout::Error. On errors that con…

2c1d839

…nect, reset the timeout.

Clean up some minitest deprecation warnings

f51f6a1

icco reviewed Jul 25, 2017

View reviewed changes

Remove "english" require which is not supported in old rubies

927a629

Implement #ready? in abstract Server model to return false

2eff6cc

geemus merged commit 57bdecf into fog:master Jul 31, 2017

geemus added a commit that referenced this pull request Aug 1, 2017

clarify docs on server#ready?

73e2784

fixes #214

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add exponential increase on timeout in #sshable? for services that do not respond in 8s #214

Add exponential increase on timeout in #sshable? for services that do not respond in 8s #214

jeremywadsack commented Jul 19, 2017 •

edited

coveralls commented Jul 19, 2017

coveralls commented Jul 19, 2017

geemus Jul 25, 2017

jeremywadsack Jul 25, 2017 •

edited

geemus commented Jul 25, 2017

jeremywadsack commented Jul 25, 2017

jeremywadsack commented Jul 25, 2017

icco Jul 25, 2017

icco Jul 25, 2017

jeremywadsack Jul 25, 2017

coveralls commented Jul 25, 2017

geemus commented Jul 26, 2017

coveralls commented Jul 26, 2017

geemus commented Jul 26, 2017

jeremywadsack commented Jul 26, 2017

geemus commented Jul 31, 2017

jeremywadsack commented Jul 31, 2017

geemus commented Aug 1, 2017

geemus commented Aug 1, 2017

Add exponential increase on timeout in #sshable? for services that do not respond in 8s #214

Add exponential increase on timeout in #sshable? for services that do not respond in 8s #214

Conversation

jeremywadsack commented Jul 19, 2017 • edited

coveralls commented Jul 19, 2017

coveralls commented Jul 19, 2017

geemus Jul 25, 2017

Choose a reason for hiding this comment

jeremywadsack Jul 25, 2017 • edited

Choose a reason for hiding this comment

geemus commented Jul 25, 2017

jeremywadsack commented Jul 25, 2017

jeremywadsack commented Jul 25, 2017

icco Jul 25, 2017

Choose a reason for hiding this comment

icco Jul 25, 2017

Choose a reason for hiding this comment

jeremywadsack Jul 25, 2017

Choose a reason for hiding this comment

coveralls commented Jul 25, 2017

geemus commented Jul 26, 2017

coveralls commented Jul 26, 2017

geemus commented Jul 26, 2017

jeremywadsack commented Jul 26, 2017

geemus commented Jul 31, 2017

jeremywadsack commented Jul 31, 2017

geemus commented Aug 1, 2017

geemus commented Aug 1, 2017

jeremywadsack commented Jul 19, 2017 •

edited

jeremywadsack Jul 25, 2017 •

edited