Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add exponential increase on timeout in #sshable? for services that do not respond in 8s #214

Merged
merged 7 commits into from Jul 31, 2017

Conversation

jeremywadsack
Copy link
Contributor

@jeremywadsack jeremywadsack commented Jul 19, 2017

See issue fog/fog-aws#372

For AWS Spot Request, instances never complete the #setup process eventually timing out through the #wait_for block because the default 8s is apparently not long enough to get the ssh connection. Through testing 11s seemed to work, but this is likely highly variable for regions, instance types, images, and providers.

This change implements a 1.5 increase in timeout each time #sshable? is called, capped at 60s. If a successful connection is made, then the timeout is reset to the initial value.

I tried to keep this as simple as possible. This solves the problem for AWS Spot Requets and does not adversely affect regular AWS instances. I have not tested this with other providers, but a quick check it didn't seem like this would be a problem.

The only concern I have is that there's no way to reset the @sshable_timeout value, but I don't see an instance where sshable? would be called successively on an instance, would continue to fail, and then you'd want to call it on that instance again, but starting at the lower timeout.

One other question is whether it's appropriate to increase the timeout for all failures, or if we should only increase the timeout with a Timeout::Error.


Note that there were no specs for Fog::Compute::Server, so I implemented specs for the existing behavior of #sshable? (leaving the specs of the remaining methods to others to complete coverage).

I'm happy to squash commits, but for readability, I thought it would help to see the separation of the two parts.

… not respond in 8s

Issue fog/fog-aws#372

For AWS Spot Request, instances never complete the `#setup` process
eventually timing out through the `#wait_for` block because the
default 8s is apparently not long enough to get the ssh connection.
Through testing 11s seemed to work, but this is likely highly variable
for regions, instance types, images, and providers.

This change implements a 1.5 increase in timeout each time #sshable?
is called, capped at 60s. If a successful connection is made, then
the timeout is reset to the initial value.
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.7%) to 73.76% when pulling 8ea03d6 on keylimetoolbox:exponential_timeout into 71513c5 on fog:master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.7%) to 73.76% when pulling dd90fc4 on keylimetoolbox:exponential_timeout into 71513c5 on fog:master.


describe "#sshable?" do
before do
# I'm not sure why #sshable? depends on a method that's not defined except in implementing classes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. Maybe we should define it in the base class also, but just have it always return false? That would clear up this confusion/oddity, hopefully. Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about a method that raises NotImplementedError? That's what we usually do for abstract interfaces on my team. It basically maintains the same behavior as currently without hidden side effects.

OTOH, http://chrisstump.online/2016/03/23/stop-abusing-notimplementederror/. Maybe we should just document that it must be implemented?

@geemus
Copy link
Member

geemus commented Jul 25, 2017

Looking good overall, one small inline comment above and one small comment below.

The question of which errors to retry on is an interesting one. I guess my inclination would be to start conservatively, with the fewest needed errors resulting in retry. This is the smallest expansion in scope/change in behavior, which limits impact. But more importantly, it should be easier to add more things to the retry later without breaking things for any one than to remove them later. Does that sound reasonable?

Thanks!

@jeremywadsack
Copy link
Contributor Author

Agreed on retrying with limited errors. I think increasing the timeout should only happen in Timeout::Error. If authentication fails that shouldn't affect the error (in fact, perhaps the should reset the timeout because now you've connected.

@jeremywadsack
Copy link
Contributor Author

@geemus 2c1d839 updates the logic to only increment the timeout period in the case of Timeout::Error and to reset it for either of the two errors for which a connection would have been made. I opted not to do anything for SystemCall because that seems like it could encompass a whole slew of errors before connections are made (socket issues?).

I also cleaned up some of the deprecation warnings from minitest. I hope you don't mind tacking that into this PR. There are still some ruby warnings, but I didn't want to get too deep in code cleanup on code I wasn't working on.

Let me know your preference on how to handle #ready?. I can return false, raise NotImplementedError, or leave it as is as suggested by the post I linked.

fog-core.gemspec Outdated
@@ -2,6 +2,7 @@
lib = File.expand_path("../lib", __FILE__)
$LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
require "fog/core/version"
require 'english'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think english is supported by older versions of ruby which fog-core still supports. Also, double quotes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@icco Sorry about that. I was just trying to get rid of the warning that $INPUT_RECORD_SEPARATOR isn't defined. I'll back that out.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.6%) to 73.868% when pulling 927a629 on keylimetoolbox:exponential_timeout into 71513c5 on fog:master.

@geemus
Copy link
Member

geemus commented Jul 26, 2017

Thanks for continuing to discuss and iterate. Looks good. As for the abstract interface part, I don't think I have a good default answer ready and there are many pros/cons as you suggest. I guess in this particular case I would still lean toward having the base class just return false, the abstract server is in fact not ready, and if something hasn't defined that yet, also not being ready seems reasonable. As long as we have a comment to that effect in case people come back to it later and wonder. Does that seem reasonable to you? Thanks!

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.6%) to 73.836% when pulling 2eff6cc on keylimetoolbox:exponential_timeout into 71513c5 on fog:master.

@geemus
Copy link
Member

geemus commented Jul 26, 2017

@jeremywadsack seems good to me. Just to double check, is this complete and ready from your perspective?

@jeremywadsack
Copy link
Contributor Author

@geemus Yes, I believe I've addressed all the issues and Travis is passing.

Do you need me to squash/rebase?

@geemus
Copy link
Member

geemus commented Jul 31, 2017

@jeremywadsack thanks for the offer, but I'm not worried about it. Thanks!

@geemus geemus merged commit 57bdecf into fog:master Jul 31, 2017
@jeremywadsack
Copy link
Contributor Author

Thanks @geemus. I just noticed that the documentation comments I wrote for #ready? is missing something. It should say "Returns false by default"

geemus added a commit that referenced this pull request Aug 1, 2017
@geemus
Copy link
Member

geemus commented Aug 1, 2017

Good catch, I think that ^ should help. I'll see about cutting a release in the next couple days as well.

@geemus
Copy link
Member

geemus commented Aug 1, 2017

Released in v1.45.0, thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants