Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a delay to SSH connection to fix the deploy_node race condition #331

Closed
wants to merge 2 commits into from
Closed

Add a delay to SSH connection to fix the deploy_node race condition #331

wants to merge 2 commits into from

Conversation

dtgay
Copy link
Contributor

@dtgay dtgay commented Jun 28, 2014

This patch might look a little "hackish", but it has solved the terrible deploy_node race condition for me 100%. I've been using libcloud with this patch for a few days with a 100% success rate. It seems that the timeout argument for _ssh_client_connect is insufficient. In fact, it's set to 300 seconds by default, but the entire operation doesn't take nearly that long to fail, so that timeout must not be the proper thing to fix the deploy_node race condition. This fix, however, resolves the issue. 60 seconds is more than enough time to get the SSH key installed onto the node, even with the recent addition of ssh_alternate_usernames, which we suspect to be the culprit of this new race condition.

Please, let me know if there's anything I can do to improve upon this patch and get it merged in. This is really a critical bug that needs to be resolved quickly.

connect_tries = 6
connect_delay = 10

while connect_tries > 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like exactly the same logic is already implemented in _ssh_client_connect. You just are just using number of tries instead of a timeout, but you should be able to achieve the same thing by passing wait_period=10 and timeout=10*6 to _ssh_client_connect

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, which exceptions are you seeing?

I'm asking because _ssh_client_connect catches all the exceptions which are related to a timeout / connection refused.

The problem might be that _ssh_client_connection is not catching some exception which also indicates a timeout.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In any case, I'm not a big fan of double timeout / retry thing. The existing code is sadly already quite convoluted and this makes it even more convoluted and harder to understand.

We should try to find the root cause and solve that and modify _ssh_client_connect if needed instead of just putting "another patch on top".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. I'll update this pull request if I can get to a more root problem today. I wanted to submit this for now in case anyone could make use of it for the time being. I'll play with it a bit more. Let me know if you discover anything.

Also, the exception that happens is the one that results in Paramiko printing "invalid DSA key". So this code waits a bit a few times if that pops up, which resolves the issue. I did play with the timeout argument a bit to no avail. But I can certainly do some more searching.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. It looks like it's an exception which we currently don't catch inside _ssh_client_connect method.

Currently we don't retry if this exception is throw which would explain the failure. A solution would be to modify _ssh_client_connect to also retry on this exception, although, I would really like to come to the root cause and figure out why this exception is being thrown in the first place.

Afaik, this exception has nothing to do with the connection and is thrown when parsing the key. This means that retrying shouldn't really fix it since key doesn't change during the life-time of deploy_node method and if this exception is throw it means the key is invalid and it doesn't make sense to retry. Unless, this is covering up a some other issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edit - Is this getting thrown as an exception or just printed in the debug log?

If it's just printed in the paramiko's debug log, than it's normal and not actually an error. It simply means you are not using an DSA key (probably using RSA). If you are, in fact using DSA key, then it means that the key is corrupted and retrying won't change or fix anything.

Similar debug message is also printed when connecting with openssh-client in debug mode (ssh -vvv).

@dtgay
Copy link
Contributor Author

dtgay commented Jun 29, 2014

  1. Catching the exception in _ssh_client_connect could be a solution.
  2. The key isn't invalid. I've been debugging this issue for a while and I'm sure this isn't it. The problem is that the failed SSH connection is caused by the key not being installed on the target node yet. It tries to SSH before the key has been installed, and by extension, Paramiko fails to connect, which results in it trying to use the key as DSA since RSA has failed, and that also fails of course. I've come to this conclusion through debugging with @ralphbean. The key parsing isn't the issue -- I'm 100% sure of this. As you guessed, this error is covering up another issue: the SSH key not being installed yet.
  3. This is an exception that is thrown, not debug log output. An exception is thrown that libcloud covers up, which is related to an exception that paramiko covers up (ugh). So we used import traceback; traceback.print_exc() and another traceback function that escapes me at the moment to get the full exceptions to actually print out. This is how we figured out that the issue was not with the key parsing but with the key not being installed yet.

I'm gonna pop in to the code again and see if I can use the timeout arguments to fix the problem, but my issue is this: the argument is written as timeout=300, so the timeout is already set to a large value, but the code fails way faster than 300 seconds, so I had to assume that the timeout argument wasn't helping the right things. I'll try it now, though.

@dtgay
Copy link
Contributor Author

dtgay commented Jun 29, 2014

Yeah, that timeout=60 and wait_period=10 definitely does not solve the problem. I'm all for not further convoluting the codebase with the solution I submitted, but I'm not sure where to poke at next to find a better solution.

@Kami
Copy link
Member

Kami commented Jun 29, 2014

Thanks for additional clarification.

And yeah, if I understand it correctly from your previous comments, the problem is that _ssh_client_connect doesn't retry on invalid key exception and not the actual timeout values.

As noted above, the solution would be to modify _ssh_client_connect to also retry on this exception. I'd modify the code to also explicitly retry on this exception instead of doing "catch all". Catch all might work, but it's nasty and might mask other exceptions which should, in fact, get propagated.

@dtgay
Copy link
Contributor Author

dtgay commented Jun 29, 2014

@Kami seems like this solution works, and is probably better. ^

@Kami
Copy link
Member

Kami commented Jun 30, 2014

Thanks, I've merged the patch into trunk with a minor change - I fixed it to also work if paramiko is not available / installed.

The reason for that is that paramiko is an optional dependency and everything else except deploy_node should still work if paramiko is not installed.

@asfgit asfgit closed this in 907cdce Jun 30, 2014
@dtgay
Copy link
Contributor Author

dtgay commented Jun 30, 2014

Awesome, good catch!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants