Add a delay to SSH connection to fix the deploy_node race condition #331

dtgay · 2014-06-28T18:47:12Z

This patch might look a little "hackish", but it has solved the terrible deploy_node race condition for me 100%. I've been using libcloud with this patch for a few days with a 100% success rate. It seems that the timeout argument for _ssh_client_connect is insufficient. In fact, it's set to 300 seconds by default, but the entire operation doesn't take nearly that long to fail, so that timeout must not be the proper thing to fix the deploy_node race condition. This fix, however, resolves the issue. 60 seconds is more than enough time to get the SSH key installed onto the node, even with the recent addition of ssh_alternate_usernames, which we suspect to be the culprit of this new race condition.

Please, let me know if there's anything I can do to improve upon this patch and get it merged in. This is really a critical bug that needs to be resolved quickly.

Kami · 2014-06-29T13:30:01Z

libcloud/compute/base.py

+        connect_tries = 6
+        connect_delay = 10
+
+        while connect_tries > 0:


It seems like exactly the same logic is already implemented in _ssh_client_connect. You just are just using number of tries instead of a timeout, but you should be able to achieve the same thing by passing wait_period=10 and timeout=10*6 to _ssh_client_connect

Also, which exceptions are you seeing?

I'm asking because _ssh_client_connect catches all the exceptions which are related to a timeout / connection refused.

The problem might be that _ssh_client_connection is not catching some exception which also indicates a timeout.

In any case, I'm not a big fan of double timeout / retry thing. The existing code is sadly already quite convoluted and this makes it even more convoluted and harder to understand.

We should try to find the root cause and solve that and modify _ssh_client_connect if needed instead of just putting "another patch on top".

You're right. I'll update this pull request if I can get to a more root problem today. I wanted to submit this for now in case anyone could make use of it for the time being. I'll play with it a bit more. Let me know if you discover anything.

Also, the exception that happens is the one that results in Paramiko printing "invalid DSA key". So this code waits a bit a few times if that pops up, which resolves the issue. I did play with the timeout argument a bit to no avail. But I can certainly do some more searching.

I see. It looks like it's an exception which we currently don't catch inside _ssh_client_connect method.

Currently we don't retry if this exception is throw which would explain the failure. A solution would be to modify _ssh_client_connect to also retry on this exception, although, I would really like to come to the root cause and figure out why this exception is being thrown in the first place.

Afaik, this exception has nothing to do with the connection and is thrown when parsing the key. This means that retrying shouldn't really fix it since key doesn't change during the life-time of deploy_node method and if this exception is throw it means the key is invalid and it doesn't make sense to retry. Unless, this is covering up a some other issue.

Edit - Is this getting thrown as an exception or just printed in the debug log?

If it's just printed in the paramiko's debug log, than it's normal and not actually an error. It simply means you are not using an DSA key (probably using RSA). If you are, in fact using DSA key, then it means that the key is corrupted and retrying won't change or fix anything.

Similar debug message is also printed when connecting with openssh-client in debug mode (ssh -vvv).

dtgay · 2014-06-29T15:39:47Z

Catching the exception in _ssh_client_connect could be a solution.
The key isn't invalid. I've been debugging this issue for a while and I'm sure this isn't it. The problem is that the failed SSH connection is caused by the key not being installed on the target node yet. It tries to SSH before the key has been installed, and by extension, Paramiko fails to connect, which results in it trying to use the key as DSA since RSA has failed, and that also fails of course. I've come to this conclusion through debugging with @ralphbean. The key parsing isn't the issue -- I'm 100% sure of this. As you guessed, this error is covering up another issue: the SSH key not being installed yet.
This is an exception that is thrown, not debug log output. An exception is thrown that libcloud covers up, which is related to an exception that paramiko covers up (ugh). So we used import traceback; traceback.print_exc() and another traceback function that escapes me at the moment to get the full exceptions to actually print out. This is how we figured out that the issue was not with the key parsing but with the key not being installed yet.

I'm gonna pop in to the code again and see if I can use the timeout arguments to fix the problem, but my issue is this: the argument is written as timeout=300, so the timeout is already set to a large value, but the code fails way faster than 300 seconds, so I had to assume that the timeout argument wasn't helping the right things. I'll try it now, though.

dtgay · 2014-06-29T15:52:13Z

Yeah, that timeout=60 and wait_period=10 definitely does not solve the problem. I'm all for not further convoluting the codebase with the solution I submitted, but I'm not sure where to poke at next to find a better solution.

Kami · 2014-06-29T15:57:56Z

Thanks for additional clarification.

And yeah, if I understand it correctly from your previous comments, the problem is that _ssh_client_connect doesn't retry on invalid key exception and not the actual timeout values.

As noted above, the solution would be to modify _ssh_client_connect to also retry on this exception. I'd modify the code to also explicitly retry on this exception instead of doing "catch all". Catch all might work, but it's nasty and might mask other exceptions which should, in fact, get propagated.

dtgay · 2014-06-29T20:35:00Z

@Kami seems like this solution works, and is probably better. ^

Kami · 2014-06-30T11:06:45Z

Thanks, I've merged the patch into trunk with a minor change - I fixed it to also work if paramiko is not available / installed.

The reason for that is that paramiko is an optional dependency and everything else except deploy_node should still work if paramiko is not installed.

dtgay · 2014-06-30T17:28:32Z

Awesome, good catch!

Add a delay to SSH connection to fix the deploy_node race condition

05c8462

Kami reviewed Jun 29, 2014
View reviewed changes

Probably a better way to handle the deploy_node race condition

8929c66

asfgit closed this in 907cdce Jun 30, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a delay to SSH connection to fix the deploy_node race condition #331

Add a delay to SSH connection to fix the deploy_node race condition #331

dtgay commented Jun 28, 2014

Kami Jun 29, 2014

Kami Jun 29, 2014

Kami Jun 29, 2014

dtgay Jun 29, 2014

Kami Jun 29, 2014

Kami Jun 29, 2014

dtgay commented Jun 29, 2014

dtgay commented Jun 29, 2014

Kami commented Jun 29, 2014

dtgay commented Jun 29, 2014

Kami commented Jun 30, 2014

dtgay commented Jun 30, 2014

Add a delay to SSH connection to fix the deploy_node race condition #331

Add a delay to SSH connection to fix the deploy_node race condition #331

Conversation

dtgay commented Jun 28, 2014

Kami Jun 29, 2014

Choose a reason for hiding this comment

Kami Jun 29, 2014

Choose a reason for hiding this comment

Kami Jun 29, 2014

Choose a reason for hiding this comment

dtgay Jun 29, 2014

Choose a reason for hiding this comment

Kami Jun 29, 2014

Choose a reason for hiding this comment

Kami Jun 29, 2014

Choose a reason for hiding this comment

dtgay commented Jun 29, 2014

dtgay commented Jun 29, 2014

Kami commented Jun 29, 2014

dtgay commented Jun 29, 2014

Kami commented Jun 30, 2014

dtgay commented Jun 30, 2014