New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a delay to SSH connection to fix the deploy_node race condition #331
Conversation
connect_tries = 6 | ||
connect_delay = 10 | ||
|
||
while connect_tries > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like exactly the same logic is already implemented in _ssh_client_connect
. You just are just using number of tries instead of a timeout, but you should be able to achieve the same thing by passing wait_period=10
and timeout=10*6
to _ssh_client_connect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, which exceptions are you seeing?
I'm asking because _ssh_client_connect
catches all the exceptions which are related to a timeout / connection refused.
The problem might be that _ssh_client_connection
is not catching some exception which also indicates a timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In any case, I'm not a big fan of double timeout / retry thing. The existing code is sadly already quite convoluted and this makes it even more convoluted and harder to understand.
We should try to find the root cause and solve that and modify _ssh_client_connect
if needed instead of just putting "another patch on top".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. I'll update this pull request if I can get to a more root problem today. I wanted to submit this for now in case anyone could make use of it for the time being. I'll play with it a bit more. Let me know if you discover anything.
Also, the exception that happens is the one that results in Paramiko printing "invalid DSA key". So this code waits a bit a few times if that pops up, which resolves the issue. I did play with the timeout
argument a bit to no avail. But I can certainly do some more searching.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. It looks like it's an exception which we currently don't catch inside _ssh_client_connect
method.
Currently we don't retry if this exception is throw which would explain the failure. A solution would be to modify _ssh_client_connect
to also retry on this exception, although, I would really like to come to the root cause and figure out why this exception is being thrown in the first place.
Afaik, this exception has nothing to do with the connection and is thrown when parsing the key. This means that retrying shouldn't really fix it since key doesn't change during the life-time of deploy_node
method and if this exception is throw it means the key is invalid and it doesn't make sense to retry. Unless, this is covering up a some other issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Edit - Is this getting thrown as an exception or just printed in the debug log?
If it's just printed in the paramiko's debug log, than it's normal and not actually an error. It simply means you are not using an DSA key (probably using RSA). If you are, in fact using DSA key, then it means that the key is corrupted and retrying won't change or fix anything.
Similar debug message is also printed when connecting with openssh-client in debug mode (ssh -vvv
).
I'm gonna pop in to the code again and see if I can use the |
Yeah, that |
Thanks for additional clarification. And yeah, if I understand it correctly from your previous comments, the problem is that As noted above, the solution would be to modify |
@Kami seems like this solution works, and is probably better. ^ |
Thanks, I've merged the patch into trunk with a minor change - I fixed it to also work if paramiko is not available / installed. The reason for that is that paramiko is an optional dependency and everything else except |
Awesome, good catch! |
This patch might look a little "hackish", but it has solved the terrible
deploy_node
race condition for me 100%. I've been using libcloud with this patch for a few days with a 100% success rate. It seems that thetimeout
argument for_ssh_client_connect
is insufficient. In fact, it's set to 300 seconds by default, but the entire operation doesn't take nearly that long to fail, so that timeout must not be the proper thing to fix thedeploy_node
race condition. This fix, however, resolves the issue. 60 seconds is more than enough time to get the SSH key installed onto the node, even with the recent addition ofssh_alternate_usernames
, which we suspect to be the culprit of this new race condition.Please, let me know if there's anything I can do to improve upon this patch and get it merged in. This is really a critical bug that needs to be resolved quickly.