Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too many cases of ssh failure and process paused after exponential backoff #4442

Closed
cpignedoli opened this issue Oct 15, 2020 · 14 comments
Closed

Comments

@cpignedoli
Copy link

self._client = paramiko.SSHClient()

Dear all, not sure this is the correct portion of code.
I noticed that many of my workchains go in pause state due to ssh problems.
I suspect that this originates from the fact that on daint at CSCS, connection via ssh
normally requires "some seconds".
It is basically rare that a workchain completes without the need of verdi process play
and this, in view of increased throughput, could become problematic.
I think it would be very useful to add the possibility to change the parameters of the exp backoff (or at least to set them to
reach a time span of the order of few hours)
BUT it would be probably not effective in my case if it is not possible to avoid socket.errors related to "long waiting time" in establishing ssh connection
I attach the typical report I get form a process that was paused.
report.txt

@greschd
Copy link
Member

greschd commented Oct 15, 2020

Hi Carlo,
The problem you're facing is not that daint has an intermittent issue where it is slow for a while, but that it is generally slow and transport has some (roughly fixed) probability of failing transport, is that right?

Indeed then extending the exponential backoff probably isn't the best solution. It may somewhat reduce the probability of stuck processes, but at the cost of waiting much longer than really necessary.

I think we could:

  • make the timeout used by paramiko (the SSH library) configurable
  • add "immediate retries" in addition to the exponential backoff

Not sure who the expert on this part of the code is, @sphuber @ltalirz @giovannipizzi?

@cpignedoli
Copy link
Author

Ciao Dominik
correct, this is what I suspect, that ssh to Daint takes "always" some seconds and this creates a "high" probability
for the transport to fail. If I am right on this, then, yes the only effective fix I can imagine is
increasing timeout in paramiko (at the moment I did not see a timeout limit specified)
Thanks a lot

Ciao

Carlo

@sphuber
Copy link
Contributor

sphuber commented Oct 15, 2020

Thanks for the report Carlo. The timeout of the SSH connection is already configurable. By default it is 60 seconds, which should be sufficient. You can check by running verdi computer configure ssh <COMPUTERNAME>. The relevant key is called timeout.

That being said, I have recently also been experiencing the same problem you describe when running high-throughput on Daint. The problem is that it is impossible to determine from the exception raised by paramiko what the real underlying cause of the connection failure is. It doesn't necessarily have to be because the connection timed out. It can also be due to the host not being reachable, credentials being incorrect and probably a whole other host of problems. I have also not been able to get more information out of the stack trace. I will contact CSCS to see if they see any out of the ordinary activity from our two accounts at the times of these exceptions and if they are actively doing something to deny those connections, through connection throttling or temporary ip banning. If that is the case, maybe we are simply connecting too often, despite the active connection throttling in AiiDA core.

Final questions, how many daemon workers do you have active and what are the connection and job update intervals? The second you get again from verdi computer configure show COMPUTER (key called safe_interval) and the latter you get from the computer in the shell:

computer = load_computer(COMPUTER_NAME)
computer.get_minimum_job_poll_interval()

@cpignedoli
Copy link
Author

cpignedoli commented Oct 15, 2020 via email

@sphuber
Copy link
Contributor

sphuber commented Oct 15, 2020

It is a bit weird that the value does not seem to be defined for safe_interval. Only thing I can think of is that you configured this computer on an older version of aiida-core that didn't have that value yet. Anyway, you can set it now using

verdi computer configure ssh daint-gpu-s904 --safe-interval 60 -n

Note that this safe interval is respected per daemon worker. So if you are running with more workers, at some point you might want to increase it. With how many workers do you typically run?

If you are running high-throughput, I think the 10 seconds minimum poll interval is also relatively low. Especially if you are running jobs that can take one or more hours. You may want to consider setting this also to a higher number. I personally run with 300 seconds for both the safe interval as well as the poll interval when I run high-throughput on Daint.
You can set it in the shell with:

computer = load_computer(COMPUTER_NAME)
computer.set_minimum_job_poll_interval(300)

@cpignedoli
Copy link
Author

cpignedoli commented Oct 15, 2020 via email

@mbercx
Copy link
Member

mbercx commented Nov 21, 2020

I've followed up on this issue a bit, and noticed that the connection issues were also happening when I just tried to ssh into daint.cscs.ch from theossrv5. In my case the issue was definitely intermittent, since I was able to connect just fine for some time periods, but then unable to connect for minutes at a time. I've written a little script that tries to connect every 30s and this is the log. It seems there are certain periods (up to 10 mins) where I am unable to connect to daint.cscs.ch. The error that is returned is the following:

$ ssh daint-mbx
ssh_exchange_identification: read: Connection reset by peer
ssh_exchange_identification: Connection closed by remote host

I don't run into any issues connecting at all from my work station, however, and the ssh config is set up exactly the same for both machines. I'll update CSCS with this information and ask if they have any idea why their is such a difference in the connectivity.

I've also opened a PR (#4583) that adds the transport.task_retry_initial_interval and transport.task_maximum_attempts options so these can be configured for the exponential backoff mechanism. I'm going to try increasing the maximum number of attempts a bit and see if this helps.

Note that this safe interval is respected per daemon worker. So if you are running with more workers, at some point you might want to increase it. With how many workers do you typically run?

@sphuber: Just to note here: I've also had the pausing issue on theossrv5 while running with a single worker on the 3dd project.

@cpignedoli
Copy link
Author

cpignedoli commented Nov 21, 2020 via email

@giovannipizzi
Copy link
Member

Indeed CSCS support would help us, checking their logs - maybe they have a max # of connections (per time interval) from the same IP, so since on theossrv5 there are many users, the limit is reached faster than on your workstation? Their logs should say. This is very important as the same issue then will occur for AiiDAlab, with many users (that is what Carlo is experiencing I think)

@sphuber
Copy link
Contributor

sphuber commented Nov 23, 2020

Just to note here: I've also had the pausing issue on theossrv5 while running with a single worker on the 3dd project.

I am more and more convinced that this is not actually due to a problem on AiiDA's side but with Daint itself

@mbercx
Copy link
Member

mbercx commented Nov 23, 2020

I am more and more convinced that this is not actually due to a problem on AiiDA's side but with Daint itself

I'm inclined to agree! 😉 I've added you in cc to the communication with CSCS, we'll see what they respond. That said, @flavianojs and I have been testing the EBM with an increased number of max_attempts (using this branch) and so far it seems to help quite a bit, due to the fact that the connection issues are intermittent (i.e. at some point you get lucky 😅).

@mbercx
Copy link
Member

mbercx commented Jan 27, 2021

After a meeting with CSCS together with @cpignedoli and @yakutovicha, it seems the issue was indeed on their end. Our IP's were getting temporarily blocked when they shouldn't (I don't remember the exact explanation from CSCS, maybe @cpignedoli or @yakutovicha can comment?). They have now whitelisted our IP's, and since then I have not found any more connection related error in the calculation job logs.

Since the problem was not really caused by AiiDA, and the user can now configure the EBM settings (#4583), I'm closing this issue.

@mbercx mbercx closed this as completed Jan 27, 2021
@yakutovicha
Copy link
Contributor

yakutovicha commented Jan 27, 2021

I don't remember the exact explanation from CSCS, maybe @cpignedoli or @yakutovicha can comment?

I think the problem was that after the IP address was temporarily blocked on the ela machine, it wasn't further unblocked (it should happen after 10 minutes). And this is what they fixed.

To make things worse: the temporary blacklist isn't shared among the ela machines. This means if AiiDA would connect to, say, ela1 (where the access was blocked) it wouldn't go through. However, the other machines (ela2..5) would allow AiiDA to set up the connection. This made it so difficult to understand the origin of the problem.

@cpignedoli
Copy link
Author

cpignedoli commented Jan 27, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants