Too many cases of ssh failure and process paused after exponential backoff #4442

cpignedoli · 2020-10-15T07:39:47Z

aiida-core/aiida/transports/plugins/ssh.py

Line 387 in bd6903d

self._client = paramiko.SSHClient()

Dear all, not sure this is the correct portion of code.
I noticed that many of my workchains go in pause state due to ssh problems.
I suspect that this originates from the fact that on daint at CSCS, connection via ssh
normally requires "some seconds".
It is basically rare that a workchain completes without the need of verdi process play
and this, in view of increased throughput, could become problematic.
I think it would be very useful to add the possibility to change the parameters of the exp backoff (or at least to set them to
reach a time span of the order of few hours)
BUT it would be probably not effective in my case if it is not possible to avoid socket.errors related to "long waiting time" in establishing ssh connection
I attach the typical report I get form a process that was paused.
report.txt

greschd · 2020-10-15T09:29:58Z

Hi Carlo,
The problem you're facing is not that daint has an intermittent issue where it is slow for a while, but that it is generally slow and transport has some (roughly fixed) probability of failing transport, is that right?

Indeed then extending the exponential backoff probably isn't the best solution. It may somewhat reduce the probability of stuck processes, but at the cost of waiting much longer than really necessary.

I think we could:

make the timeout used by paramiko (the SSH library) configurable
add "immediate retries" in addition to the exponential backoff

Not sure who the expert on this part of the code is, @sphuber @ltalirz @giovannipizzi?

cpignedoli · 2020-10-15T09:57:15Z

Ciao Dominik
correct, this is what I suspect, that ssh to Daint takes "always" some seconds and this creates a "high" probability
for the transport to fail. If I am right on this, then, yes the only effective fix I can imagine is
increasing timeout in paramiko (at the moment I did not see a timeout limit specified)
Thanks a lot

Ciao

Carlo

sphuber · 2020-10-15T10:57:20Z

Thanks for the report Carlo. The timeout of the SSH connection is already configurable. By default it is 60 seconds, which should be sufficient. You can check by running verdi computer configure ssh <COMPUTERNAME>. The relevant key is called timeout.

That being said, I have recently also been experiencing the same problem you describe when running high-throughput on Daint. The problem is that it is impossible to determine from the exception raised by paramiko what the real underlying cause of the connection failure is. It doesn't necessarily have to be because the connection timed out. It can also be due to the host not being reachable, credentials being incorrect and probably a whole other host of problems. I have also not been able to get more information out of the stack trace. I will contact CSCS to see if they see any out of the ordinary activity from our two accounts at the times of these exceptions and if they are actively doing something to deny those connections, through connection throttling or temporary ip banning. If that is the case, maybe we are simply connecting too often, despite the active connection throttling in AiiDA core.

Final questions, how many daemon workers do you have active and what are the connection and job update intervals? The second you get again from verdi computer configure show COMPUTER (key called safe_interval) and the latter you get from the computer in the shell:

computer = load_computer(COMPUTER_NAME)
computer.get_minimum_job_poll_interval()

cpignedoli · 2020-10-15T11:20:39Z

Dear Sebastian you are right.. it is already set to 60 and this is definitely safe I would say. This morning I also opened a Ticket with CSCS on this, I have different accounts where this is happening. What should I have as "safe_interval" instead of '-' ? I have 10 as min poll int and '-' as safe_interval: In [1]: c=load_computer('daint-gpu-s904') In [2]: c.get_minimum_job_poll_interval() Out[2]: 10.0 In [4]: quit (base) aiida@42418e53f722:~$ verdi computer configure show daint-gpu-s904 * username cpi * port 22 * look_for_keys - * key_filename - * timeout 60 * allow_agent - * proxy_command ssh cpi@ela.cscs.ch netcat daint.cscs.ch 22 * compress True * gss_auth False * gss_kex False * gss_deleg_creds False * gss_host daint.cscs.ch * load_system_host_keys True * key_policy WarningPolicy * use_login_shell - * safe_interval -

…

On Thu, 15 Oct 2020 at 12:57, Sebastiaan Huber ***@***.***> wrote: Thanks for the report Carlo. The timeout of the SSH connection *is* already configurable. By default it is 60 seconds, which should be sufficient. You can check by running verdi computer configure ssh <COMPUTERNAME>. The relevant key is called timeout. That being said, I have recently also been experiencing the same problem you describe when running high-throughput on Daint. The problem is that it is impossible to determine from the exception raised by paramiko what the real underlying cause of the connection failure is. It doesn't necessarily have to be because the connection timed out. It can also be due to the host not being reachable, credentials being incorrect and probably a whole other host of problems. I have also not been able to get more information out of the stack trace. I will contact CSCS to see if they see any out of the ordinary activity from our two accounts at the times of these exceptions and if they are actively doing something to deny those connections, through connection throttling or temporary ip banning. If that is the case, maybe we are simply connecting too often, despite the active connection throttling in AiiDA core. Final questions, how many daemon workers do you have active and what are the connection and job update intervals? The second you get again from verdi computer configure show COMPUTER (key called safe_interval) and the latter you get from the computer in the shell: computer = load_computer(COMPUTER_NAME) computer.get_minimum_job_poll_interval() — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#4442 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFPEIOKX3YUUXZZANJOIC2TSK3ISFANCNFSM4SRTPVGA> .

sphuber · 2020-10-15T11:42:54Z

It is a bit weird that the value does not seem to be defined for safe_interval. Only thing I can think of is that you configured this computer on an older version of aiida-core that didn't have that value yet. Anyway, you can set it now using

verdi computer configure ssh daint-gpu-s904 --safe-interval 60 -n

Note that this safe interval is respected per daemon worker. So if you are running with more workers, at some point you might want to increase it. With how many workers do you typically run?

If you are running high-throughput, I think the 10 seconds minimum poll interval is also relatively low. Especially if you are running jobs that can take one or more hours. You may want to consider setting this also to a higher number. I personally run with 300 seconds for both the safe interval as well as the poll interval when I run high-throughput on Daint.
You can set it in the shell with:

computer = load_computer(COMPUTER_NAME)
computer.set_minimum_job_poll_interval(300)

cpignedoli · 2020-10-15T12:12:23Z

Thanks a lot Sebastiaan the '-' in safe-interval derives probably from the fact that the computer was set with an app in aiidalab that probably is not up-to-date, I will counter check with Sasha. I will switch to the safer values suggested by you and see how things will evolve. At the moment I do not do high throughput, and I am using one walker

…

On Thu, 15 Oct 2020 at 13:43, Sebastiaan Huber ***@***.***> wrote: It is a bit weird that the value does not seem to be defined for safe_interval. Only thing I can think of is that you configured this computer on an older version of aiida-core that didn't have that value yet. Anyway, you can set it now using verdi computer configure ssh daint-gpu-s904 --safe-interval 60 -n Note that this safe interval is respected *per daemon worker*. So if you are running with more workers, at some point you might want to increase it. With how many workers do you typically run? If you are running high-throughput, I think the 10 seconds minimum poll interval is also relatively low. Especially if you are running jobs that can take one or more hours. You may want to consider setting this also to a higher number. I personally run with 300 seconds for both the safe interval as well as the poll interval when I run high-throughput on Daint. You can set it in the shell with: computer = load_computer(COMPUTER_NAME) computer.set_minimum_job_poll_interval(300) — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#4442 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFPEIOLB4CPATTSDHKITB2LSK3N5BANCNFSM4SRTPVGA> .

mbercx · 2020-11-21T20:06:37Z

I've followed up on this issue a bit, and noticed that the connection issues were also happening when I just tried to ssh into daint.cscs.ch from theossrv5. In my case the issue was definitely intermittent, since I was able to connect just fine for some time periods, but then unable to connect for minutes at a time. I've written a little script that tries to connect every 30s and this is the log. It seems there are certain periods (up to 10 mins) where I am unable to connect to daint.cscs.ch. The error that is returned is the following:

$ ssh daint-mbx
ssh_exchange_identification: read: Connection reset by peer
ssh_exchange_identification: Connection closed by remote host

I don't run into any issues connecting at all from my work station, however, and the ssh config is set up exactly the same for both machines. I'll update CSCS with this information and ask if they have any idea why their is such a difference in the connectivity.

I've also opened a PR (#4583) that adds the transport.task_retry_initial_interval and transport.task_maximum_attempts options so these can be configured for the exponential backoff mechanism. I'm going to try increasing the maximum number of attempts a bit and see if this helps.

Note that this safe interval is respected per daemon worker. So if you are running with more workers, at some point you might want to increase it. With how many workers do you typically run?

@sphuber: Just to note here: I've also had the pausing issue on theossrv5 while running with a single worker on the 3dd project.

cpignedoli · 2020-11-21T20:31:11Z

Thanks a lot Marnik for the update. Let’s hope this will help CSCS digging more. Cheers Carlo

…

On 21 Nov 2020, at 21:06, Marnik Bercx ***@***.***> wrote: I've followed up on this issue a bit, and noticed that the connection issues were also happening when I just tried to ssh into daint.cscs.ch from theossrv5. In my case the issue was definitely intermittent, since I was able to connect just fine for some time periods, but then unable to connect for minutes at a time. I've written a little script that tries to connect every 30s and this is the log. It seems there are certain periods (up to 10 mins) where I am unable to connect to daint.cscs.ch. The error that is returned is the following: $ ssh daint-mbx ssh_exchange_identification: read: Connection reset by peer ssh_exchange_identification: Connection closed by remote host I don't run into any issues connecting at all from my work station, however, and the ssh config is set up exactly the same for both machines. I'll update CSCS with this information and ask if they have any idea why their is such a difference in the connectivity. I've also opened a PR (#4583) that adds the transport.task_retry_initial_interval and transport.task_maximum_attempts options so these can be configured for the exponential backoff mechanism. I'm going to try increasing the maximum number of attempts a bit and see if this helps. Note that this safe interval is respected per daemon worker. So if you are running with more workers, at some point you might want to increase it. With how many workers do you typically run? @sphuber: Just to note here: I've also had the pausing issue on theossrv5 while running with a single worker on the 3dd project. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

giovannipizzi · 2020-11-22T17:01:17Z

Indeed CSCS support would help us, checking their logs - maybe they have a max # of connections (per time interval) from the same IP, so since on theossrv5 there are many users, the limit is reached faster than on your workstation? Their logs should say. This is very important as the same issue then will occur for AiiDAlab, with many users (that is what Carlo is experiencing I think)

sphuber · 2020-11-23T15:27:54Z

Just to note here: I've also had the pausing issue on theossrv5 while running with a single worker on the 3dd project.

I am more and more convinced that this is not actually due to a problem on AiiDA's side but with Daint itself

mbercx · 2020-11-23T15:53:22Z

I am more and more convinced that this is not actually due to a problem on AiiDA's side but with Daint itself

I'm inclined to agree! 😉 I've added you in cc to the communication with CSCS, we'll see what they respond. That said, @flavianojs and I have been testing the EBM with an increased number of max_attempts (using this branch) and so far it seems to help quite a bit, due to the fact that the connection issues are intermittent (i.e. at some point you get lucky 😅).

mbercx · 2021-01-27T10:49:10Z

After a meeting with CSCS together with @cpignedoli and @yakutovicha, it seems the issue was indeed on their end. Our IP's were getting temporarily blocked when they shouldn't (I don't remember the exact explanation from CSCS, maybe @cpignedoli or @yakutovicha can comment?). They have now whitelisted our IP's, and since then I have not found any more connection related error in the calculation job logs.

Since the problem was not really caused by AiiDA, and the user can now configure the EBM settings (#4583), I'm closing this issue.

yakutovicha · 2021-01-27T11:06:54Z

I don't remember the exact explanation from CSCS, maybe @cpignedoli or @yakutovicha can comment?

I think the problem was that after the IP address was temporarily blocked on the ela machine, it wasn't further unblocked (it should happen after 10 minutes). And this is what they fixed.

To make things worse: the temporary blacklist isn't shared among the ela machines. This means if AiiDA would connect to, say, ela1 (where the access was blocked) it wouldn't go through. However, the other machines (ela2..5) would allow AiiDA to set up the connection. This made it so difficult to understand the origin of the problem.

cpignedoli · 2021-01-27T12:28:08Z

The only thing I can add is that CSCS was suggesting, to make easier traceback of future related problems, not to use the generic ela.cscs.ch <http://ela.cscs.ch/> rather one of the direct IPs: 148.187.1.(16,17,18,19,20,21)

…

On 27 Jan 2021, at 12:07, Aliaksandr Yakutovich ***@***.***> wrote: I don't remember the exact explanation from CSCS, maybe @cpignedoli <https://github.com/cpignedoli> or @yakutovicha <https://github.com/yakutovicha> can comment? I think the problem was that after the IP address was temporarily blocked on Ela machine, it wasn't further unblocked (it should happen after 10 minutes). And this is what they fixed. To make things worse: the temporary blacklist isn't shared among the Ela machines. This means if AiiDA would connect to, say, ela1 (where the access was blocked) it wouldn't go through. However, the other machines (ela2..5) would allow AiiDA to set up the connection. This made it so difficult to understand the origin of the problem. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4442 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFPEIOP5P7EXKQFQUCPVEVDS37XV5ANCNFSM4SRTPVGA>.

mbercx mentioned this issue Nov 22, 2020

Show number of failed transport attempts in verdi process list #4585

Open

mbercx closed this as completed Jan 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Too many cases of ssh failure and process paused after exponential backoff #4442

Too many cases of ssh failure and process paused after exponential backoff #4442

cpignedoli commented Oct 15, 2020

greschd commented Oct 15, 2020

cpignedoli commented Oct 15, 2020

sphuber commented Oct 15, 2020

cpignedoli commented Oct 15, 2020 via email

sphuber commented Oct 15, 2020

cpignedoli commented Oct 15, 2020 via email

mbercx commented Nov 21, 2020

cpignedoli commented Nov 21, 2020 via email

giovannipizzi commented Nov 22, 2020

sphuber commented Nov 23, 2020

mbercx commented Nov 23, 2020

mbercx commented Jan 27, 2021

yakutovicha commented Jan 27, 2021 •

edited

cpignedoli commented Jan 27, 2021 via email

Too many cases of ssh failure and process paused after exponential backoff #4442

Too many cases of ssh failure and process paused after exponential backoff #4442

Comments

cpignedoli commented Oct 15, 2020

greschd commented Oct 15, 2020

cpignedoli commented Oct 15, 2020

sphuber commented Oct 15, 2020

cpignedoli commented Oct 15, 2020 via email

sphuber commented Oct 15, 2020

cpignedoli commented Oct 15, 2020 via email

mbercx commented Nov 21, 2020

cpignedoli commented Nov 21, 2020 via email

giovannipizzi commented Nov 22, 2020

sphuber commented Nov 23, 2020

mbercx commented Nov 23, 2020

mbercx commented Jan 27, 2021

yakutovicha commented Jan 27, 2021 • edited

cpignedoli commented Jan 27, 2021 via email

yakutovicha commented Jan 27, 2021 •

edited