New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Too many cases of ssh failure and process paused after exponential backoff #4442
Comments
Hi Carlo, Indeed then extending the exponential backoff probably isn't the best solution. It may somewhat reduce the probability of stuck processes, but at the cost of waiting much longer than really necessary. I think we could:
Not sure who the expert on this part of the code is, @sphuber @ltalirz @giovannipizzi? |
Ciao Dominik Ciao Carlo |
Thanks for the report Carlo. The timeout of the SSH connection is already configurable. By default it is 60 seconds, which should be sufficient. You can check by running That being said, I have recently also been experiencing the same problem you describe when running high-throughput on Daint. The problem is that it is impossible to determine from the exception raised by paramiko what the real underlying cause of the connection failure is. It doesn't necessarily have to be because the connection timed out. It can also be due to the host not being reachable, credentials being incorrect and probably a whole other host of problems. I have also not been able to get more information out of the stack trace. I will contact CSCS to see if they see any out of the ordinary activity from our two accounts at the times of these exceptions and if they are actively doing something to deny those connections, through connection throttling or temporary ip banning. If that is the case, maybe we are simply connecting too often, despite the active connection throttling in AiiDA core. Final questions, how many daemon workers do you have active and what are the connection and job update intervals? The second you get again from
|
Dear Sebastian
you are right.. it is already set to 60 and this is definitely safe I would
say.
This morning I also opened a Ticket with CSCS on this,
I have different accounts where this is happening.
What should I have as "safe_interval" instead of '-' ?
I have 10 as min poll int
and '-' as safe_interval:
In [1]: c=load_computer('daint-gpu-s904')
In [2]: c.get_minimum_job_poll_interval()
Out[2]: 10.0
In [4]: quit
(base) aiida@42418e53f722:~$ verdi computer configure show daint-gpu-s904
* username cpi
* port 22
* look_for_keys -
* key_filename -
* timeout 60
* allow_agent -
* proxy_command ssh cpi@ela.cscs.ch netcat daint.cscs.ch 22
* compress True
* gss_auth False
* gss_kex False
* gss_deleg_creds False
* gss_host daint.cscs.ch
* load_system_host_keys True
* key_policy WarningPolicy
* use_login_shell -
* safe_interval -
…On Thu, 15 Oct 2020 at 12:57, Sebastiaan Huber ***@***.***> wrote:
Thanks for the report Carlo. The timeout of the SSH connection *is*
already configurable. By default it is 60 seconds, which should be
sufficient. You can check by running verdi computer configure ssh
<COMPUTERNAME>. The relevant key is called timeout.
That being said, I have recently also been experiencing the same problem
you describe when running high-throughput on Daint. The problem is that it
is impossible to determine from the exception raised by paramiko what the
real underlying cause of the connection failure is. It doesn't necessarily
have to be because the connection timed out. It can also be due to the host
not being reachable, credentials being incorrect and probably a whole other
host of problems. I have also not been able to get more information out of
the stack trace. I will contact CSCS to see if they see any out of the
ordinary activity from our two accounts at the times of these exceptions
and if they are actively doing something to deny those connections, through
connection throttling or temporary ip banning. If that is the case, maybe
we are simply connecting too often, despite the active connection
throttling in AiiDA core.
Final questions, how many daemon workers do you have active and what are
the connection and job update intervals? The second you get again from verdi
computer configure show COMPUTER (key called safe_interval) and the
latter you get from the computer in the shell:
computer = load_computer(COMPUTER_NAME)
computer.get_minimum_job_poll_interval()
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#4442 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AFPEIOKX3YUUXZZANJOIC2TSK3ISFANCNFSM4SRTPVGA>
.
|
It is a bit weird that the value does not seem to be defined for
Note that this safe interval is respected per daemon worker. So if you are running with more workers, at some point you might want to increase it. With how many workers do you typically run? If you are running high-throughput, I think the 10 seconds minimum poll interval is also relatively low. Especially if you are running jobs that can take one or more hours. You may want to consider setting this also to a higher number. I personally run with 300 seconds for both the safe interval as well as the poll interval when I run high-throughput on Daint.
|
Thanks a lot Sebastiaan
the '-' in safe-interval derives probably from the fact that the computer
was
set with an app in aiidalab that probably is not up-to-date, I will counter
check with Sasha.
I will switch to the safer values suggested by you and see how things will
evolve.
At the moment I do not do high throughput, and I am using one walker
…On Thu, 15 Oct 2020 at 13:43, Sebastiaan Huber ***@***.***> wrote:
It is a bit weird that the value does not seem to be defined for
safe_interval. Only thing I can think of is that you configured this
computer on an older version of aiida-core that didn't have that value
yet. Anyway, you can set it now using
verdi computer configure ssh daint-gpu-s904 --safe-interval 60 -n
Note that this safe interval is respected *per daemon worker*. So if you
are running with more workers, at some point you might want to increase it.
With how many workers do you typically run?
If you are running high-throughput, I think the 10 seconds minimum poll
interval is also relatively low. Especially if you are running jobs that
can take one or more hours. You may want to consider setting this also to a
higher number. I personally run with 300 seconds for both the safe interval
as well as the poll interval when I run high-throughput on Daint.
You can set it in the shell with:
computer = load_computer(COMPUTER_NAME)
computer.set_minimum_job_poll_interval(300)
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#4442 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AFPEIOLB4CPATTSDHKITB2LSK3N5BANCNFSM4SRTPVGA>
.
|
I've followed up on this issue a bit, and noticed that the connection issues were also happening when I just tried to
I don't run into any issues connecting at all from my work station, however, and the ssh config is set up exactly the same for both machines. I'll update CSCS with this information and ask if they have any idea why their is such a difference in the connectivity. I've also opened a PR (#4583) that adds the
@sphuber: Just to note here: I've also had the pausing issue on |
Thanks a lot Marnik for the update.
Let’s hope this will help CSCS digging more.
Cheers
Carlo
… On 21 Nov 2020, at 21:06, Marnik Bercx ***@***.***> wrote:
I've followed up on this issue a bit, and noticed that the connection issues were also happening when I just tried to ssh into daint.cscs.ch from theossrv5. In my case the issue was definitely intermittent, since I was able to connect just fine for some time periods, but then unable to connect for minutes at a time. I've written a little script that tries to connect every 30s and this is the log. It seems there are certain periods (up to 10 mins) where I am unable to connect to daint.cscs.ch. The error that is returned is the following:
$ ssh daint-mbx
ssh_exchange_identification: read: Connection reset by peer
ssh_exchange_identification: Connection closed by remote host
I don't run into any issues connecting at all from my work station, however, and the ssh config is set up exactly the same for both machines. I'll update CSCS with this information and ask if they have any idea why their is such a difference in the connectivity.
I've also opened a PR (#4583) that adds the transport.task_retry_initial_interval and transport.task_maximum_attempts options so these can be configured for the exponential backoff mechanism. I'm going to try increasing the maximum number of attempts a bit and see if this helps.
Note that this safe interval is respected per daemon worker. So if you are running with more workers, at some point you might want to increase it. With how many workers do you typically run?
@sphuber: Just to note here: I've also had the pausing issue on theossrv5 while running with a single worker on the 3dd project.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Indeed CSCS support would help us, checking their logs - maybe they have a max # of connections (per time interval) from the same IP, so since on theossrv5 there are many users, the limit is reached faster than on your workstation? Their logs should say. This is very important as the same issue then will occur for AiiDAlab, with many users (that is what Carlo is experiencing I think) |
I am more and more convinced that this is not actually due to a problem on AiiDA's side but with Daint itself |
I'm inclined to agree! 😉 I've added you in cc to the communication with CSCS, we'll see what they respond. That said, @flavianojs and I have been testing the EBM with an increased number of max_attempts (using this branch) and so far it seems to help quite a bit, due to the fact that the connection issues are intermittent (i.e. at some point you get lucky 😅). |
After a meeting with CSCS together with @cpignedoli and @yakutovicha, it seems the issue was indeed on their end. Our IP's were getting temporarily blocked when they shouldn't (I don't remember the exact explanation from CSCS, maybe @cpignedoli or @yakutovicha can comment?). They have now whitelisted our IP's, and since then I have not found any more connection related error in the calculation job logs. Since the problem was not really caused by AiiDA, and the user can now configure the EBM settings (#4583), I'm closing this issue. |
I think the problem was that after the IP address was temporarily blocked on the To make things worse: the temporary blacklist isn't shared among the |
The only thing I can add is that CSCS was suggesting, to make easier traceback of future related problems,
not to use the generic ela.cscs.ch <http://ela.cscs.ch/> rather one of the direct IPs:
148.187.1.(16,17,18,19,20,21)
… On 27 Jan 2021, at 12:07, Aliaksandr Yakutovich ***@***.***> wrote:
I don't remember the exact explanation from CSCS, maybe @cpignedoli <https://github.com/cpignedoli> or @yakutovicha <https://github.com/yakutovicha> can comment?
I think the problem was that after the IP address was temporarily blocked on Ela machine, it wasn't further unblocked (it should happen after 10 minutes). And this is what they fixed.
To make things worse: the temporary blacklist isn't shared among the Ela machines. This means if AiiDA would connect to, say, ela1 (where the access was blocked) it wouldn't go through. However, the other machines (ela2..5) would allow AiiDA to set up the connection. This made it so difficult to understand the origin of the problem.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#4442 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFPEIOP5P7EXKQFQUCPVEVDS37XV5ANCNFSM4SRTPVGA>.
|
aiida-core/aiida/transports/plugins/ssh.py
Line 387 in bd6903d
Dear all, not sure this is the correct portion of code.
I noticed that many of my workchains go in pause state due to ssh problems.
I suspect that this originates from the fact that on daint at CSCS, connection via ssh
normally requires "some seconds".
It is basically rare that a workchain completes without the need of verdi process play
and this, in view of increased throughput, could become problematic.
I think it would be very useful to add the possibility to change the parameters of the exp backoff (or at least to set them to
reach a time span of the order of few hours)
BUT it would be probably not effective in my case if it is not possible to avoid socket.errors related to "long waiting time" in establishing ssh connection
I attach the typical report I get form a process that was paused.
report.txt
The text was updated successfully, but these errors were encountered: