`verdi computer test`: Add test for login shell being slow #5845

sphuber · 2022-12-19T13:05:14Z

By default computers are configured to use a login shell, setting the use_login_shell configuration option to True, which ensures that the Transport always uses the -l option for bash when executing a command. This can be important for certain compute environments where the login script contains commands that are necessary for calculation jobs to execute successfully.

However, this can incur significant slowdowns if the login script loading is slow. This will affect the throughput of calculation jobs significantly the source of which can be difficult to track down. That is why a test is added to verdi computer test. It will execute whoami over the transport with and without a login shell and if the ratio exceeds 2, a warning is emitted, suggesting to turn off the option unless strictly necessary.

By default computers are configured to use a login shell, setting the `use_login_shell` configuration option to `True`, which ensures that the `Transport` always uses the `-l` option for bash when executing a command. This can be important for certain compute environments where the login script contains commands that are necessary for calculation jobs to execute successfully. However, this can incur significant slowdowns if the login script loading is slow. This will affect the throughput of calculation jobs significantly the source of which can be difficult to track down. That is why a test is added to `verdi computer test`. It will execute `whoami` over the transport with and without a login shell and if the ratio exceeds 2, a warning is emitted, suggesting to turn off the option unless strictly necessary.

ltalirz

thanks @sphuber, very useful!

some minor comments

ltalirz · 2022-12-19T13:49:31Z

aiida/cmdline/commands/cmd_computer.py

+    finally:
+        authinfo.set_auth_params(auth_params)
+
+    if timing_true > (timing_false * factor):


You may want to consider introducing an absolute threshold for the difference as well (say, 100ms).

If the timing in both cases is very short (e.g. it goes to the local machine, and there are no "slow" commands in the .bashrc), the current approach might be brittle.

Done. But I have questions about the implementation. My first instinct was to use math.isclose however this does not work as I expected. If I set relative tolerance of 0.5 (factor of two), it would still mark two very small numbers, say 0.01 and 0.011, as close even for a negligible absolute tolerance of 0.000000001. The same goes for numpy.isclose. They effectively consider both tolerances, and as long as one is respected, the numbers are considered close. I think in this case, we want it to fail as soon as either difference (abs or relative) exceeds the tolerance, by itself. Is this correct?

The idea of the absolute threshold was to avoid false positives when the absolute time delta is very small and could be dominated by "noise".
So I would have raised the warning only if both thresholds are exceeded.

Come to think of it, I guess we mostly care about absolute time differences here anyhow, so as long as we have a somewhat reliable timing (via the average of the 3 attempts), one could also go with an absolute time threshold only (as in the verdi load tests).

Ok, simplified to just checking the difference is within 100 ms

aiida/cmdline/commands/cmd_computer.py

sphuber · 2022-12-19T19:24:28Z

@ltalirz Thanks for the review. I have addressed the comments. Just not sure about my implementation of the addition of absolute tolerance. And I haven't addressed the comment about including the computer label in the error message as it would require a refactor. If you think this is worth it, happy to do it.

ltalirz

thanks @sphuber !

looks good to me

mbercx · 2023-04-27T12:59:31Z

Just encountered this in practise for the first time:

* Checking for possible delay from using login shell... [Failed]: computer is configured to use a login shell, which is slower compared to a normal shell (Command execution time of 3.693061113357544 s versus 3.409668763478597, respectively).
Unless this setting is really necessary, consider disabling it with: verdi computer configure core.local COMPUTER_NAME -n --no-use-login-shell
Warning: 1 out of 6 tests failed

I understand this is useful, and at least it made me thing of whether I really need a login shell or not, but new users already struggle quite a bit with setting up/configuring computers. So perhaps:

Use a relative threshold for the time difference by default, falling back on the absolute as a minimum absolute threshold for triggering the warning. I'm not sure it's worth causing this extra hurdle for 10% speed gain. In fact, I'm also left wondering: do I really need my login shell or not? So I'm sure a new use will be equally confused and this makes it even more difficult to finally get started.
In the same vein, maybe make this test's response more positive. Now it says the computer configuration failed. That's not really true, the configuration works fine, you could just make it faster. So I would instead let this test leave a note that "the performance can be increased by not using the login shell".
Maybe add a text in the documentation on "to login shell or not to login shell", with some ideas on what to check for.
Put the verdi computer configure core.local COMPUTER_NAME -n --no-use-login-shell on a separate line so it can be copied more easily and depending on how difficult the refactor is: I'd still think it would be nice if the label was there instead of COMPUTER_NAME. 🙃

mbercx · 2023-04-27T16:47:56Z

Also, after trying the instructions:

❯ verdi computer test eiger
Report: Testing computer<eiger> for user<marnik.bercx@epfl.ch>...
* Opening connection... [OK]
* Checking for spurious output... [OK]
* Getting number of jobs from scheduler... [OK]: 532 jobs found in the queue
* Determining remote user name... [OK]: mbercx
* Creating and deleting temporary file... [OK]
* Checking for possible delay from using login shell... [Failed]: computer is configured to use a login shell, which is slower compared to a normal shell (Command execution time of 4.225524981816609 s versus 3.783755381902059, respectively).
Unless this setting is really necessary, consider disabling it with: verdi computer configure core.local COMPUTER_NAME -n --no-use-login-shell
Warning: 1 out of 6 tests failed
❯ verdi computer configure core.local eiger -n --no-use-login-shell
Critical: Computer eiger has transport of type "core.ssh", not core.local!
❯ verdi computer configure core.ssh eiger -n --no-use-login-shell
Usage: verdi computer configure core.ssh [OPTIONS] COMPUTER
Try 'verdi computer configure core.ssh --help' for help.

Error: Missing option '--safe-interval'.

This raises more questions on the interface of the various setup/configure commands, but will try to find the time to address these more holistically, see #4981

sphuber · 2023-04-28T13:07:37Z

The core.local is actually hard-coded. That is clearly a bug and should be fixed. For the --safe-interval being prompted for, that is also incorrect. When --non-interactive is passed, it shouldn't prompt for it as it should have a default. I think this is still happening because the default is not defined on the option itself, but inside the class. Will have a look into fixing this.

mbercx · 2023-05-13T10:44:30Z

@sphuber just ran into this again and opened some issues so we don't forget these!

sphuber requested a review from ltalirz December 19, 2022 13:36

ltalirz reviewed Dec 19, 2022

View reviewed changes

Address PR review comments

00ecfd9

sphuber requested a review from ltalirz December 19, 2022 19:23

Use simple absolute time tolerance of 100 ms

32e9a1a

ltalirz approved these changes Dec 20, 2022

View reviewed changes

sphuber merged commit 41b28cb into aiidateam:main Dec 20, 2022

sphuber deleted the feature/5844/verdi-computer-test-use-login-shell branch December 20, 2022 10:30

This was referenced May 13, 2023

🐛 CLI: Instructions for login delay test are incorrect #6013

Closed

🐛 CLI: Set default for --safe-interval #6014

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`verdi computer test`: Add test for login shell being slow #5845

`verdi computer test`: Add test for login shell being slow #5845

sphuber commented Dec 19, 2022

ltalirz left a comment

ltalirz Dec 19, 2022 •

edited

sphuber Dec 19, 2022

ltalirz Dec 19, 2022 •

edited

sphuber Dec 20, 2022

sphuber commented Dec 19, 2022

ltalirz left a comment

mbercx commented Apr 27, 2023

mbercx commented Apr 27, 2023

sphuber commented Apr 28, 2023

mbercx commented May 13, 2023

verdi computer test: Add test for login shell being slow #5845

verdi computer test: Add test for login shell being slow #5845

Conversation

sphuber commented Dec 19, 2022

ltalirz left a comment

Choose a reason for hiding this comment

ltalirz Dec 19, 2022 • edited

Choose a reason for hiding this comment

sphuber Dec 19, 2022

Choose a reason for hiding this comment

ltalirz Dec 19, 2022 • edited

Choose a reason for hiding this comment

sphuber Dec 20, 2022

Choose a reason for hiding this comment

sphuber commented Dec 19, 2022

ltalirz left a comment

Choose a reason for hiding this comment

mbercx commented Apr 27, 2023

mbercx commented Apr 27, 2023

sphuber commented Apr 28, 2023

mbercx commented May 13, 2023

`verdi computer test`: Add test for login shell being slow #5845

`verdi computer test`: Add test for login shell being slow #5845

ltalirz Dec 19, 2022 •

edited

ltalirz Dec 19, 2022 •

edited