Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

verdi computer test: Add test for login shell being slow #5845

Conversation

sphuber
Copy link
Contributor

@sphuber sphuber commented Dec 19, 2022

Fixes #5844

By default computers are configured to use a login shell, setting the use_login_shell configuration option to True, which ensures that the Transport always uses the -l option for bash when executing a command. This can be important for certain compute environments where the login script contains commands that are necessary for calculation jobs to execute successfully.

However, this can incur significant slowdowns if the login script loading is slow. This will affect the throughput of calculation jobs significantly the source of which can be difficult to track down. That is why a test is added to verdi computer test. It will execute whoami over the transport with and without a login shell and if the ratio exceeds 2, a warning is emitted, suggesting to turn off the option unless strictly necessary.

By default computers are configured to use a login shell, setting the
`use_login_shell` configuration option to `True`, which ensures that the
`Transport` always uses the `-l` option for bash when executing a
command. This can be important for certain compute environments where
the login script contains commands that are necessary for calculation
jobs to execute successfully.

However, this can incur significant slowdowns if the login script loading
is slow. This will affect the throughput of calculation jobs
significantly the source of which can be difficult to track down. That
is why a test is added to `verdi computer test`. It will execute
`whoami` over the transport with and without a login shell and if the
ratio exceeds 2, a warning is emitted, suggesting to turn off the option
unless strictly necessary.
@sphuber sphuber requested a review from ltalirz December 19, 2022 13:36
Copy link
Member

@ltalirz ltalirz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @sphuber, very useful!

some minor comments

finally:
authinfo.set_auth_params(auth_params)

if timing_true > (timing_false * factor):
Copy link
Member

@ltalirz ltalirz Dec 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may want to consider introducing an absolute threshold for the difference as well (say, 100ms).

If the timing in both cases is very short (e.g. it goes to the local machine, and there are no "slow" commands in the .bashrc), the current approach might be brittle.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. But I have questions about the implementation. My first instinct was to use math.isclose however this does not work as I expected. If I set relative tolerance of 0.5 (factor of two), it would still mark two very small numbers, say 0.01 and 0.011, as close even for a negligible absolute tolerance of 0.000000001. The same goes for numpy.isclose. They effectively consider both tolerances, and as long as one is respected, the numbers are considered close. I think in this case, we want it to fail as soon as either difference (abs or relative) exceeds the tolerance, by itself. Is this correct?

Copy link
Member

@ltalirz ltalirz Dec 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea of the absolute threshold was to avoid false positives when the absolute time delta is very small and could be dominated by "noise".
So I would have raised the warning only if both thresholds are exceeded.

Come to think of it, I guess we mostly care about absolute time differences here anyhow, so as long as we have a somewhat reliable timing (via the average of the 3 attempts), one could also go with an absolute time threshold only (as in the verdi load tests).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, simplified to just checking the difference is within 100 ms

aiida/cmdline/commands/cmd_computer.py Outdated Show resolved Hide resolved
aiida/cmdline/commands/cmd_computer.py Outdated Show resolved Hide resolved
aiida/cmdline/commands/cmd_computer.py Outdated Show resolved Hide resolved
aiida/cmdline/commands/cmd_computer.py Show resolved Hide resolved
aiida/cmdline/commands/cmd_computer.py Outdated Show resolved Hide resolved
@sphuber sphuber requested a review from ltalirz December 19, 2022 19:23
@sphuber
Copy link
Contributor Author

sphuber commented Dec 19, 2022

@ltalirz Thanks for the review. I have addressed the comments. Just not sure about my implementation of the addition of absolute tolerance. And I haven't addressed the comment about including the computer label in the error message as it would require a refactor. If you think this is worth it, happy to do it.

Copy link
Member

@ltalirz ltalirz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @sphuber !

looks good to me

@sphuber sphuber merged commit 41b28cb into aiidateam:main Dec 20, 2022
@sphuber sphuber deleted the feature/5844/verdi-computer-test-use-login-shell branch December 20, 2022 10:30
@mbercx
Copy link
Member

mbercx commented Apr 27, 2023

Just encountered this in practise for the first time:

* Checking for possible delay from using login shell... [Failed]: computer is configured to use a login shell, which is slower compared to a normal shell (Command execution time of 3.693061113357544 s versus 3.409668763478597, respectively).
Unless this setting is really necessary, consider disabling it with: verdi computer configure core.local COMPUTER_NAME -n --no-use-login-shell
Warning: 1 out of 6 tests failed

I understand this is useful, and at least it made me thing of whether I really need a login shell or not, but new users already struggle quite a bit with setting up/configuring computers. So perhaps:

  1. Use a relative threshold for the time difference by default, falling back on the absolute as a minimum absolute threshold for triggering the warning. I'm not sure it's worth causing this extra hurdle for 10% speed gain. In fact, I'm also left wondering: do I really need my login shell or not? So I'm sure a new use will be equally confused and this makes it even more difficult to finally get started.
  2. In the same vein, maybe make this test's response more positive. Now it says the computer configuration failed. That's not really true, the configuration works fine, you could just make it faster. So I would instead let this test leave a note that "the performance can be increased by not using the login shell".
  3. Maybe add a text in the documentation on "to login shell or not to login shell", with some ideas on what to check for.
  4. Put the verdi computer configure core.local COMPUTER_NAME -n --no-use-login-shell on a separate line so it can be copied more easily and depending on how difficult the refactor is: I'd still think it would be nice if the label was there instead of COMPUTER_NAME. 🙃

@mbercx
Copy link
Member

mbercx commented Apr 27, 2023

Also, after trying the instructions:

verdi computer test eiger
Report: Testing computer<eiger> for user<marnik.bercx@epfl.ch>...
* Opening connection... [OK]
* Checking for spurious output... [OK]
* Getting number of jobs from scheduler... [OK]: 532 jobs found in the queue
* Determining remote user name... [OK]: mbercx
* Creating and deleting temporary file... [OK]
* Checking for possible delay from using login shell... [Failed]: computer is configured to use a login shell, which is slower compared to a normal shell (Command execution time of 4.225524981816609 s versus 3.783755381902059, respectively).
Unless this setting is really necessary, consider disabling it with: verdi computer configure core.local COMPUTER_NAME -n --no-use-login-shell
Warning: 1 out of 6 tests failedverdi computer configure core.local eiger -n --no-use-login-shell
Critical: Computer eiger has transport of type "core.ssh", not core.local!verdi computer configure core.ssh eiger -n --no-use-login-shell
Usage: verdi computer configure core.ssh [OPTIONS] COMPUTER
Try 'verdi computer configure core.ssh --help' for help.

Error: Missing option '--safe-interval'.

This raises more questions on the interface of the various setup/configure commands, but will try to find the time to address these more holistically, see #4981

@sphuber
Copy link
Contributor Author

sphuber commented Apr 28, 2023

The core.local is actually hard-coded. That is clearly a bug and should be fixed. For the --safe-interval being prompted for, that is also incorrect. When --non-interactive is passed, it shouldn't prompt for it as it should have a default. I think this is still happening because the default is not defined on the option itself, but inside the class. Will have a look into fixing this.

@mbercx
Copy link
Member

mbercx commented May 13, 2023

@sphuber just ran into this again and opened some issues so we don't forget these!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add speed test to verdi computer test comparing with login and without login shell
3 participants