Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

check presence of CWD at the end of run_shell_cmd and try to return to original working directory if non-existent #4390

Open
wants to merge 8 commits into
base: 5.0.x
Choose a base branch
from

Conversation

lexming
Copy link
Contributor

@lexming lexming commented Nov 26, 2023

OpenFOAM installation fails after a successful build because the hooks after the run command fail to determine the working directory. The reason is that the last working directory during the build gets deleted just before that step.

[...]
OMPI_CXX="g++" mpicxx -Dlinux64 -DWM_ARCH_OPTION=64 -DWM_DP -DWM_LABEL_SIZE=32 -Wall -Wextra -Wold-style-cast -Wnon-virtual-dtor -Wno-unused-parameter -Wno-invalid-offsetof -Wno-attributes -O2 -ftree-vectorize -march=native -fno-math-errno -fuse-ld=bfd
  -DNoRepository -ftemplate-depth-100 -I/vscmnt/brussel_pixiu_apps/_apps_brussel/RL8/broadwell/software/OpenFOAM/10-foss-2022a-20230119/OpenFOAM-10/src/finiteVolume/lnInclude -I/vscmnt/brussel_pixiu_apps/_apps_brussel/RL8/broadwell/software/OpenFOAM/10
-foss-2022a-20230119/OpenFOAM-10/src/dynamicMesh/lnInclude -I/vscmnt/brussel_pixiu_apps/_apps_brussel/RL8/broadwell/software/OpenFOAM/10-foss-2022a-20230119/OpenFOAM-10/src/meshTools/lnInclude -IlnInclude -I. -I/vscmnt/brussel_pixiu_apps/_apps_brussel/
RL8/broadwell/software/OpenFOAM/10-foss-2022a-20230119/OpenFOAM-10/src/OpenFOAM/lnInclude -I/vscmnt/brussel_pixiu_apps/_apps_brussel/RL8/broadwell/software/OpenFOAM/10-foss-2022a-20230119/OpenFOAM-10/src/OSspecific/POSIX/lnInclude   -fPIC -fuse-ld=bfd
-Xlinker --add-needed -Xlinker --no-as-needed /vscmnt/brussel_pixiu_apps/_apps_brussel/RL8/broadwell/software/OpenFOAM/10-foss-2022a-20230119/OpenFOAM-10/platforms/linux64GccDPInt32Opt/applications/utilities/mesh/manipulation/subsetMesh/subsetMesh.o -L
/vscmnt/brussel_pixiu_apps/_apps_brussel/RL8/broadwell/software/OpenFOAM/10-foss-2022a-20230119/OpenFOAM-10/platforms/linux64GccDPInt32Opt/lib \
    -ldynamicMesh -lgenericPatchFields -lOpenFOAM -ldl  \
     -lm -o /vscmnt/brussel_pixiu_apps/_apps_brussel/RL8/broadwell/software/OpenFOAM/10-foss-2022a-20230119/OpenFOAM-10/platforms/linux64GccDPInt32Opt/bin/subsetMesh
== 2023-11-26 03:00:19,509 build_log.py:267 INFO ... (took 54 mins 11 secs)
== ... (took 54 mins 11 secs)
== 2023-11-26 03:00:19,510 config.py:695 DEBUG software install path as specified by 'installpath' and 'subdir_software': /apps/brussel/RL8/broadwell/software
== 2023-11-26 03:00:19,510 filetools.py:2012 INFO Removing lock /apps/brussel/RL8/broadwell/software/.locks/_apps_brussel_RL8_broadwell_software_OpenFOAM_10-foss-2022a-20230119.lock...
== 2023-11-26 03:00:19,516 filetools.py:383 INFO Path /apps/brussel/RL8/broadwell/software/.locks/_apps_brussel_RL8_broadwell_software_OpenFOAM_10-foss-2022a-20230119.lock successfully removed.
== 2023-11-26 03:00:19,516 filetools.py:2016 INFO Lock removed: /apps/brussel/RL8/broadwell/software/.locks/_apps_brussel_RL8_broadwell_software_OpenFOAM_10-foss-2022a-20230119.lock
== 2023-11-26 03:00:19,587 build_log.py:171 ERROR EasyBuild crashed with an error (at easybuild/base/exceptions.py:126 in __init__): Traceback (most recent call last):
  File "/scratch/brussel/vo/000/bvo00005/vsc10001/easybuild-framework/easybuild/main.py", line 135, in build_and_install_software
    (ec_res['success'], app_log, err) = build_and_install_one(ec, init_env)
  File "/scratch/brussel/vo/000/bvo00005/vsc10001/easybuild-framework/easybuild/framework/easyblock.py", line 4256, in build_and_install_one
    result = app.run_all_steps(run_test_cases=run_test_cases)
  File "/scratch/brussel/vo/000/bvo00005/vsc10001/easybuild-framework/easybuild/framework/easyblock.py", line 4135, in run_all_steps
    self.run_step(step_name, step_methods)
  File "/scratch/brussel/vo/000/bvo00005/vsc10001/easybuild-framework/easybuild/framework/easyblock.py", line 3970, in run_step
    step_method(self)()
  File "/scratch/brussel/vo/000/bvo00005/vsc10001/easybuild-easyblocks/easybuild/easyblocks/o/openfoam.py", line 341, in build_step
    run_cmd(cmd_tmpl % cmd, log_all=True, simple=True, log_output=True)
  File "/scratch/brussel/vo/000/bvo00005/vsc10001/easybuild-framework/easybuild/tools/run.py", line 92, in cache_aware_func
    res = func(cmd, *args, **kwargs)
  File "/scratch/brussel/vo/000/bvo00005/vsc10001/easybuild-framework/easybuild/tools/run.py", line 261, in run_cmd
    regexp=regexp, stream_output=stream_output, trace=trace, with_hook=with_hooks)
  File "/scratch/brussel/vo/000/bvo00005/vsc10001/easybuild-framework/easybuild/tools/run.py", line 362, in complete_cmd
    'work_dir': os.getcwd(),
FileNotFoundError: [Errno 2] No such file or directory
 (at easybuild/main.py:174 in build_and_install_software)

I suggest to change a bit the order of execution here, to make sure that we sit in an existing working directory before we call the hooks.

@boegel boegel changed the title earlier return to original directory after executing a command earlier return to original directory after executing a command in run_cmd (via complete_cmd) Dec 6, 2023
@boegel boegel added this to the next release (4.9.0?) milestone Dec 6, 2023
@boegel
Copy link
Member

boegel commented Dec 6, 2023

@lexming I've enhanced an existing test to catch this problem, see lexming#2

The only way to trigger the problem was to specify a working directory to run_cmd via path=, which is not used in the OpenFOAM easyblock, so I'm not entirely sure whether this fix actually resolves the exact problem you're trying to fix here?

If run_cmd is directly running into a working directory that gets removed by the command that is being run (via a change_dir before run_cmd), then we're still screwed, no?

@boegel
Copy link
Member

boegel commented Dec 6, 2023

@lexming I tried rebuilding OpenFOAM-10-foss-2022a-20230119.eb with current develop, and it didn't trigger the problem you're hitting, which tells me this is somehow self-inflicted on your end.

That also makes me a bit reluctant to merge this, since we're actually changing behavior here: the post run_cmd hook will now be triggered from a different directory.

That doesn't mean there no bug here, you could argue that run_cmd should be robust against the command being run removing the working directory we were in when run_cmd was started.
That has never been the case though, so I'm still puzzled why you are hitting that problem...
My best guess would be that you have a hook in place that changes to a particular directory, without changing back before the hook function completes?

@lexming
Copy link
Contributor Author

lexming commented Dec 8, 2023

@boegel ok I'll investigate this further 👍

@boegel
Copy link
Member

boegel commented Dec 20, 2023

@lexming Any updates?

@boegel
Copy link
Member

boegel commented Apr 3, 2024

@lexming ping?

@lexming
Copy link
Contributor Author

lexming commented May 6, 2024

I finally troubleshooted this issue. The underlying cause is a not-so-reliable NFS mount in our cluster where the installation directory is located.

The combination of this flaky NFS mount with a heavy build process like the one in OpenFOAM that is done directly on the installation directory (build_in_installdir) leaves the filesystem in a very weird state. Things like listing directories and changing directory work, but CWD is missing at the end of the installation and it cannot be re-determined by usual means.

I opened #4525 to improve the error reporting in this case.

Nonetheless, I still believe that we should change directory to a well defined location (like work_dir which is the root of the installation directory), before executing the hooks. Otherwise the behaviour of hooks might be erratic depending on where the build process brings you.

I agree that this is a change of behavior, so I'm changing targets to 5.0.x

@lexming lexming changed the base branch from develop to 5.0.x May 6, 2024 10:33
@lexming lexming added the EasyBuild-5.0 EasyBuild 5.0 label May 6, 2024
@lexming lexming modified the milestones: release after 4.9.1, 5.0 May 6, 2024
lexming and others added 3 commits May 6, 2024 17:00
… is robust against command that removes the directory it's running in
…riginal workdir or jump into a new working subdirectory and remove it
@lexming
Copy link
Contributor Author

lexming commented May 6, 2024

Improved the tests to consider 2 different cases:

  1. Command that removes the original working directory defined by EasyBuild (for software installations this is the root of the installation directory for that software)
  • EasyBuild should error out on such cases, because the command unexpectedly removed a directory created by EB
  1. Command that generates new directories inside initial working directory, jumps into them and removes them; leaving the shell on non-existing working directory
  • EasyBuild can continue to work in this case, we just need to make sure that after execution of any command, we come back to known working directory.

@lexming lexming changed the title earlier return to original directory after executing a command in run_cmd (via complete_cmd) return to original working directory after executing a command in run_shell_cmd May 6, 2024
@Micket
Copy link
Contributor

Micket commented May 8, 2024

We discussed this in the EB5 meeting. I will summarize my thoughts.

I think the premise is wrong here, neither the cmd the subprocess or the cwd=work_dir parameter can change the main threads cwd.

import os
import subprocess
print(os.getcwd())
p = subprocess.Popen(['pwd'], shell=True, cwd='/tmp', stdout=subprocess.PIPE)
a, b = p.communicate()
print('subprocess cwd:', a.decode('utf8'))
print(os.getcwd())

p = subprocess.Popen(['cd Downloads; pwd'], shell=True, stdout=subprocess.PIPE)
a, b = p.communicate()
print('subprocess cwd:', a.decode('utf8'))
print(os.getcwd())

p = subprocess.Popen(['pwd'], cwd='Downloads', shell=True, stdout=subprocess.PIPE)
a, b = p.communicate()
print('subprocess cwd:', a.decode('utf8'))
print(os.getcwd())

it's still just the expected

/home/micket
# First subprocess:
subprocess cwd: /tmp
/home/micket

# Second subprocess
subprocess cwd: /home/micket/Downloads
/home/micket

# Third subprocess
subprocess cwd: /home/micket/Downloads
/home/micket

so os.getcwd() doesn't change, which is good. I really really don't think we should ever change os.chdir except in a few central places.

In some easyblocks we still do patterns like

os.chdir(some_path)
run_cmd(cmd)
os.chdir(start_dir)

but those should be replaced with just

run_shell_cmd(cmd, work_dir=some_path)

knowing that the cwd of the main process will not under any circumstance change, regardless of what some_path is or cmd does.

So this would drastically change this, ensuring that the dir is changed to some_path, which I think is the opposite of what we want.

I'm struggling a bit to understand what the tests are trying to do here. I attempted extracting the core parts so i can run it manually

import os
from easybuild.tools.filetools import mkdir
from easybuild.tools.run import run_shell_cmd
from test.framework.utilities import init_config

init_config()

test_prefix = '/tmp'
workdir = os.path.join(test_prefix, 'workdir')
sub_workdir = os.path.join(workdir, 'subworkdir')
mkdir(sub_workdir, parents=True)

cmd_workdir_rm = "echo 'Command that jumps to working subdir and removes it' && "
cmd_workdir_rm += f"cd {sub_workdir} && pwd && rm -rf {sub_workdir} && "
cmd_workdir_rm += "echo 'Working directory removed.'"

run_shell_cmd(cmd_workdir_rm, work_dir=workdir)
print(os.getcwd())

and this work just fine with EB5.0 without this PR. The output from the command is

Command that jumps to working subdir and removes it
/tmp/workdir/subworkdir
Working directory removed.

and my print(os.getcwd()) still outputs /cephyr/NOBACKUP/users/c3-builder/micke, Everything is already working as intended as far as i can tell.

@lexming
Copy link
Contributor Author

lexming commented May 10, 2024

@Micket Indeed, you are right and the working directory passed to subprocess.Popen will not change the CWD of the main Python process. HOWEVER, whatever is executed by subprocess.Popen can still delete the working directory of the main Python process.

For instance, the following is a scenario that will cause trouble:

  1. Python is running on /dir_A
  2. run_shell_cmd is called with work_dir=/dir_A/dir_B
  3. command deletes /dir_A
  4. run_shell_cmd runs the post-run hooks with work_dir=/dir_A/dir_B
  5. 💥

The original issue in this PR is effectively triggering this problem. The command executed in run_shell_cmd does not explicitly delete the working directory of EB, but it leaves the filesystem in a zombie state (because a flaky NFS mount) and the main EB process on a non-existent CWD.

So we still need to decide what to do in such a case, as any forward execution of EB in that situation is dangerous. I agree with you that always changing the working directory at the end of run_shell_cmd is bad practice. I updated this PR to first do a check on CWD and only change dir if that fails and as a last attempt to bring EB into a sane environment.

@lexming lexming changed the title return to original working directory after executing a command in run_shell_cmd check presence of CWD at the end of run_shell_cmd and try to return to original working directory if non-existent May 10, 2024
@Micket
Copy link
Contributor

Micket commented May 12, 2024

@Micket Indeed, you are right and the working directory passed to subprocess.Popen will not change the CWD of the main Python process. HOWEVER, whatever is executed by subprocess.Popen can still delete the working directory of the main Python process.

Yes, or literally any other directory. Maybe it deletes the builddir and there is of course no way out of that, except fixing the broken build/command that was executed.

For instance, the following is a scenario that will cause trouble:

1. Python is running on `/dir_A`
2. `run_shell_cmd` is called with `work_dir=/dir_A/dir_B`
3. command deletes `/dir_A`
4. `run_shell_cmd` runs the post-run hooks with `work_dir=/dir_A/dir_B`
5. 💥

I don't understand the issue with that. This should crash/be a problem, and it is? Especially if entire /dir_A is deleted, there is no way possible directory to jump to a different dir even if one wanted to attempt some ad-hoc recovery.
Even if only the subdirectory dir_B was removed, even then the subsequent step would fail there anyway if it incorrectly expects that directory to exist.
There is really only so much (i.e. nothing in my opinion) we can do when the commands we run are doing shenanigans.

So we still need to decide what to do in such a case, as any forward execution of EB in that situation is dangerous. I agree with you that always changing the working directory at the end of run_shell_cmd is bad practice. I updated this PR to first do a check on CWD and only change dir if that fails and as a last attempt to bring EB into a sane environment.

I think the most important thing is that subprocess/run_shell_cmd never changes anything for the main process, be it working directory or environment variables.
I really really don't like idea that it would be doing this conditionally, which i think is an even bigger source of hard to find bugs..

If anything, i think all the sprinkled os.chdir we have today is probably the culprit of any issues. Apart from framework chdir'ing into the builddir at the start of the config step, i really don't think there should be any os.chdir in the code, i.e. we should do this easybuilders/easybuild-easyblocks#3327 "everywhere"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants