You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I work on a HPC cluster where access to individual /proc entries is highly restricted. A software package I'm using (horovod) relies on psutil to find the child processes spawned by a supervisor process. Process.children() in turn uses ppid_map() to find any child processes.
Here lies the crux: ppid_map() iterates over ALL processes, including those of other users or the root. This causes a PermissionError when trying to access information about the processes (one might say rightfully so). In a multi-user system, psutil can not expect to have access to information of all processes running on the system. Notice, that to solve the initial task, we actually don't need that additional information; information about our own processes are good enough.
The end result is, that horovod fails in its task of terminating its children and simply enters a zombie state, including those worker processes. I then have to manually kill those processes by hand!
Here is the full stack trace (sorry for the duplicate logs, these come from intertwined outputs from multiple processes):
Traceback (most recent call last):
File "[...]/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
Exception in thread Thread-1 (fn):
Traceback (most recent call last):
File "[...]/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "[...]/lib/python3.10/threading.py", line 953, in run
self.run()
File "[...]/lib/python3.10/threading.py", line 953, in run
self.run()
File "[...]/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "[...]/lib/python3.10/site-packages/horovod/runner/util/threads.py", line 157, in fn
self._target(*self._args, **self._kwargs)
File "[...]/lib/python3.10/site-packages/horovod/runner/util/threads.py", line 157, in fn
self._target(*self._args, **self._kwargs)
File "[...]/lib/python3.10/site-packages/horovod/runner/util/threads.py", line 157, in fn
func(*args)
File "[...]/lib/python3.10/site-packages/horovod/runner/common/util/safe_shell_exec.py", line 41, in terminate_executor_shell_and_children
func(*args)
File "[...]/lib/python3.10/site-packages/horovod/runner/common/util/safe_shell_exec.py", line 41, in terminate_executor_shell_and_children
func(*args)
File "[...]/lib/python3.10/site-packages/horovod/runner/common/util/safe_shell_exec.py", line 41, in terminate_executor_shell_and_children
for child in p.children():
File "[...]/lib/python3.10/site-packages/psutil/__init__.py", line 278, in wrapper
for child in p.children():
File "[...]/lib/python3.10/site-packages/psutil/__init__.py", line 278, in wrapper
for child in p.children():
File "[...]/lib/python3.10/site-packages/psutil/__init__.py", line 278, in wrapper
return fun(self, *args, **kwargs)
File "[...]/lib/python3.10/site-packages/psutil/__init__.py", line 906, in children
return fun(self, *args, **kwargs)
File "[...]/lib/python3.10/site-packages/psutil/__init__.py", line 906, in children
return fun(self, *args, **kwargs)
File "[...]/lib/python3.10/site-packages/psutil/__init__.py", line 906, in children
ppid_map = _ppid_map()
File "[...]/lib/python3.10/site-packages/psutil/_pslinux.py", line 1640, in ppid_map
ppid_map = _ppid_map()
File "[...]/lib/python3.10/site-packages/psutil/_pslinux.py", line 1640, in ppid_map
ppid_map = _ppid_map()
File "[...]/lib/python3.10/site-packages/psutil/_pslinux.py", line 1640, in ppid_map
with open_binary("%s/%s/stat" % (procfs_path, pid)) as f:
File "[...]/lib/python3.10/site-packages/psutil/_common.py", line 711, in open_binary
with open_binary("%s/%s/stat" % (procfs_path, pid)) as f:
File "[...]/lib/python3.10/site-packages/psutil/_common.py", line 711, in open_binary
with open_binary("%s/%s/stat" % (procfs_path, pid)) as f:
File "[...]/lib/python3.10/site-packages/psutil/_common.py", line 711, in open_binary
return open(fname, "rb", **kwargs)
PermissionError: [Errno 1] Operation not permitted: '/proc/1/stat'
return open(fname, "rb", **kwargs)
PermissionError: [Errno 1] Operation not permitted: '/proc/1/stat'
return open(fname, "rb", **kwargs)
PermissionError: [Errno 1] Operation not permitted: '/proc/1/stat'
Proposed Solution
Ignore any processes without access. Maybe print a warning once to indicate that this behavior might cause issues down the line, but doesn't have to. Introduce an option to disable this warning. This is a much better solution imo than to just fail completely (which as you can take from my description, also causes a lot of trouble).
The text was updated successfully, but these errors were encountered:
I agree that some APIs that return multiple values, such as Process children(), open_files(), connections() etc., can have a new ignore_ad=bool parameter to ignore AccessDenied internally. It should default to False, meaning raise AD by default. There was a proposal already in some other ticket.
It should default to False, meaning raise AD by default.
The non-read permission is caused by hidepid=1, a standard option provided by linux. The problem with that is that most any package that uses psutils won't work with hidepid=1 then, because lets be realistic; how many packages going to change that default.
Summary
Description
I work on a HPC cluster where access to individual
/proc
entries is highly restricted. A software package I'm using (horovod) relies on psutil to find the child processes spawned by a supervisor process.Process.children()
in turn usesppid_map()
to find any child processes.Here lies the crux:
ppid_map()
iterates over ALL processes, including those of other users or the root. This causes a PermissionError when trying to access information about the processes (one might say rightfully so). In a multi-user system, psutil can not expect to have access to information of all processes running on the system. Notice, that to solve the initial task, we actually don't need that additional information; information about our own processes are good enough.The end result is, that horovod fails in its task of terminating its children and simply enters a zombie state, including those worker processes. I then have to manually kill those processes by hand!
Here is the full stack trace (sorry for the duplicate logs, these come from intertwined outputs from multiple processes):
Proposed Solution
Ignore any processes without access. Maybe print a warning once to indicate that this behavior might cause issues down the line, but doesn't have to. Introduce an option to disable this warning. This is a much better solution imo than to just fail completely (which as you can take from my description, also causes a lot of trouble).
The text was updated successfully, but these errors were encountered: