You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When the Pilot creates its DIRACOS environment, it directly calls the DIRACOS-Linux-machine.sh which eventually invokes mamba. Mamba assumes that it can use all cores of the machine (see mamba-org/mamba#2463), which isn't realistic for Pilot environments. This leads to excessive process creation, which can negatively affect the pilot, user or even entire compute resource.
As far as I can tell, the templates from which DIRACOS is generated do no provide a feasible way to limit this internally. The Pilot thus seems like the best place, seeing how it is aware of resource restrictions.
A solution would be to set MAMBA_EXTRACT_THREADS when installing DIRACOS, either to a conservative 1 or pp.maxNumberOfProcessors.
For reference of scale, we caught this on a WLCG Tier 1 WN with 256 cores that got allocated mostly to one VO. Each of the single core pilots tried to use 256 child processes; each pilot quickly ground to a halt due to resource and fork bomb protection, which caused each new pilot to also immediately get stuck on nproc limits and similar safeguards.
The text was updated successfully, but these errors were encountered:
When the Pilot creates its DIRACOS environment, it directly calls the
DIRACOS-Linux-machine.sh
which eventually invokesmamba
. Mamba assumes that it can use all cores of the machine (see mamba-org/mamba#2463), which isn't realistic for Pilot environments. This leads to excessive process creation, which can negatively affect the pilot, user or even entire compute resource.As far as I can tell, the templates from which DIRACOS is generated do no provide a feasible way to limit this internally. The Pilot thus seems like the best place, seeing how it is aware of resource restrictions.
A solution would be to set
MAMBA_EXTRACT_THREADS
when installing DIRACOS, either to a conservative1
orpp.maxNumberOfProcessors
.For reference of scale, we caught this on a WLCG Tier 1 WN with 256 cores that got allocated mostly to one VO. Each of the single core pilots tried to use 256 child processes; each pilot quickly ground to a halt due to resource and fork bomb protection, which caused each new pilot to also immediately get stuck on nproc limits and similar safeguards.
The text was updated successfully, but these errors were encountered: