Improve robustness of PipInstall
plugin
#7102
Labels
enhancement
Improve existing functionality or make things work better
PipInstall
plugin
#7102
The
PipInstall
plugin currently relies on rather brittle and untested logic to determine restarts which may result in infinite restart loops (see #7037). In addition to this, it is not safe in scenarios where we have multiple workers sharing a file systems and attempting to install packages simultaneously. In this scenarios, some workers might not restart if another worker already installed packages, but they have not updated their Python environment accordingly (distributed/distributed/diagnostics/plugin.py
Lines 257 to 259 in f2a517f
To avoid this, we should track on which hosts packages have been installed and which workers have already restarted. This can be done by using
Client.{get|set}_metadata
. By using adistributed.Lock
, thePipInstall
plugin already instantiates a client that we can reuse. This way, we can ensure that all workers restart on a machine on which packages have been installed and we can ensure that they only restart once. This will remove the brittle string matching logic.To further improve the
PipInstall
plugin, we should replace the use of thedistributed.Lock
with adistributed.Semaphore
to avoid deadlocks when workers fail unexpectedly.The text was updated successfully, but these errors were encountered: