RuntimeError: can't start new thread #1395

scintilla9 · 2024-05-23T16:06:34Z

Hi,

I'm trying to using the latest version of cactus (2.8.2) in docker.
At first a numpy error which was solved by using export OMP_NUM_THREADS=1 (suggestion from bcgsc/mavis#185)

Then another error came up:

Traceback (most recent call last):
File "/home/cactus/cactus_env/bin/cactus", line 8, in
sys.exit(main())
File "/home/cactus/cactus_env/lib/python3.10/site-packages/cactus/progressive/cactus_progressive.py", line 454, in main
hal_id = toil.start(Job.wrapJobFn(progressive_workflow, options, config_node, mc_tree, og_map, input_seq_id_map))
File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/common.py", line 894, in start
self._batchSystem = self.createBatchSystem(self.config)
File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/common.py", line 1043, in createBatchSystem
return batch_system(**kwargs)
File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/batchSystems/singleMachine.py", line 198, in init
self.daddyThread.start()
File "/usr/lib/python3.10/threading.py", line 935, in start
_start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread

My command is:
cactus ./js/ cactus.txt cactus.hal --defaultCores 40 --maxCores 40 --defaultMemory 512G --maxMemory 700G --defaultDisk 100G --maxDisk 500G --lastzCore 40 --lastzMemory 256G

Any suggestions are appreciated

The text was updated successfully, but these errors were encountered:

glennhickey · 2024-05-23T16:10:49Z

I think we've seen this issue before. @adamnovak does this seem familiar?

glennhickey · 2024-05-23T18:09:23Z

I think the issue I'm thinking of is DataBiosphere/toil#3573 and #462

adamnovak · 2024-05-23T21:53:31Z

Toil has been passing an OMP_NUM_THREADS to each job individually since 5.5.0, so if the Toil here is newer than that we shouldn't have the same problem with all the single-machine jobs thinking they can have one thread per core on the machine.

@scintilla9 what is your ulimit -u value (which would be the maximum number of threads you are allowed)? And how does that compare to what nproc says for the number of cores that are in the system?

It looks like Toil is failing to start one of its internal threads before it even gets around to making jobs that use threads. Are you running anything else on this machine that could be eating into your thread limit? Did you like start a previous Toil run and somehow leave processes running?

scintilla9 · 2024-05-24T01:04:17Z

Here's the information:
ulimit -u = 3095605, nproc = 48

and ulimit -a:

core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 3095605
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 200000
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) 3095605
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

I am running another catcus (an older version) on the machine locally, but I've tried stop the job and run a new cactus (latest version) in docker, the error still happened. In fact the older version does not occupy the resources, it only take 1 thread when running the process even I set --defaultCores 40 and --maxCores 40. That is why I want to change to the latest version, but I am not sure if that affect the thread limit.

Do I need to increase max user processes? It seems already a huge value.

adamnovak · 2024-05-24T15:23:07Z

Yeah, that looks big enough. Apparently there is also a system-wide limit you can check with cat /proc/sys/kernel/threads-max, but I don't think that's your problem.

Do you happen to be using Docker 20.10.9 (or older)? That version causes problems when newer containers try to start threads because it doesn't know about and thus forbids some of the syscalls they try to use, and the Cactus Docker images are on Ubuntu 22.04 so they would presumably be new enough to hit that bug.

scintilla9 · 2024-05-25T15:27:41Z

Hi @adamnovak

Thanks for reply.
cat /proc/sys/kernel/threads-max shows 6191210.
And yes, my docker version is 18.09, so this might be the reason.
Now I've built cactus 2.8.2 from pre-compile binary, and ran without error so far.

BTW, the multiple cores lastz only works when GPU available, right?

adamnovak · 2024-05-28T13:55:53Z

BTW, the multiple cores lastz only works when GPU available, right?

I feel like multiple cores and using GPUs are independent features, but @glennhickey would know for sure.

glennhickey · 2024-05-28T14:00:23Z

BTW, the multiple cores lastz only works when GPU available, right?

yes

  --lastzCores LASTZCORES
                        Number of cores for each lastz/segalign job, only
                        relevant when running with --gpu

scintilla9 · 2024-05-28T14:36:16Z

Thanks for clarifying.

scintilla9 closed this as completed May 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: can't start new thread #1395

RuntimeError: can't start new thread #1395

scintilla9 commented May 23, 2024

glennhickey commented May 23, 2024

glennhickey commented May 23, 2024

adamnovak commented May 23, 2024 •

edited

Loading

scintilla9 commented May 24, 2024

adamnovak commented May 24, 2024

scintilla9 commented May 25, 2024

adamnovak commented May 28, 2024 •

edited

Loading

glennhickey commented May 28, 2024

scintilla9 commented May 28, 2024

RuntimeError: can't start new thread #1395

RuntimeError: can't start new thread #1395

Comments

scintilla9 commented May 23, 2024

glennhickey commented May 23, 2024

glennhickey commented May 23, 2024

adamnovak commented May 23, 2024 • edited Loading

scintilla9 commented May 24, 2024

adamnovak commented May 24, 2024

scintilla9 commented May 25, 2024

adamnovak commented May 28, 2024 • edited Loading

glennhickey commented May 28, 2024

scintilla9 commented May 28, 2024

adamnovak commented May 23, 2024 •

edited

Loading

adamnovak commented May 28, 2024 •

edited

Loading