Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BrokenProcessPool #46

Closed
oschwengers opened this issue Oct 16, 2023 · 6 comments
Closed

BrokenProcessPool #46

oschwengers opened this issue Oct 16, 2023 · 6 comments
Labels
bug Something isn't working

Comments

@oschwengers
Copy link

Hi @althonos,
thanks for the recent 3.0 version and all its improvements. I'm currently working on a patch to bring this into Bakta.

Primary, I wanted to implement a multi-threaded version of this as suggested in https://github.com/althonos/pyrodigal#-thread-safety. However, in meta mode, I always run into a concurrent.futures.process.BrokenProcessPool. I cannot 100% rule out that maybe there is something wrong on our site, but after playing around and running a minimal example, I wanted to let you know - just in case there might be bug lurking in Pyrodigal.

So, using a small plasmid as input, the following minimal example is working as expected:

import sys
import concurrent.futures as cf
import pyrodigal
from Bio import SeqIO

sequences = []
with open(str(sys.argv[1])) as fh:
    for record in SeqIO.parse(fh, 'fasta'):
        sequences.append(str(record.seq).upper())

gene_finder = pyrodigal.GeneFinder(meta=True, metagenomic_bins=None, closed=False, mask=True)
for genes in map(gene_finder.find_genes, sequences):
    for gene in genes:
        print(gene)

However, If i switch to a parallel setup, this is not working anymore:

import sys
import concurrent.futures as cf
import pyrodigal
from Bio import SeqIO

sequences = []
with open(str(sys.argv[1])) as fh:
    for record in SeqIO.parse(fh, 'fasta'):
        sequences.append(str(record.seq).upper())

gene_finder = pyrodigal.GeneFinder(meta=True, metagenomic_bins=None, closed=False, mask=True)
with cf.ProcessPoolExecutor(max_workers=4) as ppe:
    for genes in ppe.map(gene_finder.find_genes, sequences):
        for gene in genes:
            print(gene)

Giving this stacktrace:

Traceback (most recent call last):
  File "/home/oliver/software/bakta/bakta/bakta/pyrodigal-test.py", line 13, in <module>
    for genes in ppe.map(gene_finder.find_genes, sequences):
  File "/home/oliver/tmp/conda-envs/bakta-test/lib/python3.10/concurrent/futures/process.py", line 575, in _chain_from_iterable_of_lists
    for element in iterable:
  File "/home/oliver/tmp/conda-envs/bakta-test/lib/python3.10/concurrent/futures/_base.py", line 621, in result_iterator
    yield _result_or_cancel(fs.pop())
  File "/home/oliver/tmp/conda-envs/bakta-test/lib/python3.10/concurrent/futures/_base.py", line 319, in _result_or_cancel
    return fut.result(timeout)
  File "/home/oliver/tmp/conda-envs/bakta-test/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/home/oliver/tmp/conda-envs/bakta-test/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

So this only occurs in meta mode; the parallel implementation works fine in parallel on larger sequences running in default non-meta mode. Any ideas why this is not working? Thanks a lot in advance!

Best regards
Oliver

@althonos
Copy link
Owner

Hi @oschwengers,

I haven't used future-based concurrency that much, but I notice that you're using ProcessPoolExecutor; using a ThreadPoolExecutor should work fine because most of the Pyrodigal code is nogil so it should run with true parallelism even within a single process. Tell me if you have errors there -- otherwise, I guess the problem may come from inter-process communication and I'll have a look.

Cheers 😄

@oschwengers
Copy link
Author

Thanks @althonos,
indeed using a ThreadPoolExecutor seems to work. This is very interesting, since I always thought that CPU bound tasks could not effectively parallelized by Python threads. I guess some Cython magic?

Anyways, thanks a lot for this! I'll test this a bit and maybe re-active the parallel gene prediction in Bakta which indeed saves a few tens of seconds in meta mode on a draft genome.

@althonos
Copy link
Owner

Indeed, Cython lets you declare code that runs in no-GIL mode, but for that you need to have code that doesn't interact with the Python interpreter in any way during these sections -- this is the case in Pyrodigal because the whole computation uses C data structures (from the Prodigal code) and only wraps the results at the very end of the computation 🙂

I'll try to have a look at the process pool eventually but I don't have ideas as to what could be wrong right now !

@oschwengers
Copy link
Author

Ah, I see. Thanks for the explanation. So far, everything seem to work using the ThreadPool which - of course - is faster on its own by avoiding all the back-and-forth copying of data, external process overhead, etc.
Thanks again. I'll close this for now.

@althonos
Copy link
Owner

althonos commented Nov 6, 2023

By the way, it looks like the error with ProcessPool was indeed caused by a pickling bug; I'll make a patch just in case, but it's quite likely that a ThreadPool is still faster.

@oschwengers
Copy link
Author

Thanks a lot for the confirmation and quick fix. Indeed, the ThreadPool is much faster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants