Use Pyrodigal instead of Prodigal for ORF prediction #200
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hi @raphenya !
This PR proposes to replace Prodigal with Pyrodigal for running the ORF prediction stage. Pyrodigal is a Python library binding to Prodigal with additional performance enhancements. I'm the author of Pyrodigal, so ofc this is not a completely neutral list, but there are several advantages over Prodigal that I'll try to list down:
Single-threaded speed
Pyrodigal comes with a SIMD pre-filter to skip score computation for invalid gene pairs. This typically saves around half of the runtime for processing a genome in single mode (and more than that in metagenomic mode) on platforms with supported CPU features (SSE or NEON). I did a small writeup about this in the paper.
I ran some benchmarks on a single closed genome (NC_004129) to compare the runtime (still using BLAST for the downstream analysis):
Multi-threading
Pyrodigal supports re-entrant multithreading, so you can use multi-threaded ORF prediction even when running in single mode, contrary to what the code is currently doing with Prodigal where you only run multi-threaded prediction in
--low_quality
mode. This improves the runtime even more on fragmented genomes (e.g. 548.SAMN21245456):Simpler installation
Contrary to Prodigal, Pyrodigal can be
pip install
ed, so it's one less dependency to worry about for people who don't use conda. Otherwise it's also in Bioconda.Same results
Despite the faster speed, Pyrodigal and Prodigal produce exactly1 the same output.
Footnotes
Well, almost. During the refactor I found a bug in Prodigal that got all genes on the reverse strand to be penalized. It was fixed here but Prodigal never got a new release, so unless you recompile the code yourself you're still getting a buggy version. On the contrary, Pyrodigal contains the fix. So the "recompiled/fixed" Prodigal and Pyrodigal predict exactly the same thing (this is tested for), but the buggy Prodigal and Pyrodigal may occasionally diverge. ↩