Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Pyrodigal instead of Prodigal for ORF prediction #200

Merged
merged 5 commits into from Dec 7, 2022
Merged

Use Pyrodigal instead of Prodigal for ORF prediction #200

merged 5 commits into from Dec 7, 2022

Conversation

althonos
Copy link
Contributor

@althonos althonos commented Oct 18, 2022

Hi @raphenya !

This PR proposes to replace Prodigal with Pyrodigal for running the ORF prediction stage. Pyrodigal is a Python library binding to Prodigal with additional performance enhancements. I'm the author of Pyrodigal, so ofc this is not a completely neutral list, but there are several advantages over Prodigal that I'll try to list down:

Single-threaded speed

Pyrodigal comes with a SIMD pre-filter to skip score computation for invalid gene pairs. This typically saves around half of the runtime for processing a genome in single mode (and more than that in metagenomic mode) on platforms with supported CPU features (SSE or NEON). I did a small writeup about this in the paper.

I ran some benchmarks on a single closed genome (NC_004129) to compare the runtime (still using BLAST for the downstream analysis):

Mode RGI w/ Prodigal RGI w/ Pyrodigal
Default 245s 205s
Low quality 340s 272s

Multi-threading

Pyrodigal supports re-entrant multithreading, so you can use multi-threaded ORF prediction even when running in single mode, contrary to what the code is currently doing with Prodigal where you only run multi-threaded prediction in --low_quality mode. This improves the runtime even more on fragmented genomes (e.g. 548.SAMN21245456):

Mode RGI w/ Prodigal RGI w/ Pyrodigal
Default 231s 153s
Low quality 241s 165s

Simpler installation

Contrary to Prodigal, Pyrodigal can be pip installed, so it's one less dependency to worry about for people who don't use conda. Otherwise it's also in Bioconda.

Same results

Despite the faster speed, Pyrodigal and Prodigal produce exactly1 the same output.

Footnotes

  1. Well, almost. During the refactor I found a bug in Prodigal that got all genes on the reverse strand to be penalized. It was fixed here but Prodigal never got a new release, so unless you recompile the code yourself you're still getting a buggy version. On the contrary, Pyrodigal contains the fix. So the "recompiled/fixed" Prodigal and Pyrodigal predict exactly the same thing (this is tested for), but the buggy Prodigal and Pyrodigal may occasionally diverge.

@raphenya
Copy link
Collaborator

@althonos Thank you, Martin! This looks awesome. I will review the code, but I think the best way is to have orf tools (i.e Prodigal and Pyrodigal) as an option. That way, it will be easy to compare and also in light of the anticipated Prodigal 3 release in the future.

@althonos
Copy link
Contributor Author

Fine by me! I updated the code to control the ORF finder based on the CLI, like for the aligner tool

@althonos
Copy link
Contributor Author

Please don't merge yet, I'm making some breaking API changes regarding output formatting in Pyrodigal, so I'll update the PR later to use Pyrodigal v2 after it's properly released.

@althonos
Copy link
Contributor Author

althonos commented Nov 3, 2022

Just updated to v2.0, which has been verified to produce exactly the same results as Prodigal.

@nickp60
Copy link

nickp60 commented Nov 16, 2022

Excited for this!

@raphenya
Copy link
Collaborator

raphenya commented Dec 7, 2022

@althonos Thank you, I will merge away!

@raphenya raphenya merged commit ed0d289 into arpcard:master Dec 7, 2022
@althonos
Copy link
Contributor Author

althonos commented Dec 7, 2022

Yay, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants