Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement SIMD code to quickly compute skippable connections #6

Merged
merged 17 commits into from Sep 22, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
17 commits
Select commit Hold shift + click to select a range
a9b7c30
Add experimental SIMD code to compute which node connections can be s…
althonos Sep 20, 2021
9b35360
Use aligned loads ands stores in AVX and SSE implementations
althonos Sep 20, 2021
177a826
Rewrite connection scoring with a dedicated class to support runtime …
althonos Sep 20, 2021
c05cb34
Move `Prodigal` submodule to the `vendor` folder
althonos Sep 20, 2021
ce3e60e
Vendor the `cpu_features` library to do runtime detection of CPU feat…
althonos Sep 20, 2021
acd8a09
Refactor `setup.py` and `_pyrodigal.pyx` to support conditional compi…
althonos Sep 20, 2021
3b0a74e
Feature-gate functions in `avx.c` and `sse.c` so they can be compiled…
althonos Sep 20, 2021
1592675
Add tests checking the connection scorer gets the same results
althonos Sep 20, 2021
3a8d8bb
Fix outdated import in `setup.py`
althonos Sep 20, 2021
f55e93e
Refactor `setup.py` to check for required OSX functions when building…
althonos Sep 20, 2021
745059d
Fix useless `memset` and optimize tight loops in `ConnectionScorer`
althonos Sep 20, 2021
4deefec
Fix dead code in `_pyrodigal.pyx` when compiling without SIMD support
althonos Sep 20, 2021
2a3b734
Fix typo in `pyrodigal.tests.test_connection_scorer`
althonos Sep 20, 2021
aacdaa0
Fix extension of temporary executables used in `setup.py` to check fo…
althonos Sep 20, 2021
dcad34a
Add missing `intergenic_mod` function to `node.pxd`
althonos Sep 21, 2021
da1be90
Update `README.md` with latest changes to implementation
althonos Sep 22, 2021
71fb662
Add a very limited CLI that can be invoked with `python -m pyrodigal`
althonos Sep 22, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
5 changes: 4 additions & 1 deletion .gitmodules
@@ -1,3 +1,6 @@
[submodule "Prodigal"]
path = Prodigal
path = vendor/Prodigal
url = https://github.com/hyattpd/Prodigal
[submodule "vendor/cpu_features"]
path = vendor/cpu_features
url = https://github.com/google/cpu_features/
5 changes: 3 additions & 2 deletions MANIFEST.in
Expand Up @@ -6,6 +6,7 @@ include README.md
recursive-include pyrodigal/tests *.py *.txt
recursive-include pyrodigal/tests/data *

recursive-include Prodigal *.c *.h *.md VERSION LICENSE CHANGES Makefile
recursive-include vendor/Prodigal *.c *.h *.md VERSION LICENSE CHANGES Makefile

recursive-include pyrodigal *.py *.pyi *.pyx *.pxd *.h py.typed
recursive-exclude pyrodigal *.c
recursive-exclude pyrodigal *.c *.html
49 changes: 29 additions & 20 deletions README.md
@@ -1,7 +1,7 @@
# 🔥 Pyrodigal [![Stars](https://img.shields.io/github/stars/althonos/pyrodigal.svg?style=social&maxAge=3600&label=Star)](https://github.com/althonos/pyrodigal/stargazers)

*Cython bindings and Python interface to [Prodigal](https://github.com/hyattpd/Prodigal/), an ORF
finder for genomes and metagenomes.*
finder for genomes and metagenomes. **Now with SIMD!***

[![Actions](https://img.shields.io/github/workflow/status/althonos/pyrodigal/Test/master?logo=github&style=flat-square&maxAge=300)](https://github.com/althonos/pyrodigal/actions)
[![Coverage](https://img.shields.io/codecov/c/gh/althonos/pyrodigal?style=flat-square&maxAge=3600)](https://codecov.io/gh/althonos/pyrodigal/)
Expand All @@ -27,14 +27,17 @@ internals, which has the following advantages:
- **single dependency**: Pyrodigal is distributed as a Python package, so you
can add it as a dependency to your project, and stop worrying about the
Prodigal binary being present on the end-user machine.
- **no intermediate files**: everything happens in memory, in a Python object
- **no intermediate files**: Everything happens in memory, in a Python object
you fully control, so you don't have to invoke the Prodigal CLI using a
sub-process and temporary files.
- **no input formatting**: sequences are manipulated directly as strings, which
leverages the issue of formatting your input to FASTA for Prodigal.
sub-process and temporary files. Sequences can be passed directly as
strings or bytes, which avoids the issue of formatting your input to
FASTA for Prodigal.
- **lower memory usage**: Pyrodigal is slightly more conservative when it comes
to using memory, which can help process very large sequences. It also lets
you save some more memory when running several *meta*-mode analyses
- **better performance**: Pyrodigal uses *SIMD* instructions to compute which
dynamic programming nodes can be ignored when scoring connections. This can
save up to half the runtime on some sequences.

### 📋 Features

Expand All @@ -50,23 +53,28 @@ metagenomic mode. It is still missing some features of the CLI:

### 🐏 Memory

Contrary to the Prodigal command line, Pyrodigal attempts to be more conservative
about memory usage. This means that most of the allocations will be lazy, and
that some functions will reallocate their results to exact-sized arrays when
it's possible. This leads to Pyrodigal using about 30% less memory, but with
a little bit more overhead to compute the size of buffers in advance.
Pyrodigal makes two changes compared to the origina Prodigal command line:

* Sequences are stored as raw bytes instead of compressed bitmaps. This means
that the sequence itself takes 3/8th more space, but since the memory used
for storing the sequence is often negligible compared to the memory used to
store dynamic programming nodes, this is an acceptable trade-off for better
performance when finding the start and stop nodes.
* Node arrays are dynamically allocated and grow exponentially instead of
being pre-allocated with a large size. On small sequences, this leads to
Pyrodigal using about 30% less memory.


### 🧶 Thread-safety

`pyrodigal.Pyrodigal` instances are thread-safe, and use an internal lock to
prevent parallel calls to their methods from overwriting the internal buffers.
However, a better solution to process sequences in parallel is to use a
consumer/worker pattern, and have one `Pyrodigal` instance in each worker.
Using a pool spawning `Pyrodigal` instances on the fly is also fine,
but prevents recycling memory:
`pyrodigal.Pyrodigal` instances are thread-safe. In addition, the `find_genes`
method is re-entrant. This means you can train a `Pyrodigal` instance once,
and then use a pool to process sequences in parallel:
```python
p = Pyrodigal()
p.train(training_sequence)
with multiprocessing.pool.ThreadPool() as pool:
pool.map(lambda s: Pyrodigal(meta=True).find_genes(s), sequences)
predictions = pool.map(p.find_genes, sequences)
```

## 🔧 Installing
Expand Down Expand Up @@ -150,9 +158,10 @@ for more details.

## ⚖️ License

This library is provided under the [GNU General Public License v3.0](https://choosealicense.com/licenses/gpl-3.0/). The Prodigal code was written by [Doug Hyatt](https://github.com/hyattpd)
and is distributed under the terms of the GPLv3 as well. See `Prodigal/LICENSE`
for more information.
This library is provided under the [GNU General Public License v3.0](https://choosealicense.com/licenses/gpl-3.0/).
The Prodigal code was written by [Doug Hyatt](https://github.com/hyattpd) and is distributed under the
terms of the GPLv3 as well. See `vendor/Prodigal/LICENSE` for more information. The `cpu_features` library is
licensed under the terms of the Apache license. See `vendor/cpu_features/LICENSE` for more information.

*This project is in no way not affiliated, sponsored, or otherwise endorsed
by the [original Prodigal authors](https://github.com/hyattpd). It was developed
Expand Down
44 changes: 44 additions & 0 deletions pyrodigal/__main__.py
@@ -0,0 +1,44 @@
import argparse

import coloredlogs
coloredlogs.install(level='DEBUG')

from . import __name__, __author__, __version__, Pyrodigal
from .tests.fasta import parse

parser = argparse.ArgumentParser(prog=__name__)
# parser.add_argument("-a", required=False, help="Write protein translations to the selected file.")
parser.add_argument("-c", required=False, action="store_true", help="Closed ends. Do not allow genes to run off edges.", default=False)
# parser.add_argument("-d", required=False, help="Write nucleotide sequences of genes to the selected file.")
# parser.add_argument("-f", required=False, help="Select output format.", choices={"gbk", "gff", "sco"}, default="gbk")
# parser.add_argument("-g", required=False, type=int, choices=_TRANSLATION_TABLES, help="Specify a translation table to use.", default=11)
parser.add_argument("-i", required=True, help="Specify FASTA input file.")
parser.add_argument("-p", required=False, help="Select procedure.", choices={"single", "meta"}, default="single")

try:
import argcomplete
argcomplete.autocomplete(parser)
except ImportError:
pass

args = parser.parse_args()

pyrodigal = Pyrodigal(meta=args.p == "meta", closed=args.c)


for i, seq in enumerate(parse(args.i)):
if args.p == "single" and i == 0:
pyrodigal.train(seq.seq)
for pred in predictions:
print(
seq.id,
"{}_v{}".format(__name__, __version__),
"CDS",
pred.begin,
pred.end,
"{:.1f}".format(pred.sscore + pred.cscore),
"+" if pred.strand > 0 else "-",
"0",
";".join([pred._gene_data, pred._score_data]),
sep="\t",
)
32 changes: 28 additions & 4 deletions pyrodigal/_pyrodigal.pxd
Expand Up @@ -3,7 +3,7 @@

# ----------------------------------------------------------------------------

from libc.stdint cimport uint8_t
from libc.stdint cimport int8_t, uint8_t

from pyrodigal.prodigal.bitmap cimport bitmap_t
from pyrodigal.prodigal.gene cimport _gene
Expand Down Expand Up @@ -41,6 +41,29 @@ cdef class Sequence:
cdef int _mer_ndx(self, int i, int length, int strand=*) nogil
cdef char _amino(self, int i, int tt, int strand=*, bint is_init=*) nogil

# --- Connection Scorer ------------------------------------------------------

cdef class ConnectionScorer:
# which SIMD backend to use
cdef uint8_t backend
# capacity of bypassing buffers
cdef size_t capacity
# connection skip lookup table
cdef uint8_t* skip_connection
cdef uint8_t* skip_connection_raw
# aligned storage of node types
cdef uint8_t* node_types
cdef uint8_t* node_types_raw
# aligned storage of node strands
cdef int8_t* node_strands
cdef int8_t* node_strands_raw
# aligned storage of node frame
cdef uint8_t* node_frames
cdef uint8_t* node_frames_raw

cdef int _index(self, Nodes nodes) nogil except 1
cdef int _compute_skippable(self, int min, int i) nogil except 1
cdef int _score_connections(self, Nodes nodes, int min, int i, TrainingInfo tinf, bint final=*) nogil except 1

# --- Nodes ------------------------------------------------------------------

Expand All @@ -53,6 +76,7 @@ cdef class Node:
cdef _node* node

cdef class Nodes:
# contiguous array of nodes, with capacity and length
cdef _node* nodes
cdef size_t capacity
cdef size_t length
Expand All @@ -66,6 +90,7 @@ cdef class Nodes:
const bint edge,
) nogil except NULL

cpdef Nodes copy(self)
cdef int _clear(self) nogil except 1
cdef int _sort(self) nogil except 1

Expand Down Expand Up @@ -143,6 +168,7 @@ cpdef int add_genes(Genes genes, Nodes nodes, int ipath) nogil except -1
cpdef void calc_dicodon_gene(TrainingInfo tinf, Sequence sequence, Nodes nodes, int ipath) nogil
cdef int* calc_most_gc_frame(Sequence seq) nogil except NULL
cpdef int calc_orf_gc(Nodes nodes, Sequence seq, TrainingInfo tinf) nogil except -1
cpdef int dynamic_programming(Nodes nodes, TrainingInfo tinf, ConnectionScorer score, bint final=*) nogil
cpdef int find_best_upstream_motif(Nodes nodes, int ni, Sequence seq, TrainingInfo tinf, int stage) nogil except -1
cpdef void raw_coding_score(Nodes nodes, Sequence seq, TrainingInfo tinf) nogil
cpdef void rbs_score(Nodes nodes, Sequence seq, TrainingInfo tinf) nogil
Expand All @@ -151,17 +177,15 @@ cpdef void score_upstream_composition(Nodes nodes, int ni, Sequence seq, Trainin
cpdef int shine_dalgarno_exact(Sequence seq, int pos, int start, TrainingInfo tinf, int strand=*) nogil
cpdef int shine_dalgarno_mm(Sequence seq, int pos, int start, TrainingInfo tinf, int strand=*) nogil
cpdef void train_starts_nonsd(Nodes nodes, Sequence sequence, TrainingInfo tinf) nogil
cpdef void train_starts_sd(Nodes nodes, Sequence sequence, TrainingInfo tinf) nogil

# --- Wrappers ---------------------------------------------------------------

cpdef void reset_node_scores(Nodes nodes) nogil
cpdef void record_overlapping_starts(Nodes nodes, TrainingInfo tinf, bint is_meta=*) nogil
cpdef int dynamic_programming(Nodes nodes, TrainingInfo tinf, bint final=*) nogil
cpdef void eliminate_bad_genes(Nodes nodes, int ipath, TrainingInfo tinf) nogil
cpdef void tweak_final_starts(Genes genes, Nodes nodes, TrainingInfo tinf) nogil
cpdef void record_gene_data(Genes genes, Nodes nodes, TrainingInfo tinf, int sequence_index) nogil
cpdef void calc_dicodon_gene(TrainingInfo tinf, Sequence sequence, Nodes nodes, int ipath) nogil
cpdef void train_starts_sd(Nodes nodes, Sequence sequence, TrainingInfo tinf) nogil
cpdef void determine_sd_usage(TrainingInfo tinf) nogil

# --- Main functions ---------------------------------------------------------
Expand Down