RolyPoly

RolyPoly is an RNA virus analysis toolkit, meant to be a "swiss-army knife" for RNA virus discovery and characterization by including a variety of commands, wrappers, parsers, automations, and some "quality of life" features for any many of a virus investigation process (from raw read processing to genome annotation). While it includes an "end-2-end" command that employs an entire pipeline, the main goals of rolypoly are:

Help non-computational researchers take a deep dive into their data without compromising on using tools that are non-techie friendly.
Help (software) developers of virus analysis pipeline "plug" holes missing from their framework, by using specific RolyPoly commands to add features to their existing code base.

Note - Rolypoly is still under development (contributions welcome!)

RolyPoly is an open, still in progress project - I aim to summarise the main functionality into a manuscript ~early 2026. Pull requests and contributions are welcome and will be considered (see CONTRIBUTING.md).
This also means that there are bugs, verbose logging even for non debug mode, and some place holders and TODOs here and there.

Installation

Quick and Easy - One Conda/Mamba Environment

Recommended for most users who want a "just works" solution and primarily intend to use rolypoly as a CLI tool in an independent environment.

We hope to have rolypoly available from bioconda in the near future.
In the meantime, it can be installed with the quick_setup.sh script, which will also fetch the pre-generated data rolypoly requires.

curl -O https://code.jgi.doe.gov/rolypoly/rolypoly/-/raw/main/src/setup/quick_setup.sh && \
bash quick_setup.sh

Quick Setup - Additional Options

You can specify custom paths for the code, databases, and conda environment location:

bash quick_setup.sh /path/to/conda/env /path/to/install/rolypoly_code /path/to/store/databases /path/to/logfile

By default if no positional arguments are supplied, rolypoly is installed into the session current folder (path the quick_setup.sh is called from):

database in ./rolypoly/data/
code in ./rolypoly/code/
conda environment in ./rolypoly/env/
log file in ./RolyPoly_quick_setup.log

Modular / Dev - Command-Specific Pixi Environments

For software developers looking to try or make use of specific rolypoly features with minimal risk of dependency conflicts. This approach should allow you to install only the tools you need for specific functionality. Note: dependencies from pip are always installed; conda/bioconda dependencies are the modular ones.

# Install pixi first (if not already installed)
curl -fsSL https://pixi.sh/install.sh | bash

# Clone the repository
git clone https://code.jgi.doe.gov/rolypoly/rolypoly.git
cd rolypoly

# Install for specific functionality (examples):
pixi install -e reads-only        # Just read processing tools
pixi install -e assembly-only     # Just assembly tools  
pixi install -e basic-analysis    # Reads + assembly + identification
pixi install -e complete          # All tools (equivalent to legacy install)

# Run commands in the appropriate environment
pixi run -e reads-only rolypoly filter-reads --help
# or load the environment
pixi shell -e reads-only
rolypoly filter-reads --help

For detailed modular installation options, see the installation documentation.

Usage

RolyPoly is a command-line tool with subcommands grouped by analysis stage. Use rolypoly --help or rolypoly <command> --help for most up to date details. Some additional information is in the docs.

rolypoly  <COMMAND> [ARGS]...

Commands and Project Status

Active development. Command groups and current implementation status are summarized below.

Legend:

✅ - Available (on pypi and has tests). Command default parameters are unlikely to change much.
🧪 - Experimental, might not be on pypi / have tests. Default parameters may change. Code might be in a seperate dev branch.
🚧 - Under active development.
🤔/TBD - Planned / under consideration.

Data

✅ get-data — Download/setup required data
✅ version — Show code and data version info

Raw-Reads

✅ filter-reads — Host/rRNA/adapters/artifact filtering and QC (bbmap, falco, etc.)
✅ shrink-reads — Downsample or subsample reads. Useful for testing or normalizing coverage across samples.
✅ mask-dna — Mask DNA regions in RNA-seq reads (bbmap, seqkit). Useful for avoiding mis-filtering of RNA virus reads in because of potential matches to EVEs.

Annotation

✅ annotate — Genome feature annotation (wraps the rna and prot commands)
✅ annotate-rna — RNA secondary structure labelling and ribozyme detection (Infernal, ViennaRNA/linearfold, cmsearch on Rfam...)
🧪 annotate-prot — Gene calling and Protein domain annotation and functional prediction (HMMER, Pfam, custom).

Meta/Genome Assembly

✅ assemble — Assemble reads into contigs (SPAdes, MEGAHIT, penguin)
✅ filter-contigs — Filter sequences based on user-supplied host/contamination references (nucleotide and amino acid modes).

RNA Virus Identification

✅ marker-search — Search for viral markers (mainly RdRps, genomad VVs, or user-provided), using profile-based methods (HMMER / MMseqs2).
✅ virus-mapping — Map and identify viruses using nucleic acid search (MMseqs2).
✅ rdrp-motif-search — Search RdRp motifs (A/B/C/D) in nucleotide or amino acid sequences.

Bining / Clustering

🧪 cluster — Average Nucleic identity (ANI) based contig gropuing. Supports several common backends and methods.
🧪 extend — Extend sequences by pile-up/assembly. Useful for combining assemblies of with low abundance viruses, or those with high microdiversity, at the cost of worse strain/sub-species resolution (i.e. can condense to a consenus).
🧪 termini — Shared termini grouping and motif reporting. Writes assignments + groups tables (TSV/CSV/Parquet/JSONL) and motif FASTA by default.
🧪 correlate — group contigs based on co-occurence, co-abundance, minimal correlation (Spearman's) of these, or both.
🤔 binit — Combines the above commands with sample information and genome attributes (e.g. require a shared termini AND protein complementarity, like CP + RdRp). See notebooks/Exprimental/partiti_usecase/partiti_segment_workflow_experimental.ipynb for candidate workflow.

Miscellaneous

✅ roll — Run an end-to-end pipeline (before v0.7.1, named end2end).
✅ fetch-sra — Download SRA fastq files (from ENA)
✅ fastx-calc — Calculate per-sequence metrics (length, GC content, hash, ...)
✅ fastx-stats — Calculate (-->aggregate) statistics for sequences (min, max, mean, median, ...) (input is file/s)
✅ rename-seqs — Rename sequences (add a prefix, suffix, hash, running number, etc.)
🚧 quick-taxonomy — Quick taxonomy assignment. Candidate workflows are github.com/UriNeri/ictv-mmseqs2-protein-database and github.com/apcamargo/ictv-mmseqs2-protein-database
🤔 support for genotate for gene prediction.
🤔 Genome refinement / strain de-entalgement / variant calling?
🤔 Virus feature prediction (+/-ssRNA/dsRNA, circular/linear, mono/poly-segmented, capsid type, etc.)
🤔 Host prediction
🤔 protein structural prediction support (and reseek search xyz dbs)

If you have suggestions for additional commands or features, or want to implement some of these - please let us know, and consider contributing :-)

Dependencies

📦 Modular Installation Available: RolyPoly supports both quick setup (one environment with all dependecies for all commands) and modular installation (command-specific environments). The modular approach is particularly useful for software developers who want to integrate specific rolypoly features with minimal dependency conflicts. See the installation documentation for details.

Not all 3rd party software is used by all the different commands. RolyPoly includes a "citation reminder" that will try to list all the external software used by a command. The "reminded citations" are pretty printed to console (stdout) and to a logfile. To shut off the terminal citation reminder printing, set ROLYPOLY_REMIND_CITATIONS to false in your rpconfig.json file.

Click to show dependencies

Non-Python

Python Libraries

Databases used by rolypoly

RolyPoly will try to remind you to cite these too based on the commands you run. For more details, see the citation_reminder.py script and all_used_tools_dbs_citations

Click to show databases

NCBI RefSeq rRNAs - Reference RNA sequences from NCBI RefSeq
NCBI RefSeq viruses - Reference viral sequences from NCBI RefSeq
pfam_A_38 - RdRp and RT profiles from Pfam-A version 38
RVMT - RNA Virus Meta-Transcriptomes database
SILVA_138 - High-quality ribosomal RNA database
NeoRdRp_v2.1 - Collection of RdRp profiles
RdRp-Scan - RdRp profile database incorporating PALMdb
TSA_2018 - RNA virus profiles from transcriptome assemblies
Rfam - Database of RNA families (structural/catalytic/both)
VFAM - Viral protein family database (part of vog/vogdb).
UniRef50 - UniProt Reference Clusters at 50% sequence identity

Motivation

There are many good virus analysis tools out there*. Many of them are custom made for specific virus groups, some are generalists, but most require complete control over the analysis process (so one or two points of entry for data). Apart from input requirements, these pipelines vary in implementation (language, workflow management system (snakemake, nextflow...), dependencies), methodologies (tool choice for a similar step such as assembly), and goals (e.g. specific pathogen analysis vs whole virome analysis). These differences affect design and tooling choices (such as selecting a fast nucleotide-based sequence search method limited to high identity, over a slower but more sensitive profile- or structure-based (amino acid) search method). This has created some "lock in" (IMO), and I have found myself asked by people "what do you recommend for xyz" or "which pipeline should I use". Most people have limited time to invest in custom analysis pipeline design and so end up opting for an existing, off-the-shelf option, potentially compromising or having to align their goals with what the given software offers (if they are already aligned - great!).

Checkout awesome-rna-virus-tools for an awesome list of RNA virus (and related) software.

Reporting Issues

Please report bugs you find in the Issues page.

Contribution

All forms of contributions are welcome - please see the CONTRIBUTING.md file for more details.

Authors (partial list, TBD update)

Click to show authors

Uri Neri
Antônio Pedro Castello Branco Rocha Camargo
Dimitris Karapliafis
Brian Bushnell
Andrei Stecca Steindorff
Clement Coclet
Frederik Schulz
David Parker
Simon Roux
And more!
Your name here? Open a PR :)

Related projects

RdRp-CATCH If you are interested in profile-based marker searches, benchmarking, and threshold setting.
suvtk if you are looking to expedite NCBI submission (among other tasks)
gff2parquet if you are looking for a fast GFF parser and converter to parquet format (note, also WIP).
pyrodigal-rv if you are looking for an RNA virus specific Prodigal fork (incl. newly trained models for exotic genetic codes!)
hoodini if you are interested in large-scale gene neighborhood analyses and visualization.

Acknowledgments

Thanks to the DOE Joint Genome Institute for infrastructure support. Special thanks to all contributors who have offered insights and improvements.

Copyright Notice

RolyPoly (rp) Copyright (c) 2024, The Regents of the University of California, through Lawrence Berkeley National Laboratory (subject to receipt of any required approvals from the U.S. Dept. of Energy). All rights reserved.

If you have questions about your rights to use or distribute this software, please contact Berkeley Lab's Intellectual Property Office at IPO@lbl.gov.

NOTICE. This Software was developed under funding from the U.S. Department of Energy and the U.S. Government consequently retains certain rights. As such, the U.S. Government has been granted for itself and others acting on its behalf a paid-up, nonexclusive, irrevocable, worldwide license in the Software to reproduce, distribute copies to the public, prepare derivative works, and perform publicly and display publicly, and to permit others to do so.

License Agreement

GPL v3 License

RolyPoly (rp) Copyright (c) 2024, The Regents of the University of California, through Lawrence Berkeley National Laboratory (subject to receipt of any required approvals from the U.S. Dept. of Energy). All rights reserved.

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see https://www.gnu.org/licenses/.

Name		Name	Last commit message	Last commit date
Latest commit History 263 Commits
.github/workflows		.github/workflows
docs		docs
notebooks		notebooks
src		src
testing_folder/inputs		testing_folder/inputs
.gitignore		.gitignore
CHANGELOG		CHANGELOG
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
TODO.tsv		TODO.tsv
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RolyPoly

Note - Rolypoly is still under development (contributions welcome!)

Installation

Quick and Easy - One Conda/Mamba Environment

Quick Setup - Additional Options

Modular / Dev - Command-Specific Pixi Environments

Usage

Commands and Project Status

Data

Raw-Reads

Annotation

Meta/Genome Assembly

RNA Virus Identification

Bining / Clustering

Miscellaneous

Dependencies

Python Libraries

Databases used by rolypoly

Motivation

Reporting Issues

Contribution

Authors (partial list, TBD update)

Related projects

Acknowledgments

Copyright Notice

License Agreement

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RolyPoly

Note - Rolypoly is still under development (contributions welcome!)

Installation

Quick and Easy - One Conda/Mamba Environment

Quick Setup - Additional Options

Modular / Dev - Command-Specific Pixi Environments

Usage

Commands and Project Status

Data

Raw-Reads

Annotation

Meta/Genome Assembly

RNA Virus Identification

Bining / Clustering

Miscellaneous

Dependencies

Python Libraries

Databases used by rolypoly

Motivation

Reporting Issues

Contribution

Authors (partial list, TBD update)

Related projects

Acknowledgments

Copyright Notice

License Agreement

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages