VirBinn improves viral genome binning from metagenomic Hi-C through graph diffusion.
VirBinn is designed to execute efficiently on standard computers or high-performance serverss.
Operating System: VirBinn v1.0.0 has been tested and validated on Linux and macOS environments.
Dependencies: The pipeline relies on the Python scientific stack:
numpy=1.20.1scipy=1.6.0pandas=1.2.2biopython=1.78pysam=0.15.3scikit-learn=0.24.1leidenalg=0.8.3python-igraph=0.8.3tqdm=4.58.0
To ensure all C-based dependencies (such as pysam and leidenalg) interact correctly, we recommend managing the installation via Conda.
Clone the repository using git:
git clone https://github.com/dyxstat/VirBinn.git
cd VirBinnWe provide configuration files for Linux and macOS to install the packages.
For Linux Users:
conda env create -f virbinn_linux_env.yamlFor macOS Users:
conda env create -f virbinn_osx_env.yamlOnce the environment creation is finished, activate it to start using the tool:
conda activate virbinn_envWe provide a built-in test command to verify that VirBinn is correctly installed and that all library dependencies are loaded properly.
python virbinn.py test - Success: The command will run silently and exit without errors. A directory named
Test/out_testcontaining a log file will be created. - Failure: If the environment is not set up correctly, Python will raise an error.
To help you get started quickly, the Test/ directory includes a small simulated dataset:
Test/final.contigs.fa
Test/MAP_SORTED.bam
Test/viral_contigs.txt
You can verify the software by running it on this data.
Once the environment is set up, you can run VirBinn. The pipeline is designed as a modular process consisting of three main stages: raw contact matrix construction, imputation, and binning. These steps should be run sequentially, as the output of one step is the input for the next.
You can view the overall workflow of VirBinn below:
Before starting the pipeline, ensure your input data meets the following criteria:
- Assembly (FASTA): Your metagenomic assembly.
- Viral List (TXT): A text file containing the headers of identified viral contigs (one per line, no header).
- Hi-C Alignment (BAM): Hi-C reads aligned to your assembly. The BAM file must be sorted by query name to ensure paired-end reads are processed correctly.
# Sort the alignment file by name using samtools
samtools sort -n -o sorted_by_name.bam coordinate_sorted.bamThe first module calculates physical contact frequencies between contigs based on restriction enzyme cutting sites.
python virbinn.py raw [OPTIONS] FASTA_FILE BAM_FILE OUTDIRRequired Arguments:
FASTA_FILE: Path to the reference assembly.BAM_FILE: Path to the name-sorted Hi-C BAM file.OUTDIR: Directory for outputting intermediate matrices.-e, --enzyme: The restriction enzyme used in the Hi-C library.
Optional Parameters:
--min-len: Filter out contigs shorter than this threshold (default: 1000).--min-mapq: Minimum mapping quality for a read to be considered (default: 30).
The second module refines the sparse contact matrix using a Random Walk with Restart (RWR) algorithm to recover missing edges.
python virbinn.py impute [OPTIONS] VIRAL_LIST OUTDIRRequired Arguments:
VIRAL_LIST: Text file listing viral contig names.OUTDIR: The output directory (must be the same directory used in Step 1).
Optional Parameters:
--discard-viral: Threshold for retaining viral edges (default: 50).--discard-host: Threshold for retaining host edges (default: 80).
The final module performs clustering using the Leiden algorithm to group viral contigs into genomic bins.
python virbinn.py bin [OPTIONS] FASTA_FILE OUTDIRRequired Arguments:
FASTA_FILE: The original reference assembly.OUTDIR: The output directory (must be the same directory used in Step 1 & 2).
Optional Parameters:
--output-prefix: A string prefix for the resulting cluster files (default:virbinn).
Upon completion, the OUTDIR will contain several key files:
| Folder / File | Description |
|---|---|
[PREFIX]_VIRAL_BIN/ |
A directory containing individual FASTA files for each recovered viral bin. |
[PREFIX]_clusters.txt |
Mapping of each viral contig to its assigned bin. |
contig_info.csv |
Summary metrics (site count, contig length) for all contigs. |
viral_contig_info.csv |
Summary metrics specifically for viral contigs. |
VirBinn.log |
Detailed runtime logs. |
