Skip to content

filiperomero2/ViralUnity

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

47 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ViralUnity

ViralUnity is yet another simple tool to perform data quality control and viral genome reference assembly from Illumina paired-end reads. It is designed to be effective and easy to use. It runs on *nix systems and is able to process complete Illumina MiSeq runs in a couple of hours on a regular laptop.

Installation

ViralUnity is distributed as single bash script. Nevertheless, it relies on diverse dependencies and all of them have been conveniently documented on a conda environment YAML file. From this file one can easily install required softwares and run the pipeline.

To enable ViralUnity, clone the repo and create the environment:

$ git clone https://github.com/filiperomero2/ViralUnity.git
$ cd ViralUnity/
$ conda env create --file ENV/viralunity.yml

Alternatively, in case of compatibility issues between OS and versions specified in the yaml file, one may create the environment and download dependencies manually from the bioconda channel:

$ conda create --name viralunity
$ conda activate viralunity
$ conda install -c bioconda fastqc multiqc trimmomatic minimap2 samtools bcftools bedtools -y

To make the script available throughout the system, one may add it to the path:

$ cd SCRIPT
$ my_local=(pwd)
$ export PATH="$my_local:$PATH" # to make this permanent, one needs to edit the ~/.bashrc file

Keep in mind this pipeline has been developed and tested with the following dependencies versions:

  • fastqc v0.11.9
  • multiqc v1.9
  • trimmomatic v0.39
  • minimap2 v2.17
  • samtools v1.11
  • bcftools v1.11
  • bedtools v2.30.0

Usage

Arguments and data organization requirements

ViralUnity was designed to be as simple as possible, with the objective of making users go from raw reads to processed consensus genome sequences for entire batches of samples with a single command line. The script may take ten arguments, being only the first four strictly required:

--input - The absolute path for samples root directory;
--output - The absolute path for output directory;
--reference - The absolute path for a reference genome in fasta format;
--adapters - The absolute path for trimmomatic adapters fasta file;
--minimum_depth - The minimum sequencing depth necessary to incorporate a base into the consensus sequence (default = 100);
--minimum_length - Minimum read length (default = 50);
--trim - Number of bases to trim from the start of the read, useful for primer sequences removal (Odefault = 30);
--threads - Number of threads available for processing (default = 1);
--version - exhibits the software verion;
--help - exhibits a help message.

The first argument may demand clarification. To be able to analyze entire sequencing runs from a single command line, ViralUnity needs data to be stored in a well specified structure of directories. The path in argument 1 refers to a directory that harbors samples' directories, each containing two fastq files (R1 and R2 reads), like in the example bellow:

(base) username@DESKTOP:~/Desktop/pilot/DATA$ tree
.
├── SEQ_176_L001_ds.5c9c29d7d8824301ae5a4ddee48ee059
│   ├── SEQ-176_S74_L001_R1_001.fastq.gz
│   └── SEQ-176_S74_L001_R2_001.fastq.gz
└── SEQ_180_L001_ds.b269a1119df645ad8c2876cd312620a3
    ├── SEQ-180_S35_L001_R1_001.fastq.gz
    └── SEQ-180_S35_L001_R2_001.fastq.gz

2 directories, 4 files

Importantly, in case this directory organization is not strictly available, the pipeline will fail. Any directory outside these specifications on the root directory will halt execution. Each sample directory is expected to only have two (R1 and R2) fastq files (with .fastq or .fastq.gz extensions). Sample names should not include the underline character (_).

If the structure is correct, the script iterates over these directories, performing QC (trimmomatic), read mapping (minimap2), variant calling (bcftools) and consensus sequence inferences (bcftoools/bedtools). QC reports are generated with fastQC and multiQC. The results from all these analysis are stored in the specified output directory.

Run

To run the pipeline, just activate the conda environment and launch the analysis:

$ conda activate viralunity
$ cd ViralUnity/SCRIPT/
$ ./viralunity.sh --input ~/LIBRARIES/RUN_1/ --output ~/ANALYSIS/RUN1/ --reference ~/REFERENCE_GENOMES/reference.fasta --adapters ~/trimmomatic/adapter.fa

One may also specify other parameters:

$ ./viralunity.sh --input ~/LIBRARIES/RUN_1/ --output ~/ANALYSIS/RUN1/ --reference ~/REFERENCE_GENOMES/reference.fasta --adapters ~/trimmomatic/adapter.fa --minimum_depth 200 --minimum_length 40 --trim 20 --threads 6

The output directory contains 2 report files, one with assembly statistics and other with a timestamp for each sample processing. In addition, three directories are also created, comprehending QC reports for raw and filtered data, mapping and variants associated files and consensus sequences.

Notes

Note on segmented viruses

Even though the pipeline has been originally designed to handle non-segmented viruses, it can be naively used to assemble segmented genomes. One just needs to specify one genomic segment as reference at a time. This will automatically create output directory for each segment, which can be analyzed in downstream workflows.

Note on primer sequences removal

The removal of primer associated SNPs is a mandatory step in processing sequences generated from targetted sequencing approaches. While this pipeline does not use any tool that specifically query for primer sequences, its default mode trim out the initial 30 bp of all reads with trimmomatic. As primer sequences hardly extends for more than 30 bp, this step avoids the introduction of artefactual SNPs. Users analyzing data generated under alternative protocols (e.g., metagenomics) may want to change this behavior by setting the value of --HEADCROP to 0 (zero).

Citation

If you use this pipeline, please cite this github repo and also:

Trimmomatic: Anthony M. Bolger, Marc Lohse, Bjoern Usadel. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 2014; 30(15):2114–212.

Fastqc

Multiqc: Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: Summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016; 32(19):3047–8.

Minimap2: Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018; 34:3094-3100.

Samtools: Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009; 25(16):2078–9.

BCFtools: Li, H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics. 2011; 27(21):2987–2993.

BEDtools: Quinlan AR, Hall IM. BEDTools: A flexible suite of utilities for comparing genomic features. Bioinformatics. 2010; 26(6):841–2.

For visualization of fasta and bam files, we recommend Aliview and Tablet, respectively.

About

ViralUnity is a pipeline for quality control and assembly of viral genomes from Illumina sequencing data.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages