Generate Metrics and Summary Statistics for paired-end Illumina Bacterial Whole-Genome Sequencing (WGS) fastq data. The pipeline was originally developed on the Galaxy platform and the workflow is made available.
Install tools below from the Galaxy Tool Shed
- trimmomatic
- microrunqc
Import the MicroRunQC workflow. The workflow is intended for paired-collections of fastq files.
- SKESA
- Strategic k-mer extension for scrupulous assemblies
- mlst
- Scan contig files against traditional PubMLST typing schemes
- trimmomatic
- A flexible read trimming tool for Illumina NGS data
- bwa
- BWA is a software package for mapping DNA sequences against a large reference genome, such as the human genome
- fastq-scan
- reads a FASTQ and outputs summary statistics (read lengths, per-read qualities, per-base qualities)
Create conda environment.
% conda create --name microrunqc
% conda activate microrunqc
Install dependencies using Conda and Bioconda
% conda install -c conda-forge -c bioconda -c defaults mlst skesa trimmomatic bwa fastq-scan
% cd $HOME
% git clone https://github.com/estrain/MicroRunQC.git
% export PATH=$PATH:$HOME/MicroRunQC/bin
% chmod a+x $HOME/MicroRunQC/bin/*
% microrunqc.py --help
% microrunqc.py --forward forward.fastq.gz --reverse reverse.fastq.gz --cores 12 --output example
Output is a tab delimited file.
Column | Description |
---|---|
File | Input filename for skesa, taken from forward read. |
Contigs | Number of contigs in the de-novo SKESA assembly. Contigs smaller than 200 base-pairs (bp) are not counted. |
Length | Total length of all contigs > 200bp. This should approximate the size of the genome for the target organism. |
EstCov | Mean coverage for contigs in the assembly as reported by SKESA. |
N50 | Sequence length of the shortest contig at 50% of the total genome length. |
MedianInsert | Distance between forward and reverse reads. Calculated by mapping reads to SKESA assembly using bwa. |
MeanLength_R1 | Mean length of forward read. |
MeanLength_R2 | Mean length of reverse read. |
MeanQ_R1 | Mean Q-score of forward read. |
MeanQ_R2 | Mean Q-score of reverse read. |
Scheme | PubMLST database scheme (e.g. senterica for Salmonella enterica) |
ST | Sequence Type |
Loci | gene (allele number) – for example aroC(118) |