Skip to content

ayoraind/assembly

Repository files navigation

GHRU SPAdes Assembly workflow

This pipeline is git cloned from the CGPS assembly pipeline written by Anthony Underwood. I needed to include the possibility of using the conda profile instead of docker or singularity alone (and tweak the codes in the modules directory and lines in the config files a bit), based on the fact that I don't have root access to the server I currently use (2023).

It would be best to download the confindr database here into a directory of your choice, unzip, and untar. An example code is found below:

curl https://gitlab.com/cgps/ghru/pipelines/dsl2/pipelines/assembly.git/gitlab-lfs/objects/87baec8d61603b511c8f26a92ed67b95974c2ade68bd0198dbe0cfe42c426b48 -o confindr_database.tar.gz

tar xfz /path/to/confindr_database.tar.gz && rm /path/to/confindr_database.tar.gz && chmod -R o+w /path/to/confindr_database

This Nextflow workflow can be used to process short read fastq files an assembly pipeline using the SPAdes assembler. Alongside this it will QC the reads before and after trimming and QC the final assembled scaffolds file using Quast. The pipeline was based on Shovill (Thanks to Torsten Seemann @torstenseemann): https://github.com/tseemann/shovill

Authors

Anthony Underwood @bioinformant au3@sanger.ac.uk
Varun Shammana @varunshamanna varunshamanna4@gmail.com
Ayorinde Afolayan @Ayorinde_Afo afolayanayorinde@gmail.com
Erkison Odih @bioinfo_erkison erkisonodih@gmail.com
Angela Sofia Garcia as.garciav@uniandes.edu.co
Felipe Delgadillo Barrera f.delgadillo2628@uniandes.edu.co
Oscar Gabriel Beltran
Johan Fabian Bernal johan.bernal.morales@gmail.com

Instructions

The dependencies are provided in a Docker image

docker pull registry.gitlab.com/cgps/ghru/pipelines/dsl2/pipelines/assembly:latest

Typically the workflow should be run as follows

nextflow run main.nf [options] -resume 

To run the test sets either of the following commands will work

  • Using paired end reads and no down sampling
    nextflow run main.nf --input_dir test_input --output_dir test_output --fastq_pattern "*{R,_}{1,2}*.fastq.gz" --adapter_file adapters.fas --qc_conditions qc_conditions_nextera_relaxed.yml 
    
  • Using single end reads ato a depth cutofff of 50
    nextflow run main.nf  --input_dir test_input --output_dir test_output --fastq_pattern "*{R,_}*.fastq.gz" --adapter_file adapters.fas --qc_conditions qc_conditions_nextera_relaxed.yml --depth_cutoff 50 --single_end -resume
    

The mandatory options that should be supplied are

  • A source of the fastq files specified as either of the following
    • local files on disk using the --input_dir and --fastq_pattern arguments
  • The output from the pipeline will be written to the directory specified by the --output_dir argument
  • The path to a fasta file containing adapter sequences to trim from reads specified by the --adapter_sequences argument

Optional arguments include

  • --single_end There is only one read file per sample
  • --kmer_min_copy Define a hard cutoff for the minimum number of copies of a kmer to be included in mash sketch
  • --depth_cutoff Downsample each sample to an approximate depth of the value supplied e.g 50 means downsample to 50x depth of coverage . If not specified no downsampling will occur
  • --careful Turn on the SPAdes careful option which improves assembly by mapping the reads back to the contigs
  • --minimum_scaffold_length The minimum length of a scaffold to keep. Others will be filtered out. Default 500
  • --minimum_scaffold_depth The minimum depth of coverage a scaffold must have in order to be kept. Others will be filtered out. Default 3
  • --confindr_db_path The path to the confindr database. If not set assumes that the path is "${baseDir}/Docker/confindr_database". ${baseDir} in this case signify the file path to the nextflow pipeline.
  • --qc_conditions Path to a YAML file containing pass/warning/fail conditions used by QualiFyr. An example of the format can be seen here and another more suitable for reads generated from a Nextera library preparation
  • --prescreen_genome_size_check Size in bp of the maximum estimated genome to assemble. Without this any size genome assembly will be attempted
  • --prescreen_file_size_check Minumum size in Mb for the input fastq files. Without this any size of file will be attempted (this and prescreen_genome_size_check are mutually exclusive)
  • --skip_quast_summary Large numbers of assemblies may cause quast summary to hang. Use this parameter to skip this step
  • --full_output Output pre_trimming fastqc reports, merged_fastqs and corrected_fastqs. These take up signficant disk space and so are not copied to the output_dir by default

Workflow process

The workflow consists of the following steps

  1. QC reads using FastQC before trimming
  2. Trim reads using trimmomatic (dynamic MIN_LEN based on 30% of the read length) and Cutadapt (if --cutadapt specified)
  3. QC reads using FastQC after trimming
  4. Correct reads using lighter
  5. Check for contamination using confindr
  6. Count number of reads and estimate genome size using Mash
  7. Downsample reads if the --depth_cutoff argument was specified
  8. Merge reads using Flash where the insert size is small
  9. Assemble reads using SPAdes (by default the --careful option is turned off)
  10. Assess species identification using bactinspector
  11. Assess assembly quality using Quast
  12. Sumarise all assembly QCs using Quast
  13. (Optional if QuailFyr qc conditions YAML file is supplied). Filter assemblies into three directories: pass, warning and failure based on QC metrics

A sumamry of this process is shown below

pipeline diagram

A full DAG of the proceesed generated when running Nextflow can be seen below

pipeline dag

Workflow outputs

These will be found in the directory specified by the --output_dir argument

  • A directory called fastqc/post_trimming that contans the Fastqc reports for each fastq in html format after trimming
  • A directory called assemblies containing the final assembled scaffold files named as <SAMPLE NAME>_scaffolds.fasta. If the qc_conditions argument was given there will be subdirectories named pass, warning and failure where the appropiately QCed scaffolds and failure reasons will be stored.
  • A directory called quast containing
    • A summary quast report named combined_quast_report.tsv
    • A transposed summary quast report with the samples as rows so that they can be sorted based on the quast metric in columns named combined_quast_report.tsv
  • A directory called quality_reports containing html reports
    • MultiQC summary reports combining QC results for all samples from
      • FastQC: fastqc_multiqc_report.html
      • Quast: quast_multiqc_report.html
    • QualiFyr reports. If a qc_conditions.yml file was supplied reports will be generated that contain a summary of the overall pass/fail status of each sample.
  • if the --full_output parameter is given then the following will also be available in the output directory
    • A directory called fastqc/pre_trimming that contans the Fastqc reports for each fastq in html format prior to trimming
    • A directory called corrected_fastqs that contains the fastq files that have been trimmed with Trimmomatic and corrected using Lighter
    • If using paired end reads, a directory called merged_fastqs that contains the fastq files that have been merged using Flash. There will be a files called
      • <SAMPLE NAME>.extendedFrags.fastq.gz merged reads
      • <SAMPLE NAME>.notCombined_1.fastq.gz unmerged read 1 reads
      • <SAMPLE NAME>.notCombined_2.fastq.gz unmerged read 2 reads

Software used within the workflow

  • FastQC A quality control tool for high throughput sequence data.
  • Trimmomatic A flexible read trimming tool for Illumina NGS data.
  • Cutadapt Finds and removes adapter sequences, primers, poly-A tails and other types of unwanted sequence from high-throughput sequencing reads
  • mash Fast genome and metagenome distance estimation using MinHash.
  • lighter Fast and memory-efficient sequencing error corrector.
  • seqtk A fast and lightweight tool for processing sequences in the FASTA or FASTQ format.
  • FLASH (Fast Length Adjustment of SHort reads) A very fast and accurate software tool to merge paired-end reads from next-generation sequencing experiments.
  • SPAdes A genome assembly algorithm designed for single cell and multi-cells bacterial data sets.
  • contig-tools A utility Python package to parse multi fasta files resulting from de novo assembly.
  • Quast A tool to evaluate the aulaity of genome assemblies.
  • ConFindr Software that can detect contamination in bacterial NGS data, both between and within species.
  • QualiFyr Software to give an overall QC status for a sample based on multiple QC metric files
  • MultiQC Aggregate results from bioinformatics analyses across many samples into a single report
  • KAT The K-mer Analysis Toolkit (KAT) contains a number of tools that analyse and compare K-mer spectra
  • BactInspector Software using an updated refseq mash database to predict species

Test that the pipeline and Docker dependency is installed correctly

Test command for paired end reads

nextflow run main.nf --input_dir small_test_input --output_dir test_output --fastq_pattern '*{R,_}{1,2}.fastq.gz' --adapter_file adapters.fas  --qc_conditions qc_conditions_nextera.yml --full_output --cutadapt -resume

Test command for single end reads

nextflow run main.nf --input_dir small_test_input --output_dir test_output --fastq_pattern '*{R,_}1.fastq.gz' --adapter_file adapters.fas --qc_conditions qc_conditions_nextera.yml --full_output  --cutadapt --single_end -resume

About

Git cloned assembly pipeline for short reads

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages