Skip to content

ammaraziz/wfi

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

wfi (WhoFlu IRMA)

Snakemake pipeline for running IRMA. Designed for Influenza and RSV Illumina Sequencing.

Note: This pipeline is not ready for general use, while it generally works, it is very brittle.

Requirements:

  • Linux Distro (or *unix system like MacOS)
  • Conda
  • Snakemake
  • Cutadapt
  • IRMA
  • R
  • R Packages: ggplot2 dplyr stringr tidyr cowplot gridExtra furrr

Installation - short

  1. Install miniconda: https://docs.conda.io/en/latest/miniconda.html
  2. Install the latest mamba:
    conda install -n base -c conda-forge mamba
    
  3. Git clone this repo:
    git clone --depth 1 https://github.com/ammaraziz/wfi
    
  4. Install dependencies:
    cd wfi
    mamba env create -f conda.yaml
    
  5. Activate wfi environment:
    conda activate wfi
    

Installation - long

  1. Install miniconda: https://docs.conda.io/en/latest/miniconda.html
  2. Install snakemake: https://snakemake.readthedocs.io/en/stable/getting_started/installation.html
    conda install -n base -c conda-forge mamba
    mamba install -c bioconda -c conda-forge snakemake-minimal python=3.9
    
  3. Install cutadapt and biopython:
    mamba install -c bioconda -c conda-forge cutadapt biopython bbmap
    
  4. Install R (>3.6 should work) and R packages:
    mamba install -c conda-forge r-base 
    mamba install -c r r-ggplot2 r-dplyr r-tidyr r-cowplot r-gridExtra r-optparse r-furrr
    

Note: installing r packages through conda is troublesome for some, if so install manually in R.

  1. Install custom verison of IRMA which contains the RSV module:
    mamba install -c ammaraziz irma 
    
  2. Finally, download this repository and store in your /bin/

Usage

To use the pipeline, follow these steps:

  1. Navigate to config.yaml and modify as appropriate:
Params Values Information
input_dir path input directory - location of the raw fastq files for input
output_dir path output directory - location to output results - same dir where the config sits
second_assembly True/False if you suspect mixtures, set to True . It will increase run time substantially
subset True/False if you are only sequencing HA/NA/MP set this to True else leave as False
trim_prog standard/tile Trimming program to use, tile (bbduk) or standard (cutadapt)
trim_org h1/h3 Influenza only, Flu subtype
technology illumina/ont/pgm seq technology used, will change the module by IRMA
  1. Check snakemake is installed, if an error is produced it means snakemake was not found or it is not installed.
% snakemake --version
% 5.10.0 
  1. Test the pipeline, this will output all the commands that will be run. Look for errors (red).
% snakemake -nq
  1. Run the pipeline, with option -j to specify number of cores to use.
% snakemake -j 8

Output structure:

  1. Pipeline will output correctly formatted names located in:

{output_dir}/assemblies/rename/

  1. Sorted by subtype - most likely the disired output:

{output_dir}/assemblies/rename/type/FLU{A|B}

  1. IRMA assembly specific files, see: https://wonder.cdc.gov/amd/flu/irma/output.html

{output_dir}/assemblies/{sampleID}/

  1. Files for depth and summary info located in:

{output_dir}/assemblies/{sampleID}/figures/ {output_dir}/assemblies/{sampleID}/tables/

Dependencies

  • BLAT for the match step
  • LABEL, which also packages certain resources used by IRMA:
    • Sequence Alignment and Modeling System (SAM) for both the rough align and sort steps
    • Shogun Toolbox, which is an essential part of LABEL, is used in the sort step
  • SSW for the final assembly step, download our minor modifications to SSW
  • samtools for BAM-SAM conversion as well as BAM sorting and indexing
  • GNU Parallel for single node parallelization
  • R and these R packages: optparse, ggplot2, dplyr, tidyr, stringr, cowplot, gridExtra

Troubleshooting problems:

1. Error regarding path directories		Check input and output directorys you've specified end with a '/'

2. Error: Nothing to be done			Check config file and ensure you've changed the input/output directories. 

3. A job crashed. What do I do?			Two options, delete the output directory so snakemake can run everything again. 
						Or find out where it crashed and delete the whole folder/sample. 
						Example, sometimes IRMA produces errors, find the sample which crashed, 
						go to assemblies and delete the corresponding {sampleID} folder. Rerun snakemake.
4. I'm very confusd 										
   or I need more help			
   or I've screwed something up badly!		Shoot me an email

For any issues please submit a github issue.