Skip to content

Set of high throughput sequencing analysis scripts to quickly generate and queue jobs on SLURM-based HPC clusters.

License

Notifications You must be signed in to change notification settings

emc2cube/Bioinformatics

Repository files navigation

Bioinformatics pipelines, SLURM friendly

GitHub package.json versionGitHub top languageGitHubRuns on Sherlock

Set of high throughput sequencing analysis scripts to quickly generate and queue jobs on SLURM-based HPC clusters, such as Stanford's Sherlock🕵🏻‍♂️️

Most scripts include some sort of failsafe: if a job fails it will be requeued once. This is useful in case of unexpected node failure.

Currently available pipelines:

  • Whole Exome Sequencing
  • RNA Sequencing
  • CRISPR screens

sh_WES.sh

This script will process fastq(.gz) files and align them to a reference genome using bowtie2. It will then use Picard and GATK following the June 2016 best practices workflow. It is currently only compatible with GATK version 3.X, and have been used extensively with the latest available version GATK 3.8.1 SNPs will then be annotated using ANNOVAR.

See the WES.ini configuration file for all available options and settings.

Options:
  • --help : Display help message.
  • --version : Display version number.
Dependencies:
  • bowtie2 should be installed on your system in a location included in your $PATH for alignment.
  • Samtools should be installed on your system in a location included in your $PATH.
  • Picard should be installed on your system.
  • GATK 3.X should be installed on your system for transcript quantification.
  • (optional) Trimmomatic should be installed on your system for read trimming.
  • (optional) FastQC should be installed on your system in a location included in your $PATH for quality control.

Usage

sh_WES.sh </path/to/fastq(.gz)/folder> </path/to/destination/folder> [/path/to/config/file.ini]

sh_RNAseq.sh

This script will process fastq(.gz) files and align them to a reference genome using either STAR (recommended), hishat2 or tophat2. If STAR is used then RSEM will also be used and differential expression will be analyzed using DESeq2. Differential expression can also be computed using cufflinks (cufflinks is pretty much deprecated, should be avoided unless trying to reproduce old results). Local Splicing Variation can now be computed using MAJIQ and/or LeafCutter. If a 4th '[OutputDirName]' argument is provided only the secondary analyses selected in the config file will be queued, 6ft/2m apart, using the already aligned and processed files from a previous run, and results will be saved in a '_OutputDirName' directory.

See the RNAseq.ini configuration file for all available options and settings.

Options:
  • --help : Display help message.
  • --version : Display version number.
Dependencies:
  • STAR (recommended) or tophat2 / hisat2 should be installed on your system in a location included in your $PATH for alignment.
  • RSEM should be installed on your system in a location included in your $PATH for transcript quantification.
  • R should be installed on your system in a location included in your $PATH.
  • DESeq2 (recommended) or Cufflinks (deprecated) for differential expression analysis.
  • MAJIQ and/or LeafCutter for local splicing variation detection.
    • This pipeline was created when LeafCutter documentation (and code) was on commit 249fc26. Since then documentation and potentially code changed considerably and may not be backward compatible despite keeping the same 0.2.9 version number...
  • (optional) Trimmomatic should be installed on your system for read trimming.
  • (optional) FastQC should be installed on your system in a location included in your $PATH for quality control.
  • (optional) DESeqAnalysis for advanced graph options downstream of DESeq2 analysis.
  • (optional) cummeRbund for advanced graph options downstream of Cufflinks. Not supported by the pipeline itself but results will be compatible with cummeRbund.

Usage

sh_RNAseq.sh </path/to/fastq(.gz)/folder> </path/to/destination/folder> [/path/to/config/file.ini] [OutputDirName]

sh_CRISPR.sh

This script will process the fastq(.gz) files generated in a typical CRISPR screen using either casTLE or MAGeCK.

  • If using casTLE, a reference file of all the indices will be automatically created using bowtie (NOT bowtie2). It will then analyze the screen and generate basic graphs.
  • If using MAGeCK counts, tests, mle and pathway analysis will be performed. It will also run the R package "MAGeCKFlute" and in all cases generate basic graphs.

See the CRISPR.ini configuration file for all available options and settings.

Options:
  • --help : Display help message.
  • --version : Display version number.
Dependencies:
  • casTLE and/or MAGeCK.
  • R should be installed on your system in a location included in your $PATH.
  • MAGeCKFlute for downstream analysis if using MAGeCK.
  • csvkit should be installed on your system in a location included in your $PATH.
  • pathos should be installed on your system, this will provide ppft, a fork of Parallel Python working with both python2.7 and python3.6.

Usage

sh_CRISPR.sh </path/to/fastq(.gz)/folder> </path/to/destination/folder> [/path/to/config/file.ini]

Python 3.6 compatibility

For easy integration along MAGeCK, or any other modern tools, a python 3.6+ compatible version of casTLE is included. This is based on casTLE commit 981d6d8 and may not be up to date. You still need to download the whole casTLE repository even if you end up switching the scripts with their python 3.6+ compatible version. If you previously used casTLE with python 2.7 and already have Parallel Python installed you will need to uninstall it before installing pathos

sh_md5alldir.sh

This script will process all sub-directories of the input folders and for each of them will create a <directory_name>.md5 file if it does not exist yet, or check files against the existing <directory_name>.md5 file.

Options:
  • -f or --force : even if there is already a .md5 file, it will be replaced by a new .md5 file.
  • --help : Display help message.
  • --version : Display version number.

Usage

sh_md5alldir.sh [OPTIONS] </path/to/dir/>

sh_sha1alldir.sh

This script will process all sub-directories of the input folders and for each of them will create a <directory_name>.sha1 file if it does not exist yet, or check files against the existing <directory_name>.sha1 file.

Options:
  • -f or --force : even if there is already a .sha1 file, it will be replaced by a new .sha1 file.
  • --help : Display help message.
  • --version : Display version number.

Usage

sh_sha1alldir.sh [OPTIONS] </path/to/dir/>

sh_ACMGfilter.sh

This script will look for an annovar .snps.exome_summary.csv file and generate a list of all SNPs found in the ACMG guidelines in a new ACMG_genes.csv file. This file can be directly sent to a clinician for incidental findings reports, if required.

Options:
  • --help : Display help message.
  • --version : Display version number.

Usage

sh_ACMGfilter.sh </path/to/.csv/containing/folder> [/path/to/destination/folder]

sh_mergeFastQ.sh

Simple script to consolidate fragmented .fastq files from different sequencing lanes. Original files will be backed up in a FastQbackup folder.

Options:
  • --help : Display help message.
  • --version : Display version number.

Usage

sh_mergeFastQ.sh </path/to/fastq(.gz)/folder>

Author(s) contributions

👤 Julien Couthouis

Initial work and releases

👤 Rosa Ma

Local Splicing Variation

Show your support

Give a GitHub stars if this project helped you!

License

Copyright © 2019 Julien Couthouis.

This project is EUPL-1.2 licensed.