Skip to content

ccb-hms/MutantHuntWGS

 
 

Repository files navigation

MutantHuntWGS

A bioinformatics pipeline for identification and characterization of mutations in Saccharomyces cerevisiae. MutantHuntWGS compares data, input in FASTQ format, from a mutant strain to a wild-type strain to identify high confidence sequence variants present only in the mutant. This pipeline was designed to be as user friendly as possible.

picture alt

MutantHuntWGS_O2 available (4/5/23) modified MutantHuntWGS_v1.1 pipeline for use on HMS O2 cluster

Why v1.1?

1. One of our users identified an error coming from VCFtools that (simply put) was causing the pipeline to fail if there were not variants on all chromosomes in the wild type. This has now been corrected. We would like to thank Payal Arora, a graduate student in the Kaplan lab here at UPitt, for here help identifying this issue and for graciously offering us her solution to the problem so that we could update MutantHuntWGS to ensure others did not experience this same issue.

2. The aligner has the capability to use multiple processors and was set to the maximum number under the assumption that it would default to the max that was available, but we believe that under these conditions it is actually defaulting to the mininmum. Therefore, we created an option for number of processors/threads -t that can be set to an integer value to specify how many processors to use. This should speed up the analysis. On your average laptop you will only be able to set it to 4 at max and we reccomend only setting it to 3.

How do I run v1.1?

Like this:

MutantHuntWGS_v1.1.sh \
    -n wttoy \
    -g /Main/MutantHuntWGS/S_cerevisiae_Bowtie2_Index_and_FASTA/genome \
    -f /Main/MutantHuntWGS/S_cerevisiae_Bowtie2_Index_and_FASTA/genome.fa \
    -r single \
    -s 0 \
    -p /Main/MutantHuntWGS/S_cerevisiae_Bowtie2_Index_and_FASTA/ploidy_n1.txt \
    -d /Main/MutantHuntWGS/FASTQ_test \
    -o /Main/Analysis_Directory/test_output \
    -a YES \
    -t 3

Setup and Run on O2

1. Create a directory (a folder) in a directory of your choice (on your Desktop is fine) named ./Analysis_Directory. Within ./Analysis_Directory create a file directory named ./FASTQ inside of ./Analysis_Directory and place all FASTQ files into it (full path: ./PATH_TO_DESKTOP/Analysis_Directory/FASTQ).

2. Ensure that FASTQ files are gzipped (can run gzip FILENAME.fastq to generate FILENAME.fastq.gz) and adhere to the naming convention described below and that all gzipped FASTQ files (fastq.gz) are placed into the FASTQ folder that you created. THIS IS REALLY IMPORTANT.

  • Single end sequencing FASTQ file should be named: FILENAME.fastq.gz

  • Paired end sequencing FASTQ files should be named: FILENAME_R1.fastq.gz and FILENAME_R2.fastq.gz

  • For this naming example "FILENAME" will be used as the input for the -n option below.

  • "FILENAME" should not have any spaces or punctuation, not even underscores.

3. Modify the driver script SLURM_MutantHunt_run.sh with the paths to your input FASTQ files and desired output directory, see the example below

#!/bin/bash
#SBATCH --partition=short   #choose a partition to run on
#SBATCH -c 16 #choose how many cores
#SBATCH --time=0-5:00:00 #estimated run time
#SBATCH --mem=16G #choose how much ram RAM 
#SBATCH --mail-type=ALL #ALL email types
#SBATCH --mail-user=<YOURNAMEHERE>@hms.harvard.edu #where to send email updates

#this line loads the software the pipeline needs to run
module load gcc/6.2.0 bowtie2/2.2.9 samtools/1.3.1  bcftools/1.3.1 vcftools/0.1.15 java/jdk-1.8u112 snpEff/4.3g qualimap/2.2.1 bedtools/2.27.1

#this line runs the script
<PATH TO THIS REPO>/MutantHuntWGS_o2.sh \
    -n sample-# \
    -g <PATH TO THIS REPO>/S_cerevisiae_Bowtie2_Index_and_FASTA/genome \
    -f <PATH TO THIS REPO>/S_cerevisiae_Bowtie2_Index_and_FASTA/genome.fa \
    -r single \
    -s 10 \
    -p <PATH TO THIS REPO>/S_cerevisiae_Bowtie2_Index_and_FASTA/ploidy_n1.txt \
    -d <PATH TO YOUR OUTPUT>/Analysis_Directory/FASTQ \
    -o <PATH TO YOUR OUTPUT>/Analysis_Directory/<NAME OF EXPERIMENT> \
    -a YES \
    -t 16

date

5. Submit the pipeline to the SLURM scheduler

sbatch SLURM_MutantHunt_run.sh

MutantHuntWGS Options

All options are required

-n

The -n option takes the prefix of the FASTQ file name for the wild-type strain. For the example of FILENAME.fastq or FILENAME_R1.fastq this prefix would simply be "FILENAME".

-g

The -g option takes the file PATH to the bowtie index files and the file prefix (genome). Use exactly what is shown above for this command.

-f

The -f option takes the file PATH and file name of the genome FASTA file (genome.fa) Use exactly what is shown above for this command.

-r

The -r option specifies whether the input data contains paired-end or single-end reads and can take values of "paired" or "single".

-s

The -s option takes a score cutoff for the variant scores. This score is calculated by the following formula: -10 * log10(P) where P is the probability that the variant call (ALT) in the VCF file is wrong.

So a score of:

10 means a P of 0.1 and a 10% chance the ALT is wrong

20 means a P of 0.01 and a 1% chance the ALT is wrong

30 means a P of 0.001 and a 0.1% chance the ALT is wrong

40 means a P of 0.0001 and a 0.01% chance the ALT is wrong

50 means a P of 0.00001 and a 0.001% chance the ALT is wrong

100 means a P of 0.000000001 and a 0.0000001% chance the ALT is wrong

-p

The -p option takes the file PATH and file name of the ploidy file (genome.fa) Use exactly what is shown above for this command.

Ploidy files are available for haploid (ploidy_n1.txt) and diploid (ploidy_n2.txt) and are in the following format:

chr start end sex ploidy

ploidy_n1.txt:

chrI	1	230218	M	1
chrII	1	813184	M	1
chrIII	1	316620	M	1

ploidy_n2.txt

chrI	1	230218	M	2
chrII	1	813184	M	2
chrIII	1	316620	M	2

so if you wanted to account for a ploidy of 2 on chromosome II in an otherwise haploid yeast strain you should be able to edit the file like this:

chrI	1	230218	M	1
chrII	1	813184	M	2
chrIII	1	316620	M	1

A sex of M or male was arbitrarily chosen and the MutantHuntWGS program is expecting that so it cannot be changed without editing MutantHuntWGS.sh.

-d

Directory containing your FASTQ files. If you set things up in the way that the instructions outline above this should stay the same as the example: /PATH_TO_DESKTOP/Analysis_Directory/FASTQ. Use exactly what is shown above for this command.

-o

This allows you to specify a folder for your data to output to. This should be structured like the example /Main/Analysis_Directory/NAME_YOUR_OUTPUT_FOLDER except you will come up with a descriptive name to replace the NAME_YOUR_OUTPUT_FOLDER part of the file PATH.

-a

This allows you to turn on and off the alignment and calling step. So if you have already aligned reads and called variants and all that you want to do is reanalyze with a different score cuttoff then you can set this to "NO", but if you are starting from FASTQ files that have not gone throught this process yet you set this to "YES"

-t (version 1.1 only)

This allows you to set a number of concurrent threads that will be used when running bowtie2. This is equivolent to setting -p/--threads in bowtie2.

Understanding the MutantHuntWGS output

Directory organization

The image below shows an example of the directory I set up on my Desktop when testing the software. I named the directory ./Analysis_Directory.tmp and placed my test FASTQ files into two folders ./FASTQ_toy_paired and ./FASTQ_toy_single which each have examples of proper naming for FASTQ files from paired-end and single-end reads. The rest of the folders and files you see are the output of the MutantHuntWGS pipeline. I have expanded the directory for the test of single end read containing FASTQ files to show the full output of the program.

picture alt

Folder and File contents

params.txt

A .txt file containing the input parameters passed to the script for that particular run.

Alignment_Stats

Contains TXT files with the stats that Bowtie2 normally would print to the screen. This tells you how many reads you had in your input file and how well they aligned to the reference genome. One file will be generated per sample.

BAM

Contains alignment results in BAM format (.bam) as well as index files for each BAM (.bai). One file will be generated per sample.

VCF

Contains variant calles in VCF format. Files labled SAMPLE_variants.vcf contain all variants called. One of these will be generated for every sample. SAMPLE_variants_filtered.vcf contains the variants for each mutant after comparing to the wild-type. SAMPLE_variants_filtered_and_scored.vcf contains variants from SAMPLE_variants_filtered.vcf that passed the user assigned quality score.

SnpEff_Output

Contains a folder for each mutant within which you will find VCF, TXT and HTML files with the output of the SnpEff program. This program will map variants to protein coding genes and regions upstream and downstream of those genes. The HTML is a high level summary of the data. The TXT file is a table of genes with counts of the number of variants within each gene that fall into a number of catagories (given in the column names). The VCF file contains all the variants input into SnpEff but now includes additional information following ANN= in the info field. The Events.log contains inforamation on the the run such as errors etc.

SIFT_Output

Contains a folder for each mutant within which you will find an XLS and VCF file. This program will map variants within protein coding genes only and score them based off how badly they are predicted to alter protein function. The XLS file contains information on all variants and provides the SIFT scores (SIFT_SCORE column) and interpretation of those scores (SIFT_PREDICTION column) as well as additional information about each variant. The VCF file contains all the variants input into SIFT, but now includes additional information following SIFTINFO= in the info field.

QualiMap

Contains a folder containing all the QualiMap reports as PDF files for every mutant ran in the pipeline. A QC report (plots & statistics) is produced from each BAM file generated. The coverage plot on page 6 is adjusted so that a high proportion of reads aligning to just a small fraction of the genome.

zero-read_Coverage

bedtools genomecovcoverage reports for zero-read regions. Reported as a single output line describing the chromosome, the start and end coordinate of the interval, and followed by the coverage level itself (which should be gaps or 0 coverage).

About

Bioinformatics pipeline for identification and characterization of mutations in Saccharomyces cerevisiae. Modified to run on SLUM clusters,

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Shell 100.0%