## BWA

BWA is an aligner recommended by GATK best practices pipeline. 

First, the genome needs to be indexed (bwa index).
Indexed genome is already prepated and saved in erisone in the following directory. 

In [4]:
! ls -lt /pub/genome_references/UCSC/Homo_sapiens/UCSC/hg19/Sequence/BWAIndex/

total 416
drwxrwxr-x. 2 526 datamgr 4096 Apr  2  2014 version0.5.x
drwxrwxr-x. 2 526 datamgr 4096 Apr  2  2014 version0.6.0
lrwxrwxrwx. 1 526 datamgr   26 Jun 23  2013 genome.fa.amb -> version0.6.0/genome.fa.amb
lrwxrwxrwx. 1 526 datamgr   26 Jun 23  2013 genome.fa.pac -> version0.6.0/genome.fa.pac
lrwxrwxrwx. 1 526 datamgr   22 Jun 23  2013 genome.fa -> version0.6.0/genome.fa
lrwxrwxrwx. 1 526 datamgr   26 Jun 23  2013 genome.fa.bwt -> version0.6.0/genome.fa.bwt
lrwxrwxrwx. 1 526 datamgr   26 Jun 23  2013 genome.fa.ann -> version0.6.0/genome.fa.ann
lrwxrwxrwx. 1 526 datamgr   25 Jun 23  2013 genome.fa.sa -> version0.6.0/genome.fa.sa


Let's index it with samtools and picard for down stream analysis

In [None]:
%%bash

module load picard

cd /data/humgen/burook/ref
cp /pub/genome_references/UCSC/Homo_sapiens/UCSC/hg19/Sequence/BWAIndex/genome.fa .
cp /pub/genome_references/UCSC/Homo_sapiens/UCSC/hg19/Sequence/BWAIndex/* .

samtools faidx genome.fa

java -jar /apps/software/picard/2.6.0-Java-1.8.0_161/picard.jar CreateSequenceDictionary R=genome.fa O=genome.dict


Let's also copy reference genomes so we can decompress them (we could not make some down stream analysis work with the compressed references).

In [None]:
%%bash

cd /data/humgen/burook/ref
refloc2=/pub/genome_references/gatk-bundle/1.5/hg19

cp ${refloc2}/1000G_omni2.5.hg19.vcf.gz .
cp ${refloc2}/1000G_omni2.5.hg19.vcf.idx.gz .
gunzip 1000G_omni2.5.hg19.vcf.idx.gz 
gunzip 1000G_omni2.5.hg19.vcf.gz

cp ${refloc2}/dbsnp_135.hg19.vcf.gz .
cp ${refloc2}/dbsnp_135.hg19.vcf.idx.gz .
gunzip dbsnp_135.hg19.vcf.idx.gz 
gunzip dbsnp_135.hg19.vcf.gz

cp ${refloc2}/hapmap_3.3.hg19.vcf.gz .
cp ${refloc2}/hapmap_3.3.hg19.vcf.idx.gz .
gunzip hapmap_3.3.hg19.vcf.idx.gz 
gunzip hapmap_3.3.hg19.vcf.gz

cp ${refloc2}/Mills_and_1000G_gold_standard.indels.hg19.vcf.gz .
cp ${refloc2}/Mills_and_1000G_gold_standard.indels.hg19.vcf.idx.gz .
gunzip Mills_and_1000G_gold_standard.indels.hg19.vcf.idx.gz 
gunzip Mills_and_1000G_gold_standard.indels.hg19.vcf.gz



### Map read to reference genome (bwa mem)

For a pair of pair-end fastq files, the following script maps reads to a given reference genome and returns a sam file. (save it to a file named bwa_map.sh)

In [None]:
#!/bin/bash

header=$(zcat $1 | head -n 1)
id=$(echo $header | head -n 1 | cut -f 1-4 -d":" | sed 's/@//' | sed 's/:/_/g')
sm=$(echo $header | head -n 1 | grep -Eo "[ATGCN]+$")
echo "Read Group @RG\tID:$id\tSM:$id"_"$sm\tLB:$id"_"$sm\tPL:ILLUMINA"

refloc1=/data/humgen/burook/ref


bwa mem \
    -M \
    -t 8 \
    -R $(echo "@RG\tID:$id\tSM:$id"_"$sm\tLB:$id"_"$sm\tPL:ILLUMINA") \
    ${refloc1}/genome.fa \
    $1 $2 | samtools view -Sb -  >  $3.bam


We need to use this script to map all fastq results found in the Trimmomatic results folder. Note we are using our lab's node (hna001).

In [None]:
%%bash
# Now submit it as follows

dir3=/data/humgen/burook/sysbio_exome/trimmomatic_results1
dir4=/data/humgen/burook/sysbio_exome/bwa_mapped

for f1 in ${dir3}/*R1_001.pe.fastq.gz
do
    # f1 and f2 are inpute files for forward and reverse reads
    f2=${dir3}/$(basename $f1 _R1_001.pe.fastq.gz)_R2_001.pe.fastq.gz
    fout=${dir4}/$(basename $f1 _R1_001.pe.fastq.gz)_mapped_bwa
    
    bsub -m "hna001" -n 8 -M 8000 -o "tmp1.out" -e "tmp1.err" "sh bwa_map.sh $f1 $f2 $fout"

    sleep 1
done


### Mark duplicates (using Picard)

(save the following in a file named mark_duplicates.sh)

In [None]:
#!/bin/bash

# input bam file
f1=$1
# prefix of the output file
f2=$2
# directory where input files are saved
dir1=$3
# directory where output files will be saved
dir2=$4

module load picard
# on erisone the picard jar is located at $EBROOTPICARD
# (its full path is /apps/software/picard/2.6.0-Java-1.8.0_161 )

# Sorting the mapped file
# The aligned reads need to be sorted for the next steps. This can be done with Picard.

java -jar /apps/software/picard/2.6.0-Java-1.8.0_161/picard.jar SortSam \
     INPUT=${f1} \
     OUTPUT=${dir2}/${f2}_mapped_sorted.bam \
     SORT_ORDER=coordinate

# Mark duplicates
java -jar /apps/software/picard/2.6.0-Java-1.8.0_161/picard.jar MarkDuplicates \
      INPUT=${dir2}/${f2}_mapped_sorted.bam \
      OUTPUT=${dir2}/${f2}_marked_duplicates.bam \
      METRICS_FILE=${dir2}/${f2}_marked_dup_metrics.txt \
      CREATE_INDEX=TRUE \
      ASSUME_SORTED=TRUE


Now call this script for individual bam files.

In [None]:
%%bash

# input data directory
dir1=/data/humgen/burook/sysbio_exome/bwa_mapped
# output data directory
dir2=/data/humgen/burook/sysbio_exome/picard_mdupl/

for f1 in ${dir1}/*.bam
do
    # f1 is the input sam file
    f2=$(basename $f1 _mapped_bwa.bam)
    
    bsub -m "hna001" -n 8 -M 8000 -o "tmp2.out" -e "tmp2.err" "sh mark_duplicates.sh $f1 $f2 $dir1 $dir2"

    sleep 1
done


In [None]:

# let's see some stats for some samples
samtools flagstat test_marked_duplicates.bam



### Base Quality Score Recalibration (BQSR)
save the following in a file named base_recalibration.sh

In [None]:
#!/bin/bash

# input bam file
f1=$1
# prefix of the output file
f2=$2
# directory where input files are saved
dir1=$3
# directory where output files will be saved
dir2=$4

cd ${dir2}

module load GATK
# on erisone the GATK jar is located at $EBROOTGATK
# (its full path is /apps/software/GATK/3.8-0-Java-1.8.0_161 )

refloc1=/data/humgen/burook/ref
refloc2=/pub/genome_references/gatk-bundle/1.5/hg19

# Base recalibratin is done in two passes

# Base recalibration (first pass)

java -jar /apps/software/GATK/3.8-0-Java-1.8.0_161/GenomeAnalysisTK.jar -T BaseRecalibrator \
   -I ${f1} \
   -R ${refloc1}/genome.fa \
   -knownSites ${refloc1}/1000G_omni2.5.hg19.vcf \
   -knownSites ${refloc1}/dbsnp_135.hg19.vcf \
   -knownSites ${refloc1}/hapmap_3.3.hg19.vcf \
   -knownSites ${refloc1}/Mills_and_1000G_gold_standard.indels.hg19.vcf \
   --out ${f2}_recal_data.table

# Base recalibration (second pass)

java -jar /apps/software/GATK/3.8-0-Java-1.8.0_161/GenomeAnalysisTK.jar -T BaseRecalibrator \
   -I ${f1} \
   -R ${refloc1}/genome.fa \
   -knownSites ${refloc1}/1000G_omni2.5.hg19.vcf \
   -knownSites ${refloc1}/dbsnp_135.hg19.vcf \
   -knownSites ${refloc1}/hapmap_3.3.hg19.vcf \
   -knownSites ${refloc1}/Mills_and_1000G_gold_standard.indels.hg19.vcf \
   -BQSR ${f2}_recal_data.table \
   --out ${f2}_recal_data2.table

# Print reads

java -jar /apps/software/GATK/3.8-0-Java-1.8.0_161/GenomeAnalysisTK.jar -T PrintReads \
   -R ${refloc1}/genome.fa \
   -I ${f1} \
   -BQSR ${f2}_recal_data.table \
   --out ${f2}_recal3.bam

# Analyze covariates

module load R

java -jar /apps/software/GATK/3.8-0-Java-1.8.0_161/GenomeAnalysisTK.jar -T AnalyzeCovariates \
   -R ${refloc1}/genome.fa \
   -before ${f2}_recal_data.table \
   -after ${f2}_recal_data2.table \
   -plots ${f2}_AnalyzeCovariates.pdf \
   -csv ${f2}_AnalyzeCovariates.csv
   

run this script for individual bam files

In [None]:
%%bash

# input data directory
dir1=/data/humgen/burook/sysbio_exome/picard_mdupl/
# output data directory
dir2=/data/humgen/burook/sysbio_exome/gatk_BQSR

#for f1 in ${dir1}/*marked_duplicates.bam 
for f1 in ${dir1}/*mapped_bwa.bam_marked_duplicates.bam 
do
    # f1 is the input sam file
    #f2=$(basename $f1 _marked_duplicates.bam)
    f2=$(basename $f1 _mapped_bwa.bam_marked_duplicates.bam)
    
    bsub -m "hna001" -n 8 -M 8000 -o "tmp3.out" -e "tmp3.err" "sh base_recalibration.sh $f1 $f2 $dir1 $dir2"

    sleep 1
done
