# Assignment 2: Comparing aligners

## Question 1: Please list the considerations you have to pick an aligner for your experiment (Make sure you mentioned the different consideration when your sample is genomic DNA vs RNA) 

Briefly, the first thing to consider is the nature of the sample: whether it is DNA or RNA. In eukaryotes, cDNA library won't align to a reference genome due to alternative splicing. As a consequence, an aligner that deal with splicing junctions (splicing-aware) should be used with RNA-Seq.

If the sample is DNA (e.g., Whole genome sequencing, ChIP-Seq, ATAC-Seq, MNase-seq, and etc.), a splicing-unaware aligner is preferred because splicing-aware and unaware mappers are designed differently. As a result, while alignment of genomic DNA is usually okay, things like the mapping quality score from a splicing-aware aligner are defined in a different way. The different way defining mapping quality scores will have an impact on later analysis: For example, GATK uses MAPQ to call variants.

As a result, it is advisable to use a splicing-aware aligner when you know there is splicing, and to use a splicing-unaware aligner in all other cases.

Other considerations include the features the aligner supports (e.g., seed length, mismatch tolerance, soft-clipping), alignment rate, alignment speed, and etc.

For who is interested, there are many benchmarking study for bioinformatic tools like [this](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-184).

## Question 2: Align the RNA-seq (read_1.fastq and read_2.fastq) with HISAT2 + two other aligners of your choice and submit them as sbatch scripts, and compare the alignment results. Alignment results should be converted to BAM and sorted (upload the sbatch scripts for the alignment jobs and the output of samtools flagstats, 25% each).

- For references of a sbatch script, please see [this wiki page](https://devwikis.nyu.edu/display/NYUHPC/Slurm+Tutorial). Briefly, arrange your commands as a bash script, and prepend the script with slurm options. (Splicing-aware aligners require more resources especially when indexing the reference genome, so --mem=8GB is recommended).

## Using STAR (With gene model information)

In [3]:
# Show the sbatch script
cat star_align.sbatch

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=4
#SBATCH --time=0:30:00
#SBATCH --mem=16GB
#SBATCH --job-name=star
#SBATCH --mail-type=END
#SBATCH --account=class
#SBATCH --mail-user=ycc520@nyu.edu
#SBATCH --output=slurm_%j.out
module purge
module load star/intel/2.7.6a
module load samtools/intel/1.11

# Index the reference genome for STAR
STAR --runThreadN 4 \
--runMode genomeGenerate \
--genomeSAindexNbases 11 \
--genomeDir star_ref \
--genomeFastaFiles chr17.fa \
--sjdbGTFfile chr17.gtf \
--sjdbOverhang 99

# Align
STAR --genomeDir star_ref \
--runThreadN 4 \
--readFilesIn read_1.fastq read_2.fastq \
--outFileNamePrefix ./result/STAR/ \
--outSAMtype BAM SortedByCoordinate \
--outSAMunmapped Within \
--outSAMattributes Standard


In [4]:
sbatch star_align.sbatch

Submitted batch job 4163945


In [5]:
module load samtools/intel/1.11

In [9]:
samtools flagstat result/STAR/Aligned.sortedByCoord.out.bam

751688 + 0 in total (QC-passed reads + QC-failed reads)
15048 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
751420 + 0 mapped (99.96% : N/A)
736640 + 0 paired in sequencing
368320 + 0 read1
368320 + 0 read2
736370 + 0 properly paired (99.96% : N/A)
736370 + 0 with itself and mate mapped
2 + 0 singletons (0.00% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)


## Using Bowtie2

In [22]:
# Show the sbatch script
cat bowtie2_align.sbatch

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=4
#SBATCH --time=0:30:00
#SBATCH --mem=8GB
#SBATCH --job-name=bowtie2
#SBATCH --mail-type=END
#SBATCH --account=class
#SBATCH --mail-user=ycc520@nyu.edu
#SBATCH --output=slurm_%j.out
module purge
module load bowtie2/2.4.2
module load samtools/intel/1.11

# Index the reference genome for Bowtie2
bowtie2-build chr17.fa chr17_bowtie2

# Align
bowtie2 -x chr17_bowtie2 \
	-1 read_1.fastq \
	-2 read_2.fastq \
	-S result/bowtie2/aligned_bowtie2.sam

cd result/bowtie2

# Convert to BAM
samtools view -S -b aligned_bowtie2.sam > aligned_bowtie2.bam
# Sort by coordinate
samtools sort aligned_bowtie2.sam -o aligned_bowtie2_sorted.bam
# Index the BAM
samtools index aligned_bowtie2_sorted.bam


In [23]:
mkdir result/bowtie2
sbatch bowtie2_align.sbatch

mkdir: cannot create directory ‘result/bowtie2’: File exists
Submitted batch job 4166105


In [29]:
samtools flagstat result/bowtie2/aligned_bowtie2_sorted.sam

736640 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
600718 + 0 mapped (81.55% : N/A)
736640 + 0 paired in sequencing
368320 + 0 read1
368320 + 0 read2
488146 + 0 properly paired (66.27% : N/A)
527108 + 0 with itself and mate mapped
73610 + 0 singletons (9.99% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)


## Using BWA-MEM

In [24]:
# Show the sbatch script
cat bwa_align.sbatch

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=4
#SBATCH --time=0:30:00
#SBATCH --mem=8GB
#SBATCH --job-name=bwa
#SBATCH --mail-type=END
#SBATCH --account=class
#SBATCH --mail-user=ycc520@nyu.edu
#SBATCH --output=slurm_%j.out
module purge
module load bwa/intel/0.7.17
module load samtools/intel/1.11

# Index the reference genome for BWA
bwa index chr17.fa

mkdir result/BWA
# Align
bwa mem chr17.fa \
	read_1.fastq read_2.fastq > result/BWA/aligned_bwa.sam

cd result/BWA

# Convert to BAM
samtools view -S -b aligned_bwa.sam > aligned_bwa.bam
# Sort by coordinate
samtools sort aligned_bwa.sam -o aligned_bwa_sorted.bam
# Index the BAM
samtools index aligned_bwa_sorted.bam


In [25]:
sbatch bwa_align.sbatch

Submitted batch job 4166106


In [28]:
samtools flagstat result/BWA/aligned_bwa_sorted.bam

834803 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
98163 + 0 supplementary
0 + 0 duplicates
834489 + 0 mapped (99.96% : N/A)
736640 + 0 paired in sequencing
368320 + 0 read1
368320 + 0 read2
639544 + 0 properly paired (86.82% : N/A)
736072 + 0 with itself and mate mapped
254 + 0 singletons (0.03% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)


## Using HISAT2

In [30]:
cat hisat2.sbatch

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=4
#SBATCH --time=0:30:00
#SBATCH --mem=16GB
#SBATCH --job-name=HISAT2
#SBATCH --mail-type=END
#SBATCH --account=class
#SBATCH --mail-user=ycc520@nyu.edu
#SBATCH --output=slurm_%j.out
module purge
module load hisat2/2.2.1
module load samtools/intel/1.11

# Index the reference genome for HISAT2

## Prepare gene model information
hisat2_extract_splice_sites.py chr17.gtf > genome.ss
hisat2_extract_exons.py chr17.gtf > genome.exon

## Index the genome
hisat2-build -p 4 chr17.fa hisat2_genome

mkdir result/hisat2
# Align
hisat2 -x hisat2_genome \
       -1 read_1.fastq \
       -2 read_2.fastq \
       -S result/hisat2/aligned_hisat2.sam

cd result/hisat2
# Convert to BAM
samtools view -S -b aligned_hisat2.sam > aligned_hisat2.bam
# Sort by coordinate
samtools sort aligned_hisat2.sam -o aligned_hisat2_sorted.bam
# Index the BAM
samtools index aligned_hisat2_sorted.bam


In [31]:
sbatch hisat2.sbatch

Submitted batch job 4166336


In [32]:
samtools flagstat result/hisat2/aligned_hisat2_sorted.bam

748101 + 0 in total (QC-passed reads + QC-failed reads)
11461 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
729548 + 0 mapped (97.52% : N/A)
736640 + 0 paired in sequencing
368320 + 0 read1
368320 + 0 read2
709488 + 0 properly paired (96.31% : N/A)
712002 + 0 with itself and mate mapped
6085 + 0 singletons (0.83% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)


## Q3: Briefly describe the differences between the alignment results, and select an alignment tool to use if you were the author of this study. Justify your pick. (15%)

Considering the sample is RNA-Seq, splicing-unaware aligners like BWA-MEM or Bowtie2 are not suitable for the job because they are not able to deal with splicing junctions. Note that the ratio of mapped reads does not always reflect the ability of an aligner to deal with RNA splicing.

Modern aligners are often versatile and will tolerate suboptimal alignments. For example, BWA-MEM will try to find if a read maps to two distinct region of the genome and report it to be multiply mapped.

On the other hand, HISAT2 and STAR are both suitable for the job and both popular in the field. While the popularity of a tool does not warrant its quality, the larger user base does provide a better chance for an issue to be spotted. It will be also easier to find tutorials and trouble-shooting discussions for popular tools.

While HISAT2 and STAR are both suitable, since we are not doing SNP analyis, which only HISAT2 is designed to deal with, I will go with the one that maps more reads.

To examine if HISAT2 is correctly omitting reads while STAR might be mapping those wrongly, it will be nice to export unmapped reads with `samtools view -f 4` from HISAT2 alignment results to see what is being omitted by HISAT2 but STAR.