# Quality control of RNA-seq alignment to a reference genome

## Alignment file types

If you look at the output data from last week, you'll notice that there are two main file types that were produced:

- STAR produced `Aligned.sortedByCoord.out.bam`
- HISAT produced `$bn.sam` where `$bn` is the SRA ID for each of the FASTQ files that were aligned.

SAM (Sequence Alignment/Map format) files and BAM (Binary Alignment/Map format) files include the exact same data types, except that a SAM files is includes human-readable tab-separated values, while BAM values are binary files that are therefore much smaller. BAM files are also convenient because they can be quickly indexed (BAI files), which will allow for fast retrieval of alignments from specific regions of the genome without having to start the search from the beginning of the first chromosome. The specification for these files are defined by the The SAM/BAM Format Specification Working Group, they can be found at the following link [(SAM specification link)](https://samtools.github.io/hts-specs/SAMv1.pdf). The specification is easy to read and understand and you are encouraged to reference it often.

### SAM specification

A full specification can be found at the above link, but we'll go through the relevant sections now using the SAM we generated with HISAT last week. The BAM from STAR will be very similar, but it's a bit trickier to interact with because it's a binary file.

#### Header

A SAM file may container a header, which header lines starting with the `@` symbol. You can use `grep` to show the header lines:

In [None]:
!grep '@' ../6_alignment/alignment/hisat/SRR26691082.sam

In this case, the header includes the SAm format version number (`VN:1.0`), whether it is sorted (`SO:unsorted`), information about the reference genome (`SN`), the read group (`@RG`, which we added in the HISAT command), and information about the program that was used during alignment (`@PG`). The definitions for each of these IDs can be found in the SAM format spec.

### Alignment

After the header, the rest of the SAM file consists of alignments. According to the spec, each alignment contains at least 11 tab-separated fields. Section 1.4 of the spec includes the following:

|Col | Field | Type   | Regexp/Range | Brief description|
|----|-------|--------|--------------|------------------|
|1   | QNAME | String | [!-?A-~]{1,254} | Query template NAME |
|2   | FLAG  | Int    | [0, 216 − 1] | bitwise FLAG |
|3   | RNAME | String | \*|[:rname:∧*=][:rname:]* | Reference sequence NAME  |
|4   | POS   | Int    | [0, 231 − 1] | 1-based leftmost mapping POSition |
|5   | MAPQ  | Int    | [0, 28 − 1] | MAPping Quality |
|6   | CIGAR | String | \*|([0-9]+[MIDNSHP=X])+ | CIGAR string |
|7   | RNEXT | String | \*|=|[:rname:∧*=][:rname:]* | Reference name of the mate/next read |
|8   | PNEXT | Int    | [0, 231 − 1] | Position of the mate/next read |
|9   | TLEN  | Int    | [−231 + 1, 231 − 1] | observed Template LENgth |
|10  | SEQ   | String | \*|[A-Za-z=.]+ | segment SEQuence |
|11  | QUAL  | String | [!-~]+ | ASCII of Phred-scaled base QUALity+33 |

You can see this format in action by using `head` to look at the first 25 lines:

In [None]:
!head -100 ../6_alignment/alignment/hisat/SRR26691082.sam

The SAM format contains many encodings that take some getting used to. The most important for our purposes are FLAG, mapping quality (MAPQ) and the CIGAR string.

#### FLAG

The FLAG is a combination of bits that represent certain features of the alignment. The meaning of each bit is provided in the spec:

|Bit | Description|
|----|------------|
|1 0x1 | template having multiple segments in sequencing|
|2 0x2 | each segment properly aligned according to the aligner|
|4 0x4 | segment unmapped|
|8 0x8 | next segment in the template unmapped|
|16 0x10 | SEQ being reverse complemented|
|32 0x20 | SEQ of the next segment in the template being reverse complemented|
|64 0x40 | the first segment in the template|
|128 0x80 | the last segment in the template|
|256 0x100 | secondary alignment|
|512 0x200 | not passing filters, such as platform/vendor quality controls|
|1024 0x400 | PCR or optical duplicate|
|2048 0x800 | supplementary alignment|

These are very useful for filtering out alignment types that you aren't interested in. For intance, take a look at the `QNAME` `SRR26691082.4`:

In [None]:
!grep -P 'SRR26691082.4\t' ../6_alignment/alignment/hisat/SRR26691082.sam

This read aligned to 10 different locations in the genome, all on chromosome 3. The first alignnment has the `FLAG` set to 16, which indicates the read has been reverse-complemented during alignment. The second one has a `FLAG` of 256, which indicates a secondary alignment. The 3rd-8th alignments have a `FLAG` of 272, which are secondary alignments (256) *and* are reverse complemented (16) because (256 + 16 = 272).

The `FLAG` is useful for filtering out alignments that probably don't represent the *true* or *best* alignment. The filtering strategy someone uses can have drastic effects on downstream analysis of differential expression.

#### MAPQ

The SAM spec defines the mapping quality as " −10log<sub>10</sub> Pr{mapping position is wrong}, rounded to the nearest integer. A value 255 indicates that the mapping quality is not available." This formula should remind you of the Phred quality...because it's the same. The difference is that we don't need to represent it in a single character so we don't +33 and convert to ASCII. Each alignment tool can calculate the `MAPQ` differently, so you need to be sure that you're using the right scale. For instance, the STAR manual states:

>The mapping quality MAPQ (column 5) is 255 for uniquely mapping reads, and int(-10*log10(11/Nmap)) for multi-mapping reads. This scheme is same as the one used by TopHat and is compatible with Cufflinks. The default MAPQ=255 for the unique mappers maybe changed with --outSAMmapqUnique parameter (integer 0 to 255) to ensure compatibility with downstream tools such as GATK.

It is not obvious how HISAT calculates the `MAPQ` score (you have to look at the CPP code). The most important thing to know is that uniquely mapping reads are given a score of 60. Knowing that, we could use `cut` and `grep` to see what portion of the reads are uniquely mapping:

In [None]:
!cut -f 5 ../6_alignment/alignment/hisat/SRR26691082.sam | grep -c '60'

These values are also reported when you ran the alignment and stored in the associated log file.

## Samtools

The SAM spec is well designed to be able parse with command line tools like `cut`, `uniq`, and `grep`. There exist more simple tools to perform these operations, though, and `samtools` is the most popular. Check out the manual:

In [None]:
!samtools --help

One of the first things you might want to do on a new alignment is to use `samtools flagstat` to generate some summary statistics. For instance:

In [None]:
!samtools flagstat ../6_alignment/alignment/hisat/SRR26691082.sam

Samtools also enables a variety of other operations. Recall that when we aligned with STAR, we were able to generate a single output file, which you can find at `6_alignment/alignment/Aligned.sortedByCoord.out.bam`. When doing the two-pass alignment, we provided it with `manifest.tsv` which assigned a read group (`RG`) to each read depending on the file that it came from. This wasn't an option with HISAT; instead, we used a loop to align each FASTQ individually and used the `--rg-id $rg` option to add the `RG`. We want to compare the alignments derived from the two different aligners, which means we need to produce comparable output file. We will use BAM files to save space and speed things up. The STAR output is already `.bam`, but the HISAT output is `.sam`. Let's use Samtools to fix that.

Samtools allows piping. In order, we must:

1. Convert the SAM files to BAM
2. Sort the BAM files
2. Merge the BAM files to a single file
4. Generate the index

Samtools can be piped, but we'll also need to use a loop:

In [None]:
!for sam in ../6_alignment/alignment/hisat/*.sam; do \
    bn=$(basename "$sam" .sam); \
    echo "Converting $sam to $bn.bam and sorting."; \
    samtools view -@ 16 -b $sam | samtools sort -@ 16 > ../6_alignment/alignment/hisat/$bn.bam; \
    done

***9 minute time to completion***

You can see that these BAM files are about 10% of the size of the SAM files:

In [None]:
!ls -lh ../6_alignment/alignment/hisat/ | grep -P 'bam|sam'

Now we merge them into a single file and generate the index:

In [None]:
!samtools merge -@ 16 ../6_alignment/alignment/hisat/merged.bam  ../6_alignment/alignment/hisat/*.bam 
!samtools index -@ 16 ../6_alignment/alignment/hisat/merged.bam 

***4 minutes time to completion***

When storage space is a concern, it would be wise to remove the old SAM/BAM files. But they might be useful to us later, so we'll keep them.

## Alignment QC

There are a range of metrics we can look at to compare the quality of alignments. The simplest is `samtools flagstat`, which simple reports the number of unique and secondary mappings, as we saw earlier. `samtools stats` gives a more full-featured report. Note that an index is required, which we will create for the STAR output:

In [None]:
!echo "HISAT mapping stats"
!samtools stats  -@ 16 ../6_alignment/alignment/hisat/merged.bam > hisat_stats.txt
!echo "STAR index"
!samtools index -@ 16 ../6_alignment/alignment/Aligned.sortedByCoord.out.bam
!echo "STAR mapping stats"
!samtools stats -@ 16 ../6_alignment/alignment/Aligned.sortedByCoord.out.bam > star_stats.txt

***4 minutes time to completion***

The output stats file contains quite a bit of information. We can use `multiqc` to summarize and plot it for us:

In [4]:
!multiqc --force -d ../4_fastq/fastq/qc/ ../4_fastq/trimmed/qc/ ../6_alignment .


  [91m///[0m ]8;id=363173;https://multiqc.info\[1mMultiQC[0m]8;;\ 🔍 [2m| v1.17[0m

[34m|           multiqc[0m | Prepending directory to sample names
[34m|           multiqc[0m | Search path : /data/users/wheelenj/BIOL343/4_fastq/fastq/qc
[34m|           multiqc[0m | Search path : /data/users/wheelenj/BIOL343/4_fastq/trimmed/qc
[34m|           multiqc[0m | Search path : /data/users/wheelenj/BIOL343/6_alignment
[34m|           multiqc[0m | Search path : /data/users/wheelenj/BIOL343/7_alignment_qc
[2K[34m|[0m         [34msearching[0m | [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m [32m108/108[0m  05/108[0m [2m./temp.txt[0m
[?25h[34m|          samtools[0m | Found 2 stats reports
[34m|              star[0m | Found 2 reports
[34m|            hisat2[0m | Found 12 reports
[34m|            fastqc[0m | Found 24 reports
[34m|           multiqc[0m | Report      : multiqc_report.html   (overwritten)
[34m|           multiqc[0m | Data     

You can see from the output that `multiqc` detects the old FASTQC reports, as well as STAR (2, one for the first test and one for the full alignment) and HISAT logs (12, one for each individual alignment).

Note that MultiQC can also detect log output in IPython Notebooks, so if you don't clear the output from old notebooks then those will be included as well.