# Quality control of RNA-seq alignment to a reference genome

## Alignment file types

If you look at the output data from last week, you'll notice that there are two main file types that were produced:

- STAR produced `Aligned.sorted.bam`
- HISAT produced `$bn.sam` where `$bn` is the SRA ID for each of the FASTQ files that were aligned. We merged and sorted those into `merged.bam`.

SAM (Sequence Alignment/Map format) files and BAM (Binary Alignment/Map format) files include the exact same data types, except that a SAM files includes human-readable tab-separated values, while BAM values are binary files that are therefore much smaller. BAM files are also convenient because they can be quickly indexed (BAI files), which will allow for fast retrieval of alignments from specific regions of the genome without having to start the search from the beginning of the first chromosome. The specification for these files are defined by the The SAM/BAM Format Specification Working Group, they can be found at the following link [(SAM specification link)](https://samtools.github.io/hts-specs/SAMv1.pdf). The specification is easy to read and understand and you are encouraged to reference it often.

### SAM specification

A full specification can be found at the above link, but we'll go through the relevant sections now using the SAM we generated with HISAT last week. The BAM from STAR will be very similar, but it's a bit trickier to interact with because it's a binary file.

#### Header

A SAM file may container a header, which header lines starting with the `@` symbol. You can use `grep` to show the header lines:

In [1]:
!head -100 ../6_alignment/alignment/hisat/SRR10776762.sam | grep '^@' 

@HD	VN:1.0	SO:unsorted
@SQ	SN:SM_V10_1	LN:87984048
@SQ	SN:SM_V10_Z	LN:86692428
@SQ	SN:SM_V10_3	LN:49793710
@SQ	SN:SM_V10_4	LN:46471573
@SQ	SN:SM_V10_2	LN:45716244
@SQ	SN:SM_V10_6	LN:24696076
@SQ	SN:SM_V10_5	LN:24128858
@SQ	SN:SM_V10_7	LN:19942475
@SQ	SN:SM_V10_WSR	LN:5969970
@SQ	SN:SM_V10_MITO	LN:26917
@RG	ID:ADULT_PZQ_E	SM:ADULT_PZQ_E
@PG	ID:hisat2	PN:hisat2	VN:2.2.1	CL:"/data/users/zhouz6436/.conda/envs/biol343/bin/hisat2-align-s --wrapper basic-0 /data/classes/2025/fall/biol343/course_files/hisat/genome -p 16 --rg-id ADULT_PZQ_E --rg SM:ADULT_PZQ_E --summary-file alignment/hisat/SRR10776762.log --new-summary --read-lengths 51 -1 /tmp/621982.inpipe1 -2 /tmp/621982.inpipe2"


In this case, the header includes the SAM format version number (`VN:1.0`), whether it is sorted (`SO:unsorted`), information about the reference genome (`SN`), the read group (`@RG`, which we added in the HISAT command), and information about the program that was used during alignment (`@PG`). The definitions for each of these IDs can be found in the SAM format spec.

### Alignment

After the header, the rest of the SAM file consists of alignments. According to the spec, each alignment contains at least 11 tab-separated fields. Section 1.4 of the spec includes the following:

|Col | Field | Type   | Regexp/Range | Brief description|
|----|-------|--------|--------------|------------------|
|1   | QNAME | String | [!-?A-~]{1,254} | Query template NAME |
|2   | FLAG  | Int    | [0, 216 − 1] | bitwise FLAG |
|3   | RNAME | String | \*|[:rname:∧*=][:rname:]* | Reference sequence NAME  |
|4   | POS   | Int    | [0, 231 − 1] | 1-based leftmost mapping POSition |
|5   | MAPQ  | Int    | [0, 28 − 1] | MAPping Quality |
|6   | CIGAR | String | \*|([0-9]+[MIDNSHP=X])+ | CIGAR string |
|7   | RNEXT | String | \*|=|[:rname:∧*=][:rname:]* | Reference name of the mate/next read |
|8   | PNEXT | Int    | [0, 231 − 1] | Position of the mate/next read |
|9   | TLEN  | Int    | [−231 + 1, 231 − 1] | observed Template LENgth |
|10  | SEQ   | String | \*|[A-Za-z=.]+ | segment SEQuence |
|11  | QUAL  | String | [!-~]+ | ASCII of Phred-scaled base QUALity+33 |

You can see this format in action by using `head` to look at the first 25 lines:

In [2]:
!head -25 ../6_alignment/alignment/hisat/SRR10776762.sam

@HD	VN:1.0	SO:unsorted
@SQ	SN:SM_V10_1	LN:87984048
@SQ	SN:SM_V10_Z	LN:86692428
@SQ	SN:SM_V10_3	LN:49793710
@SQ	SN:SM_V10_4	LN:46471573
@SQ	SN:SM_V10_2	LN:45716244
@SQ	SN:SM_V10_6	LN:24696076
@SQ	SN:SM_V10_5	LN:24128858
@SQ	SN:SM_V10_7	LN:19942475
@SQ	SN:SM_V10_WSR	LN:5969970
@SQ	SN:SM_V10_MITO	LN:26917
@RG	ID:ADULT_PZQ_E	SM:ADULT_PZQ_E
@PG	ID:hisat2	PN:hisat2	VN:2.2.1	CL:"/data/users/zhouz6436/.conda/envs/biol343/bin/hisat2-align-s --wrapper basic-0 /data/classes/2025/fall/biol343/course_files/hisat/genome -p 16 --rg-id ADULT_PZQ_E --rg SM:ADULT_PZQ_E --summary-file alignment/hisat/SRR10776762.log --new-summary --read-lengths 51 -1 /tmp/621982.inpipe1 -2 /tmp/621982.inpipe2"
SRR10776762.12	83	SM_V10_MITO	23569	1	51M	=	15838	-7782	GTTATAGCCCATACTCCTTTAGTCTTTTAGTATTATCGTCTATAGTCCCTT	11E:0:01DF<1CEGGGGGGGEGGGGGDGEGGGGGGGGGGGGEGGGBBB=B	AS:i:0	ZS:i:0	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:51	YS:i:0	YT:Z:CP	RG:Z:ADULT_PZQ_E	NH:i:10
SRR10776762.12	163	SM_V10_MITO	15838	1	51M	=	23569	7782	GTT

The SAM format contains many encodings that take some getting used to. The most important for our purposes are FLAG, mapping quality (MAPQ) and the CIGAR string.

#### FLAG

The FLAG is a combination of bits that represent certain features of the alignment. The meaning of each bit is provided in the spec:

|Bit | Description|
|----|------------|
|1 0x1 | template having multiple segments in sequencing|
|2 0x2 | each segment properly aligned according to the aligner|
|4 0x4 | segment unmapped|
|8 0x8 | next segment in the template unmapped|
|16 0x10 | SEQ being reverse complemented|
|32 0x20 | SEQ of the next segment in the template being reverse complemented|
|64 0x40 | the first segment in the template|
|128 0x80 | the last segment in the template|
|256 0x100 | secondary alignment|
|512 0x200 | not passing filters, such as platform/vendor quality controls|
|1024 0x400 | PCR or optical duplicate|
|2048 0x800 | supplementary alignment|

These are very useful for filtering out alignment types that you aren't interested in. For intance, take a look at the `QNAME` `SRR10776762.2`:

In [6]:
!head -25 ../6_alignment/alignment/hisat/SRR10776762.sam

@HD	VN:1.0	SO:unsorted
@SQ	SN:SM_V10_1	LN:87984048
@SQ	SN:SM_V10_Z	LN:86692428
@SQ	SN:SM_V10_3	LN:49793710
@SQ	SN:SM_V10_4	LN:46471573
@SQ	SN:SM_V10_2	LN:45716244
@SQ	SN:SM_V10_6	LN:24696076
@SQ	SN:SM_V10_5	LN:24128858
@SQ	SN:SM_V10_7	LN:19942475
@SQ	SN:SM_V10_WSR	LN:5969970
@SQ	SN:SM_V10_MITO	LN:26917
@RG	ID:ADULT_PZQ_E	SM:ADULT_PZQ_E
@PG	ID:hisat2	PN:hisat2	VN:2.2.1	CL:"/data/users/zhouz6436/.conda/envs/biol343/bin/hisat2-align-s --wrapper basic-0 /data/classes/2025/fall/biol343/course_files/hisat/genome -p 16 --rg-id ADULT_PZQ_E --rg SM:ADULT_PZQ_E --summary-file alignment/hisat/SRR10776762.log --new-summary --read-lengths 51 -1 /tmp/621982.inpipe1 -2 /tmp/621982.inpipe2"
SRR10776762.12	83	SM_V10_MITO	23569	1	51M	=	15838	-7782	GTTATAGCCCATACTCCTTTAGTCTTTTAGTATTATCGTCTATAGTCCCTT	11E:0:01DF<1CEGGGGGGGEGGGGGDGEGGGGGGGGGGGGEGGGBBB=B	AS:i:0	ZS:i:0	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:51	YS:i:0	YT:Z:CP	RG:Z:ADULT_PZQ_E	NH:i:10
SRR10776762.12	163	SM_V10_MITO	15838	1	51M	=	23569	7782	GTT

This read has two lines: one for each mate in the pair (remember, these are paired end reads). The FLAG for the first mate is 99, and it's 147 for the second mate. Here's are the descriptions of those values:

|Bit value|Hex |Mnemonic           |Meaning                                                     |
|---------|----|-------------------|------------------------------------------------------------|
|1        |0x1 |Read paired        |Read originates from a paired-end sequencing experiment.    |
|2        |0x2 |Proper pair        |Read and mate map with expected orientation and insert size.|
|32       |0x20|Mate reverse strand|The mate aligns to the reverse strand of the reference.     |
|64       |0x40|First in pair      |This is the first mate (R1) in the paired-end read.         |

1 + 2 + 32 + 64 = 99

|Bit value|Hex |Mnemonic           |Meaning                                                              |
|---------|----|-------------------|---------------------------------------------------------------------|
|1        |0x1 |Read paired        |Read originates from a paired-end sequencing experiment.             |
|2        |0x2 |Proper pair        |Read and mate map with expected orientation and insert size.         |
|16       |0x10|Read reverse strand|This read aligns to the reverse (complement) strand of the reference.|
|128      |0x80|Second in pair     |This is the second mate (R2) in the paired-end read.                 |

1 + 2 + 16 + 128 = 147

The `FLAG` is useful for filtering out alignments that probably don't represent the *true* or *best* alignment. The filtering strategy someone uses can have drastic effects on downstream analysis of differential expression.

#### MAPQ

The SAM spec defines the mapping quality as " −10log<sub>10</sub> Pr{mapping position is wrong}, rounded to the nearest integer. A value 255 indicates that the mapping quality is not available." This formula should remind you of the Phred quality...because it's the same. The difference is that we don't need to represent it in a single character so we don't +33 and convert to ASCII. Each alignment tool can calculate the `MAPQ` differently, so you need to be sure that you're using the right scale. For instance, the STAR manual states:

>The mapping quality MAPQ (column 5) is 255 for uniquely mapping reads, and int(-10*log10(11/Nmap)) for multi-mapping reads. This scheme is same as the one used by TopHat and is compatible with Cufflinks. The default MAPQ=255 for the unique mappers maybe changed with --outSAMmapqUnique parameter (integer 0 to 255) to ensure compatibility with downstream tools such as GATK.

It is not obvious how HISAT calculates the `MAPQ` score (you have to look at the CPP code). The most important thing to know is that uniquely mapping reads are given a score of 60. Knowing that, we could use `cut` and `grep` to see what portion of the reads are uniquely mapping:

In [7]:
!cut -f 5 ../6_alignment/alignment/hisat/SRR10776762.sam | grep -c '60'

1229605


## Samtools

The SAM spec is well designed to be able parse with command line tools like `cut`, `uniq`, and `grep`. There exist more simple tools to perform these operations, though, and `samtools` is the most popular. Check out the manual:

In [8]:
!samtools --help


Program: samtools (Tools for alignments in the SAM format)
Version: 1.6 (using htslib 1.6)

Usage:   samtools <command> [options]

Commands:
  -- Indexing
     dict           create a sequence dictionary file
     faidx          index/extract FASTA
     index          index alignment

  -- Editing
     calmd          recalculate MD/NM tags and '=' bases
     fixmate        fix mate information
     reheader       replace BAM header
     rmdup          remove PCR duplicates
     targetcut      cut fosmid regions (for fosmid pool only)
     addreplacerg   adds or replaces RG tags
     markdup        mark duplicates

  -- File operations
     collate        shuffle and group alignments by name
     cat            concatenate BAMs
     merge          merge sorted alignments
     mpileup        multi-way pileup
     sort           sort alignment file
     split          splits a file by read group
     quickcheck     quickly check if SAM/BAM/CRAM file appears intact
     fastq          con

One of the first things you might want to do on a new alignment is to use `samtools flagstat` to generate some summary statistics. For instance:

In [9]:
!samtools flagstat ../6_alignment/alignment/hisat/SRR10776762.sam

2201437 + 0 in total (QC-passed reads + QC-failed reads)
782107 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
2174345 + 0 mapped (98.77% : N/A)
1419330 + 0 paired in sequencing
709665 + 0 read1
709665 + 0 read2
1368826 + 0 properly paired (96.44% : N/A)
1374916 + 0 with itself and mate mapped
17322 + 0 singletons (1.22% : N/A)
4150 + 0 with mate mapped to a different chr
2628 + 0 with mate mapped to a different chr (mapQ>=5)


***1 minute run time***

Samtools also enables a variety of other operations. Recall that when we aligned with STAR, we were able to generate a single output file, which you can find at `6_alignment/alignment/Aligned.sorted.bam`. When doing the two-pass alignment, we provided it with `manifest.tsv` which assigned a read group (`RG`) to each read depending on the file that it came from. This wasn't an option with HISAT; instead, we used a loop to align each FASTQ individually and used the `--rg-id $rg` option to add the `RG`. We want to compare the alignments derived from the two different aligners, which means we need to produce a comparable output file. We will use BAM files to save space and speed things up. The STAR output is already `.bam`, but the HISAT output is `.sam`. Let's use Samtools to fix that.

Samtools allows piping. In order, we must:

1. Convert the SAM files to BAM
2. Sort the BAM files
2. Merge the BAM files to a single file
4. Generate the index

Samtools can be piped, but we'll also need to use a loop:

In [None]:
!for sam in ../6_alignment/alignment/hisat/*.sam; do \
    bn=$(basename "$sam" .sam); \
    echo "Converting $sam to $bn.bam and sorting."; \
    samtools view -@ 32 -b $sam | samtools sort -@ 32 > ../6_alignment/alignment/hisat/$bn.bam; \
    done

***9 minute run time***

You can see that these BAM files are about 10% of the size of the SAM files:

In [None]:
!ls -lh ../6_alignment/alignment/hisat/ | grep -P 'bam|sam'

Now we merge them into a single file and generate the index:

In [None]:
!samtools merge -@ 32 ../6_alignment/alignment/hisat/merged.bam  ../6_alignment/alignment/hisat/*.bam 
!samtools index -@ 32 ../6_alignment/alignment/hisat/merged.bam 

***4 minutes time to completion***

When storage space is a concern, it would be wise to remove the old SAM/BAM files. But they might be useful to us later, so we'll keep them.

## Alignment QC

### Samtools

There are a range of metrics we can look at to compare the quality of alignments. The simplest is `samtools flagstat`, which simple reports the number of unique and secondary mappings, as we saw earlier. `samtools stats` gives a more full-featured report. Note that an index is required, which we will create for the STAR output:

In [11]:
!echo "HISAT mapping stats"
!samtools stats  -@ 32 /data/classes/2025/fall/biol343/course_files/alignments/hisat/merged.bam > hisat_stats.txt

#!echo "STAR mapping stats"
#!samtools stats -@ 32 /data/classes/2025/fall/biol343/course_files/alignments/star/Aligned.sorted.bam > star_stats.txt

HISAT mapping stats


***15 minutes time to completion***

The output stats file contains quite a bit of information. We can use `multiqc` to summarize and plot it for us (will do this later).

### Qualimap

[Qualimap](http://qualimap.conesalab.org/#what-is-qualimap) is a tool built specificially for QC of RNA-seq data, not just general sequencing data. You will have to install it with conda before being able to use it ([here's the link to the Anaconda package](https://anaconda.org/bioconda/qualimap) and [here's the link to documentation](http://qualimap.conesalab.org/doc_html/index.html)). Use `mamba install qualimap` to install it to the `biol343` conda environment.

Qualimap usage is pretty simple. There are two modules BAM QC, which is general for all BAMs, regarless of the type of sequencing, Count QC, and RNA-seq QC. RNA-seq QC is helped if we tell it if our library prep protocol was stand specific or not:

In [None]:
!qualimap bamqc -nt 32 –java-mem-size=32G -outdir qualimap/star/bam -bam ../6_alignment/alignment/star/Aligned.sorted.bam --feature-file ../2_genome_exploration/genome/annotations.gtf 
!qualimap rnaseq –java-mem-size=32G -outdir qualimap/star/rnaseq -bam ../6_alignment/alignment/star/Aligned.sorted.bam -gtf ../2_genome_exploration/genome/annotations.gtf
!qualimap bamqc -nt 32 -outdir qualimap/hisat/bam -bam ../6_alignment/alignment/hisat/merged.bam --feature-file ../2_genome_exploration/genome/annotations.gtf
!qualimap rnaseq -outdir qualimap/hisat/rnaseq -bam ../6_alignment/alignment/hisat/merged.bam -gtf ../2_genome_exploration/genome/annotations.gtf 

***27 minute run time***

We can summarize these with MultiQC, like previously. Note that MultiQC can also detect log output in IPython Notebooks, so if you don't clear the output from old notebooks then those will be included as well.

In [None]:
!multiqc --force -d ../5_fastq/fastq/qc/ ../5_fastq/trimmed/qc/ ../6_alignment .

## Alignment browsing

After performing QC with command line tools, the next step is to actually look at the alignments. We'll use JBrowse2 to do so. The slides can be found at [7_alignment_qc.pdf](../lectures/7_alignment_qc.pdf)