# Alignment of RNA-seq reads to a reference genome

## STAR

Remember how STAR works...it is a splice-aware aligner that searches read prefixes against a suffix array (SA). It is ultra-fast and can soft-clip, which means trimming isn't as necessary as with other aligners. The use of the SA index makes it fast, but it also requires a lot of RAM to build the index and load it into memory for searches.

As we use STAR to align reads, we will be using the [STAR manual](https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf) heavily. I recommend opening it in a separate tab/window.

### Build the SA

An SA is a lexographically sored array of *all* the suffixes of an entire genome. STAR usage is generally `STAR --runMode {mode} --argument option`. To build the SA and give the aligner some information about where to expect splice junctions, here are the relevant details provided in the manual:

```
--runThreadN NumberOfThreads
--runMode genomeGenerate
--genomeDir /path/to/genomeDir
--genomeFastaFiles /path/to/genome/fasta1 /path/to/genome/fasta2 ...
--sjdbGTFfile /path/to/annotations.gtf
--sjdbOverhang ReadLength-1
```

Most of these are self-explanatory, but `--sjdbOverhang` takes special consideration. Here's the desription:

> `--sjdbOverhang` specifies the length of the genomic sequence around the annotated junction
to be used in constructing the splice junctions database. Ideally, this length should be equal
to the ReadLength-1, where ReadLength is the length of the reads. For instance, for Illumina
2x100b paired-end reads, the ideal value is 100-1=99. In case of reads of varying length, the
ideal value is max(ReadLength)-1. In most cases, the default value of 100 will work as
well as the ideal value.

Usings this description and what we know about our read lenghs from our QC, choose the best value. Edit and run the code block below.

In [1]:
%mkdir alignment/star/

!STAR \
--runThreadN 32 \
--runMode genomeGenerate \
--genomeDir ../2_genome_exploration/genome/star \
--genomeFastaFiles ../2_genome_exploration/genome/genome.fa \
--sjdbGTFfile ../2_genome_exploration/genome/annotations.gtf \
--sjdbOverhang 68 \
--genomeSAindexNbases 13

mkdir: cannot create directory ‘alignment/star/’: No such file or directory
	STAR --runThreadN 32 --runMode genomeGenerate --genomeDir ../2_genome_exploration/genome/star --genomeFastaFiles ../2_genome_exploration/genome/genome.fa --sjdbGTFfile ../2_genome_exploration/genome/annotations.gtf --sjdbOverhang 68 --genomeSAindexNbases 13
	STAR version: 2.7.10b   compiled: 2022-11-01T09:53:26-04:00 :/home/dobin/data/STAR/STARcode/STAR.master/source
Oct 15 15:44:01 ..... started STAR run
Oct 15 15:44:01 ... starting to generate Genome files
Oct 15 15:44:07 ..... processing annotations GTF
Oct 15 15:44:08 ... starting to sort Suffix Array. This may take a long time...
Oct 15 15:44:10 ... sorting Suffix Array chunks and saving them to disk...
Oct 15 15:44:56 ... loading chunks from disk, packing SA...
Oct 15 15:45:04 ... finished generating suffix array
Oct 15 15:45:04 ... generating Suffix Array index
Oct 15 15:45:45 ... completed Suffix Array index
Oct 15 15:45:45 ..... inserting junctions in

***3 minute time to completion***

You will now have many new files in the `BIOL343/2_genome_exploration/genome/star` directory. Many of these are `.txt` files that are clearly named. Click through them to checkout their contents. Crucially, you also have `SA` and `SAindex`, which is the SA that will be used during alignment. Using OnDemand, you can see that this file is several GB large, which explains why a lot of RAM is required to use it.

Each STAR command also generates a `Log.out` file. Each different bioinformatics tools deals with logs in their own way, but you should grow to be comfortable viewing logs and diagnosing potential problems.

### Aligning
Here are the manual instructions for the aligning steps:

>The basic options to run a mapping job are as follows:  
`--runThreadN` *NumberOfThreads*  
`--genomeDir` */path/to/genomeDir*  
`--readFilesIn` */path/to/read1 [/path/to/read2 ]*  
`--genomeDir` specifies path to the genome directory where genome indices where generated
(see Section 2. Generating genome indexes).  
`--readFilesIn` name(s) (with path) of the files containing the sequences to be mapped (e.g.
RNA-seq FASTQ files). If using Illumina paired-end reads, the read1 and read2 files have to
be supplied. STAR can process both FASTA and FASTQ files. Multi-line (i.e. sequence split
in multiple lines) FASTA (but not FASTQ) files are supported.  
If the read files are compressed, use the `--readFilesCommand` *UncompressionCommand* option,
where *UncompressionCommand* is the un-compression command that takes the file name as
input parameter, and sends the uncompressed output to stdout. For example, for gzipped
files (\*.gz) use `--readFilesCommand` *zcat* OR `--readFilesCommand` *gunzip -c*. For bzip2compressed files, use `--readFilesCommand` *bunzip2 -c*.

Run the alignment step using one FASTQ in the code block below. Remember to use the trimmed reads. This command will create a SAM file, which we'll talk more about next week. A few options relate to this file: we'll set an option to create a BAM instead of a SAM, to sort this file by coordinate, and to keep unmapped reads in the SAM. Finally, we also need to set the `--outFileNamePrefix` to make sure each outfile gets a different name. We will make a separate directory to organize the output files.

In [2]:
!STAR \
--runThreadN 32 \
--runMode alignReads \
--genomeDir ../2_genome_exploration/genome/star \
--readFilesIn ../5_fastq/trimmed/SRR26691082.fastq.gz \
--readFilesCommand zcat \
--outSAMtype BAM SortedByCoordinate \
--outSAMunmapped Within \
--outFileNamePrefix alignment/star/SRR26691082/

	STAR --runThreadN 32 --runMode alignReads --genomeDir ../2_genome_exploration/genome/star --readFilesIn ../5_fastq/trimmed/SRR26691082.fastq.gz --readFilesCommand zcat --outSAMtype BAM SortedByCoordinate --outSAMunmapped Within --outFileNamePrefix alignment/star/SRR26691082/
	STAR version: 2.7.10b   compiled: 2022-11-01T09:53:26-04:00 :/home/dobin/data/STAR/STARcode/STAR.master/source
Oct 15 15:48:28 ..... started STAR run
Oct 15 15:48:28 ..... loading genome
Oct 15 15:48:30 ..... started mapping
Oct 15 15:49:01 ..... finished mapping
Oct 15 15:49:01 ..... started sorting BAM
Oct 15 15:49:10 ..... finished successfully


***1 minute time to completion***

After completion, you will see a number of new files in `BIOL343/6_alignment/alignment/star/SRR26691082/`, including `Log.out`. The results that we care about are stored in `BIOL343/alignment/star/SRR26691082/Aligned.sortedByCoord.out.bam`. Take a look at `Log.out`, then delete the entire `SRR26691082` directory, we're going to realign it in a more sophisticated way...

#### Two-pass alignment

In section 9 of the STAR manual, it provides guidance for running the so-called 2-pass mapping. In this scheme, the first pass maps to the known splice junctions provided in the GTF while the second pass re-maps to known and novel junctions (which are output in `SJ.out.tab`. However, 9.1 recommends that we include the junctions from ***all*** samples, so let's rerun the alignment, this time aligning all of the trimmed FASTQ files in the dataset.

There are a few different ways to use multiple FASTQ files as input, which are described in section 3.2. Take some time to read through that section...

So, we're going to align all the FASTQ files and generate a single output file (BAM or SAM). However, down the road we are going to want to differentiate reads from each FASTQ, because we'll want to know if a given transcript was differentially expressed between tissues/treatment groups. To help differentiate, the output file is going to have an RG tag in the header of each aligned read (more on that later). The important thing to know now is we need to provide STAR with an RG (read group) ID for each FASTQ. We can do that be creating a tab-separated file (TSV) that is called the "manifest" and directing STAR to it with `--readFilesManifest /path/to/manifest.tsv`. That manifest should have 3 columns: `read1-file-name tab - tab read-group-line`, e.g.:

|read1-file-name| - | read-group-line |
|---------------|----|-----------------|
| /data/users/wheelenj/BIOL343/5_fastq/trimmed/SRR26691082.fastq.gz | - | LIV_ma1| 
| /data/users/wheelenj/BIOL343/5_fastq/trimmed/SRR26691083.fastq.gz | - | LIV_im3|

The `read-group-line` field will be populated with sample details from `5_fastq/SraRunTable.txt`. You could create the manifest using command line tools and the run table, but I've provided the TSV for you at `6_alignment/manifest.tsv`.

In [3]:
!STAR \
--runThreadN 32 \
--runMode alignReads \
--genomeDir ../2_genome_exploration/genome/star \
--readFilesManifest manifest.tsv \
--readFilesCommand zcat \
--outSAMtype BAM SortedByCoordinate \
--outSAMunmapped Within \
--outFileNamePrefix alignment/star/

	STAR --runThreadN 32 --runMode alignReads --genomeDir ../2_genome_exploration/genome/star --readFilesManifest manifest.tsv --readFilesCommand zcat --outSAMtype BAM SortedByCoordinate --outSAMunmapped Within --outFileNamePrefix alignment/star/
	STAR version: 2.7.10b   compiled: 2022-11-01T09:53:26-04:00 :/home/dobin/data/STAR/STARcode/STAR.master/source
Oct 15 16:00:32 ..... started STAR run
Oct 15 16:00:32 ..... loading genome
Oct 15 16:00:33 ..... started mapping
Oct 15 16:08:29 ..... finished mapping
Oct 15 16:08:30 ..... started sorting BAM
Oct 15 16:10:43 ..... finished successfully


***9 minute time to completion***

You can track the progress by viewing `6_alignment/alignment/star/Log.progress.out`. Mapping statistics can be found in `6_alignment/alignment/star/Log.final.out`. Let's rename that file so it doesn't get overwritten in the next section:

In [4]:
!cp alignment/star/Log.final.out alignment/star/first-pass.final.out


Now that we're finished with the first pass, which has given us new information regarding potential splice junctions in `6_alignment/alignment/star/SJ.out.tab`, we can run the second pass:

In [6]:
!STAR \
--runThreadN 32 \
--runMode alignReads \
--genomeDir ../2_genome_exploration/genome/star \
--readFilesManifest manifest.tsv \
--readFilesCommand zcat \
--outSAMtype BAM SortedByCoordinate \
--outSAMunmapped Within \
--outFileNamePrefix alignment/star/ \
--sjdbFileChrStartEnd alignment/star/SJ.out.tab

!cp alignment/star/Log.final.out alignment/star/second-pass.final.out

	STAR --runThreadN 32 --runMode alignReads --genomeDir ../2_genome_exploration/genome/star --readFilesManifest manifest.tsv --readFilesCommand zcat --outSAMtype BAM SortedByCoordinate --outSAMunmapped Within --outFileNamePrefix alignment/star/ --sjdbFileChrStartEnd alignment/star/SJ.out.tab
	STAR version: 2.7.10b   compiled: 2022-11-01T09:53:26-04:00 :/home/dobin/data/STAR/STARcode/STAR.master/source
Oct 15 16:13:15 ..... started STAR run
Oct 15 16:13:16 ..... loading genome
Oct 15 16:13:17 ..... inserting junctions into the genome indices
Oct 15 16:13:34 ..... started mapping
Oct 15 16:22:45 ..... finished mapping
Oct 15 16:22:47 ..... started sorting BAM
Oct 15 16:25:45 ..... finished successfully


***15 minute time to completion***

Open `first-pass.final.out` and `second-pass.final.out`. What differences stick out? How can you explain those differences?

## HISAT

Now it's time to use HISAT. Remember how HISAT works...it is a splice-aware aligner that searches transformst the genome with the BWT and uses an FM-index. It is very fast and can soft-clip, which means trimming isn't as necessary as with other aligners. It's slightly slower than STAR because it doesn't use an SA, but the tradeoff is that it uses much less memory to align. However, a lot of memory is needed for a one-time generation of the genome index files. Our genome has been indexed already and can be found at `2_genome_exploration/genome/hisat`. A summary of the commands used to generate the index and the associated logs can be found at `6_alignment/hisat_index.txt`.

As we use HISAT to align reads, we will be using the [HISAT manual](https://daehwankimlab.github.io/hisat2/manual/) heavily. I recommend opening it in a separate tab/window.

Unfortunately, HISAT cannot assign different RG tags based on the input FASTQ files, so we won't be able to align all the data in a single command. Instead, we'll use a Bash loop, like in `5_fastq`. Will also use the `--new-summary` flag to write out logs that are compatible with [MultiQC](https://multiqc.info/modules/hisat2/).

In [2]:
%mkdir alignment/hisat/

!while read -r line; do \
    fq=$(echo $line | awk '{print $1}'); \
    bn=$(basename "$fq" .fastq.gz); \
    rg=$(echo $line | awk '{print $3}' | sed 's/ID://' ); \
    echo "Aligning $fq with $rg"; \
    hisat2 /data/users/willetse0745/BIOL343/2_genome_exploration/genome/hisat/genome -p 16 -U $fq --rg-id $rg --rg SM:$rg --summary-file alignment/hisat/$bn.log --new-summary > alignment/hisat/$bn.sam; \
    done < manifest.tsv

mkdir: cannot create directory ‘alignment/hisat/’: File exists
Aligning /data/users/willetse0745/BIOL343/5_fastq/trimmed/SRR26691082.fastq.gz with LIV_ma1
Could not locate a HISAT2 index corresponding to basename "/data/users/willetse0745/BIOL343/2_genome_exploration/genome/hisat/genome"
Error: Encountered internal HISAT2 exception (#1)
Command: /data/users/willetse0745/.conda/envs/biol343/bin/hisat2-align-s --wrapper basic-0 -p 16 --rg-id LIV_ma1 --rg SM:LIV_ma1 --summary-file alignment/hisat/SRR26691082.log --new-summary --read-lengths 69,68,66,64,65,67,63,62,61,60,59,47,56,57,58,55,48,52,53,51,54,50,49,46,43,44,45,42,41,40,38,37,36,39,35,27,22,34,33,31,24,20,32,28,26 -U /tmp/1601489.unp /data/users/willetse0745/BIOL343/2_genome_exploration/genome/hisat/genome 
(ERR): hisat2-align exited with value 1
Aligning /data/users/willetse0745/BIOL343/5_fastq/trimmed/SRR26691083.fastq.gz with LIV_im3
Could not locate a HISAT2 index corresponding to basename "/data/users/willetse0745/BIOL343/2_

***14 minute completion time***

While you're waiting, do some research to figure out how the above `while` loop works. You'll need to read about `awk` and `basename`, as well as `read -r $line ... done < file.txt`. You should also look into the `sed` command and how it is used to find/replace text.

We will talk more about alignment formats (SAM/BAM/CRAM) next week. Until then, take a look at the logs/summaries for STAR and HISAT alignments. What differences do you notice?