# Mapping Reads to a Reference Genome
<!---
** Add Intro Text***
-->

## Shell Variables

In [1]:
# Source the config script
source bioinf_intro_config.sh

ls $CUROUT

[0m[01;34mcount_out[0m  [01;34migv[0m     [01;34mqc_output[0m  [01;31mstuff_for_igv_shorter_intron.tgz[0m  [01;34mtrimmed_fastqs[0m
[01;34mgenome[0m     [01;34mmyinfo[0m  [01;34mstar_out[0m   [01;31mstuff_for_igv.tgz[0m


## Mapping with STAR


In [2]:
STAR \
    --runMode alignReads \
    --twopassMode None \
    --genomeDir $GENOME_DIR \
    --readFilesIn $TRIMMED/21_2019_P_M1_S21_L002_R1_001.trim.fastq.gz \
    --readFilesCommand gunzip -c \
    --outFileNamePrefix ${STAR_OUT}/21_2019_P_M1_S21_L002_R1_ \
    --quantMode GeneCounts \
    --outSAMtype BAM Unsorted \
    --outSAMunmapped Within \
    --runThreadN 2

Jun 26 15:10:01 ..... started STAR run
Jun 26 15:10:01 ..... loading genome
Jun 26 15:10:05 ..... started mapping
Jun 26 15:11:40 ..... finished successfully


We will start with these parameters, but there is an extensive list of command line options detailed in the [STAR Manual](https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf), it is a good idea to read through and try to understand all of them.  We will discuss some more later.

* --runMode alignReads : map reads 
* --twopassMode : Run one pass or two? If two-pass mode is on, STAR tries to discover novel junctions, then reruns mapping with these added to the annotation
* --genomeDir : directory containing the genome index
* --readFilesIn : input FASTQ
* --readFilesCommand gunzip -c : use "gunzip -c" to uncompress FASTQ on-the-fly, since it is gzipped
* --outFileNamePrefix : prefix (and path) to use for all output files
* --quantMode GeneCounts : output a table of read counts per gene
* --outSAMtype BAM Unsorted : output an unsort BAM file
* --outSAMunmapped Within : included unmapped reads in the BAM file
* --runThreadN : tells STAR to run using multiple cores.  I am using it so we don't have to wait too long for this to run during class.  It is OK to use multiple cores, but before you do this you should be sure that the server is not busy, and even then you should use a reasonable number of cores.  Abusing multi-threading is inconsiderate of other users and could crash the server.


### STAR Output
So what happened? Let's take a look . . .

In [3]:
ls ${STAR_OUT}

21_2019_P_M1_S21_L001_R1_short_introns_Aligned.sortedByCoord.out.bam
21_2019_P_M1_S21_L001_R1_short_introns_Aligned.sortedByCoord.out.bam.bai
21_2019_P_M1_S21_L001_R1_short_introns_Log.final.out
21_2019_P_M1_S21_L001_R1_short_introns_Log.out
21_2019_P_M1_S21_L001_R1_short_introns_Log.progress.out
21_2019_P_M1_S21_L001_R1_short_introns_ReadsPerGene.out.tab
21_2019_P_M1_S21_L001_R1_short_introns_SJ.out.tab
21_2019_P_M1_S21_L002_R1_Aligned.out.bam
21_2019_P_M1_S21_L002_R1_Log.final.out
21_2019_P_M1_S21_L002_R1_Log.out
21_2019_P_M1_S21_L002_R1_Log.progress.out
21_2019_P_M1_S21_L002_R1_ReadsPerGene.out.tab
21_2019_P_M1_S21_L002_R1_short_introns_Aligned.sortedByCoord.out.bam
21_2019_P_M1_S21_L002_R1_short_introns_Aligned.sortedByCoord.out.bam.bai
21_2019_P_M1_S21_L002_R1_short_introns_Log.final.out
21_2019_P_M1_S21_L002_R1_short_introns_Log.out
21_2019_P_M1_S21_L002_R1_short_introns_Log.progress.out
21_2019_P_M1_S21_L002_R1_short_introns_ReadsPerGene.out.tab
21_2019_P_M1_S21_L002_R1_short_in

In [4]:
head ${STAR_OUT}/21_2019_P_M1_S21_L002_R1*

==> /home/jovyan/work/scratch/bioinf_intro/star_out/21_2019_P_M1_S21_L002_R1_Aligned.out.bam <==
     � BC ���Mo�@���C$�@�H�X\!���_�S�T�JM��M���^����
�?�g�	=��r�#�q�R�
�#�M�AX�qLlӪVq,�1gy��^�4�Z{p_(���Z�2b2V2O4��	�)��Lk����d&0�.��,`��0������v��u�5�p��#�ڬbqm�e-� �Ř=�
6:�Z�W���1?��a5��]�����z�	/q=�	�b�����D{:0�"��r���D	�q�t?�^�<1�a����
�2ө�������f����Y�eF��A*wS��b�����NkS���F�ۀ���@5��g7��@��q&�iB��=�d>V/�x?��ᳳN�� �U���TƠ���)ח:�BD'���
�gB�ިO�.��i���ł���{Z��D��
�5YY�y��ϗ��^����U]�j&��s�9������40=b�;��#�ĮMEKGCQT;a\.pF1�AŏN ����cl�J�a���Y�0���Uu�{�ǩ�Y+����ԩ�{������~��g��~����0{��^������Twp:{�Y�͏�����8|F�����s^b��x�|�����x��������G�fy^�Ɩ���&��8I��[$u�DMEm��ud�q՝������'��������Vw�T=�Λ����W�����c����`z�n�ӒC���(lӠ��1���aS�eX��d�i�*�L?�S#<��i̙�j���C����)tZ��{�t�0�i�,HB�%a��IPU�f6.m��"�y$n0P�H��P:���Q�^���ʕk�r%��]�~���kׯ�l�ӣ��_x����I�櫽�	&DEaY�Q��^�E�MY��j��$�q��咇��lwq�N�
�[���e���O�c��.>�0�V

�5�   ,lYÎ   �5��   ���   �@ҋ�   `�t�   �B��   o   ���gp   Z%xPq   Jʹ\r    ���t   �S�v   ��*w   y�7x   󉱎x   {�nz   �m6Q{   ��2`|   ɘ�}   Zpl~   5T�̂   �<g��   50�d�   �0>�   �

==> /home/jovyan/work/scratch/bioinf_intro/star_out/21_2019_P_M1_S21_L002_R1_short_introns_Log.final.out <==
                                 Started job on |	Jun 24 09:50:39
                             Started mapping on |	Jun 24 09:50:39
                                    Finished on |	Jun 24 09:51:17
       Mapping speed, Million of reads per hour |	230.76

                          Number of input reads |	2435761
                      Average input read length |	75
                                    UNIQUE READS:
                   Uniquely mapped reads number |	2355964
                        Uniquely mapped reads % |	96.72%

==> /home/jovyan/work/scratch/bioinf_intro/star_out/21_2019_P_M1_S21_L002_R1_short_introns_Log.out <==
STAR version=STAR_2.5.2b
STAR compilation time,server,dir=<not set in

STAR generates several files for each FASTQ:
* Log.out : lots of details of the run, including all parameters used
* Log.final.out : Important summary statistics
* ReadsPerGene.out.tab : Count table, the main thing we are interested in
* SJ.out.tab : All splice junctions, including ones from the GTF and novel junctions discovered by STAR
* Log.progress.out: run statistics updated during run, not so interesting at the end

Let's take a closer look at Log.final.out

In [5]:
cat ${STAR_OUT}/21_2019_P_M1_S21_L002_R1_Log.final.out

                                 Started job on |	Jun 26 15:10:01
                             Started mapping on |	Jun 26 15:10:05
                                    Finished on |	Jun 26 15:11:40
       Mapping speed, Million of reads per hour |	92.30

                          Number of input reads |	2435761
                      Average input read length |	75
                                    UNIQUE READS:
                   Uniquely mapped reads number |	2354695
                        Uniquely mapped reads % |	96.67%
                          Average mapped length |	75.36
                       Number of splices: Total |	610596
            Number of splices: Annotated (sjdb) |	596312
                       Number of splices: GT/AG |	603894
                       Number of splices: GC/AG |	5846
                       Number of splices: AT/AC |	112
               Number of splices: Non-canonical |	744
                      Mismatch rate per base, % |	0.16%
                       

#### Sanity Check: Number of Reads
Are the number of input reads what we expect? Let's look at how many reads are in the input FASTQ

In [6]:
zcat $TRIMMED/21_2019_P_M1_S21_L002_R1_001.trim.fastq.gz | awk '{s++}END{print s/4}' 

2435761


#### Sanity Check: Unmapped Reads
It is always a good idea to examine a sample of unmapped reads to figure out what they are. The easiest way to do this is with BLAST.  In the past I have discovered that an experiment was contaminated with a different species by BLASTing unmapped reads.  In that case there were a large number of unmapped reads, which raised my suspicions.

Even with a high rate of mapped reads, it is worth spending a few minutes to check them out.  The simplest thing to do is use `samtools` to generate a FASTA from the unmapped.bam, grab a few of these sequences, and then [BLAST them against the nr database](https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastn&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome)

In [7]:
samtools fasta -f 4 ${STAR_OUT}/21_2019_P_M1_S21_L002_R1_Aligned.out.bam | head -n20

>NB501800:327:HF27FBGXB:2:11202:18025:8102
GATCGGAAGAGCACACGTCTGAACTCCAGTCACCGCTCATTATCTCGTATGCCGTCTTCTGCTTGAAAAAAAA
>NB501800:327:HF27FBGXB:2:11202:20718:8137
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGTAAAAGTGGGGGTTGTTTTTTATTTTTTTGTAGATTTAAAAA
>NB501800:327:HF27FBGXB:2:11202:9226:8227
GATCGGAAGAGCACACGTCTGAACTCCAGTCACCGCTCATTATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAA
>NB501800:327:HF27FBGXB:2:11202:17164:8437
GATCGGAAGAGCACACGTCTGAACTCCAGTCACCGCTCATTATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAA
>NB501800:327:HF27FBGXB:2:11202:25989:8488
GATCGGAAGAGCACACGTCTGAACTCCAGTCACCGCTCATTAACTCGTATGCCGTCTTATGCTTGAAAAAAA
>NB501800:327:HF27FBGXB:2:11202:11951:8497
GATCGGAAGAGCACACGTCTGAACTCCAGTCACCGCTCATTATCTCGTATGCCGTCTTCTGCTTGAAAAAAAA
>NB501800:327:HF27FBGXB:2:11202:11505:8555
ATCGGAAGAGCACACGTCTGAACTCCAGTCACCGCTCATTATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAA
>NB501800:327:HF27FBGXB:2:11202:21516:8563
GATCGGAAGAGCACACGTCTGAACTCCAGTCACCGCTCATTATCTCGTATGCCGTCTTCTGCTTGAAAAAAATAAG
>NB501800:327:HF27FBGXB:2:11202:8434:8590
GATCGGAAGAGCACACGT

### MultiQC
MultiQC also works with STAR reports, so let's try it!

In [8]:
multiqc ${STAR_OUT} --outdir ${STAR_OUT}

  configs = yaml.load(f)
  sp = yaml.load(f)
[INFO   ]         multiqc : This is MultiQC v1.7
[INFO   ]         multiqc : Template    : default
[INFO   ]         multiqc : Searching '/home/jovyan/work/scratch/bioinf_intro/star_out'
[?25lSearching 21 files..  [####################################]  100%          [?25h
[INFO   ]            star : Found 3 reports and 3 gene count files
[INFO   ]         multiqc : Compressing plot data
[INFO   ]         multiqc : Report      : ../../scratch/bioinf_intro/star_out/multiqc_report.html
[INFO   ]         multiqc : Data        : ../../scratch/bioinf_intro/star_out/multiqc_data
[INFO   ]         multiqc : MultiQC complete


Once multiqc is done running we can view the results by finding the output in the Jupyter browser, it should be in a file named `multiqc_report.html` in :

In [9]:
echo ${STAR_OUT}

/home/jovyan/work/scratch/bioinf_intro/star_out
