# In this notebook, we will sort and index bam files for downstream analyses

**Introduction to samtools**

We will next use samtools to work with out aligned sequencing data, which is currently in SAM (Sequence Alignment/Map) format. Samtools is a nice tool to manipulate alignments stored in both SAM and BAM (binary version of SAM) files. Since BAM files are binary, they can only be read by the computer. Samtools is a great tool that lets us view the contents of BAM files and perform various manipulations on them. 

Check out the samtools [documentation](http://www.htslib.org/doc/samtools.html).

Some of these commands might not mean much right now so let's take a look at our aligned .sam files

You will notice that the format is all over the place. You can structure the file a bit more by using:

This file is not easy to interpret. This is because a lot of information has to be fitted within this file. There is a full .sam file documentation which you can find [here](http://samtools.github.io/hts-specs/SAMv1.pdf). However, there is a lot of information has to be conveyed and not all is useful. Here a few pointers:

The header lines: 

@SQ: Reference sequence dictionary. Gives information about the sequence alignments to the reference genome. 

@PG: The command used for the alignment 

Column 1: read name 

Column 2: bitwise flag. Most importantly '4' means unmapped 

Column 3: Technically reference sequence name (from header section). Essentially the chromosome to which it was mapped. 

Column 4: leftmost mapping position 

Column 5: mapping quality, 255 means quality not available 

Column 6: CIGAR, specifies how the mapping took place. S (soft clipped), M (mapped), N (skipped region from reference) 

Column 7-9: info about other read pair 

Column 10: nucleotide sequence 

Column 11: base quality Column 

12+: additional information, nM (number of mismatches)


### Q:
Remember we downloaded samtools at the beginning of class. How do you check if it downloaded and what version you are running?
    

Now we need to use samtools to sort and index BAM files for downstream analysis. First, work with your partner to figure out how you would sort a BAM/SAM file and save it to a new file with the extension .sorted.bam


```
samtools sort -o example.sam -O BAM  -o example.sorted.bam
```
  

*Q: What was the file format of output of out alignment? Will this work (and if so, how do you know)*


We also need a bai index of the sorted bam file that we just created. Again, work with your partner to determine what that command would look like.


    samtools index example.sorted.bam
    

**Looking at bam files with samtools**

On the resulting BAM files, try out some of the samtools commands:

    samtools view interesting_file.bam (or intersting_file.sam)
    
This will print all the alignments in the provided alignment file in SAM format. This leads to A LOT of text being dispayed in the screen at once. Use control-c (or cmd-c) to end the command. How would I be able to use samtools to viewe my BAM file of interest but only display minimal text at one time? 
    
    samtools flagstat insteresting_file.bam
    
This will calculate and give you quality control (QC) statistics for your alignment file.

Aa an example:

70395066 + 3443258 in total (QC-passed reads + QC-failed reads)

2960624 + 181960 secondary

0 + 0 supplementary

0 + 0 duplicates

70395066 + 3443258 mapped (100.00% : 100.00%)

67434442 + 3261298 paired in sequencing

33717221 + 1630649 read1

33717221 + 1630649 read2

67434442 + 3261298 properly paired (100.00% : 100.00%)

67434442 + 3261298 with itself and mate mapped

0 + 0 singletons (0.00% : 0.00%)

0 + 0 with mate mapped to a different chr

0 + 0 with mate mapped to a different chr (mapQ>=5)


*What else can samtools do? Read the documentation to find out what other tools are available to you through samtools.*

We will use the resulting sorted BAM files to run featureCounts, which will summarize our reads and map them back to genomic features (in our case, genes) so we can perform differential gene expression analysis.