# In this notebook, we will sort and index bam files for downstream analyses and look at the alignment with IGV

**Introduction to samtools**

Check out the samtools [documentation](http://www.htslib.org/doc/samtools.html). Since BAM files are binary, they can only be read by the computer. Samtools is a great tool that lets us view the contents of bamfiles and perform various manipulations on them. 

*Q: Remember we downloaded samtools on day 1. How do you check if it downloaded and what version you are running?*

Try out some of the samtools commands:

    samtools view interesting_file.bam (or intersting_file.sam)
    
    samtools flagstat insteresting_file.bam
    

Now we need to use samtools to soft and index the bam file. Work with your partner to figure out how you would sort a bam file and save it to a new file with the extension .sorted.bam

    samtools sort -@ 8 -o interesting_file.sorted.bam interesting_file.bam 
    
We also need a bai index of the sorted bam file. Again, work with your partner to determine what that command would look like:

    samtools index interesting_file.sorted.bam

Now that you have figured out what the commands should be, write a submitter script with both of those commands that you can submit for the file you made. The sorting takes 8 processors, so we need to submit a job. Keep in mind that you can include two commands in the same script. Just put one below the other and your second one will run after the first one is finished.

This is just for practice on the bam file you made. I have done this already on the bam files provided so you can view them on IGV.
    

**Download IGV to view your alignments**

Check out the IGV [website](https://www.broadinstitute.org/igv/) to download the application.

On the downloads page scroll down to "Binary Distribution" and click "Download Binary Distribution". You are going to load IGV from your desktop, NOT TSCC. Move the downloaded file into some meaningful directory for you.

    mv ~/Downloads/IGV_2.3.79.zip ~/Desktop/BIOM200/module1/IGV
    
Unzip the file

    unzip ~/Desktop/BIOM200/module1/IGV/IGV_2.3.79.zip
    
This will make a new folder in that directory called IGV_2.3.79. Go into that directory to find the igv.sh script. Open IGV with bash. Remember all of this is done on your local machine, NOT tscc

    cd IGV_2.3.79
    bash igv.sh
    
This will open the application.

In order to view alignments, you need to upload the bam files to an external server (not TSCC) for viewing. You can also download the bam and the indexed bai files to your desktop and load them from there. But since the files are big, I have uploaded them to an external server for you to view. I have uploaded four files in total, two for each condition (control and knockdown).

After you complete the download, open IGV (as described above). 

Select your genome with genome - load from server. Choose hg19. 

Upload the bam files with - Select File - Load from URL

The URL links are:

    https://s3-us-west-2.amazonaws.com/mstp-bioinformatics-2016/lin28b_ctrl_rep1.bam
    
    https://s3-us-west-2.amazonaws.com/mstp-bioinformatics-2016/lin28b_ctrl_rep2.bam
    
    https://s3-us-west-2.amazonaws.com/mstp-bioinformatics-2016/lin28b_kd_rep1.bam
    
    https://s3-us-west-2.amazonaws.com/mstp-bioinformatics-2016/lin28b_kd_rep2.bam

You can leave the index field blank. By default the program searches in the same location for another file by the same name with the extension .bai. I have uploaded these files as well, so IGV will find them by default.

Once you have uploaded all 4 files, play around by viewing different genes or chromosome locations. Can you see genes that clearly have fewer reads in the knockdown vs control datasets? What does LIN28B look like? What about let7?

We will come back to these later on for the initial pass that our most highly differetially expressed genes show expression differences.   

When you are ready to quit IGV, you can save the session with File - Save session. Next time you open IGV you can open your saved session without having to reload the BAM files. 

**Make softlinks to the rest of the bam files we are interested in**

Since we only processed one dataset together, I will provide the bam files for the rest of the datasets we are interested in. They are located in:

    /projects/ps-yeolab/biom200_module1_2016/
    
Go into this folder and check out what is there to see what files you want to copy. You need to copy 8 total, 4 bam files and 4 index files.
    
Remember the softlink syntax we learned before?

    ln -s sourcefilename destinationfilename
    
Use this syntax to make a softlink for each of the bams provided and put the link in your ~/projects/lin28b_shrna/all_bams/ directory (you will need to make a directory called all_bams first)
    
Remember to use your full path properly for both the source and destination file to make a softlink for each of the four bam files in that folder. 

*Q: How do you check that your softlink was made properly?*
