# Here we will use STAR to align reads to the genome

**Here we will be writing a submitter, or shell, script to submit a job to the TSCC cluster**

In your home directory, we will make a directory to store our scripts. In this directory, we will make fake submitter script for reference for all of our other submission jobs. Since you will likely use a variation of the same PBS commands for all your processing needs, you can make a script that you copy and modify for each job as necessary. We are going to use Bash to submit these scripts to the cluster, so they have the file extension .sh 

    cd ~/
    vi fake_script.sh
    i
    #!/bin/bash
    #PBS -q hotel
    #PBS -N jobname
    #PBS -l nodes=1:ppn=8
    #PBS -l walltime=1:00:00
    #PBS -o outputfile
    #PBS -e errorfile
    esc
    :wq
    
Remember you can learn more about which submission parameters to include and what they do [here](http://www.sdsc.edu/support/user_guides/tscc-quick-start.html)

**Build STAR genome index**

Open the STAR user [manual](https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf). We will go through this briefly together to get an understanding of how to read documentation. 

Open UCSC genome [browser](https://genome.ucsc.edu/). The link to the specific annotations we will use is provided below, but first take a look through the website to see all the available annotations and features. We will go through this together.

We will use UCSC to download the chromosome fasta files that are needed to build the STAR index. Use the same wget command followed by a copy of the web link address that we used previously to download the files to TSCC. The annotations are located [here](http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/). Scroll to the bottom of the page and get the link for chromFa.tar.gz. We are going to first make a directory and do this processing in scratch because we will need a lot of space. Once you have made the folder, move into it so your annotations will land in the proper place.

    mkdir -p ~/scratch/annotations/hg19/
    cd ~/scratch/annotations/hg19/
    wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz

This will download a zipped file. You can then unzip the file with:

    tar -xvf chromFa.tar.gz
    
Unfortunately this downloads EVERYTHING, including a lot of files that we do not need; we only want the chr#.fa files. We will remove the rest of the stuff in the directory with the rm function that we learned before. To remove more than one file at a time you have to use the -r flag (recursive), which is used to remove multiple items simulaneously. 
You can use the star character to remove all things that contain common characters. For example:

    rm -r *random*
    rm -r *chrUn*
    rm -r *hap*
    
Once the folder is clean and only contains one fasta file per chromosome (and the original tar.gz file) you can merge them all together using cat (concatenate) and assigning the output to a new file called allchrom.fa using >. This is the chromosome fasta file that you will need to use to generate the genome index.

    cat *.fa > allchrom.fa
    
*NOTE - the > character saves the result of your command to a new file. In this case, we want to save the result of concatonating together all of the individual chromosome files into a giant one called chrall.fa*
    
*Q: Why did we have to clean up the folder before running the cat command?*
    
Download the gtf annotation from gencode that can be found [here](http://www.gencodegenes.org/releases/19.html)

In addition to chromosomal sequence information that we got from out fasta files we will also need gene annotations to make our index. We will use the most current gencode release (19) for genome build GRCh37 (hg19). We want the gtf file of the comprehensive gencode annotation for chromosomes. Right click on the link to get the link address and download to your annotations folder with wget:

    cd ~/scratch/annotations/hg19/
    wget ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz
    
Unzip the file with gunzip (REMEMBER TO USE TABS TO AVOID TYPOS!):

    gunzip gencode.v19.annotation.gtf.gz
    
AKA

    gunzip genc<tab>
    
*Note - the unzip is different than above, because the above file was tar.gz which required tar -xvf to unzip. This one is only .gz, so it can be unzipped with gunzip.*

**Generate the STAR index**

Refer to the STAR manual for a description of this step. What flags do you need to include? Work with your partner to decide what will be important given the information in the manual. 

Since STAR requires a lot of processing power, we are going to submit this command as a job to the cluster. Remember that handy fake submission script we made? Let's use it here by copying it and updating the necessary parameters:

Let's make a directory in our fto_shrna folder to keep all the scripts that we will run for this project.

    mkdir -p ~/projects/fto_shrna/scripts
    
Now copy our fake script into that directory with a new, meaningful name such as star_generate_index.sh

    cp ~/fake_script.sh ~/projects/fto_shrna/scripts/star_genome_generate.sh
    
*Q: What do you need to change in the PBS flags for this script?*

*Q: I want to receive an email if this script aborts for any reason. How do I get it to do this?*

*HINT - Remember you can read about submission parameters [here](http://www.sdsc.edu/support/user_guides/tscc-quick-start.html)*

For this script, we will use a walltime of 3 hours, 1 node, and 16 processors.

Once you have decided on what your STAR command should look like, add it to your .sh file below all the PBS flags (that you have already modified above to make unique for this script). PAY CLOSE ATTENTION TO FULL PATHS OF FILES. You have downloaded the necessary annotations already, make sure the paths to those files are correct in your command. I recommend using tab-complete to make the full path and then copying and pasting them directly into your script. Remember you can display the path with pwd. 

What did you learn about the --genomeDir flag from the documentation? It looks like you need to make a folder where the output can go before we run the script. Let's make that now and make sure we have the path correct in our script before running.

    mkdir ~/scratch/annotations/hg19/star

    STAR --runThreadN 16 --runMode genomeGenerate --genomeDir ~/scratch/annotations/hg19/star --genomeFastaFiles ~/scratch/annotations/hg19/allchrom.fa --sjdbGTFfile ~/scratch/annotations/hg19/gencode.v19.annotation.gtf
    
As an example: my complete script looks like:

    #!/bin/bash
    #PBS -q hotel
    #PBS -N star_genome_generate
    #PBS -l nodes=1:ppn=16
    #PBS -l walltime=3:00:00
    #PBS -o star_genome_generate.out
    #PBS -e star_genome_generate.err


    STAR --runThreadN 16 --runMode genomeGenerate --genomeDir ~/scratch/annotations/hg19/star --genomeFastaFiles ~/scratch/annotations/hg19/allchrom.fa --sjdbGTFfile ~/scratch/annotations/hg19/gencode.v19.annotation.gtf
    
Submit your script to the cluster with:

    qsub star_genome_generate.sh

**How to check the status of your job**

    qstat -u username
    
AKA:

    qstat -u ucsd-train01
    
Take a look at the status (The column labeled S). Q means your job is in the queue and has not started yet. R means your job is running (you will see the time updated according to how long it has been running). C means your job is complete. 

Once your job has been running for ~5-10 minutes without aborting, you likely are okay and it will run to completion. But this takes some time. So in the meantime, read up on the STAR mapping steps described below and write your script for mapping. However, you will have to wait until the generate genome step is complete before you submit your mapping job.

**How to delete your job**

If you realize after you submitted your script that you made a mistake and would like to delete your job, you can do that with:

    qdel jobid##
    
You can get the jobID# from the output of:

    qstat -u username

**Making Aliases**

If there are particular commands that you use a lot but are lengthy to type each time, you can make a handy shortcut for yourself by defining an alias in your ~/.bashrc file. For example, 

    qstat -u ucsd-train01
    
is really annoying to type and I use it a lot. I want to make a shortcut so I can just use:

    qme
    
To do this, open your bashrc and add the following line to the bottom of your file, BELOW the line that says #user specific aliases and functions:

    vi ~/.bashrc
    i
    alias qme="qstat -u ucsd-train##"
    esc
    :wq
    
*NOTE - Don't forget to substitute your specific number for the ##*

Now try your new alias!

    qme
    
What happened? Why did you get this error?

    -bash: qme: command not found


Your ~/.bashrc will only be read at the beginning of a login session. Here, we changed our ~/.bashrc, but have not logged out and logged back in. In order to activate the changes that you made to the file you will need to source your ~/.bashrc:

    source ~/.bashrc
    
Now try it again:

    qme

**Dealing with Errors**

You job is done now. Let's first take a look at the .err and .out files. For instance, let's do:

    less star_genome_generate.err


Example error that was reported in the .err file:

    EXITING: FATAL INPUT ERROR: unrecoginzed parameter name "sjdbGTFFile" in input "Command-Line-Initial"
    SOLUTION: use correct parameter name (check the manual)

    Jul 21 14:19:02 ...... FATAL ERROR, exiting
    
Solution... Go back and check that argument with the GTF filename, it looks like there was a typo, the second F should not be capitalized


    STAR --runThreadN 16 --runMode genomeGenerate --genomeDir ~annotations/hg19/star --genomeFastaFiles ~/annotations/hg19/allchrom.fa --sjdbGTFfile ~/annotations/hg19/gencode.v19.annotation.gtf

**Temporary STAR genome folder**

If you job is sitting in the queue or is running but has not errored out, you can use the output of the genome generate step that you need in order to move on that is located here:

    /oasis/tscc/scratch/biom200/star/
    
Don't forget to go back and run the mapping later using the output from your genome generate step to make sure that it ran properly.

**Map reads to the genome**

Once your job is complete you can move onto the next step of mapping your reads to the genome. Once again, copy your fake .sh script and make the necessary changes for this particular job submission.

    cp ~/fake_script.sh ~/projects/fto_shrna/scripts/star_mapping.sh

Information on this step can be found under "Running mapping jobs" in the basic options:

Again, we need to make a directory for the output, so we will do that first:

    mkdir ~/projects/fto_shrna/star_alignment

Edit the script that you have just copied into your scripts directory to have the mapping command and update your PBS flags.


    #!/bin/bash
    #PBS -q hotel
    #PBS -N star_mapping
    #PBS -l nodes=1:ppn=8
    #PBS -l walltime=1:00:00
    #PBS -o star_mapping.out
    #PBS -e star_mapping.err

    STAR --runThreadN 8 --genomeDir ~/scratch/annotations/hg19/star --readFilesIn ~/biom200/fastqs/k562_FTO_shRNA_rep1_R1.fastq.gz ~/biom200/fastqs/k562_FTO_shRNA_rep1_R2.fastq.gz --readFilesCommand zcat --genomeLoad LoadAndRemove --outFilterType BySJout --outFilterMultimapNmax 10 --alignSJoverhangMin 8 --alignSJDBoverhangMin 1 --outFilterMismatchNmax 4 --alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 100000 --outFileNamePrefix ~/projects/fto_shrna/star_alignment/k562_FTO_shRNA_rep1_

**Take a look at the Log file to determine mapping quality**

In section 4 Output Files of the STAR manual, take a look at the different output files to expect and view each one with less to see how your run went. 

Remember you specified the path of where these files would end up with your STAR submission script above.

*Q: What is the difference between a sam and a bam file?*

**On your own outside of class**

When you are happy with the output from mapping one sample, repeat the procedure to map the other 3 samples to the same star index.