# Module 1 : Unix Basics, Data Access

### Class Resources
* [Course Website](http://en.wikipedia.org "Course Website")  
* [Syllabus](http://en.wikipedia.org "Syllabus")  
* [Contact the TAs (Olga and Jamison)](mailto:jmccorri@eng.ucsd.edu; obotvinn@ucsd.edu "Contact the TAs")  

### Table of Contents
* [**Day 1: January 5, 2016**](#Day-1-:-January-5,-2016)
    * [Unix Basics](#Unix-Basics)
    * [TSCC User Guide (UCSD Server Access)](#TSCC-User-Guide-(UCSD-Server-Access))
    * [File Download and Processing Example](#File-Download-and-Processing-Example)
    * [Homework 1](#Homework-1)
* [**Day 2: January 7, 2016**](#Day-2:-January-7,-2016)
    * [Git and Github](#Git-and-Github)
    * [Downloading Data](#Downloading-Data)
    * [Jupyter Notebook](#Jupyter-Notebook)
    * [Python](#Python)
    * [Addtional Resources](#Addtional-Resources)


* * *

# Day 1 : January 5, 2016

* * *

# Unix Basics

The command line allows you to input commands, such as creating folders, deleting and copying files and extracting information from files.

## Opening the command line.

* **From Mac OS**
    * Applications folder, open Utilities and launch Terminal  
* **From Linux machine**
    * Applications, Accessories and launch Terminal  
* **From Windows**
    * Determine if you have a 32 or 64 bit version of Windows. 
        * https://support.microsoft.com/en-us/kb/827218. 
        * http://cygwin.com/install.html.
    * Run setup-x86.exe if you have 32 bit windows. If not Run setup-x86_64.exe
    * Install Cygwin and double-click on Cygwin Terminal


## Getting Started: Navigating your folders and files
You start any terminal session in your "home area".  View your "present working directory"
* `pwd`

Your default home area is represented by the character alias `~` (tilde)
* `print ~`

Change directory  
* `cd ~/Desktop`  

List all the files in the present working directory using  
* `ls`  
* `ls .`
    
Arguments for unix commands  
* `man ls`  
    
Creating a folder  
* `mkdir data`  
* `mkdir software`  
    
Change directory into data or software (tab complete or use Up and Down)
* `cd da[TAB]`  

Change back to the root directory from any subdirectory:
* `cd ..`  

Create an empty file
* `touch emptyfile.txt`  

Copy a file
* `cp emptyfile.txt anotheremptyfile.txt`  

Move a file
* `mv anotheremptyfile.txt deleteme.txt`  

Delete a file
* `rm deleteme.txt`  

Create a pointer (symlink) to a file
* `ln -s pointer emptyfile.txt`  

## File Manipulation: Getting some data from UCSC's Table Browser
Go to the UCSC genome browser and the knownGene table (save as knowngene.txt)  
* http://hgdownload.cse.ucsc.edu/downloads.html

Move knownGene.txt to Desktop
* `mv Downloads/knownGene.txt Desktop`  
    
(*optional*) Secure copy knownGene.txt to TSCC.
* `scp Desktop/knownGene.txt  ewyeo@tscc-login2.sdsc.edu:.`  
    
What's in the file?
* `less knownGene.txt`  
* `cat knownGene.txt`  

How many lines are in the file?
* `wc -l knownGene.txt`  
    
See what's in the first n lines
* `head -n 10 knownGene.txt`  
    
Check if it's indeed n lines (| command)
* `less knownGene.txt | wc -l`
* `wc -l knownGene.txt`  
    
What's in the last n lines?
* `tail -n 10 knownGene.txt`  
    
Extract specific columns
* `cut -f`  
* `paste` **TBA - UPDATE REQ'D**

How many genes have 3 exons?
* `grep -c 'REGEXSEARCHTERM' target.txt`  
    
How many genes have 1...max # exons?
* `sort | uniq -c`  
    
Output to a file and open in excel (make bar chart)
* **TBA - UPDATE REQ'D**

## Deleting files and file permissions
* **TBA - UPDATES REQ'D BELOW**
* `whoami`
* `groups`
* `ls -lrt`
* `chmod -R 777`
* `chmod +x`
* `chmod -R o-rwx ~/`
* Scratch maintenance occurs every 90 days:
    * `cd important_scratch_dir`
    * `find . | xargs touch`

## Introduction to `awk`

**TBA - UPDATES REQ'D BELOW : BRING IN INTERACTIVE RUN OF GREP VS. TEXT IN WINDOW**

Another way to extract all lines
* `awk -F "\t" '{print;}' knownGene.txt`  
 
What if we only wanted one column
* `awk -F "\t" '{print $8;}' knownGene.txt  | head`  

What if we wanted the length of genes?
* `awk -F "\t" '{ len = $5-$4;} {print len;}' knownGene.txt | head`  

Length of all genes summed?
* `awk -F "\t" '{ len = $5-$4;} {tot = tot + len;} END {print tot;}' knownGene.txt | head`  

Don't process the header line (introduction to conditionals)
* `awk -F "\t" '{
 if (FNR==1){
     next
 };
 tot = tot + $5-$4};
 END {print tot;}' knownGene.txt | head`
 
What if you only want the total length of genes in chromosome 1?
* `awk -F "\t" '{
 if (FNR==1){
     next;
 };
    chr =$2;
 if (chr == "chr1") {
    tot = tot + $5-$4;
 }
     };
 END {print tot;}' knownGene.txt`


* * *

# TSCC User Guide (UCSD Server Access)

## What is TSCC?

TSCC houses the 640-core supercomputer as part of a resource sharing system which allows researchers to perform calculations and experiments when they need extra computing power.  
  
* TSCC user guide: http://rci.ucsd.edu/computing/index.html
* The main contacts for questions about TSCC is the TSCC users mailing lists. The main contact for problems with TSCC is (Jim Hayes)[jhayes@sdsc.edu]. 
    * TSCC users: tscc-l@mailman.ucsd.edu
    * Jim Hayes: jhayes@sdsc.edu

## My First Supercomputer Login Session

Your first login session will include some of the following commands, which will familiarize you with the cluster, teach you how to do some useful tasks on the queue, and help you set up a common directory structure shared by everyone in the lab.


### 1. Log in to TSCC

In your terminal, type the following (you'll need to replace username with your actual username)
* `ssh username@tscc.sdsc.edu`

### 2. Organize your home directory
Create the base storage location for your code development (or just use your home area):
* `mkdir code`
* `mkdir notebooks`
* `mkdir data`
* `ln -s /oasis/tscc/scratch/$USER $HOME/scratch`

### 3. Environment Variables and your Bash Profile
Unix commands are written in "BASH".  

Set a BASH environment variable
* `export STR="hello world"`  

Access a variable
* `print $STR`

The most important environment variable is `$PATH`.  Folders in this path are automatically searched when looking for executable tools via auto-complete or `which`
* `echo $PATH`
* `which programname.sh`

Customize your BASH profile by editing your `~/.bashrc` file.  This command is executed each time you log in to TSCC:
* `source ~/.bashrc`

( *optional* ) Additional details on BASH profile customization
* https://wiki.archlinux.org/index.php/Bash

### 4. Shell Scripting

If you have a bunch of commands you want to run at once, you can use this script to submit them all at once. 

In the next example, `commands.sh` is a file has the commands you want on their own line, i.e. one command per line.
* `java -Xms512m -Xmx512m -jar /home/yeo-lab/software/gatk/dist/Queue.jar \`  
`-S ~/gscripts/qscripts/do_stuff.scala --input commands.sh -run -qsub \`  
`-jobQueue <queue> -jobLimit <n> --ncores <n> --jobname <name> -startFromScratch`

This runs a scala job that submits sub-jobs to the PBS queue under name you fill in where <name> now sits as a placeholder.

### 5. Executing Tasks on a TSCC Server

When you log in to TSCC, you are connected to a "login node".  When executing a task, you should always use an "execution node".  
* More details in the TSCC user guide: http://rci.ucsd.edu/computing/index.html

To submit a script that you wrote, in this case called myscript.sh, to TSCC, do:
* `qsub -q home-yeo -l nodes=1:ppn=2 -l walltime=0:30:00 myscript.sh`

To submit interactive jobs, do:
* `qsub -I -q home-yeo -l nodes=1:ppn=2 -l walltime=0:30:00`

To submit to the home-scrm queue, add -W group_list=scrm-group to your qsub command:
* `qsub -I -l walltime=0:30:00 -q home-scrm -W group_list=scrm-group`

Check the status of your jobs:
* `qstat`

Check the status of your array jobs, you need to specify ``-t`` to see the status of the individual array pieces. 
* `qstat -t`

Killing jobs
* `qdel 2006527`

Kill an array job
* `qdel 2006527[]`

Kill all your jobs
* `qdel $(qselect -u $USER)`

### 6. Which queue do I submit to? (check status of queues)

**Check the status of the queue** (so you know which queues to NOT submit to!)
* `qstat -q`

Example output is:  

    server: tscc-mgr.local

    Queue            Memory CPU Time Walltime Node  Run Que Lm  State
    ---------------- ------ -------- -------- ----  --- --- --  -----
    home-dkeres        --      --       --      --    2   0 --   E R
    home-komunjer      --      --       --      --    0   0 --   E R
    home-ong           --      --       --      --    2   0 --   E R
    home-tg            --      --       --      --    0   0 --   E R
    home-yeo           --      --       --      --    3   1 --   E R
    home-visres        --      --       --      --    0   0 --   E R
    home-mccammon      --      --       --      --   15  29 --   E R
    home-scrm          --      --       --      --    1   0 --   E R
    hotel              --      --    168:00:0   --  232  26 --   E R
    home-k4zhang       --      --       --      --    0   0 --   E R
    home-kkey          --      --       --      --    0   0 --   E R
    home-kyang         --      --       --      --    2   1 --   E R
    home-jsebat        --      --       --      --    1   0 --   E R
    pdafm              --      --    72:00:00   --    1   0 --   E R
    condo              --      --    08:00:00   --   18   6 --   E R
    gpu-hotel          --      --    336:00:0   --    0   0 --   E R
    glean              --      --       --      --   24  75 --   E R
    gpu-condo          --      --    08:00:00   --   16  36 --   E R
    home-fpaesani      --      --       --      --    4   2 --   E R
    home-builder       --      --       --      --    0   0 --   E R
    home               --      --       --      --    0   0 --   E R
    home-mgilson       --      --       --      --    0   4 --   E R
    home-eallen        --      --       --      --    0   0 --   E R
                                                   ----- -----
                                                     321   180

So right now is not a good time to submit to the ``hotel`` queue, since it has a bunch of both running and queued jobs!

**List the available Service Units** (1 SU = 1 core*hour)for a quick ego boost. Also note that our supercomputer is separated in two: yeo-group and scrm-group, but the total balance is 5.29 million SU, just enough secure us the top honors :-)
* `gbalance | sort -nrk 3 | head`

Example output is:

    Id Name                 Amount  Reserved Balance CreditLimit Available
    -- -------------------- ------- -------- ------- ----------- ---------
    19 tideker-group        5211035    27922 5183113           0   5183113
    82 yeo-group            3262925        0 3262925           0   3262925
    81 scrm-group           2039328        0 2039328           0   2039328
    14 mgilson-group         663095   208000  455095           0    455095
    73 nanosprings-ucm       650000        0  650000           0    650000
    17 kkey-group            635056     7104  627952           0    627952
    16 k4zhang-group         534430        0  534430           0    534430
    List the available TORQUE queues, for a quick boost in motivation!

* `qstat -q`

Example output is:

    Queue            Memory CPU Time Walltime Node  Run Que Lm  State
    ---------------- ------ -------- -------- ----  --- --- --  -----
    home-tideker       --      --       --       16   1   0 --   E R
    home-visres        --      --       --        1   0   0 --   E R
    hotel              --      --    72:00:00   --   25  18 --   E R
    home-k4zhang       --      --       --        4  21   0 --   E R
    home-kkey          --      --       --        5   0   0 --   E R
    pdafm              --      --    72:00:00   --    0   0 --   E R
    condo              --      --    08:00:00   --    0   0 --   E R
    glean              --      --       --      --    0   0 --   E R
    home-builder       --      --       --        8   0   0 --   E R
    home               --      --       --      --    0   0 --   E R
    home-ewyeo         --      --       --       15   0   0 --   E R
    home-mgilson       --      --       --        8   0   0 --   E R
                                           ----- -----
                                              47    18

### Show available processors

To show available processors
* `showbf`

Show specs of all nodes
* `pbsnodes -a`

#  File Download and Processing Example

## Getting some data from Ensembl biomart
* **TBA - UPDATE REQ'D**


# Homework 1

* **TBA - UPDATE REQ'D**


* * *

# Day 2 : January 7, 2015
    
* * *

# Git and Github

* **TBA - UPDATE REQ'D**


# Downloading Data

* **TBA - UPDATE REQ'D**


# Jupyter Notebook

### My First Jupyter Notebook

Check your python version:  
* `python -V`

Install the appropriate version of Anaconda, based on the python version:  
* https://www.continuum.io/downloads
        
Install jupyter:  
* `conda install jupyter`
    
Start jupyter notebook server:
* `jupyter notebook`

Connect to the jupyter notebook server:
* http://localhost:8888/
    
Start a new notebook using the dropdown menu in the top right of the screen:
![New doc image reference](newdoc.png "New doc image reference")

### Cell Types

#### Markdown and Heading:
- Formatted text using markdown language
    
#### Code:
- Input/Output dynamic processing entries
    
#### Raw NBConvert (Raw):
- No input/outp;ut or markdown processing.  Unprocessed text.
    
**To edit the type of any cell**, select it, then use the dropdown menu at the top of the screen.
![New doc image reference](types.png "New doc image reference")

**To insert a new cell**, use the Insert option in the toolbar.
![New doc image reference](insert.png "New doc image reference")

**To edit any cell,** double-click on it.

**To execute the contents of any cell (or visualize markdown language),** hit the "execute" button in the toolbar (play/pause symbol):
![New doc image reference](exe.png "New doc image reference")

### Markdown Language Basics

#### Full details in the jupyter notebook user guide:
    
- http://jupyter.cs.brynmawr.edu/hub/dblank/public/Jupyter%20Notebook%20Users%20Manual.ipynb#4.-Using-Markdown-Cells-for-Writing

#### Quick Guide to Markdown Syntax

##### Headers:
- Prepend text with "#" or "##" depending on size of desired header text (up to header size 6 = "######").  
- See the 2 headers above for examples of 1 and 2 level header sizes.
    
##### Formatting:
- Markdown does not automatically hard-wrp carriage returns.
    - Insert your own break by ending the line with two spaces and then typing Return.
- *Italics* = 1 "*" or "_"
- **Bold** = 2 "**" or "__"
- `monospace` text (for code) is initiated by a prepending '\`' character
- Use "\" or a preceeding tab to remove formatting issues caused by markdown language syntax.
    - The '\`' monospace character also works.
    
    \\- This is not a list  
    \\-- When I do this

##### Quotes
- Quotes require the use of a prepending ">", at a count matching the quote depth.
- \>example
- \>\>subexample
    
> example
>> subexample
    
##### Lists
- Lists use a "-", "*", or "+".  Use tabs to modify list depth.
- Tabs reduce markdown to raw text so lists are important!  

##### Links
    Normal link example:
    [Class Website](http://en.wikipedia.org "Class Website")
 
[Class Website](http://en.wikipedia.org "Class Website")

##### Tables
    Minimal example:
    SampleID|GeneID|ExpressionValue
    -|-|-|-
    A|ACTB|40
    B|ACTB|9500
    C|ACTB|0
    
SampleID|GeneID|ExpressionValue
-|-|-|-
A|ACTB|40
B|ACTB|9500
C|ACTB|0

##### LaTeX
- Use "\$" signs to indicate LaTeX formulas:
- \${a \choose a_1,a_2} $

${a \choose a_1,a_2}$

##### Images
- `![Example image](http://icons.iconarchive.com/icons/icons-land/medical/256/Body-DNA-icon.png "Example image")`

![Example image](http://icons.iconarchive.com/icons/icons-land/medical/256/Body-DNA-icon.png "Example image")

### Intro to Dynamic Code Execution

Jupyter notebook allows for test executions for over 50 programming languages within the browser.  See the full list of supported tools here:
https://github.com/ipython/ipython/wiki/IPython-kernels-for-other-languages
- Python is the default language
- Use "%%" (magic commands) to delineate alternative languages (see below.)
- Output is direct from the Kernel and is launched individually for each cell (ie. asycronous execution.)

##### The obligatory (python) example:
- Click on the cell below.
- Hit the "execute" button (play/pause symbol) or CTRL+Enter to launch the code
![New doc image reference](http://localhost:8888/files/Desktop/BIOM262/exe.png "New doc image reference")

In [2]:
a = "Hello World"
print(a)

Hello World


##### Now let's try again using bash:

In [1]:
%%bash
echo "Hello World"

Hello World


##### Or even perl:

In [3]:
%%perl
use strict;
use warnings;
my $a = "Hello World";
print "$a\n";

Hello World


### Advanced Processing Methods

##### Magic commands
* Otherwise known as "meta commands", these allow for code execution independent of the kernel you are using.
* Above, the `%%bash` and `%%perl` magics were used to enter those particular kernels.

##### Common magic commands
* `?` : help command
    * Example : `? hat` or `?? hat`
* `!` : run as system shell
    * Example: `! pwd` (prints present working directory)
    * Similar to `%%bash`
* `%magic`
    * Lists all available magic commands 
* Example execution of magic command `pastebin` ( [Source](http://ipython.org/ipython-doc/dev/interactive/tutorial.html#magics-explained "Source") )
    * `%pastebin 3 18-20 ~1/1-5`
  
### Advanced Processing Example with R

##### Install module
* TBA
    
##### Generate randomized data set based on normal distribution
* TBA

##### Render plot to Notebook
* TBA


# Python

* **TBA - UPDATE REQ'D**


# Addtional Resources

* **TBA - UPDATE REQ'D**
