#  File Download and Processing Example

* * *

## In-Class Project : Downloading exons with splice sites and bedtools

By the end of this notebook you will...

* Know what a BED file is
* Know how to use the UCSC genome browser to download BED files
* Copied data to TSCC
* Loaded `biotools` on TSCC to get access to the `bedtools` program
* Used `bedtools` to get sequences

### What is an annotation?
An annotation is a flag denoting a special characteristic to a particular base pair or string of base pairs in a DNA sequence.  The most common type of annotation assigns particular chromosomal coordinates to a gene ID (and its associated metadata.)  Genetic sequence is most commonly stored in strings of 'A', 'C', 'G', and 'T' (fasta format).  Annotations are sometimes bundled with sequence (gbk, genbank file) or downloaded separately in memory efficient formats like GTF and BED.

### BED Format: stuff in the genome

BED stands for "Browser Extensible Data" (informative I know .....) and is a standard format in bioinformatics for describing locations of stuff in the genome the basic format is described [here](https://genome.ucsc.edu/FAQ/FAQformat.html#format1), and here's a minimal example (from [`pybedtools`](https://github.com/daler/pybedtools/blob/master/pybedtools/test/data/a.bed) tests):

```
chr1	1	100	feature1	0	+
chr1	100	200	feature2	0	+
chr1	150	500	feature3	0	-
chr1	900	950	feature4	0	+
```

As you can see, the format is:

1. Chromosome name
2. Start of the feature (0-based, so includes this start)
3. Stop of the feature (0-based, so not inclusive). This is for computer science reasons, most programming langauges are written such that the "0th" item is the first thing. So then the "100th" item would then actually be the 101th element. To avoid the "off by one error" (a common problem in bioinformatics) then we use 0-based indexing, where `feature1` above starts at base 1 and ends at base 99, and thus is of length 99.
4. Name of the feature
5. Score of the feature, which is some integer from 0 to 1000. Some bed files come with a dot/period "`.`" here instead if it doesn't make sense for them to have a score, but there's programs that will complain that your `bed` file is improperly formatted so I end up using `awk` or something to fill that column with 1000 for every row.
6. Strand of the feature. If a strand doesn't make sense (e.g. for DNA methylation) then a dot/period "`.`"  is here.

### Get BED files from the UCSC Table Browser

You used the Table browser briefly to get `knownGene.txt` and do some file manipulations. Now we'll use the table browser to get a BED file. 

1. Go to the [Table Browser](http://genome.ucsc.edu/cgi-bin/hgTables) 
2. Pick your favorite chromosome in "position" (e.g. "chr22")
3. Instead of "output format: all fields from selected table", do "output format: BED - browser extensible data". 
4. Save as "`knownGene_exons.bed`"
4. Click "get output".
4. Do one bed record per exon, plus 10 bases at each end. This gets us the 3' and 5' splice sites (the splice sites are described relative to the intron that is cut out)
5. Click "get BED"

### Copy the BED file to TSCC

Copy the bed file from your laptop to your TSCC account, to your home folder. Hint: `scp`



### Use Bedtools to get the mRNA sequence for the exons

Bedtools is a "swiss-army knife of tools for a wide-range of genomics analysis tasks". In particular, `bedtools` ***excels*** at "genome algebra", the adding and subtracting of genomic regions together. Take a look at the diagrams for [`bedtools intersect`](http://bedtools.readthedocs.org/en/latest/content/tools/intersect.html) and think about how it can answer these questions:

* Given locations of genome methylation, which genes does it overlap?
* Given locations of RBP binding, which exons does it overlap?
* Given two CHIP-Seq experiments, which peaks are consistent between them?

#### Load `biotools` on TSCC which includes Bedtools

First, try running `bedtools`. You should see this:

```
$ bedtools
bash: bedtools: command not found
```

This is because your shell doesn't know anything about the command `bedtools`. To load `bedtools` and other bioinformatics tools to your TSCC account, do

```
module load biotools
```

This command has no output. To make sure we have `bedtools` available, use `which` to see the full path to `bedtools`:

```
$ which bedtools
/opt/biotools/bedtools/bin/bedtools
```

(This is a TSCC-specific thing that the nice system administrators have set up for us but is not a general thing you can do on all servers. Maybe other nice sysadmins on other clusters do this but it is not guaranteed)

To see all available modules to load, do

```
module avail
```

Example output:

```
[obotvinnik@tscc-login2 ~]$ module avail

-------------------------- /opt/modulefiles/applications/.gnu ---------------------------
atlas/3.10.2(default)     hdf5/1.8.14(default)      scalapack/2.0.2(default)
boost/1.55.0(default)     lapack/3.5.0(default)     slepc/3.5.3(default)
fftw/2.1.5                netcdf/3.6.2              sprng/2.0b(default)
fftw/3.3.4(default)       netcdf/4.3.2(default)     sundials/2.5.0(default)
gsl/1.16(default)         parmetis/4.0.3(default)   superlu/3.3(default)
hdf4/2.10(default)        petsc/3.5.2(default)      trilinos/11.12.1(default)

------------------------------- /opt/modulefiles/mpi/.gnu -------------------------------
mvapich2_ib/2.1rc2(default) openmpi_ib/1.8.4(default)

------------------------- /opt/modulefiles/applications/.intel --------------------------
atlas/3.10.2(default)     lapack/3.5.0(default)     scalapack/2.0.2(default)
boost/1.55.0(default)     mxml/2.9(default)         slepc/3.5.3(default)
fftw/2.1.5                netcdf/3.6.2              sprng/2.0b(default)
fftw/3.3.4(default)       netcdf/4.3.2(default)     sundials/2.5.0(default)
gsl/1.16(default)         papi/5.4.1(default)       superlu/3.3(default)
hdf4/2.10(default)        parmetis/4.0.3(default)   tau/2.23(default)
hdf5/1.8.14(default)      pdt/3.20(default)         trilinos/11.12.1(default)
ipm/2.0.3(default)        petsc/3.5.2(default)

------------------------------ /opt/modulefiles/mpi/.intel ------------------------------
mvapich2_ib/2.1rc2(default) openmpi_ib/1.8.4(default)

---------------------------- /usr/share/Modules/modulefiles -----------------------------
dot              module-info      null             rocks-openmpi_ib
module-git       modules          rocks-openmpi    use.own

----------------------------------- /etc/modulefiles ------------------------------------
openmpi-x86_64

------------------------------ /opt/modulefiles/compilers -------------------------------
cilk/5.4.6(default)           intel/2015.2.164
cmake/3.2.1(default)          mono/3.12.0(default)
gnu/4.9.2(default)            pgi/14.9(default)
guile/2.0.11(default)         python/1(default)
intel/2013_sp1.2.144(default) upc/2.20.0(default)

----------------------------- /opt/modulefiles/applications -----------------------------
abyss/1.5.2(default)        fsa/1.15.9(default)         mpi4py/1.3.1(default)
amber/14(default)           gamess/2014.12(default)     namd/2.10(default)
apbs/1.3(default)           gaussian/09.D.01(default)   namd/2.9
bbcp/14.09.02.00.0(default) globus/5.2.5                nwchem/6.5(default)
bbftp/3.2.1(default)        gmp/6.0.0a(default)         octave/3.8.2(default)
beagle/2.1(default)         gnutools/2.69(default)      polymake/2.13.1(default)
beast/1.8.0                 gromacs/5.0.4(default)      R/3.2.1(default)
beast/1.8.1(default)        idl/8.4(default)            rapidminer/6.1.0(default)
beast2/2.1.3(default)       jags/3.4.0(default)         scipy/2.7(default)
bioroll/6.2(default)        lammps/20141209(default)    siesta/3.2.5(default)
biotools/1(default)         matlab/2013a                stata/13.1(default)
blcr/0.8.5(default)         matlab/2013b                vasp/4.6
cp2k/2.5.1(default)         matlab/2014a                vasp/5.2.12
cpmd/3.17.1(default)        matlab/2014b(default)       vasp/5.2.12.gamma
cuda/6.5.19(default)        mkl/11.1.2.144(default)     vasp/5.3.5(default)
ddt/4.2.2(default)          mpc/1.0.3(default)          vtk/6.1.0(default)
eigen/3.2.3(default)        mpfr/3.1.2(default)         weka/3.7.12(default)
```

Hint: `module load biotools` may be a useful thing to add to your `~/.bashrc` :)

To see everything that's loaded, look in `/opt/biotools/`:

```
$ ls /opt/biotools/
bamtools   blat       cufflinks  GenomeAnalysisTK  miRDeep2  randfold    spades       trinity
bedtools   bowtie     dendropy   gmap_gsnap        miso      rseqc       squid        velvet
biopython  bowtie2    edena      htseq             picard    samtools    stacks       ViennaRNA
bismark    bwa        fastqc     idba-ud           plink     soapdenovo  tophat
blast      bx-python  fastx      matt              pysam     SOAPsnp     trimmomatic
```

#### Use `bedtools getfasta` to extract sequences

Read the documentation for [`bedtools getfasta`](http://bedtools.readthedocs.org/en/latest/content/tools/getfasta.html) and figure out how to  request the sequences in `fasta` format for the exons. Something to consider: Does strand specificity matter for exons, if we're interested in splice sites?

You will need a "fasta in" ("`-fi`") file for the genome. Since it's a 3 gigabyte download we've provided one for you: `/home/obotvinnik/biom262/hg19/all.fa`.

Save the file as `knownGene_exons.fasta`.


#### That's it!

Hold on tight, we'll use those sequences later to build motifs of splice sites.