# BIOM262: ChIP-Seq workshop

**Introduction:**

This workshop will walk you through an example of ChIP-seq analysis. We will focus on running tools as command lines and simple bash scripts – I recommend having a cheat-sheet like this [one](http://cheatsheetworld.com/programming/unix-linux-cheat-sheet/). 

We will use some common tools such as: 
* **bowtie2** for alignment (<a href="http://bowtie-bio.sourceforge.net/bowtie2/index.shtml" target="_blank">http://bowtie-bio.sourceforge.net/bowtie2/index.shtml</a>),
* **IGV** for visualization (<a href="http://software.broadinstitute.org/software/igv/home" target="_blank">http://software.broadinstitute.org/software/igv/home</a>) 
* Most of the workshop will be done using **HOMER** (e.g., QC of the data, peak calling etc.; <a href="http://homer.ucsd.edu/homer/" target="_blank">http://homer.ucsd.edu/homer/</a>). HOMER was created by Chris Benner at UCSD, and I love the documentation and tutorials and the threaded humor. To install HOMER follow <a href="https://github.com/biom262/cmm262-2020/blob/master/Module_5/Notebooks/Install_Homer.ipynb" target="_blank">these instructions</a>.

During the workshop, and in general, it is always good to type the command and get the notes and use options of the command. Thus e.g., typing bowtie2 would yield this output (capped after several lines):

```
[ucsd-trainXX@tscc-login1 ~]$  bowtie2
No index, query, or output file specified!  
Bowtie 2 version 2.3.0 by Ben Langmead (langmea@cs.jhu.edu, www.cs.jhu.edu/~langmea)   
Usage: bowtie2 [options]* -x <bt2-idx> {-1 <m1> -2 <m2> | -U <r>} [-S <sam>]
```



In [5]:
/opt/conda/envs/r-bio/bin/bowtie2

bash: bowtie2: command not found


: 127

## Part 0

### 1. Organize directories

Before we begin, we will create directories to organize our analysis:




In [6]:
mkdir chipseq_workshop
mkdir chipseq_workshop/aligned
mkdir chipseq_workshop/tagdirs


mkdir: chipseq_workshop: File exists
mkdir: chipseq_workshop/aligned: File exists
mkdir: chipseq_workshop/tagdirs: File exists


: 1

### 2. Generate symbolic links
Let's generate a couple [symbolic links](https://linuxize.com/post/how-to-create-symbolic-links-in-linux-using-the-ln-command/) to make it easier to type file paths. You can think of them as shortcuts.

In [9]:
ln -s /datasets/cm262-wi21-A00-public/chipseq/fastqs chipseq_workshop/fastqs
ln -s /datasets/cm262-wi21-A00-public/mm9 chipseq_workshop/mm9
ln -s /datasets/cm262-wi21-A00-public/mm9/mm9.fa.fai chipseq_workshop/fai

ln: chipseq_workshop/fastqs: File exists
ln: chipseq_workshop/mm9: File exists
ln: chipseq_workshop/fai: File exists


: 1

---

## Part I
We will start with FASTQ files and perform many of the basic analysis tasks that one might normally do when analyzing ChIP-seq data. 


### **1.** Align FASTQ reads using bowtie2.
The fastqs are at: ` /datasets/cm262-wi21-A00-public/chipseq/fastqs`

But we made a symbolic link so we can access them at `chipseq_workshop/fastqs`

These files are originally from the following study investigating the roles that reprogramming factors play when transforming MEF (fibroblasts) into embryonic stem cells.
[Chronis et al. Cooperative Binding of Transcription Factors Orchestrates Reprogramming](https://www.ncbi.nlm.nih.gov/pubmed/28111071)
Sequencing Data: [GSE90893](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE90893)

For this tutorial we extracted the ChIP-seq experiments for several transcription factors and histone modifications performed on ESC (embryonic stem cells). To reduce runtimes, only reads that mapped to chr17 (and chr17_random) are included. (the random chromosomes are explained in this link – http://genome.ucsc.edu/FAQ/FAQdownloads#download10.)

**Question:** how would you generate such a file with only one chromosome?

To align the reads we will use bowtie2. It is always a good practice to look at the manual of each tool you use, so you get an idea of options and documentation.
Initially, to get an idea of the command we will run it on one file as follows:
(notice if you have very long lines of code you can break it up with a "\\")

In [10]:
/opt/conda/envs/r-bio/bin/bowtie2 -p 8 -x chipseq_workshop/mm9/mm9 \
    -U chipseq_workshop/fastqs/oct4-esc.chr17.2m.fastq | \
    /opt/conda/envs/r-bio/bin/samtools view -bS -t ~chipseq_workshop/mm9/mm9.fa.fai > \
    chipseq_workshop/aligned/oct4-esc.chr17.2m.bam

bash: bowtie2: command not found
[main_samview] fail to read the header from "-".


: 1

To do it properly, we will use a for loop to get BAMs from all the fastqs in the directory:

In [12]:
for f in chipseq_workshop/fastqs/*fastq; 
do fname=`basename $f .fastq`; 
    /opt/conda/envs/r-bio/bin/bowtie2 -p 8 -x chipseq_workshop/mm9/mm9 -U $f | \
    /opt/conda/envs/r-bio/bin/samtools view -bS -t chipseq_workshop/fai > \
    chipseq_workshop/aligned/$fname.bam ; 
done

bash: bowtie2: command not found
[main_samview] fail to read the header from "-".


: 1

This will produce BAM files for the 6 datasets. HOMER can analyze SAM files and if it receives BAM files it converts them to SAM so samtools has to be available (you can check that by typing “samtools” in the command line). 

It is a good practice to always double check datasets before you start analyzing them. For instance, use samtools to view the files.

In [None]:
/opt/conda/envs/r-bio/bin/samtools view chipseq_workshop/aligned/input-esc.chr17.2m.bam | head -n10

and validate that the files are indeed what they should be (e.g., aligned to chr 17, and have 2M reads). 

To calculate the number of reads do:


In [1]:
/opt/conda/envs/r-bio/bin/samtools view chipseq_workshop/aligned/input-esc.chr17.2m.bam | wc -l

[E::hts_open_format] Failed to open file /Users/erickofman/workshop-2.4/aligned/input-esc.chr17.2m.bam
samtools view: failed to open "/Users/erickofman/workshop-2.4/aligned/input-esc.chr17.2m.bam" for reading: No such file or directory
       0


Alternatively, another option that is even better for here is to use samtools 

In [4]:
/opt/conda/envs/r-bio/bin/samtools flagstat chipseq_workshop/aligned/input-esc.chr17.2m.bam

[E::hts_open_format] Failed to open file /Users/erickofman/workshop-2.4/aligned/input-esc.chr17.2m.bam
samtools flagstat: Cannot open input file "/Users/erickofman/workshop-2.4/aligned/input-esc.chr17.2m.bam": No such file or directory


: 1

If you want to understand better the way SAM files are organized you can follow <a href="https://samtools.github.io/hts-specs/SAMv1.pdf" target="_blank">https://samtools.github.io/hts-specs/SAMv1.pdf</a> section 1.4.


--- 
### **2.** Create a “tag directory” 

**These commands should be run directly in the terminal**

For the example Oct4 ChIP-seq experiment using the makeTagDirectory command. Start by typing makeTagDirectory (without any options) in your command line, it will provide the usage, some info about the command and a full list of program options – as I mentioned above, I highly recommend doing that whenever you use a new tool and a new command. 

Tag directories are analogous to sorted bam files and are the starting point for most HOMER operations like finding peaks, creating visualization files, or calculating read densities. The command also performs several quality control and parameter estimation calculations. The command has the following form:  
    


```
makeTagDirectory <Output Tag Directory> [options] <input SAM/BAM file1> [input SAM/BAM file2] ...
``` 


To create a tag directory for the Oct4 experiment, open the terminal, make sure you are in the module-6-chipseq folder, and run the following command with recommended options:


The command will take several seconds to run. What it is doing is parsing through the BAM file, removing reads that do not align to a unique position in the genome, separating reads by chromosome and sorting them by position, calculating how often reads appear in the same position to estimate the clonality (i.e. PCR duplication), calculating the relative distribution of reads relative to one another to estimate the ChIP-fragment length, calculating sequence properties and GC-content of the reads and performs a simple enrichment calculation to check if the experiment looks like a ChIP-seq experiment (vs. an RNA-seq experiment).

The command creates a new directory, in this case named **oct4-esc**. Inside the directory are several text files that contain various QC results. 

Try looking at the following using the "head" command:

> * **tagInfo.txt** - summary information from the experiment, including read totals.
> * **tagFreqUniq.txt** - nucleotide frequencies relative to the 5’ end of the sequencing reads.
> * **genomeGCcontent.txt** - distribution of ChIP-fragment GC%
> * **tagAutocorrelation.txt** - relative distribution of reads found on the same strand vs. different strands.
> * **tagCountDistribution.txt** - number of reads appearing at the same positions.


In [2]:
# Use this box to look at the various files


--- 
### **3.** Create “tag directories” for all samples

By the following code, using a ‘for loop’, again pasting this command in the terminal. This process will take about 8-10 minutes.


At this point you should have 6 tag directories. Look through the QC stats of the various ones.

---
### **4.** Next we will visualize the ChIP-seq experiments.

By creating bedGraph files from the tag directories and using the IGV genome browser to look at the results. We will do this using the makeUCSCfile command. For most ChIP-seq experiments all you need to do is specify the tag directory and specify “-o auto” for the command to automatically save the bedGraph file inside the tag directory:

```
makeUCSCfile chipseq_workshop/tagdirs -o auto
```

For a specific dataset, e.g. Oct4, the command would be:

```
makeUCSCfile chipseq_workshop/tagdirs/oct4-esc/ -o auto
```

This creates the file “oct4-esc/oct4-esc.ucsc.bedGraph.gz”. This file format specifies the normalized read depth at variable intervals along the genome (use zmore and the filename to view the file format for yourself). 

Now make these for all samples:

### To view the file in the genome browser, do the following:

Download the files to your computer.

**Open IGV.** Make sure you use the right genome (mm9; it is a good practice to have!) and drag the file to the center window (or select file -> load from file).

The read pileups will display the relative density of ChIP-seq reads at each position in the genome. We only have data for chr17 in this example, so we can stick to that chromosome.

---
### **5.** See if there are any interesting patterns in the data that catch your eye.

Try visiting the Pou5f1 locus (the gene for Oct4) by typing the gene name into the search bar at the top. Once at the Pou5f1 locus, zoom out (alt+click or scale on top right) to see if there any nearby sites that might resemble enhancers.

Each dataset was created by a different antibody, and they can be divided into three types: TFs, HMs and global input. Since we will need to treat each type differently, I recommend making a directory for each – input, tfs and hms and move the tag directories to the relevant one (e.g. tfs/oct4-esc/, etc). 
