# CLIP-seq Analysis of Multi-mapped reads

<a id='section0'></a>
## Table of Content
1. [Introduction](#section1)
2. [Installation](#section2)
3. [Input](#section3)
4. [Usage](#section4)<br>
    4.1 [preprocessor](#section4_1)<br>
    4.2 [realigner](#section4_2)<br>
    4.3 [peakcaller](#section4_3)<br>
    4.4 [permutation_callpeak](#section4_4)<br>
    4.5 [peak_annotator](#section4_5)<br>
5. [Output](#section5)
6. [An example](#section6)<br>
    6.1 [Sample dataset](#section6_1)<br>
    6.2 [Preprossing of sample dataset](#section6_2)<br>
    6.3 [Read mapping](#section6_3)<br>
    6.4 [Cleaning](#section6_4)<br>
    6.5 [CLAM pre-processing](#section6_5)<br>
    6.6 [Realign](#section6_6)<br>
    6.7 [Peak calling](#section6_7)<br>
    6.8 [Annotate](#section6_8)

<a id='section1'></a>
## Introduction
CLAM is a general toolkit for re-aligning multi-mapped reads in CLIP/RIP-seq data and calling peaks.
For details, please read our [NAR paper](https://academic.oup.com/nar/article/45/16/9260/4077049).
Recently, we updated CLAM to version v1.2.0 with new features:
1. Implemented new strandness tagging for anti-sense reads.
2. Multi-replicate mode for peak calling and benchmarking.
3. Annotating peaks to genomic region annotations.
4. Install through pypi.

[TOC](#section0)

<a id='section2'></a>
## Installation
CLAM v1.2 works under Python 2/3. Please click and download the latest version from the releases. Once unzip the file, type

in your terminal and this will automatically install CLAM in your currently working python.

You should have already installed "pysam" using pip/conda for your python interpreter. If not, you can check the detailed requirements in the file "requirements.txt", or type

to install those requirements manually.

Or, a simpler way to install CLAM is through Pypi:

[TOC](#section0)

<a id='section3'></a>
## Input
The input for CLAM is a sorted or unsorted BAM file of CLIP-seq alignment and a gene annotation file in GTF format.<br>
In the case of RIP-seq or eCLIP, a BAM file for IP experiment and a BAM file for Control/input experiment are taken together as input.<br><br>
Note: As in the released version v1.1, the read_gtf function had a bug and required the Gencode format GTF (i.e. last column of GTF matches gene_id "(xx)" ) to proceed the peak calling. This bug has been fixed in the github repository and has been fixed in later releases/patches.



[TOC](#section0)

<a id='section4'></a>
## Usage
CLAM is run through issueing subcommands. Currently there are four subcommands available: preprocessor, realigner, peakcaller and permutation_callpeak.


[TOC](#section0)

<a id='section4_1'></a>
### CLAM preprocessor
This subcommand (new since v1.1) will prepare the input files for CLAM pipeline. It looks for reads passing QC, splits the input bam file by sorting them into `unique.sorted.bam` and `multi.sorted.bam`, and adding an additional tag "RT" (short for Read Tag) to each alignment based which read tagger function the user supplied.

Note that you can also run `CLAM realigner` directly, which will call `preprocessor` and automatically determine if `preprocessor` has been called in the output folder.

If you don't want to run `realigner`, you can also run peakcaller directly after preprocessor.

Example run:

<a id='section4_2'></a>
### CLAM realigner
This subcommand will run expectation-maxmization to assign the multi-mapped reads in a probablistic framework. 
More details about the EM model is described in our NAR paper.

Note when `--retag` is specified, `realigner` will re-run `preprocessor` regardless; otherwise, it will use 
the prepared files in `outdir` if available.

Example run:

[TOC](#section0)

<a id='section4_3'></a>
### CLAM peakcaller
This subcommand (new since v1.1) will call peaks by looking for bins enriched with IP reads over control, specifying a 
Negative-binomial model on observed read counts.

Note we can specify both `unique.sorted.bam` (from `preprocessor`) and `realigned.sorted.bam` (from `realigner`) and 
separte the two file paths by a space, to call peaks using the combination of uniquely- and multi-mapped reads.

Alternatively, we can also only input `unique.sorted.bam`; this will allow CLAM to call peaks using only uniquely-
mapped reads.

As a new feature in version 1.2.0, we implemented multi-replicate mode for peakcaller. Simply use comma to seperate bam files of replicates (Here, we assume there are two replicates named rep1 and rep2).

Example run:

<a id='section4_4'></a>
### CLAM permutation_callpeak
This subcommand will call peaks using permutation by randomly placing reads along the gene.
More details about the permutation procedure is described in our NAR paper.

Example run:

[TOC](#section0)

<a id='section4_5'></a>
### CLAM peak_annotator
This sumcommand will annotate peaks to genomic regions. It requires genomic region files to exist. If not, `peak_annotator` will call `data_downloader` automatically. Currently, we support annotation of human(version hg19 and hg38) and mouse(version mm10)

Example run:

[TOC](#section0)

<a id='section4_5'></a>
### CLAM data_downloader
This sumcommand will download prepared genomic annotation files to local system. Usually, it was called by `peak_annotator` automatically. You can also run it manually.

Example run:

[TOC](#section0)

<a id='section5'></a>
## Output

The output of the re-aligner is "realigned.sorted.bam" (previously "assigned_multimapped_reads.bam" in v1.0), 
which is a customized BAM file following SAM format. 
Note that the re-aligned weights are stored in "AS:" tag, so please be aware and do not change/omit it.
Output of re-aligner could also be seen as an intermediate file for CLAM pipeline.

The output of the peak-caller is a bed file following NarrowPeak format. It is a 10-column [BED](https://genome.ucsc.edu/FAQ/FAQformat.html#format1) format file. 

If you run permutation peak caller (as in v1.0), there will be only one output file called "narrow_peak.permutation.bed".
Hence a peak with "combined" but no "unique" on the fifth column indicates this is a rescued peak; both "unique" and 
"combined" as common peak; or lost peak otherwise.

If you run model-based peak caller (since v1.1), depending on the specified paramter (whether you turned on `--unique-only`), 
the output will be either "narrow_peak.unique.bed" for peaks called using only uniquely-mapped reads; or 
"narrow_peak.combined.bed" for peaks called when adding realigned multi-mapped reads.

[TOC](#section0)

<a id='section6'></a>
## An example

<a id='section6_1'></a>
### Sample dataset

A typical application of CLAM is to call peaks with CLIP-seq data. Here, we take an eCLIP seq dataset as an example. (A full Snakemake pipeline for analysing eCLIP data can be found [here](https://github.com/zj-zhang/CLAM_ENCODE_Snakemake).)

In this demo, we focused on RBFOX2 eCLIP data from K562 cell line provided by Van <em>et. al.</em> 

First, download raw data from ENCODE：

A detaild list for each file:

<body link="#0563C1" vlink="#954F72">

<table border="0" cellpadding="0" cellspacing="0" width="664" style="border-collapse:
 collapse;table-layout:fixed;width:499pt">
 <colgroup><col class="xl66" width="99" style="mso-width-source:userset;mso-width-alt:3444;
 width:74pt">
 <col width="64" span="2" style="width:48pt">
 <col width="130" style="mso-width-source:userset;mso-width-alt:4538;width:98pt">
 <col width="64" style="width:48pt">
 <col width="109" style="mso-width-source:userset;mso-width-alt:3816;width:82pt">
 <col width="134" style="mso-width-source:userset;mso-width-alt:4677;width:101pt">
 </colgroup><tbody><tr height="19" style="height:14.5pt">
  <td height="19" class="xl66" width="99" style="height:14.5pt;width:74pt">File Name</td>
  <td class="xl65" width="64" style="width:48pt">Cell Line</td>
  <td class="xl65" width="64" style="width:48pt">Bait</td>
  <td class="xl65" width="130" style="width:98pt">Group</td>
  <td class="xl65" width="64" style="width:48pt">Run type</td>
  <td class="xl65" width="109" style="width:82pt">Encode Accession</td>
  <td class="xl65" width="134" style="width:101pt">Platform</td>
 </tr>
 <tr height="19" style="height:14.5pt">
  <td height="19" class="xl66" style="height:14.5pt">ENCFF495WQA</td>
  <td class="xl65">K562</td>
  <td class="xl65">RBFOX2</td>
  <td class="xl65">Control-Read 1</td>
  <td class="xl65">PE55nt</td>
  <td class="xl65">ENCSR051IXX</td>
  <td class="xl65">Illumina HiSeq 4000</td>
 </tr>
 <tr height="19" style="height:14.5pt">
  <td height="19" class="xl66" style="height:14.5pt">ENCFF492QZU</td>
  <td class="xl65">K562</td>
  <td class="xl65">RBFOX2</td>
  <td class="xl65">Control-Read 2</td>
  <td class="xl65">PE46nt</td>
  <td class="xl65">ENCSR051IXX</td>
  <td class="xl65">Illumina HiSeq 4000</td>
 </tr>
 <tr height="19" style="height:14.5pt">
  <td height="19" class="xl66" style="height:14.5pt">ENCFF930TLO</td>
  <td class="xl65">K562</td>
  <td class="xl65">RBFOX2</td>
  <td class="xl65">Replicate 1-Read 1</td>
  <td class="xl65">PE45nt</td>
  <td class="xl65">ENCSR756CKJ</td>
  <td class="xl65">Illumina HiSeq 4000</td>
 </tr>
 <tr height="19" style="height:14.5pt">
  <td height="19" class="xl66" style="height:14.5pt">ENCFF462CMF</td>
  <td class="xl65">K562</td>
  <td class="xl65">RBFOX2</td>
  <td class="xl65">Replicate 1-Read 2</td>
  <td class="xl65">PE46nt</td>
  <td class="xl65">ENCSR756CKJ</td>
  <td class="xl65">Illumina HiSeq 4000</td>
 </tr>
 <tr height="19" style="height:14.5pt">
  <td height="19" class="xl66" style="height:14.5pt">ENCFF163QEA</td>
  <td class="xl65">K562</td>
  <td class="xl65">RBFOX2</td>
  <td class="xl65">Replicate 2-Read 1</td>
  <td class="xl65">PE45nt</td>
  <td class="xl65">ENCSR756CKJ</td>
  <td class="xl65">Illumina HiSeq 4000</td>
 </tr>
 <tr height="19" style="height:14.5pt">
  <td height="19" class="xl66" style="height:14.5pt">ENCFF942TPA</td>
  <td class="xl65">K562</td>
  <td class="xl65">RBFOX2</td>
  <td class="xl65">Replicate 2-Read 2</td>
  <td class="xl65">PE46nt</td>
  <td class="xl65">ENCSR756CKJ</td>
  <td class="xl65">Illumina HiSeq 4000</td>
 </tr>
 <!--[if supportMisalignedColumns]-->
 <tr height="0" style="display:none">
  <td width="99" style="width:74pt"></td>
  <td width="64" style="width:48pt"></td>
  <td width="64" style="width:48pt"></td>
  <td width="130" style="width:98pt"></td>
  <td width="64" style="width:48pt"></td>
  <td width="109" style="width:82pt"></td>
  <td width="134" style="width:101pt"></td>
 </tr>
 <!--[endif]-->
</tbody></table>




</body>

[TOC](#section0)

<a id='section6_2'></a>
### Preprossing of sample dataset

When all raw reads were downloaded, use catadaptor to remove adaptor sequences. Before this, let's unzip all files and rename them:

To remove adaptors, each paired-end reads should go through 2 rounds of processing. Take control as an example:

[TOC](#section0)

<a id='section6_3'></a>
### Read mapping

After reads were processed, we will need to map reads to genome. Here, we use STAR to map reads to human genome, version GRCh37.75. For a more comprehensive introduction to STAR, click [here](https://github.com/alexdobin/STAR).<br>
Still, we take Control as an example:

[TOC](#section0)

<a id='section6_3_5'></a>
### Cleaning

As rRNAs and other repetitive RNA  may cause bias when performing peak calling, we will need to clean the mapped reads.<br> Use [BEDTOOLS](https://bedtools.readthedocs.io/en/latest/) to remove rRNAs (rRNA annotation can be exported from [UCSC Table Browser](http://genome.ucsc.edu/cgi-bin/hgTables)).

In CLIP data analysis, its important to remove PCR duplicates. Here we used our in-house code to remove PCR duplicates (The code can be found [here](https://raw.githubusercontent.com/zj-zhang/CLAM_ENCODE_Snakemake/master/scripts/collapse_pcr/collapse_duplicates.py))

[TOC](#section0)

<a id='section6_5'></a>
### CLAM preprocessing

As one of the major feature of CLAM, multi-mapped reads were rescued by an EM procedure while peak calling (See our [paper](https://academic.oup.com/nar/article/45/16/9260/4077049) for more detail). Before peak calling, CLAM will seperate multi-mapped reads and uniquely mapped reads. This process can be omiited, if so, CLAM will still call the preprocessing module if it cannot find seperated BAM files.

$ CLAM preprocessor -i star/K562_RBFOX2_Inp/Aligned.out.mask_rRNA.dup_removed.bam -o clam/K562_RBFOX2_Inp --read-tagger-method start

[TOC](#section0)

<a id='section6_6'></a>
### Realign

This step will realign multi-mapped reads to a unique genome location.

[TOC](#section0)

<a id='section6_7'></a>
### Peak calling

This step will call peaks in multi-replicate mode of CLAM.

[TOC](#section0)

<a id='section6_8'></a>
### Annotate

Once peaks were called, the genome regions of peaks can be annotated by `peak_annotator`. If genomic region annotation file location is not in system environment or CLAM cannot find the files of the specific genome version in dedicated location, it will call data_downloder automatically.

$ CLAM peak_annotator -i clam/peaks/narrow.peaks.bed -g hg19 -o clam/peaks/annotate.txt

The header of result file will is:

[TOC](#section0)