# Basic input_normalization script runner for Eric's normalization wrapper
- attempt to document/explain what goes on in each step for reference' sake
- original file: /home/bay001/projects/codebase/from_others/Running_input_normalization.ipynb

```perl
FULL_PIPELINE_WRAPPER.pl [filelist] [output_folder]
```
- requires Statistics::R module to be installed

## Region level normalization script:
### REQUIRES: filelist.mapped_read_num
```
count_reads_OneOrTwoRepVersion.wrapper_organized_frombamfi_PE.pl [filelist] [output_folder]
```
```perl
    count_reads_broadfeatures_frombamfi_PEmap.pl [bam] > [*.reads_by_loc.csv]
```
        - Counts reads in regions from bam file (CDS/3'UTR/Introns/etc.)
        - Uses a priority for each transcript: (intron, 5utr, CDS, 3utr, nc_intron, nc_exon)
        - Combine transcripts into gene:
        - Uses a priority for each gene: (CDS, 5UTR, 3UTR, intron, nc_exon, nc_intron)
        - 3utr+5utr features are reads that map to both 5'UTR and 3'UTR

```perl
    combine_ReadsByLoc_files.pl [Rep1 (Rep2) Input.reads_by_loc.csv] > [UID_RBP_ReadsByLoc_combined.csv]
```
        - Combines paired replicates + input into one file for an experiment
```perl
    convert_ReadsByLoc_combined_significancecalls.pl [UID_RBP_ReadsByLoc_combined.csv] [mapped_read_num] > [UID*.csv.l2fcwithpval_enr.csv], [UID*.csv.l2fc.csv]
```
        - Calculates log2 fold-enrichment and chisq/fisher p-values for regions above input
```perl
regionlevelanalysis_GOanalysisonl2fc.pl [UID*.csv.l2fcwithpval_enr.csv] > [UID*.csv.l2fcwithpval_enr.csv***??]
```
        - Performs GO term enrichment for each region for a given log2 fold change and log10pval cutoff
## Peak level normalization script:
```perl
compare_vs_input_peakbased_wrapper2_OneOrTwoRepVersion_PEbamfileversion_submit.pl
```
        - Runs input normalization on a file list of Gabe's peak calls (submit to cluster)
```perl
overlap_peakfi_with_bam_PE.pl
```
        - Normalizes CLIP signal vs input signal (or A vs B) based on peak file
```perl
compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl
```
        - Compresses Gabe's peak calls by discarding overlapping peak if less enriched above input
```perl
get_replicating_peaks_from_two_replicates.pl
```
        - Overlaps two replicate peak calls and outputs overlapping peaks (by log10(pvalue) and log2(foldchange) above input)
```perl
annotate_peaks_bedformat.pl
```
        - Annotates peaks
        - 
### Output files
#### \*{ACC}ReadsByLoc_combined.csv
#### \*{ACC}ReadsByLoc_combined.csv.l2fc.csv
#### \*{ACC}ReadsByLoc_combined.csv.l2fcwithpval_enr.csv : Tabbed file containing l2fc | -log10(pvalue) 
#### \*{ACC}ReadsByLoc_combined.csv.l2fcwithpval_enr.csv.cutoff_counts.2 : We don't know what this is
#### \*{ACC}ReadsByLoc_combined.csv.l2fcwithpval_enr.csv.{ACC}_l2cutoff_2.3utr : List of genes 
### NaNs
NaNs = ? 
If there is both a l2fc and pvalue, we have enough reads to detect these differences. If there are NaNs present, we don't have enough reads.




In [10]:
import qtools
import pandas as pd

In [20]:
# We'll need an input normalized manifest that looks like this:
input_norm_manifest = '/projects/ps-yeolab3/bay001/tbos/input_norm/rbfox2.txt'
output_dir = '/projects/ps-yeolab3/bay001/tbos/input_norm/rbfox2_input_norm/'
df = pd.read_table(input_norm_manifest)
df['INPUT'].ix[0]

'/projects/ps-yeolab3/bay001/tbos/input_norm/rbfox2/RBFOX2-204-INPUT_S2_R1.unassigned.adapterTrim.round2.rmRep.rmDup.sorted.r2.bam'

In [22]:
command = 'perl /home/elvannostrand/data/clip/CLIPseq_analysis/scripts/LABshare/FULL_PIPELINE_WRAPPER.pl {} {} hg19'.format(
    input_norm_manifest, output_dir)
print command

perl /home/elvannostrand/data/clip/CLIPseq_analysis/scripts/LABshare/FULL_PIPELINE_WRAPPER.pl /projects/ps-yeolab3/bay001/tbos/input_norm/rbfox2.txt /projects/ps-yeolab3/bay001/tbos/input_norm/rbfox2_input_norm/ hg19


The input manifest is a tab separated file with the following column headers: uID, RBP, Cell line, CLIP_rep1, input

or if you have two clips per input, the format is: uID, RBP, Cell line, CLIP_rep1, CLIP_rep2, input

uID = user defined unique ID for the CLIP_input pair
RBP = RBP
Cell line = cell line
CLIP_rep1 = full path to clip merged bam file (suffix is .merged.r2.bam)
CLIP_rep2 = full path to clip merged bam file (suffix is .merged.r2.bam)
input = full path to input bam file (suffix is .unassigned.adapterTrim.round2.rmRep.rmDup.sorted.r2.bam)

NOTE - this script assumes the clipper peak called bed files are in the same directory where the bams are located and have Gabe's naming conventions that come out of the GATK pipeline

In [14]:
jobname = 'input_normalization_'
qtools.Submitter(command, jobname, array=False, nodes=1, ppn=1, walltime='2:00:00', submit=True)

Wrote commands to input_normalization_.sh.


7565811.tscc-mgr.local


Submitted script to queue home.
 Job ID: 7565811


<qtools.submitter.Submitter at 0x2b91b6f8a210>