analysis/deNovopeakcalling.Rmd

---
title: "deNovo peak callling"
author: "Briana Mittleman"
date: "6/28/2018"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

##Create Bedgraph

I will call peaks de novo in the combined total and nuclear fraction 3' Seq. The data is reletevely clean so I will start with regions that have continuous coverage. I will first create a bedgraph.  

```{bash, eval=F}
#!/bin/bash

#SBATCH --job-name=Tbedgraph
#SBATCH --account=pi-yangili1
#SBATCH --time=24:00:00
#SBATCH --output=Tbedgraph.out
#SBATCH --error=Tbedgraph.err
#SBATCH --partition=broadwl
#SBATCH --mem=40G
#SBATCH --mail-type=END


module load Anaconda3
source activate three-prime-env 

samtools sort -o /project2/gilad/briana/threeprimeseq/data/macs2/TotalBamFiles.sort.bam /project2/gilad/briana/threeprimeseq/data/macs2/TotalBamFiles.bam

bedtools genomecov -ibam /project2/gilad/briana/threeprimeseq/data/macs2/TotalBamFiles.sort.bam -bga > /project2/gilad/briana/threeprimeseq/data/bedgraph/TotalBamFiles.bedgraph

```

Next I will create the file without the 0 places in the genome. I will be able to use this for the bedtools merge function.  

```{bash, eval=F}
awk '{if ($4 != 0) print}' TotalBamFiles.bedgraph >TotalBamFiles_no0.bedgraph 
```

I can merge the regions with consequtive reads using the bedtools merge function.  

* -i input bed  

* -c colomn to act on  

* -o collapse, print deliminated list of the counts from -c call  
 
* -delim ","  

This is the mergeBedgraph.sh script. It takes in the no 0 begraph filename without the path.  
```{bash, eval=F}
#!/bin/bash

#SBATCH --job-name=merge
#SBATCH --account=pi-yangili1
#SBATCH --time=8:00:00
#SBATCH --output=merge.out
#SBATCH --error=merge.err
#SBATCH --partition=broadwl
#SBATCH --mem=16G
#SBATCH --mail-type=END

module load Anaconda3
source activate three-prime-env 

bedgraph=$1
describer=$(echo ${bedgraph} | sed -e "s/.bedgraph$//")

bedtools merge -c 4,4,4 -o count,mean,collapse -delim "," -i /project2/gilad/briana/threeprimeseq/data/bedgraph/$1 > /project2/gilad/briana/threeprimeseq/data/bedgraph/${describer}.peaks.bed


```

Run this first on the total bedgraph, TotalBamFiles_no0.bedgraph. The file has chromosome, start, end, number of regions, mean, and a string of the values.

This is not exaclty what I want. I need to go back and do genome cov not collapsing with bedgraph.  


To evaluate this I will bring the file into R and plot some statistics about it.  

```{bash, eval=F}
#!/bin/bash

#SBATCH --job-name=Tgencov
#SBATCH --account=pi-yangili1
#SBATCH --time=24:00:00
#SBATCH --output=Tgencov.out
#SBATCH --error=Tgencov.err
#SBATCH --partition=broadwl
#SBATCH --mem=40G
#SBATCH --mail-type=END


module load Anaconda3
source activate three-prime-env 


bedtools genomecov -ibam /project2/gilad/briana/threeprimeseq/data/macs2/TotalBamFiles.sort.bam -d > /project2/gilad/briana/threeprimeseq/data/bedgraph/TotalBamFiles.genomecov.bed

```
I will now remove the bases with 0 coverage.  

```{bash, eval=F}
awk '{if ($3 != 0) print}' TotalBamFiles.genomecov.bed > TotalBamFiles.genomecov.no0.bed 

awk '{print $1 "\t" $2 "\t"  $2 "\t" $3}' TotalBamFiles.genomecov.no0.bed > TotalBamFiles.genomecov.no0.fixed.bed
```

I will now merge the genomecov_no0 file with mergeGencov.sh
```{bash, eval=F}
#!/bin/bash

#SBATCH --job-name=mergegc
#SBATCH --account=pi-yangili1
#SBATCH --time=8:00:00
#SBATCH --output=mergegc.out
#SBATCH --error=mergegc.err
#SBATCH --partition=broadwl
#SBATCH --mem=16G
#SBATCH --mail-type=END

module load Anaconda3
source activate three-prime-env 

gencov=$1
describer=$(echo ${gencov} | sed -e "s/.genomecov.no0.fixed.bed$//")

bedtools merge -c 4,4,4 -o count,mean,collapse -delim "," -i /project2/gilad/briana/threeprimeseq/data/bedgraph/$1 > /project2/gilad/briana/threeprimeseq/data/bedgraph/${describer}.gencovpeaks.bed


```
This method gives us 811,637 regions. 

##Evaluate regions  

###Bedgraph results 
```{r}
library(dplyr)
library(ggplot2)
library(readr)
library(workflowr)
library(tidyr)
```


First I will look at the bedgraph file. This is not as imformative becuase it combined regions with the same counts.  

```{r}
total_bedgraph=read.table("../data/bedgraph_peaks/TotalBamFiles_no0.peaks.bed",col.names = c("chr", "start", "end", "regions", "mean", "counts"))
```

Plot the mean:  

```{r}
plot(sort(log10(total_bedgraph$mean), decreasing=T), xlab="Region", ylab="log10 of bedgraph region bin", main="Distribution of log10 region means from bedgraph")
```
I want to look at the distribution of how many bases are included in the regions.

```{r}

Tregion_bases=total_bedgraph %>% mutate(bases=end-start) %>% select(bases)

plot(sort(log10(Tregion_bases$bases), decreasing = T), xlab="Region", ylab="log10 of region size", main="Distribution of bases in regions- log10")
```

Given the reads are abotu 60bp this is probably pretty good.  

###GenomeCov results  

I am only going to look at the number of bases in region and mean coverage columns here because the file is really big. 


```{r}
total_gencov=read.table("../data/bedgraph_peaks/TotalBamFiles.gencovpeaks_noregstring.bed",col.names = c("chr", "start", "end", "regions", "mean"))
```

Plot the mean:  

```{r}
plot(sort(log10(total_gencov$mean), decreasing=T), xlab="Region", ylab="log10 of mean bin count", main="Distribution of log10 region means")
```

```{r}


plot(sort(log10(total_gencov$regions), decreasing = T), xlab="Region", ylab="log10 of region size", main="Distribution of bases in regions- log10")
```

Plot number of bases against the mean:  

```{r}


ggplot(total_gencov, aes(y=log10(regions), x=log10(mean))) +
         geom_point(na.rm = TRUE, size = 0.1) +
         geom_density2d(na.rm = TRUE, size = 1, colour = 'red') +
         ylab('Log10 Region size') +
         xlab('Log10 Mean region coverage') + 
        ggtitle("Region size vs Region Coverage: Combined Total Libraries")
```

##Troubleshooting 
###Account for split reads  

In the previous analysis I did not account for split reads in the genome coveragre step. This may explain some of the long regions that are an effect of splicing. This script is 

```{bash, eval=F}
#!/bin/bash

#SBATCH --job-name=Tgencovsplit
#SBATCH --account=pi-yangili1
#SBATCH --time=24:00:00
#SBATCH --output=Tgencovsplit.out
#SBATCH --error=Tgencovaplit.err
#SBATCH --partition=broadwl
#SBATCH --mem=40G
#SBATCH --mail-type=END


module load Anaconda3
source activate three-prime-env 


bedtools genomecov -ibam /project2/gilad/briana/threeprimeseq/data/macs2/TotalBamFiles.sort.bam -d -split > /project2/gilad/briana/threeprimeseq/data/bedgraph/TotalBamFiles.split.genomecov.bed
```
Now I need to remove the 0s and merge. 
```{bash, eval=F}
awk '{if ($3 != 0) print}' TotalBamFiles.split.genomecov.bed > TotalBamFiles.split.genomecov.no0.bed

awk '{print $1 "\t" $2 "\t"  $2 "\t" $3}' TotalBamFiles.split.genomecov.no0.bed > TotalBamFiles.split.genomecov.no0.fixed.bed
```


Use this file to run mergeGencov.sh.  

```{r}
total_gencov_split=read.table("../data/bedgraph_peaks/TotalBamFiles.split.gencovpeaks.noregstring.bed",col.names = c("chr", "start", "end", "regions", "mean"))
```

Plot the region size. I expect some of the long regions are gone.  

```{r}

plot(sort(log10(total_gencov_split$regions), decreasing = T), xlab="Region", ylab="log10 of region size", main="Distribution of bases in regions- log10 SPLIT")
```

Plot the region size against the mean:  


Plot number of bases against the mean:  

```{r}


splitplot=ggplot(total_gencov_split, aes(y=log10(regions), x=log10(mean))) +
         geom_point(na.rm = TRUE, size = 0.1) +
         geom_density2d(na.rm = TRUE, size = 1, colour = 'red') +
         ylab('Log10 Region size') +
         xlab('Log10 Mean region coverage') + 
     scale_y_continuous(limits = c(0, 3)) +
        ggtitle("Combined Total Libraries SPLIT")
```

###Investigate long regions

Some of the regions are long and probably represent 2 or more sites. This is evident in highly expressed genes such as actB. I will look at some of the long regions and make histograms with the strings of coverage in the region.  

First I am going to look at chr11:65266512-65268654, this is peak 580475 I will go into the otalBamFiles.split.gencovpeaks.bed  file and use:

```{bash, eval=F}
 grep -n 65266512 TotalBamFiles.split.gencovpeaks.bed | awk '{print $6}' > loc_ch11_65266512_65268654.txt
```

```{r}
loc_ch11_65266512_65268654=read.csv("../data/bedgraph_peaks/loc_ch11_65266512_65268654.txt", header=F) %>% t


loc_ch11_65266512_65268654_df= as.data.frame(loc_ch11_65266512_65268654)

loc_ch11_65266512_65268654_df$loc= seq(1:nrow(loc_ch11_65266512_65268654_df))
colnames(loc_ch11_65266512_65268654_df)= c("count", "loc")

ggplot(loc_ch11_65266512_65268654_df, aes(x=loc, y=count)) + geom_line() + labs(y="Read Count", x="Peak Location", title="Example of long region called as 1 peak \n ch11 65266512-65268654")
```


Try one more. Example. line 816811, chr:17- 79476983- 79477761
```{bash, eval=F}
 grep -n 79476983 TotalBamFiles.split.gencovpeaks.bed | awk '{print $6}' > loc_ch17_79476983_79477761.txt
```


```{r}
loc_ch17_79476983_79477761=read.csv("../data/bedgraph_peaks/loc_ch17_79476983_79477761.txt", header=F) %>% t

loc_ch17_79476983_79477761_df= as.data.frame(loc_ch17_79476983_79477761)

loc_ch17_79476983_79477761_df$loc= seq(1:nrow(loc_ch17_79476983_79477761_df))
colnames(loc_ch17_79476983_79477761_df)= c("count", "loc")

ggplot(loc_ch17_79476983_79477761_df, aes(x=loc, y=count)) + geom_line() + labs(y="Read Count", x="Peak Location", title="Example of long region called as 1 peak \n ch17 79476983:79477761")
```
This one is not multiple peaks but it does need to be trimmed.  

##Compare to adhoc method by Yang
Yang created an adhoc method to do this.  

```{python, eval=F}
def main(inFile, outFile, ctarget):
    fout = open(outFile,'w')
    mincount = 10
    ov = 20
    current_peak = []
    
    currentChrom = None
    
    for ln in open(inFile):
        chrom, pos, count = ln.split()
        if chrom != ctarget: continue
        count = float(count)

        if currentChrom == None:
            currentChrom = chrom
            
        if count == 0 or currentChrom != chrom:
            if len(current_peak) > 0:
                M = max([x[1] for x in current_peak])
                if M > mincount:
                    all_peaks = refine_peak(current_peak, M, M*0.1,M*0.05)
                    #refined_peaks = [(x[0][0],x[-1][0], np.mean([y[1] for y in x])) for x in all_peaks]  
                    rpeaks = [(int(x[0][0])-ov,int(x[-1][0])+ov, np.mean([y[1] for y in x])) for x in all_peaks]
                    if len(rpeaks) > 1:
                        for clu in cluster_intervals(rpeaks)[0]:
                            M = max([x[2] for x in clu])
                            merging = []
                            for x in clu:
                                if x[2] > M *0.5:
                                    #print x, M
                                    merging.append(x)
                            c, s,e,mean =  chrom, min([x[0] for x in merging])+ov, max([x[1] for x in merging])-ov, np.mean([x[2] for x in merging])
                            #print c,s,e,mean
                            fout.write("chr%s\t%d\t%d\t%d\t+\t.\n"%(c,s,e,mean))
                            fout.flush()
                    elif len(rpeaks) == 1:
                        s,e,mean = rpeaks[0]
                        fout.write("chr%s\t%d\t%d\t%f\t+\t.\n"%(chrom,s+ov,e-ov,mean))
                        print ("chr%s"%chrom+"\t%d\t%d\t%f\t+\t.\n"%rpeaks[0])
                    #print refined_peaks
            current_peak = []
        else:
            current_peak.append((pos,count))
        currentChrom = chrom


def refine_peak(current_peak, M, thresh, noise, minpeaksize=30):
    
    cpeak = []
    opeak = []
    allcpeaks = []
    allopeaks = []

    for pos, count in current_peak:
        if count > thresh:
            cpeak.append((pos,count))
            opeak = []
            continue
        elif count > noise: 
            opeak.append((pos,count))
        else:
            if len(opeak) > minpeaksize:
                allopeaks.append(opeak) 
            opeak = []

        if len(cpeak) > minpeaksize:
            allcpeaks.append(cpeak)
            cpeak = []
        
    if len(cpeak) > minpeaksize:
        allcpeaks.append(cpeak)
    if len(opeak) > minpeaksize:
        allopeaks.append(opeak)

    allpeaks = allcpeaks
    for opeak in allopeaks:
        M = max([x[1] for x in opeak])
        allpeaks += refine_peak(opeak, M, M*0.3, noise)

    #print [(x[0],x[-1]) for x in allcpeaks], [(x[0],x[-1]) for x in allopeaks], [(x[0],x[-1]) for x in allpeaks]
    #print '---\n'
    return allpeaks

if __name__ == "__main__":
    import numpy as np
    from misc_helper import *
    import sys

    chrom = sys.argv[1]
    inFile = "/project2/yangili1/threeprimeseq/gencov/TotalBamFiles.split.genomecov.bed"
    outFile = "APApeaks_chr%s.bed"%chrom
    main(inFile, outFile, chrom)
```

This is done by chromosome and takes in the TotalBam Split genome coverage file I made.I am going to look at the stats for these peaks.  

```{r}
YL_peaks=read.table("../data/bedgraph_peaks/APApeaks.bed", col.names = c("chr", "start", "end", "count", "strand", "score")) %>% mutate(length=end-start)
```
Plot the lengths   

```{r}
plot(sort(log10(YL_peaks$length), decreasing = T), xlab="Region", ylab="log10 of region size", main="Distribution of bases in YL regions- log10 ")
```

Plot number of bases against the mean:  

```{r}


YLplot=ggplot(YL_peaks, aes(y=log10(length), x=log10(count))) +
         geom_point(na.rm = TRUE, size = 0.1) +
         geom_density2d(na.rm = TRUE, size = 1, colour = 'red') +
         ylab('Log10 Region size') +
         xlab('Log10 Mean region coverage') + 
        scale_y_continuous(limits = c(0, 3)) + 
        ggtitle("YL Peaks Combined Total Libraries")
```

```{r}
library(cowplot)
plot_grid(splitplot, YLplot)
```


##Run this on the Nuclear Fraction Bam  

```{bash, eval=F}
#!/bin/bash

#SBATCH --job-name=Ngencov_s
#SBATCH --account=pi-yangili1
#SBATCH --time=24:00:00
#SBATCH --output=Ngencov_s.out
#SBATCH --error=Ngencov_s.err
#SBATCH --partition=broadwl
#SBATCH --mem=40G
#SBATCH --mail-type=END


module load Anaconda3
source activate three-prime-env 

samtools sort -o /project2/gilad/briana/threeprimeseq/data/macs2/NuclearBamFiles.sort.bam /project2/gilad/briana/threeprimeseq/data/macs2/NuclearBamFiles.bam 


bedtools genomecov -ibam /project2/gilad/briana/threeprimeseq/data/macs2/NuclearBamFiles.sort.bam -d -split > /project2/gilad/briana/threeprimeseq/data/bedgraph/NuclearBamFiles.split.genomecov.bed

```


I modified Yang's script to take the nuclear gencov and put the output in the data/peaks directory. I will create a wrapper to call this on chromosomes 1-22. 

```{bash, eval=F}
#!/bin/bash

#SBATCH --job-name=w_getpeakYL
#SBATCH --account=pi-yangili1
#SBATCH --time=24:00:00
#SBATCH --output=w_getpeakYL.out
#SBATCH --error=w_getpeakYL.err
#SBATCH --partition=broadwl
#SBATCH --mem=12G
#SBATCH --mail-type=END

module load Anaconda3
source activate three-prime-env


for i in $(seq 1 22); do 
  sbatch callPeaksYL_Nuc.py $i
done

```

I can now concatenate all of these into one file:  
```{bash, eval=F}
cat * | sort -k 1,1 -k2,2n > APApeaks_nuclear_all.bed 
```


Thoughts:  

* Remove peaks outside 1kb of the genes   

* Remove peaks with low expression