analysis/deNovopeakcalling.Rmd

---
title: "deNovo peak callling"
author: "Briana Mittleman"
date: "6/28/2018"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

##Create Bedgraph

I will call peaks de novo in the combined total and nuclear fraction 3' Seq. The data is reletevely clean so I will start with regions that have continuous coverage. I will first create a bedgraph.  

```{bash, eval=F}
#!/bin/bash

#SBATCH --job-name=Tbedgraph
#SBATCH --account=pi-yangili1
#SBATCH --time=24:00:00
#SBATCH --output=Tbedgraph.out
#SBATCH --error=Tbedgraph.err
#SBATCH --partition=broadwl
#SBATCH --mem=40G
#SBATCH --mail-type=END


module load Anaconda3
source activate three-prime-env 

samtools sort -o /project2/gilad/briana/threeprimeseq/data/macs2/TotalBamFiles.sort.bam /project2/gilad/briana/threeprimeseq/data/macs2/TotalBamFiles.bam

bedtools genomecov -ibam /project2/gilad/briana/threeprimeseq/data/macs2/TotalBamFiles.sort.bam -bga > /project2/gilad/briana/threeprimeseq/data/bedgraph/TotalBamFiles.bedgraph

```

Next I will create the file without the 0 places in the genome. I will be able to use this for the bedtools merge function.  

```{bash, eval=F}
awk '{if ($4 != 0) print}' TotalBamFiles.bedgraph >TotalBamFiles_no0.bedgraph 
```

I can merge the regions with consequtive reads using the bedtools merge function.  

* -i input bed  

* -c colomn to act on  

* -o collapse, print deliminated list of the counts from -c call  
 
* -delim ","  

This is the mergeBedgraph.sh script. It takes in the no 0 begraph filename without the path.  
```{bash, eval=F}
#!/bin/bash

#SBATCH --job-name=merge
#SBATCH --account=pi-yangili1
#SBATCH --time=8:00:00
#SBATCH --output=merge.out
#SBATCH --error=merge.err
#SBATCH --partition=broadwl
#SBATCH --mem=16G
#SBATCH --mail-type=END

module load Anaconda3
source activate three-prime-env 

bedgraph=$1
describer=$(echo ${bedgraph} | sed -e "s/.bedgraph$//")

bedtools merge -c 4,4,4 -o count,mean,collapse -delim "," -i /project2/gilad/briana/threeprimeseq/data/bedgraph/$1 > /project2/gilad/briana/threeprimeseq/data/bedgraph/${describer}.peaks.bed


```

Run this first on the total bedgraph, TotalBamFiles_no0.bedgraph. The file has chromosome, start, end, number of regions, mean, and a string of the values.

This is not exaclty what I want. I need to go back and do genome cov not collapsing with bedgraph.  


To evaluate this I will bring the file into R and plot some statistics about it.  

```{bash, eval=F}
#!/bin/bash

#SBATCH --job-name=Tgencov
#SBATCH --account=pi-yangili1
#SBATCH --time=24:00:00
#SBATCH --output=Tgencov.out
#SBATCH --error=Tgencov.err
#SBATCH --partition=broadwl
#SBATCH --mem=40G
#SBATCH --mail-type=END


module load Anaconda3
source activate three-prime-env 


bedtools genomecov -ibam /project2/gilad/briana/threeprimeseq/data/macs2/TotalBamFiles.sort.bam -d > /project2/gilad/briana/threeprimeseq/data/bedgraph/TotalBamFiles.genomecov.bed

```
I will now remove the bases with 0 coverage.  

```{bash, eval=F}
awk '{if ($3 != 0) print}' TotalBamFiles.genomecov.bed > TotalBamFiles.genomecov.no0.bed 

awk '{print $1 "\t" $2 "\t"  $2 "\t" $3}' TotalBamFiles.genomecov.no0.bed > TotalBamFiles.genomecov.no0.fixed.bed
```

I will now merge the genomecov_no0 file with mergeGencov.sh
```{bash, eval=F}
#!/bin/bash

#SBATCH --job-name=mergegc
#SBATCH --account=pi-yangili1
#SBATCH --time=8:00:00
#SBATCH --output=mergegc.out
#SBATCH --error=mergegc.err
#SBATCH --partition=broadwl
#SBATCH --mem=16G
#SBATCH --mail-type=END

module load Anaconda3
source activate three-prime-env 

gencov=$1
describer=$(echo ${gencov} | sed -e "s/.genomecov.no0.fixed.bed$//")

bedtools merge -c 4,4,4 -o count,mean,collapse -delim "," -i /project2/gilad/briana/threeprimeseq/data/bedgraph/$1 > /project2/gilad/briana/threeprimeseq/data/bedgraph/${describer}.gencovpeaks.bed


```
This method gives us 811,637 regions. 

##Evaluate regions  

###Bedgraph results 
```{r}
library(dplyr)
library(ggplot2)
library(readr)
library(workflowr)
library(tidyr)
```


First I will look at the bedgraph file. This is not as imformative becuase it combined regions with the same counts.  

```{r}
total_bedgraph=read.table("../data/bedgraph_peaks/TotalBamFiles_no0.peaks.bed",col.names = c("chr", "start", "end", "regions", "mean", "counts"))
```

Plot the mean:  

```{r}
plot(sort(log10(total_bedgraph$mean), decreasing=T), xlab="Region", ylab="log10 of bedgraph region bin", main="Distribution of log10 region means from bedgraph")
```
I want to look at the distribution of how many bases are included in the regions.

```{r}

Tregion_bases=total_bedgraph %>% mutate(bases=end-start) %>% select(bases)

plot(sort(log10(Tregion_bases$bases), decreasing = T), xlab="Region", ylab="log10 of region size", main="Distribution of bases in regions- log10")
```

Given the reads are abotu 60bp this is probably pretty good.  

###GenomeCov results  

I am only going to look at the number of bases in region and mean coverage columns here because the file is really big. 


```{r}
total_gencov=read.table("../data/bedgraph_peaks/TotalBamFiles.gencovpeaks_noregstring.bed",col.names = c("chr", "start", "end", "regions", "mean"))
```

Plot the mean:  

```{r}
plot(sort(log10(total_gencov$mean), decreasing=T), xlab="Region", ylab="log10 of mean bin count", main="Distribution of log10 region means")
```

```{r}


plot(sort(log10(total_gencov$regions), decreasing = T), xlab="Region", ylab="log10 of region size", main="Distribution of bases in regions- log10")
```

Plot number of bases against the mean:  

```{r}


ggplot(total_gencov, aes(y=log10(regions), x=log10(mean))) +
         geom_point(na.rm = TRUE, size = 0.1) +
         geom_density2d(na.rm = TRUE, size = 1, colour = 'red') +
         ylab('Log10 Region size') +
         xlab('Log10 Mean region coverage') + 
        ggtitle("Region size vs Region Coverage: Combined Total Libraries")
```

##Troubleshooting 
###Account for split reads  

In the previous analysis I did not account for split reads in the genome coveragre step. This may explain some of the long regions that are an effect of splicing. This script is 

```{bash, eval=F}
#!/bin/bash

#SBATCH --job-name=Tgencovsplit
#SBATCH --account=pi-yangili1
#SBATCH --time=24:00:00
#SBATCH --output=Tgencovsplit.out
#SBATCH --error=Tgencovaplit.err
#SBATCH --partition=broadwl
#SBATCH --mem=40G
#SBATCH --mail-type=END


module load Anaconda3
source activate three-prime-env 


bedtools genomecov -ibam /project2/gilad/briana/threeprimeseq/data/macs2/TotalBamFiles.sort.bam -d -split > /project2/gilad/briana/threeprimeseq/data/bedgraph/TotalBamFiles.split.genomecov.bed
```
Now I need to remove the 0s and merge. 
```{bash, eval=F}
awk '{if ($3 != 0) print}' TotalBamFiles.split.genomecov.bed > TotalBamFiles.split.genomecov.no0.bed

awk '{print $1 "\t" $2 "\t"  $2 "\t" $3}' TotalBamFiles.split.genomecov.no0.bed > TotalBamFiles.split.genomecov.no0.fixed.bed
```


Use this file to run mergeGencov.sh.  

```{r}
total_gencov_split=read.table("../data/bedgraph_peaks/TotalBamFiles.split.gencovpeaks.noregstring.bed",col.names = c("chr", "start", "end", "regions", "mean"))
```

Plot the region size. I expect some of the long regions are gone.  

```{r}

plot(sort(log10(total_gencov_split$regions), decreasing = T), xlab="Region", ylab="log10 of region size", main="Distribution of bases in regions- log10 SPLIT")
```

Plot the region size against the mean:  


Plot number of bases against the mean:  

```{r}


ggplot(total_gencov_split, aes(y=log10(regions), x=log10(mean))) +
         geom_point(na.rm = TRUE, size = 0.1) +
         geom_density2d(na.rm = TRUE, size = 1, colour = 'red') +
         ylab('Log10 Region size') +
         xlab('Log10 Mean region coverage') + 
        ggtitle("Region size vs Region Coverage: Combined Total Libraries SPLIT")
```

###Investigate long regions

Some of the regions are long and probably represent 2 or more sites. This is evident in highly expressed genes such as actB. I will look at some of the long regions and make histograms with the strings of coverage in the region.  

First I am going to look at chr11:65266512-65268654, this is peak 580475 I will go into the otalBamFiles.split.gencovpeaks.bed  file and use:

```{bash, eval=F}
 grep -n 65266512 TotalBamFiles.split.gencovpeaks.bed | awk '{print $6}' > loc_ch11_65266512_65268654.txt
```

```{r}
loc_ch11_65266512_65268654=read.csv("../data/bedgraph_peaks/loc_ch11_65266512_65268654.txt", header=F) %>% t


loc_ch11_65266512_65268654_df= as.data.frame(loc_ch11_65266512_65268654)

loc_ch11_65266512_65268654_df$loc= seq(1:nrow(loc_ch11_65266512_65268654_df))
colnames(loc_ch11_65266512_65268654_df)= c("count", "loc")

ggplot(loc_ch11_65266512_65268654_df, aes(x=loc, y=count)) + geom_line() + labs(y="Read Count", x="Peak Location", title="Example of long region called as 1 peak \n ch11 65266512-65268654")
```


Try one more. Example. line 816811, chr:17- 79476983- 79477761
```{bash, eval=F}
 grep -n 79476983 TotalBamFiles.split.gencovpeaks.bed | awk '{print $6}' > loc_ch17_79476983_79477761.txt
```


```{r}
loc_ch17_79476983_79477761=read.csv("../data/bedgraph_peaks/loc_ch17_79476983_79477761.txt", header=F) %>% t

loc_ch17_79476983_79477761_df= as.data.frame(loc_ch17_79476983_79477761)

loc_ch17_79476983_79477761_df$loc= seq(1:nrow(loc_ch17_79476983_79477761_df))
colnames(loc_ch17_79476983_79477761_df)= c("count", "loc")

ggplot(loc_ch17_79476983_79477761_df, aes(x=loc, y=count)) + geom_line() + labs(y="Read Count", x="Peak Location", title="Example of long region called as 1 peak \n ch17 79476983:79477761")
```
This one is not multiple peaks but it does need to be trimmed.