analysis/splicesitestrength.Rmd

---
title: "5' Splice site strength"
author: "Briana Mittleman"
date: "2/17/2020"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

```{r}
library(workflowr)
library(tidyverse)
library(ggpubr)
```

I will assess the 5' splice site strength with maxentscore to see if this can tell us anything interesting about intronic polyadenylation.  

http://hollywood.mit.edu/burgelab/maxent/Xmaxentscan_scoreseq.html

How to use MaxEntScan::score5ss

Each sequence must be 9 bases long. [3 bases in exon][6 bases in intron] 
Input sequences as a FastA file with one sequence per line (no linebreaks). Non-ACGT sequences will not be processed. 

Example Fasta File 
```{bash,eval=F}
> dummy1
cagGTAAGT
> dummy2 
gagGTAAGT
> dummy3 
taaATAAGT
```


I assigned PAS to introns. in https://brimittleman.github.io/apaQTL/nucintronicanalysis.html 

```{r}
pas2intron=read.table("../data/intron_analysis/IntronPeaksontoIntrons.bed",col.names = c("intronCHR", "intronStart", "intronEnd", "gene", "score", "strand", "peakCHR", "peakStart", "peakEnd", "PeakID", "meanUsage", "peakStrand")) 

#%>% mutate(PASloc=ifelse(strand=="+", peakEnd, peakStart)) %>% dplyr::select(intronStart, intronEnd, gene, strand, PeakID, PASloc ,meanUsage) %>% mutate(intronLength=intronEnd-intronStart , distance2PAS= ifelse(strand=="+", PASloc-intronStart, intronEnd-PASloc), propIntron=distance2PAS/intronLength)
```


I need a file with the PAS and the 5' splice site. For negative strand the 5' is the end and postitive strand PAS it is the start.  


postive:
start= start-3
end= start + 6 

negative: 
start= end -6
end= end + 3
```{bash,eval=F}
mkdir ../data/splicesite
```


```{r}
PAS_5SS_pos= pas2intron %>% filter(strand=="+") %>% mutate(start=intronStart-3, end= intronStart +6) %>% select(intronCHR, start,end, PeakID,meanUsage, strand)
PAS_5SS_neg=pas2intron %>% filter(strand=="-") %>% mutate(start=intronEnd-6, end= intronEnd +3) %>% select(intronCHR, start,end, PeakID,meanUsage, strand)
PAS_5SS_both= PAS_5SS_neg %>% bind_rows(PAS_5SS_pos)


write.table(PAS_5SS_pos, "../data/splicesite/TestPosSS.bed", col.names = F, row.names = F, quote=F, sep="\t")

write.table(PAS_5SS_neg, "../data/splicesite/TestNegSS.bed", col.names = F, row.names = F, quote=F, sep="\t")


write.table(PAS_5SS_both, "../data/splicesite/AllPASSS.bed", col.names = F, row.names = F, quote=F, sep="\t")


```


Merge and sort these to get the nucleotides:  

```{bash,eval=F}
sort -k1,1 -k2,2n ../data/splicesite/AllPASSS.bed > ../data/splicesite/AllPASSS.sort.bed


#cut chr  

sed 's/^chr//' ../data/splicesite/AllPASSS.sort.bed >  ../data/splicesite/AllPASSS.sort.noChr.bed


#bedtools nuc

bedtools nuc -fi /project2/gilad/briana/genome_anotation_data/genome/Homo_sapiens.GRCh37.75.dna_sm.all.fa -bed ../data/splicesite/AllPASSS.sort.noChr.bed -seq -s > ../data/splicesite/AllPASSS.sort.Nuc.txt
```

This works and it flips the strand. the first 3 bases are the exon and the next 6 are the intron. 

I need to turn this into a FA file. with the first 3 lower case and second 6 upper like the example.  I can do this in python. 

For each PAS i will have the name then the bases in the next 

```{bash,eval=F}
python splicesite2fasta.py
```
Score online with site and use  Maximum Entropy Model.  

splice result to keep every other line. Then I can join the reults with the initial bed.  

```{bash,eval=F}
python parseSSres.py
```


```{r}
res=read.table("../data/splicesite/MaxIntResParsed.txt", col.names=c("splicesite", "maxentscore"), header=F, stringsAsFactors = F)

bothSS=read.table("../data/splicesite/AllPASSS.sort.noChr.bed", header = F, col.names = c("chr", 'start','end','PAS', "NuclearUsage", 'strand'))


bothandres=bothSS %>% bind_cols(res)
```


Plot usage and score:  
```{r}
cor.test(bothandres$NuclearUsage, bothandres$maxentscore)
```

```{r}
ggplot(bothandres, aes(x=maxentscore, y=NuclearUsage)) + geom_point() + geom_density2d(col="red")
```

Filter usage higher (25%)  and score above 0 

```{r}
bothandres_filt= bothandres %>% filter(NuclearUsage>0.25, maxentscore>0)

ggplot(bothandres_filt, aes(x=maxentscore, y=NuclearUsage)) + geom_point() + geom_density2d(col="red") + geom_smooth(method="lm")
```


Does not look like there is a relationship here.  


Expectation is a stronger 5' SS means lower intronic usage.  I will compare top 10% usage and bottom 10% usage


```{r}

quantile(bothandres$NuclearUsage,probs=c(.1,.9))


bothandres_topbottom = bothandres %>% filter(NuclearUsage<= 0.056 | NuclearUsage >=0.38) %>% mutate(Usage=ifelse(NuclearUsage <=.15, "Low","High"))

ggplot(bothandres_topbottom,aes(x=Usage, y=maxentscore))+ geom_boxplot()

```


```{r}
bothandres_top=bothandres_topbottom %>% filter(Usage=="High")
bothandres_bottom=bothandres_topbottom %>% filter(Usage=="Low")

#x to the left of y  
wilcox.test(bothandres_top$maxentscore, bothandres_bottom$maxentscore, alternative="less")
```

top used have lower scores. This is in line with expectation.  


Compare to a random set of splice sites.
Select 12536  
```{r}

chroms=c("chr1", 'chr2', 'chr3', 'chr4', 'chr5', 'chr6', 'chr7', 'chr8', 'chr9', 'chr10', 'chr11', 'chr12', 'chr13', 'chr14', 'chr15', 'chr16', 'chr17', 'chr18', 'chr19', 'chr20', 'chr21', 'chr22')
allIntron=read.table("/project2/gilad/briana/apaQTL/data/intron_analysis/transcriptsMinusExons.sort.bed", col.names = c("chr","start","end", 'gene', 'score','strand'),header = T, stringsAsFactors = F) %>% filter(chr %in% chroms)

#sampleIntron= allIntron %>% sample_n(12536, replace = F)
```

Get the 5' splice site for these:  

```{r}
#randPAS_5SS_pos= sampleIntron %>% filter(strand=="+") %>% mutate(newStart=start-3, newEnd= start +6) %>% select(chr, newStart,newEnd, gene,score, strand)


#randPAS_5SS_neg=sampleIntron %>% filter(strand=="-") %>% mutate(newStart=end-6, newEnd= end +3) %>% select(chr, newStart,newEnd, gene,score, strand)


#randPAS_both= randPAS_5SS_pos %>% bind_rows(randPAS_5SS_neg)


#write.table(randPAS_both,"../data/splicesite/RandomIntronSS.bed",sep="\t", col.names = F, row.names = F, quote = F)
```

```{bash,eval=F}
sort -k1,1 -k2,2n ../data/splicesite/RandomIntronSS.bed | sed 's/^chr//' > ../data/splicesite/RandomIntronSS_noChr.bed


#bedtools nuc

bedtools nuc -fi /project2/gilad/briana/genome_anotation_data/genome/Homo_sapiens.GRCh37.75.dna_sm.all.fa -bed ../data/splicesite/RandomIntronSS_noChr.bed -seq -s > ../data/splicesite/RandomIntronSS_noChr.Nuc.bed


python Randomsplicesite2fasta.py

python parseRanodmSSres.py
```


Eval: 
```{r}
RandomSites=read.table("../data/splicesite/RandomIntronSS.bed",col.names = c('chr','start','end','name','score','strand'))
RandomRes= read.table("../data/splicesite/RandomSSMaxentParsed.txt", col.names = c("splicesite", "maxentscore_cont"), stringsAsFactors = F, header = F)

RandomSitewRes=RandomSites %>% bind_cols(RandomRes)


```

Compare these to the actual:  

```{r}
RealandCont=as.data.frame(cbind(Control=RandomSitewRes$maxentscore_cont, PAS=bothandres$maxentscore))
RealandContG=RealandCont %>%  gather("set", "score")

```


```{r}
ggplot(RealandContG, aes(x=set, y = score,fill=set)) + geom_boxplot() + stat_compare_means()

```

```{r}
summary(RandomSitewRes$maxentscore_cont)
summary(bothandres$maxentscore)
```

```{r}
wilcox.test(RandomSitewRes$maxentscore_cont,bothandres$maxentscore, alternative = "less")
```

Test if any of the QTLs fall in 5' splice sites.  For this I will look at the 5' site for every intron:  

```{r}
allIntron_sspos= allIntron %>% filter(strand=="+") %>% mutate(newStart=start-3, newEnd= start +6) %>% select(chr, newStart,newEnd, gene,score, strand)
allIntron_ssneg= allIntron  %>% filter(strand=="-") %>% mutate(newStart=end-6, newEnd= end +3) %>% select(chr, newStart,newEnd, gene,score, strand)

AllIntron_both=allIntron_ssneg %>% bind_rows(allIntron_sspos)

write.table(AllIntron_both, "../data/splicesite/AllIntron5primeSS.bed", col.names = F, row.names = F, quote = F, sep="\t")
```
sort and intersect with qtl snps.  


```{bash,eval=F}

sort -k1,1 -k2,2n ../data/splicesite/AllIntron5primeSS.bed| sed 's/^chr//' > ../data/splicesite/AllIntron5primeSS_sort.bed

sort -k1,1 -k2,2n ../data/apaQTLs/Nuclear_apaQTLs4pc_5fdr.WITHSTRAND.bed | sed '1d' | head -n -1 > ../data/apaQTLs/Nuclear_apaQTLs4pc_5fdr.WITHSTRAND.sort.bed

bedtools intersect -wo -a ../data/splicesite/AllIntron5primeSS_sort.bed -b ../data/apaQTLs/Nuclear_apaQTLs4pc_5fdr.WITHSTRAND.sort.bed -s > ../data/splicesite/QTLin5SS.txt

```

1 example.  

15	31229459	31229468	FAN1	.	+	15	31229462	31229463	FAN1:peak42822:utr3	83	+	1  


##Intron by presence of intronic  


Seperate the introns by splice site strength. First run the nuc on all of them  


```{bash,eval=F}
#../data/splicesite/AllIntron5primeSS_sort.bed


#bedtools nuc

bedtools nuc -fi /project2/gilad/briana/genome_anotation_data/genome/Homo_sapiens.GRCh37.75.dna_sm.all.fa -bed ../data/splicesite/AllIntron5primeSS_sort.bed -seq -s > ../data/splicesite/AllIntron5primeSS_sort_Nuc.bed


python Allsplicesite2fasta.py

python parseALLSSres.py
```


```{r}
IntronSites=read.table("../data/splicesite/AllIntron5primeSS_sort.bed", col.names =c('chr','start','end','name','score','strand'))
IntronRes=read.table("../data/splicesite/AllIntron_Parsed.txt", col.names=c("splicesite", "maxentscore") )

bothandres_g= bothandres %>% mutate(intron=paste(chr,start,end,sep=":")) %>% group_by(intron) %>% summarise(nPAS=n())

IntronSiteswRes=IntronSites %>% bind_cols(IntronRes) %>% mutate(intron=paste(chr,start,end,sep=":")) %>% full_join(bothandres_g,by="intron") %>% replace_na(list(nPAS = 0))
```

Now I need to decide the cutoffs: 

```{r}
ggplot(IntronSiteswRes, aes(x=maxentscore)) + geom_density()

summary(IntronSiteswRes$maxentscore)

quantile(IntronSiteswRes$maxentscore, seq(0,1, by=.1))
```
```{r}
IntronSiteswRes_dec= IntronSiteswRes %>% mutate(decile_rank = ntile(IntronSiteswRes$maxentscore,10))

IntronSiteswRes_decG= IntronSiteswRes_dec%>% group_by(decile_rank) %>% summarise(PAS=sum(nPAS))


```

```{r}
ggplot(IntronSiteswRes_decG, aes(x=decile_rank, y=PAS)) +geom_bar(stat="identity") + labs(x="5’ splice site strength (MaxEntScore) decile of introns", y= "Number of Intronic PAS", title="Number of intronic PAS by 5' Splice site strength")
```

hypergeometric significance  

```{r}
x=2116
m=IntronSiteswRes_dec %>% filter(decile_rank == 1) %>% nrow()
n=IntronSiteswRes_dec %>% filter(decile_rank != 1) %>% nrow()
k=sum(IntronSiteswRes_dec$nPAS)
  

#expected
which(grepl(max(dhyper(1:x, m, n, k)), dhyper(1:x, m, n, k)))

x

phyper(x, m, n, k,lower.tail=F)
```


This means there is significant enrichment for PAS in introns with the weakest 5' splice sites. 


Enrichment:  
(b/n) / (B/N)

Enrichment (N, B, n, b) is defined as follows:
N - is the total number of genes
B - is the total number of genes associated with a specific GO term
n - is the number of genes in the top of the user's input list or in the target set when appropriate
b - is the number of genes in the intersection
Enrichment = (b/n) / (B/N)


N= number of introns 
B= number of introns in top decile
n=number of PAS 
b= number of PAS in top decile 
```{r}
N=nrow(IntronSiteswRes)
B= nrow(IntronSiteswRes_dec %>% filter(decile_rank==1))
n= sum(IntronSiteswRes_dec$nPAS)
b=2116

(b/n)/(B/N)
```

¸