/
propeQTLs_explained.Rmd
171 lines (109 loc) · 6.89 KB
/
propeQTLs_explained.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
---
title: "Proportion eQTLs explained"
author: "Briana Mittleman"
date: "6/7/2019"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
```{r}
library(tidyverse)
library(workflowr)
library(reshape2)
```
I need to fix the explained_FDR10.sort.txt and unexplained_FDR10.sort.txt files because right now this file has multiple genes per snp.
```{bash,eval=F}
python fixExandUnexeQTL.py ../data/Li_eQTLs/explained_FDR10.sort.txt ../data/Li_eQTLs/explained_FDR10.sort_FIXED.txt
python fixExandUnexeQTL.py ../data/Li_eQTLs/unexplained_FDR10.sort.txt ../data/Li_eQTLs/unexplained_FDR10.sort_FIXED.txt
```
There are 1195 explained and 814 unexplained eQTLs. I will next look at each of these in my apadata.
Convert nominal results to have snps rather than rsids:
```{bash,eval=F}
python convertNominal2SNPLOC.py Total
python convertNominal2SNPLOC.py Nuclear
```
```{bash,eval=F}
mkdir ../data/overlapeQTL_try2
sbatch run_getapafromeQTL.sh
```
###total
I can group the unexplained by gene and snp then I can ask if there is at least 1 significat peak for each of these.
I will use the bonforoni correction here and multiply the pvalue by the number of peaks in the gene:snp association.
```{r}
nomnames=c("peakID", 'snp','dist', 'pval', 'slope')
totalapaUnexplained=read.table("../data/overlapeQTL_try2/apaTotal_unexplainedQTLs.txt", stringsAsFactors = F, col.names = nomnames) %>% separate(peakID, into=c("chr","start","end","geneID"), sep=":") %>% separate(geneID, into=c("gene", "loc", "strand", "PASnum"), sep="_") %>% group_by(gene, snp) %>% mutate(nPeaks=n(), adjPval=pval* nPeaks) %>% slice(which.min(adjPval))
totalapaUnexplained_sig= totalapaUnexplained %>% filter(adjPval<.05)
```
Look at distribution of these pvals:
```{r}
ggplot(totalapaUnexplained, aes(x=adjPval)) + geom_histogram(bins=50)
```
Proportion explained:
```{r}
nrow(totalapaUnexplained_sig)/nrow(totalapaUnexplained)
```
I tested 528 unexplained eQTLs in the total fraction and 68 have a bonforoni corrected significant peak.
Compare to explained eQTLS:
```{r}
totalapaexplained=read.table("../data/overlapeQTL_try2/apaTotal_explainedQTLs.txt", stringsAsFactors = F, col.names = nomnames) %>% separate(peakID, into=c("chr","start","end","geneID"), sep=":") %>% separate(geneID, into=c("gene", "loc", "strand", "PASnum"), sep="_") %>% group_by(gene, snp) %>% mutate(nPeaks=n(), adjPval=pval* nPeaks) %>% slice(which.min(adjPval))
totalapaexplained_sig= totalapaexplained %>% filter(adjPval<.05)
nrow(totalapaexplained_sig)/nrow(totalapaexplained)
```
I am testing 717 explained eQTLs and of those 83 have a bonforoni corrected significant peak.
difference of proportions:
```{r}
prop.test(x=c(nrow(totalapaUnexplained_sig),nrow(totalapaexplained_sig)), n=c(nrow(totalapaUnexplained),nrow(totalapaexplained)))
```
```{r}
ggplot(totalapaUnexplained_sig,aes(x=loc)) + geom_histogram(stat="count",aes(y=..count../sum(..count..))) + labs(y="Proportion", title = "Total apaQTLs explaining eQTLs")
```
```{r}
totalapaUnexplained_sig_loc= totalapaUnexplained_sig %>% group_by(loc) %>% summarise(nLocTotalUn=n()) %>% mutate(propTotalUn=nLocTotalUn/nrow(totalapaUnexplained_sig))
totalapaexplained_sig_loc= totalapaexplained_sig %>% group_by(loc) %>% summarise(nLocTotalEx=n()) %>% mutate(propTotalEx=nLocTotalEx/nrow(totalapaexplained_sig))
BothTotalLoc=totalapaUnexplained_sig_loc %>% full_join(totalapaexplained_sig_loc,by="loc") %>% replace_na(list(propTotalUn = 0, nLocTotalUn = 0,propTotalEx=0,nLocTotalEx=0 ))
BothTotalLoc
```
###nuclear
```{r}
nuclearapaUnexplained=read.table("../data/overlapeQTL_try2/apaNuclear_unexplainedQTLs.txt", stringsAsFactors = F, col.names = nomnames) %>% separate(peakID, into=c("chr","start","end","geneID"), sep=":") %>% separate(geneID, into=c("gene", "loc", "strand", "PASnum"), sep="_") %>% group_by(gene, snp) %>% mutate(nPeaks=n(), adjPval=pval* nPeaks) %>% slice(which.min(adjPval))
nuclearapaUnexplained_sig= nuclearapaUnexplained %>% filter(adjPval<.05)
nrow(nuclearapaUnexplained_sig)/nrow(nuclearapaUnexplained)
```
I tested 536 unexplained eQTLs in the nuclear fraction and 89 have a bonforoni corrected significant peak.
```{r}
nuclearapaexplained=read.table("../data/overlapeQTL_try2/apaNuclear_explainedQTLs.txt", stringsAsFactors = F, col.names = nomnames) %>% separate(peakID, into=c("chr","start","end","geneID"), sep=":") %>% separate(geneID, into=c("gene", "loc", "strand", "PASnum"), sep="_") %>% group_by(gene, snp) %>% mutate(nPeaks=n(), adjPval=pval* nPeaks) %>% slice(which.min(adjPval))
nuclearapaexplained_sig= nuclearapaexplained %>% filter(adjPval<.05)
nrow(nuclearapaexplained_sig)/nrow(nuclearapaexplained)
```
I tested 725 explained eQTLs in the nuclear fraction and 90 have a nominally significant peak.
difference of proportions:
```{r}
prop.test(x=c(nrow(nuclearapaUnexplained_sig),nrow(nuclearapaexplained_sig)), n=c(nrow(nuclearapaUnexplained),nrow(nuclearapaexplained)))
```
Nuclear are more likely to explain unexplained than expalined.
```{r}
ggplot(nuclearapaUnexplained_sig,aes(x=loc)) + geom_histogram(stat="count",aes(y=..count../sum(..count..))) + labs(title = "Nuclear apaQTLs explaining eQTLs", y="Proportion")
```
```{r}
nuclearapaUnexplained_sig_loc= nuclearapaUnexplained_sig %>% group_by(loc) %>% summarise(nLocnuclearUn=n()) %>% mutate(propnuclearUn=nLocnuclearUn/nrow(nuclearapaUnexplained_sig))
nuclearapaexplained_sig_loc= nuclearapaexplained_sig %>% group_by(loc) %>% summarise(nLocnuclearEx=n()) %>% mutate(propnuclearEx=nLocnuclearEx/nrow(nuclearapaexplained_sig))
BothnuclearLoc=nuclearapaUnexplained_sig_loc %>% full_join(nuclearapaexplained_sig_loc,by="loc") %>% replace_na(list(propnuclearUn = 0, nLocnuclearUn = 0,propnuclearEx=0,nLocnuclearEx=0 ))
BothnuclearLoc
```
```{r}
prop.test(x=c(39,48), n=c(nrow(nuclearapaUnexplained_sig),nrow(nuclearapaexplained_sig)))
prop.test(x=c(45,31), n=c(nrow(nuclearapaUnexplained_sig),nrow(nuclearapaexplained_sig)))
```
Unexplained are more likely to be in the 3' UTR.
###total v nuclear
```{r}
prop.test(x=c(nrow(nuclearapaUnexplained_sig),nrow(totalapaUnexplained_sig)), n=c(nrow(nuclearapaUnexplained),nrow(totalapaUnexplained)))
```
Differences in proportion by location
```{r}
allLocProp=BothnuclearLoc %>% full_join(BothTotalLoc, by="loc") %>% select(loc,propnuclearUn,propnuclearEx,propTotalUn,propTotalEx )
allLocPropmelt= melt(allLocProp, id.vars = "loc") %>% mutate(Fraction=ifelse(grepl("Total", variable), "Total", "Nuclear"),eQTL=ifelse(grepl("Un", variable), "Unexplained", "Explained"))
ggplot(allLocPropmelt,aes(x=loc, fill=eQTL, y=value)) + geom_histogram(stat="identity", position = "dodge") + facet_grid(~Fraction)+ labs(y="Proportion of PAS", title="apaQTLs overlaping eQTLs by PAS location")
```
This is a very stringent test. A less stringent way to get an upper bound would be to make an informed decision about which peak to use. This will make it so I am only testing one PAS per gene.