/
a.content.Rmd
328 lines (201 loc) · 10.7 KB
/
a.content.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
---
title: "“A” content in each mapped read"
author: "Briana Mittleman"
date: "6/11/2018"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
The goal of this analysis is to start to understand the sequence composition of the three prime seq reads. This may help me detect misspriming at AAAAA rich regions rather than true site usage. The genomic sequence does not carry the polyadenylation signal, this means reads mapping to a genomic AAAAAA region may be false positives. Gruber et al. removed reads that consisted of more than 80% AAAA.
##Percent nucleotide in bins
One method is to measure the number of AAAAAs in my bins with bedtools nuc. I willl need a fasta file and a bed file with the bin.
```{r}
library(workflowr)
library(ggplot2)
library(dplyr)
library(cowplot)
library(tidyr)
library(reshape2)
```
```{bash, eval=F}
#!/bin/bash
#SBATCH --job-name=nuc.bin
#SBATCH --time=8:00:00
#SBATCH --output=nuc.bin.out
#SBATCH --error=nuc.bin.err
#SBATCH --partition=broadwl
#SBATCH --mem=20G
#SBATCH --mail-type=END
module load Anaconda3
source activate three-prime-env
bedtools nuc -s -fi /project2/gilad/briana/genome_anotation_data/genome/Homo_sapiens.GRCh37.75.dna_sm.all.fa -bed /project2/gilad/briana/genome_anotation_data/an.int.genome_200_strandspec.bed > /project2/gilad/briana/threeprimeseq/data/bin200.nuccov.bed
```
I can now pull this file into R.
```{r}
bin_nuccov=read.table("../data/bin200.nuccov.bed")
names(bin_nuccov)=c("chr", "start", "end", "bin", "score", "strand", "gene", "pct_at", "pct_gc", "numA", "numC", "numG", "numT", "numN", "numOther", "seqlen")
perc_A_bin=bin_nuccov %>% select("chr", "start", "end","bin", "strand", "gene", "numA") %>% mutate(percA=numA/200)
```
```{r}
ggplot(perc_A_bin, aes(percA)) + geom_histogram(bins=30) + labs(title="Percent of As in the bins")
```
I will apply the same filter as I did in the cov.200bp.wind file. I will keep bins with greater than 0 reads in half of the libraries.
```{r}
cov_all=read.table("../data/ssFC200.cov.bed", header = T, stringsAsFactors = FALSE)
#remember name switch!
names=c("Geneid","Chr", "Start", "End", "Strand", "Length", "N_18486","T_18486","N_18497","T_18497","N_18500","T_18500","N_18505",'T_18505',"N_18508","T_18508","N_18853","T_18853","N_18870","T_18870","N_19128","T_19128","N_19141","T_19141","N_19193","T_19193","N_19209","T_19209","N_19223","N_19225","T_19225","T_19223","N_19238","T_19238","N_19239","T_19239","N_19257","T_19257")
colnames(cov_all)= names
cov_nums_only=cov_all[,7:38]
keep.exprs=rowSums(cov_nums_only>0) >= 16
cov_all_filt=cov_all[keep.exprs,]
cov_all_filt_bins= cov_all_filt %>% separate(col=Geneid, into=c("bin","gene"), sep=".E") %>% select(bin)
cov_all_filt_bins$bin=as.integer(cov_all_filt_bins$bin)
```
I will intersect the percA file with the bins in the filtered file.
```{r}
perc_A_bin_filt= perc_A_bin %>% semi_join(cov_all_filt_bins, by="bin")
```
I can no plot the distribution of percA in this.
```{r}
ggplot(perc_A_bin_filt, aes(percA)) + geom_histogram(bins = 30) + labs(title="Percent of As in the bins after filtering")
```
I will compare this distribution to those for other nucleotides. (C)
```{r}
perc_C_bin=bin_nuccov %>% select("chr", "start", "end","bin", "strand", "gene", "numC") %>% mutate(percC=numC/200)
ggplot(perc_C_bin, aes(percC)) + geom_histogram(bins=30) + labs(title="Percent of Cs in the bins")
perc_C_bin_filt= perc_C_bin %>% semi_join(cov_all_filt_bins, by="bin")
ggplot(perc_C_bin_filt, aes(percC)) + geom_histogram(bins = 30) + labs(title="Percent of Cs in the bins after filtering")
```
'
For G
```{r}
perc_G_bin=bin_nuccov %>% select("chr", "start", "end","bin", "strand", "gene", "numG") %>% mutate(percG=numG/200)
ggplot(perc_G_bin, aes(percG)) + geom_histogram(bins=30) + labs(title="Percent of Gs in the bins")
perc_G_bin_filt= perc_G_bin %>% semi_join(cov_all_filt_bins, by="bin")
ggplot(perc_G_bin_filt, aes(percG)) + geom_histogram(bins = 30) + labs(title="Percent of Gs in the bins after filtering")
```
for T
```{r}
perc_T_bin=bin_nuccov %>% select("chr", "start", "end","bin", "strand", "gene", "numT") %>% mutate(percT=numT/200)
ggplot(perc_T_bin, aes(percT)) + geom_histogram(bins=30) + labs(title="Percent of Ts in the bins")
perc_T_bin_filt= perc_T_bin %>% semi_join(cov_all_filt_bins, by="bin")
ggplot(perc_T_bin_filt, aes(percT)) + geom_histogram(bins = 30) + labs(title="Percent of Ts in the bins after filtering")
```
Now I will join all of the percent usage of each nucleotide in the filtered bins so I can plot them on one plot.
```{r}
percNuc= perc_A_bin_filt %>% left_join(perc_T_bin_filt, by=c("chr", "start", "end", "bin", "strand", "gene")) %>% left_join(perc_G_bin_filt, by=c("chr", "start", "end", "bin", "strand", "gene")) %>% left_join(perc_C_bin_filt, by=c("chr", "start", "end", "bin", "strand", "gene")) %>% select("bin", "percA", "percT", "percG", "percC")
percNuc_melt=melt(percNuc, id.vars = "bin")
ggplot(percNuc_melt, aes(value)) + geom_histogram(bins = 30) + facet_wrap(~variable) + labs(title="Percent each nucleotide in filtered bins")
```
##Strech of Nucleotides
Next check is if the bins have 5 As in a row. I can do this using bedtools nuc as well.
###Five A's
```{bash, eval=F}
#!/bin/bash
#SBATCH --job-name=nucA
#SBATCH --time=8:00:00
#SBATCH --output=nucA.out
#SBATCH --error=nucA.err
#SBATCH --partition=broadwl
#SBATCH --mem=20G
#SBATCH --mail-type=END
module load Anaconda3
source activate three-prime-env
bedtools nuc -s -fi /project2/gilad/briana/genome_anotation_data/genome/Homo_sapiens.GRCh37.75.dna_sm.all.fa -bed /project2/gilad/briana/genome_anotation_data/an.int.genome_200_strandspec.bed -pattern "AAAAA" > /project2/gilad/briana/threeprimeseq/data/bin200.5.A.nuccov.bed
```
```{r}
bin_Anuccov=read.table("../data/bin200.Anuccov.bed")
names(bin_Anuccov)=c("chr", "start", "end", "bin", "score", "strand", "gene", "pct_at", "pct_gc", "numA", "numC", "numG", "numT", "numN", "numOther", "seqlen", "fiveA")
hist(bin_Anuccov$fiveA)
```
I will filter this the same way I filtered the other file.
```{r}
bin_Anuccov_filt = bin_Anuccov %>% semi_join(cov_all_filt_bins, by="bin") %>% select( bin, gene, fiveA)
hist(bin_Anuccov_filt$fiveA)
summary(bin_Anuccov_filt$fiveA)
```
Count the number of bins with each value for number of 5 AAAAA
```{r}
countA_regions= bin_Anuccov_filt %>% group_by(fiveA) %>% count(fiveA)
```
###Five T's
I will compare this to regions with 5 T's
(This isnt the best comparison because the probability of each of these stretches genome wide may be different)
```{bash, eval=F}
#!/bin/bash
#SBATCH --job-name=nucT
#SBATCH --time=8:00:00
#SBATCH --output=nucT.out
#SBATCH --error=nucT.err
#SBATCH --partition=broadwl
#SBATCH --mem=20G
#SBATCH --mail-type=END
module load Anaconda3
source activate three-prime-env
bedtools nuc -s -fi /project2/gilad/briana/genome_anotation_data/genome/Homo_sapiens.GRCh37.75.dna_sm.all.fa -bed /project2/gilad/briana/genome_anotation_data/an.int.genome_200_strandspec.bed -pattern "TTTTT" > /project2/gilad/briana/threeprimeseq/data/bin200.5.T.nuccov.bed
```
```{r}
bin_Tnuccov=read.table("../data/bin200.5.T.nuccov.bed")
names(bin_Tnuccov)=c("chr", "start", "end", "bin", "score", "strand", "gene", "pct_at", "pct_gc", "numA", "numC", "numG", "numT", "numN", "numOther", "seqlen", "fiveT")
summary(bin_Tnuccov$fiveT)
bin_Tnuccov_filt = bin_Tnuccov %>% semi_join(cov_all_filt_bins, by="bin") %>% select( bin, gene, fiveT)
hist(bin_Tnuccov_filt$fiveT)
summary(bin_Tnuccov_filt$fiveT)
countT_regions= bin_Tnuccov_filt %>% group_by(fiveT) %>% count(fiveT)
```
The numbers are not super different.
##Correlation between count and A statistics
####Percent nucleotide
cov_all_filt
```{r}
#select bin coverage fore 18486
cov_all_filt_18486=cov_all_filt %>%separate(col=Geneid, into=c("bin","gene"), sep=".E") %>% select(bin, T_18486)
cov_all_filt_18486$bin=as.integer(cov_all_filt_18486$bin)
#join with percA
perc_A_bin_filt_cov= perc_A_bin_filt %>% select(bin, percA) %>% right_join(cov_all_filt_18486,by="bin")
#melt it
perc_A_bin_filt_cov_melt=melt(perc_A_bin_filt_cov, id.vars="bin")
#plot it
perA_18486total=ggplot(perc_A_bin_filt_cov, aes(y=T_18486, x=percA)) + geom_point()+ labs(y="Total Bin count",x="Percent A in bin") + geom_smooth(method="lm", col="red")
#select bin coverage fore 18486
cov_all_filt_18486N=cov_all_filt %>%separate(col=Geneid, into=c("bin","gene"), sep=".E") %>% select(bin, N_18486)
cov_all_filt_18486N$bin=as.integer(cov_all_filt_18486N$bin)
#join with percA
perc_A_bin_filt_covN= perc_A_bin_filt %>% select(bin, percA) %>% right_join(cov_all_filt_18486N,by="bin")
#melt it
perc_A_bin_filt_cov_melNt=melt(perc_A_bin_filt_covN, id.vars="bin")
perA_18486nuc=ggplot(perc_A_bin_filt_covN, aes(y=N_18486, x=percA)) + geom_point() + labs(y="Nuclear Bin count",x="Percent A in bin") + geom_smooth(method="lm", col="red")
title <- ggdraw() + draw_label("No relationship between bin read count and percentage \n of A nucleotides in a bin ", fontface = 'bold')
x=plot_grid(perA_18486total,perA_18486nuc)
grid.plot=plot_grid(title, x,ncol=1, rel_heights=c(0.3, 1))
ggsave( "../output/plots/perc.A.bincount.png", grid.plot, width = 10, height = 7)
```
```{r}
lm(perc_A_bin_filt_covN$percA~perc_A_bin_filt_covN$N_18486 )
lm( perc_A_bin_filt_cov$percA ~ perc_A_bin_filt_cov$T_18486)
```
Both have about a 0 correlation.
I will remove bins we did not have coverage in.
```{r}
perc_A_bin_filt_cov_no0= perc_A_bin_filt %>% select(bin, percA) %>% right_join(cov_all_filt_18486,by="bin") %>% filter(T_18486 >0 )
ggplot(perc_A_bin_filt_cov_no0, aes(y=T_18486, x=percA)) + geom_point() + geom_smooth(method="lm", col="red")
```
Check for T
```{r}
#join with percA
perc_T_bin_filt_cov= perc_T_bin_filt %>% select(bin, percT) %>% right_join(cov_all_filt_18486,by="bin")
#melt it
perc_T_bin_filt_cov_melt=melt(perc_T_bin_filt_cov, id.vars="bin")
#plot it
ggplot(perc_T_bin_filt_cov, aes(y=T_18486, x=percT)) + geom_point()
```
####Number of times we see 5 of a nucleotide
```{r}
#join with fiveA
bin_Anuccov_filt_cov= bin_Anuccov_filt %>% select(bin, fiveA) %>% right_join(cov_all_filt_18486,by="bin")
#plot it
ggplot(bin_Anuccov_filt_cov, aes(y=T_18486, x=fiveA)) + geom_point()
```
It does not look like these are drivers of the variation we see in counts.
This analysis has shown me that mispriming is not a global problem in the samples. We do not see a correlaation bertween percent As in a bin and the read count for either the total or the nuclear. I also do not see any outliers bins with high coverage and high A percentage. If this is a problem it is at a specfici locus level. We can assess this in regions we see differences between total and nuclear fractions.