/
FractionDiffProcessed.Rmd
722 lines (464 loc) · 20.8 KB
/
FractionDiffProcessed.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
---
title: "Fraction Differences with Processed data"
author: "Briana Mittleman"
date: "1/22/2019"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
```{r}
library(workflowr)
library(tidyverse)
library(reshape2)
library(cowplot)
```
In this analysis https://brimittleman.github.io/threeprimeseq/PeakToGeneAssignment.html I ran an initial run of the leafcutter tool for differences between fractions. I will use the same pipeline here for the processed data.
This starts with running feature counts with all of the peaks. I will use peaks passing the filter in either the total or nucelar fraction. These peaks are in /project2/gilad/briana/threeprimeseq/data/mergedPeaks_noMP_filtered/Filtered_APApeaks_merged_allchrom_noMP.sort.named.noCHR.refseqTrans.closest2end.sm.fixed_5percCov.bed and were created using the filternamePeaks5percCov.py script. I need to make this into an SAF file for FC.
This file has chr, start, end, peakNu, cov, strand, transcript:gene, distance. For the SAF file I want
GeneID, Chr, start, end, strand. The GeneID is peak#:chr:start:end:strand:gene
bed2saf_bothFrac_Processed.py
```{bash,eval=F}
from misc_helper import *
fout = open("/project2/gilad/briana/threeprimeseq/data/mergedPeaks_noMP_filtered/Filtered_APApeaks_merged_allchrom_noMP.sort.named.noCHR.refseqTrans.closest2end.sm.fixed_5percCov.SAF",'w')
fout.write("GeneID\tChr\tStart\tEnd\tStrand\n")
for ln in open("/project2/gilad/briana/threeprimeseq/data/mergedPeaks_noMP_filtered/Filtered_APApeaks_merged_allchrom_noMP.sort.named.noCHR.refseqTrans.closest2end.sm.fixed_5percCov.bed"):
chrom, start, end, peakNum, cov, strand, trans, dist = ln.split()
gene=trans.split(":")[1]
ID = "peak%s:%s:%s:%s:%s:%s"%(peakNum,chrom,start, end,strand,gene)
fout.write("%s\t%s\t%s\t%s\t%s\n"%(ID, chrom, start, end, strand))
fout.close()
```
bothFrac_processed_FC.sh
```{bash,eval=F}
#!/bin/bash
#SBATCH --job-name=bothFrac_processed_FC
#SBATCH --account=pi-yangili1
#SBATCH --time=24:00:00
#SBATCH --output=bothFrac_processed_FCc.out
#SBATCH --error=bothFrac_processed_FC.err
#SBATCH --partition=broadwl
#SBATCH --mem=12G
#SBATCH --mail-type=END
module load Anaconda3
source activate three-prime-env
featureCounts -O -a /project2/gilad/briana/threeprimeseq/data/mergedPeaks_noMP_filtered/Filtered_APApeaks_merged_allchrom_noMP.sort.named.noCHR.refseqTrans.closest2end.sm.fixed_5percCov.SAF -F SAF -o /project2/gilad/briana/threeprimeseq/data/filtPeakOppstrand_cov_processed_bothFrac/filtered_APApeaks_merged_allchrom_refseqGenes.Transcript_sm_quant_processed.fc /project2/gilad/briana/threeprimeseq/data/bam_NoMP_sort/*sort.bam -s 2
```
Fix headers:
fix_head_fc_procBothFrac.py
```{bash,eval=F}
#python
infile= open("/project2/gilad/briana/threeprimeseq/data/filtPeakOppstrand_cov_processed_bothFrac/filtered_APApeaks_merged_allchrom_refseqGenes.Transcript_sm_quant_processed.fc", "r")
fout = open("/project2/gilad/briana/threeprimeseq/data/filtPeakOppstrand_cov_processed_bothFrac/filtered_APApeaks_merged_allchrom_refseqGenes.Transcript_sm_quant_processed_fixed.fc",'w')
for line, i in enumerate(infile):
if line == 1:
i_list=i.split()
libraries = i_list[:6]
print(libraries)
for sample in i_list[6:]:
full = sample.split("/")[7]
samp= full.split("-")[2:4]
lim="_"
samp_st=lim.join(samp)
libraries.append(samp_st)
first_line= "\t".join(libraries)
fout.write(first_line + '\n')
else :
fout.write(i)
fout.close()
```
fc2leafphen_processed.py
```{bash,eval=F}
inFile= open("/project2/gilad/briana/threeprimeseq/data/filtPeakOppstrand_cov_processed_bothFrac/filtered_APApeaks_merged_allchrom_refseqGenes.Transcript_sm_quant_processed_fixed.fc", "r")
outFile= open("/project2/gilad/briana/threeprimeseq/data/pheno_DiffIso_processed/filtered_APApeaks_merged_allchrom_refseqGenes.Transcript_sm_quant_processed_forLC.fc", "w")
for num, ln in enumerate(inFile):
if num == 1:
lines=ln.split()[6:]
outFile.write(" ".join(lines)+'\n')
if num > 1:
ID=ln.split()[0]
peak=ID.split(":")[0]
chrom=ID.split(":")[1]
start=ID.split(":")[2]
start=int(start)
end=ID.split(":")[3]
end=int(end)
strand=ID.split(":")[4]
gene=ID.split(":")[5]
new_ID="chr%s:%d:%d:%s"%(chrom, start, end, gene)
pheno=ln.split()[6:]
pheno.insert(0, new_ID)
outFile.write(" ".join(pheno)+'\n')
outFile.close()
```
subset_diffisopheno_processed.py
```{bash,eval=F}
def main(inFile, outFile, target):
ifile=open(inFile, "r")
ofile=open(outFile, "w")
target=int(target)
for num, ln in enumerate(ifile):
if num == 0:
ofile.write(ln)
else:
ID=ln.split()[0]
chrom=ID.split(":")[0][3:]
print(chrom)
chrom=int(chrom)
if chrom == target:
ofile.write(ln)
if __name__ == "__main__":
import sys
target = sys.argv[1]
inFile = "/project2/gilad/briana/threeprimeseq/data/pheno_DiffIso_processed/filtered_APApeaks_merged_allchrom_refseqGenes.Transcript_sm_quant_processed_forLC.fc"
outFile = "/project2/gilad/briana/threeprimeseq/data/pheno_DiffIso_processed/filtered_APApeaks_merged_allchrom_refseqGenes.Transcript_sm_quant_processed_forLC_%s.txt"%(target)
main(inFile, outFile, target)
```
Run this with: run_subset_diffisopheno_processed.sh
```{bash, eval=F}
#!/bin/bash
#SBATCH --job-name=run_subset_diffisopheno_processed
#SBATCH --account=pi-yangili1
#SBATCH --time=24:00:00
#SBATCH --output=run_subset_diffisopheno_processed.out
#SBATCH --error=run_subset_diffisopheno_processed.err
#SBATCH --partition=broadwl
#SBATCH --mem=12G
#SBATCH --mail-type=END
module load Anaconda3
source activate three-prime-env
for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
do
python subset_diffisopheno_processed.py $i
done
```
Make new sample list. I could use the old one but I want to have this pipeline work when I add individuals.
makeLCSampleList_processed.py
```{bash,eval=F}
outfile=open("/project2/gilad/briana/threeprimeseq/data/pheno_DiffIso_processed/sample_groups.txt", "w")
infile=open("/project2/gilad/briana/threeprimeseq/data/filtPeakOppstrand_cov_processed_bothFrac/filtered_APApeaks_merged_allchrom_refseqGenes.Transcript_sm_quant_processed.fc", "r")
for line, i in enumerate(infile):
if line == 1:
i_list=i.split()
libraries=[]
for sample in i_list[6:]:
full = sample.split("/")[7]
samp= full.split("-")[2:4]
lim="_"
samp_st=lim.join(samp)
libraries.append(samp_st)
for l in libraries:
if l[-1] == "T":
outfile.write("%s\tTotal\n"%(l))
else:
outfile.write("%s\tNuclear\n"%(l))
else:
next
outfile.close()
```
run_leafcutter_ds_bychrom_processed.sh
```{bash,eval=F}
#!/bin/bash
#SBATCH --job-name=run_leafcutter_ds_bychrom_processed
#SBATCH --account=pi-yangili1
#SBATCH --time=24:00:00
#SBATCH --output=run_leafcutter_ds_bychrom_processed.out
#SBATCH --error=run_leafcutter_ds_bychrom_processed.err
#SBATCH --partition=bigmem2
#SBATCH --mem=50G
#SBATCH --mail-type=END
module load R
for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
do
Rscript /project2/gilad/briana/davidaknowles-leafcutter-c3d9474/scripts/leafcutter_ds.R --num_threads 4 /project2/gilad/briana/threeprimeseq/data/pheno_DiffIso_processed/filtered_APApeaks_merged_allchrom_refseqGenes.Transcript_sm_quant_processed_forLC_${i}.txt /project2/gilad/briana/threeprimeseq/data/pheno_DiffIso_processed/sample_groups.txt -o /project2/gilad/briana/threeprimeseq/data/diff_iso_processed/TN_diff_isoform_chr${i}.txt
done
```
Cat all of the signficance files and bring them to my computer to look at here.
```{r}
diffIso=read.table("../data/diff_iso_proc/TN_diff_isoform_allChrom_clusterSig.txt", header = T,col.names = c("status", "loglr", "df", "p", "cluster", "p.adjust"),stringsAsFactors = F,sep="\t") %>% filter(status == "Success")
diffIso$p.adjust=as.numeric(as.character(diffIso$p.adjust))
qqplot(-log10(runif(nrow(diffIso))), -log10(diffIso$p.adjust),ylab="-log10 Total Adjusted Leafcutter pvalue", xlab="-log 10 Uniform expectation", main="Leafcutter differencial isoform analysis between fractions")
abline(0,1)
```
This is based on successfull results for 9,532 genes with multiple PAS and enough coverage in enough individuals.
Next, I will look at the effect sizes.
```{bash,eval=F}
awk '{if(NR>1)print}' /project2/gilad/briana/threeprimeseq/data/diff_iso_processed/TN_diff_isoform_chr*.txt_effect_sizes.txt > /project2/gilad/briana/threeprimeseq/data/diff_iso_processed/TN_diff_isoform_AllChrom.txt_effect_sizes.txt
```
```{r}
effectsize=read.table("../data/diff_iso_proc/TN_diff_isoform_AllChrom.txt_effect_sizes.txt", stringsAsFactors = F, col.names=c('intron', 'logef' ,'Nuclear', 'Total','deltapsi'))
```
```{r}
effectsize$logef=as.numeric(as.character(effectsize$logef))
plot(sort(effectsize$logef),main="Leafcutter effect Sizes", ylab="Effect size", xlab="Peak Index")
```
```{r}
effectsize$deltapsi=as.numeric(as.character(effectsize$deltapsi))
plot(sort(effectsize$deltapsi),main="Leafcutter delta PSI", ylab="Delta PSI", xlab="Peak Index")
```
I want to spot check some of these in IGV. First I will focus on the |PSI| >.4
```{r}
filterHighPSI=effectsize %>% filter(abs(deltapsi)>.4) %>% arrange(deltapsi)
head(filterHighPSI)
```
Too look at this most effectively I will merge the total and nuclear clean bam files:
mergeBamFiles_byfrac_noMP.sh
```{bash,eval=F}
#!/bin/bash
#SBATCH --job-name=mergeBamFiles_byfrac_noMP.sh
#SBATCH --account=pi-yangili1
#SBATCH --time=8:00:00
#SBATCH --output=mergeBamFiles_byfrac_noMP.out
#SBATCH --error=mergeBamFiles_byfrac_noMP.err
#SBATCH --partition=bigmem2
#SBATCH --mem=100G
#SBATCH --mail-type=END
module load Anaconda3
source activate three-prime-env
samtools merge /project2/gilad/briana/threeprimeseq/data/mergedBams_NoMP/AllTotalSamples.MergedBamFiles.noMP.bam /project2/gilad/briana/threeprimeseq/data/bam_NoMP_sort/*T*.bam
samtools merge /project2/gilad/briana/threeprimeseq/data/mergedBams_NoMP/AllNuclearSamples.MergedBamFiles.noMP.bam /project2/gilad/briana/threeprimeseq/data/bam_NoMP_sort/*N*.bam
```
SortIndex_mergeBamFiles_byfrac_noMP.sh
```{bash,eval=F}
#!/bin/bash
#SBATCH --job-name=SortIndex_mergeBamFiles_byfrac_noMP.sh
#SBATCH --account=pi-yangili1
#SBATCH --time=8:00:00
#SBATCH --output=SortIndex_mergeBamFiles_byfrac_noMP.out
#SBATCH --error=SortIndex_mergeBamFiles_byfrac_noMP.err
#SBATCH --partition=bigmem2
#SBATCH --mem=100G
#SBATCH --mail-type=END
module load Anaconda3
source activate three-prime-env
samtools sort /project2/gilad/briana/threeprimeseq/data/mergedBams_NoMP/AllTotalSamples.MergedBamFiles.noMP.bam > /project2/gilad/briana/threeprimeseq/data/mergedBams_NoMP/AllTotalSamples.MergedBamFiles.noMP.sort.bam
samtools index /project2/gilad/briana/threeprimeseq/data/mergedBams_NoMP/AllTotalSamples.MergedBamFiles.noMP.sort.bam
samtools sort /project2/gilad/briana/threeprimeseq/data/mergedBams_NoMP/AllNuclearSamples.MergedBamFiles.noMP.bam > /project2/gilad/briana/threeprimeseq/data/mergedBams_NoMP/AllNuclearSamples.MergedBamFiles.noMP.sort.bam
samtools index /project2/gilad/briana/threeprimeseq/data/mergedBams_NoMP/AllNuclearSamples.MergedBamFiles.noMP.sort.bam
```
Take a look at some of the top hits in IGV:
```{r}
slice(filterHighPSI,1:15)
```
It would be good to know the mean usage scores for these as well:
```{r}
names=c("chr", "start","end","gene","strand", "peak", "meanUsage")
tot_usage=read.table("../data/PeakUsage_noMP/filtered_APApeaks_merged_allchrom_refseqGenes.TranscriptNoMP_sm_quant.Total_fixed.pheno.5percPeaks.txt", stringsAsFactors = F,col.names = names)
nuc_usage=read.table("../data/PeakUsage_noMP/filtered_APApeaks_merged_allchrom_refseqGenes.TranscriptNoMP_sm_quant.Nuclear_fixed.pheno.5percPeaks.txt",stringsAsFactors = F,col.names = names)
```
* RGPD4:
Intronic Peak in the gene 64775-
Total mean usage: 0.1389744
Nuclear mean Usage: 0.7933333
The other peak in this gene is downstream of the annotated gene. 64776
Total mean Usage: 0.8610256
Nuclear mean Usage: 0.2066667
The nuclear has the truncated internal transcript.
* FAM86C1
The most used peak in this gene in the nuclear fraction is peak20610 it is chr11:71409928:71409980. This peak is upstream of the gene. This means the peak is probably not real. However this does not look like a misprimed read.
* SLC24A2
There are 2 peaks for this gene. They are downstream of the annoated in another gene but because they are on the opposite strand from the other gene, it makes sense for them to map to SLC24A2.
This means that without a strand specific protocol it would be difficult to see this effect.
Peak 118136
Total: 0.3012821
Nucelar: 0.8553846
Peak 118140
Total: 0.6987179
Nuclear: 0.1446154
The extended transcript is used in the nuclear.
* GALNTL5
This gene is close to the GALNT11 gene. They are also going in the same direction. The peak here is in the first intron of the GALNT11 gene. The other peak is way upsteream and looks like it is near the end of the PRKAG2-ASI gene chr7:151,570,858-151,585,995.
* NME2
Peak 50663
Total: 0.1615385
Nuclear:0.7335897
The other annotated peak is Peak 50662
Total:0.5564103
Nuclear:0.2407692
50663 is directly upstream of the UTR and 50662 is uspream in one of the introns.
* IFT22
This gene has 3 peaks in the filtered peaks however the one called here is upstream of the gene. The other two make a lot more sense:
109623:
Total:0.1310256
Nuclear:0.63794872
109622:
Total: 0.1866667
Nuclear: 0.08897436
109621:
Total:0.6371795
Nuclear: 0.23717949
From these results it looks like there is still differencial ussage in this gene. Peak 109621 is also significant with a delta PSI of 0.45275927.
* PCOLCE2
This peak is pretty far downstream of the gene but other genes in the region go the opposite way:
peak83037:
Total:0.2151282
Nuclear: 0.7323077
peak83045: This peak is upstream of the UTR for the gene in one of the exons:
Total: 0.7848718
Nuclear: 0.2676923
These results look like the gene has significant run through in the nuclear fraciton.
* HNF1A
The 2 called peaks for this gene are upstream of the gene. Probably not actually correct. It does not look like any peaks in the track would work for this gene.
* WISP1
The peak is very far downstream but not in the wrong direciton for other peaks in the region. All of the peaks for this gene are pretty far downstream. I am not sure how best to know if these are real or not.
* FBXO22
The peak here is downstream of the gene close to the UTR of the next gene but because it goes the other direction, I can be confident in the assignemnt.
peak41217:
Total: 0.16333333
Nuclear: 0.4917949
The other peaks in this gene are peak41206 in both fractions and low usage of peak41207 in total.
peak41206:
Total: 0.58538462
Nuclear: 0.1776923
Again this looks like a case of run on read in the nuclear
* PCDH9
The peak is downsteam of the gene but it is in a region with many LINCs. This peak is 32303. THe other peaks for the gene are 32304 and 32305. Oeak 32305 is upstream meanin it is probably not real.
peak32303:
Total:0.1761538
Nuclear: 0.6843590
Peak 32304:
Total: 0.5946154
Nuclear: 0.2479487
This is another case of run on for the nuclear transcript.
* CCM2L
The peak is downstream in on if the first exons of the HCK gene. It is difficult to know which you should assign the peak to because they are in the same orientation. All three of the called peaks for this gene are downstream in the HCK gene.
* WINT3:
The peak is downstream in the NSF gene but the NSF gene goes the other direction. Both peaks for this gene are downstream in the NSF gene.
Peak 50131
Total: 0.1087179
Nuclear: 0.5658974
Peak 50134
Total: 0.8912821
Nuclear: 0.4084615
Again here we see more run on for the nuclear fraction transcript.
* TRPC3
The peak here is in an intron of the downstreem gene BBS7. It is difficult to know which gene to assign the peak because they go the same direction.
peak89306
total: 0.1207692
nuclear:0.6010256
peak89307
total:0.8535897
nuclear: 0.3989744
Again here, if these peaks truly go with this gene, the nuclear transcript is the run on transcript.
* FHIT:
This peak is in an intron of the gene. This is also the only peak passing 5% in each fraction.
peak 79975:
Total:0.2602564
Nucelar:0.05923077
It is hard to draw further conclusions here.
I now want to look at some intermediate examples that are less likely to be outliers:
```{r}
filterMidPSI=effectsize %>% filter(abs(deltapsi)>.1, abs(deltapsi)<.2) %>% arrange(deltapsi)
slice(filterMidPSI, 1:20)
```
* Tada3
This gene has 2 annotated UTRs. This peak is in the upstream 3'UTR.
Peak 7742:
Total:0.1382051
Nuclear: 0.29974359
The total only has 1 more annoated here: peak77240. This peak is in the downstream 3' UTR
Total:0.7933333
Nucelar:0.47923077
The nuclear has 2 more upstream peaks here:
Peak77243:0.08589744
Peak77245:0.13461538
* DEPDC1B
This gene is close to PDE4D. The sig peak is in this gene. The 1 peak in the 3' UTR of DEPDC1B is peak92690.
*GOT2
The internal peak for this gene is used less than 5% in the total and 22% in the nuclear. These are the only 2 peaks in our filtered set for the gene.
Peak45274:
Total <5%
Nuclear: 0.2202564
Peak45272:
Total:0.9438462
Nuclear : 0.2202564
* OSR1
The 3 peaks for this gene are downstream of the gene with no other peaks in the region.
Peak 60538:
Total:0.3035897
Nucelar: 0.5566667
The total fraction uses the one peak downstream at about the same rate.
* ZNF397
This gene also looks like it has multiple annotated UTRs.
Peak 53787
Total: 0.10461538
Nuclear: 0.22794872
This gene has 5 peaks and they run into the next gene downstream. Peaks 53807 and 53808 should probaby be assigned to ZNF271P but it is hard to tell.
* CDH6
These peaks should probably go to the LNC RNA LINC01021.
* BHLHE41
This doesnt make sense
* WASHC5
This is a good example. peak116684 is in an intron of the gene and peak116678 is at the end in UTR. \
peak116684:
total:0.09589744
nuclear: 0.25256410
peak116678:
total: 0.72538462
nuclear: 0.50384615
The other used peak in nucelar is peak116687 with 0.06641026.
* GPR156
The peaks are downstream in GSK3B.
* LSM14A
chr19:34702824:34702894
peak57649: This peak is in the intron of the gene.
Total: <5%
Nuclear:0.16358974
In the total there are three peaks for this gene at over 5%.
peak57651
total:0.2943590
Nuclear:0.12179487
peak57655
Total:0.1356410
nuclear: 0.05461538
peak57656
total:0.2941026
nuclear: 0.10871795
Peaks 57655 adn 57656 are in the UTR.
This is a good example of internal PAS in the nuclear fraction.
* C5orf34
chr5:43482525:43482619: peak912217. This peak is in the next gene downstream in the same direction. It is hard to tell which gene this goes with.
* AHR:
Peak 106198 This peak is upstream of the gene. Probably not ream
chr7:17260895:17260936
However 106205 is an interesting peak here. It is not used at 5% in total but it used at almost 17% in nuclear.
* ARPC1B
chr7:98993610:98993692
This is peak 109344. It is downstream in PDAP1 but also looks like its still the annoataed UTR for the ARPC1B gene . But this gene is the other direction.
peak 109344:
total:0.1546154
nuclear: 0.27615385
The nuclear has 6 peaks at > 5% for this gene and total only has 2. The most used total is peak109343. But it does not look like that in IGV. It looks like it uses 109344.
* RCSD1
peak 8281
total: 0.08051282
nuclear: 0.21512821
peak8293 is the other used peak in total. It is in the annoated 3' UTR
total:0.67102564
nuclear:0.33769231
There are 2 more peaks also used at moderate levels here.
* OCM Peak is upstream. Not correct
* ZSCAN10 peak is upstream. not correct
* SGCE
peak 190201 In an intron of the gene.
total:0.10589744
nuclear: 0.2587179
The utr peak is 109198
total:0.58205128
nuclear:0.3894872
* FKRB
chr19:47378498:47378583
Looks like there must be a transcribed element here but thereis not an annotated gene. FKRB is pretty far upstream.
* HSPBAP1
peak81962
total:0.26000000
nuclear: 0.41871795
the most used total peak is 81961 it is in the 3' UTR
total: 0.64743590
nuclear:0.49000000
* TMEM40
peak77434. The peaks for this gene are all outside. It is a bit confusing. 77435 is upstream so this is probably not correct
Conclusions:
* Gene to peak annotations are difficult, especially when the genes are in the same direction close to each other
* We need to make sure peaks cannot be upstream of the gene.
* We need to add LINC to the annotation