/
PrematureTermQTL.Rmd
105 lines (66 loc) · 4.63 KB
/
PrematureTermQTL.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
---
title: "Premature Termination QTL"
author: "Briana Mittleman"
date: "7/1/2019"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
```{r}
library(reshape2)
library(workflowr)
library(tidyverse)
```
Many papers have started to talk about premature termination. Premature terminated isoforms may be truncated protein or may be degraded. I am going to create a measure for this and test for genetic variation associated with it in my data. The measure will be sum of the reads in intronic PAS and the sum of the UTR reads. I will use leafcutter to put the ratios onto a normal distribution. I will then test for QTLs these ratios.
```{bash,eval=F}
mkdir ../data/PreTerm_pheno
```
##Prepare phenotype
###Total
gene start and end
```{r}
genes=read.table("/project2/gilad/briana/genome_anotation_data/RefSeq_annotations/FullTranscriptByName.bed", col.names = c("chr", "Gene_start", "Gene_end", "gene", "score", "strand"),stringsAsFactors = F) %>% select(chr,Gene_start, Gene_end, gene)
```
```{r}
totalPAS=read.table("../data/phenotype_5perc/APApeak_Phenotype_GeneLocAnno.Total.5perc.fc.gz",stringsAsFactors = F,header = T)
totalPASPheno=totalPAS %>% melt(id.vars="chrom", variable.name="Ind", value.name = "ratio") %>% separate(ratio, into=c("count", "geneCount"), sep="/") %>% separate(chrom, into=c("chr", "start", "end", "geneID"), sep=":") %>% separate(geneID, into=c("gene", "loc","strand", "PAS"), sep="_") %>% filter(loc=="utr3" | loc=="intron") %>% inner_join(genes,by=c("chr", "gene"))%>% mutate(gene=paste(chr,Gene_start, Gene_end, gene,sep=":")) %>% group_by(Ind,gene,loc) %>% summarise(SumCount=sum(as.integer(count))) %>% ungroup() %>% group_by(Ind,gene) %>% mutate(nType=n()) %>% filter(nType==2) %>% spread(loc, SumCount) %>% mutate(total=intron+utr3,PreTermInt=paste(intron,total, sep="/"),PreTermUTR=paste(utr3,total, sep="/")) %>% select(-nType, -intron,-utr3,-total)
totalPASPheno_melt= totalPASPheno %>% melt(id.vars=c("Ind", "gene"), variable.name="Type", value.name = "Value") %>% mutate(chrom=paste(gene, Type, sep="_")) %>% spread(Ind, Value) %>% select(-gene, -Type)
#write.table(totalPASPheno_melt,"../data/PreTerm_pheno/Total_preterminationPheno.txt",quote=F, row.names=F,col.names=T, sep=" ")
```
```{bash,eval=F}
#python2
gzip ../data/PreTerm_pheno/Total_preterminationPheno.txt
python prepare_phenotype_table.py ../data/PreTerm_pheno/Total_preterminationPheno.txt.gz
#activate env
sh ../data/PreTerm_pheno/Total_preterminationPheno.txt.gz_prepare.sh
#top 2 pcs
head -n 3 ../data/PreTerm_pheno/Total_preterminationPheno.txt.gz.PCs > ../data/PreTerm_pheno/Total_preterminationPheno.txt.gz.2PCs
```
###Nuclear
```{r}
nuclearPAS=read.table("../data/phenotype_5perc/APApeak_Phenotype_GeneLocAnno.Nuclear.5perc.fc.gz",stringsAsFactors = F,header = T)
nuclearPASPheno=nuclearPAS %>% melt(id.vars="chrom", variable.name="Ind", value.name = "ratio") %>% separate(ratio, into=c("count", "geneCount"), sep="/") %>% separate(chrom, into=c("chr", "start", "end", "geneID"), sep=":") %>% separate(geneID, into=c("gene", "loc","strand", "PAS"), sep="_") %>% filter(loc=="utr3" | loc=="intron") %>% inner_join(genes,by=c("chr", "gene"))%>% mutate(gene=paste(chr,Gene_start, Gene_end, gene,sep=":")) %>% group_by(Ind,gene,loc) %>% summarise(SumCount=sum(as.integer(count))) %>% ungroup() %>% group_by(Ind,gene) %>% mutate(nType=n()) %>% filter(nType==2) %>% spread(loc, SumCount) %>% mutate(total=intron+utr3,PreTermInt=paste(intron,total, sep="/"),PreTermUTR=paste(utr3,total, sep="/")) %>% select(-nType, -intron,-utr3,-total)
nuclearPASPheno_melt= nuclearPASPheno %>% melt(id.vars=c("Ind", "gene"), variable.name="Type", value.name = "Value") %>% mutate(chrom=paste(gene, Type, sep="_")) %>% spread(Ind, Value) %>% select(-gene, -Type)
#write.table(nuclearPASPheno_melt,"../data/PreTerm_pheno/Nuclear_preterminationPheno.txt",quote=F, row.names=F,col.names=T, sep=" ")
```
```{bash,eval=F}
#python2
gzip ../data/PreTerm_pheno/Nuclear_preterminationPheno.txt
python prepare_phenotype_table.py ../data/PreTerm_pheno/Nuclear_preterminationPheno.txt.gz
#env
sh ../data/PreTerm_pheno/Nuclear_preterminationPheno.txt.gz_prepare.sh
#top 2 pcs
head -n 3 ../data/PreTerm_pheno/Nuclear_preterminationPheno.txt.gz.PCs > ../data/PreTerm_pheno/Nuclear_preterminationPheno.txt.gz.2PCs
```
##Call QTLs
Sample list from previous work
```{bash,eval=F}
mkdir ../data/PrematureQTLNominal
mkdir ../data/PrematureQTLPermuted
```
```{bash,eval=F}
sbatch PrematureQTLNominal.sh
sbatch PrematureQTLPermuted.sh
```
May want to only test one number per gene but do this for now because I want to take advantage of the leafcutter normalization software.