In [1]:
library(ggplot2)
library(reshape2)
library(dplyr)
library(stringr)
library(tidyr)
theme_set(theme_bw())
options(repr.plot.width=7, repr.plot.height=4)


Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union


Attaching package: ‘tidyr’

The following object is masked from ‘package:reshape2’:

    smiths



# Introduction

In [clade-freqs](clade-freqs.ipynb), I've annotated a list of identity elements that appear to be conserved in over 95% of tRNAs. Some of these are reflected in the literature; others are not. I expect that the well-studied IDEs that are agreed to be universal to hold true. Thus, I need to look into the tRNAs that *don't* have this IDE. 

Do these exceptions function as tRNAs? Using a suite of supposedly gold standard IDEs, we would expect to be able to differentiate between bona fide tRNAs and tRNA pseudogenes.

I'll get a set of tRNAs that may or may not be missing a key IDE. I'll then proceed in two branches. 
1) IDE rules. We've learned something about which IDEs are required. We now know how to choose canonical tRNAs. Filter based on these IDEs or based on suites of IDEs, regenerate frequencies, rinse and repeat.
2) Interesting exceptions to the rule. Some tRNAs are exceptional. Look deeply into a few examples where they're missing a key IDE. Are any of them functional? Are they missing all of the other IDEs?
 
#2 is easier to tackle first, as we isolate the tRNAs. First, we'll recreate the frequency table.

# Data wrangling

## Import alignment and bases


In [47]:
identities = read.delim('identities.tsv', sep='\t', stringsAsFactors=FALSE)
identities$quality = as.logical(identities$quality)
identities$restrict = as.logical(identities$restrict)
positions = colnames(identities)[which(str_detect(colnames(identities), "X\\d+\\.\\d+$"))]
positions = c(positions, 'X8', 'X9', 'X14', 'X15', 'X16', 'X17', 'X18', 'X19', 'X20', 'X20a', 'X21', 'X26', 'X32', 'X33', 'X34', 'X35', 'X36', 'X37', 'X38', 'X44', 'X45', 'X46', 'X47', 'X48', 'X54', 'X55', 'X56', 'X57', 'X58', 'X59', 'X60', 'X73')

## Get frequencies

In [48]:
euk_freqs = identities %>% select(match(positions, colnames(identities))) %>%
  gather(positions, bases) %>%
  group_by(positions, bases) %>%
  tally() %>%
  group_by(positions) %>%
  mutate(freq=n) %>%
  group_by(positions) %>%
  summarize(A = sum(freq[bases == "A"]),
            C = sum(freq[bases == "C"]),
            G = sum(freq[bases == "G"]),
            U = sum(freq[bases == "U"]),
            Gap = sum(freq[bases == "-"]), 
            Purine = sum(freq[bases %in% c("A", "G")]),
            Pyrimidine = sum(freq[bases %in% c("C", "U")]),
            Weak = sum(freq[bases %in% c("A", "U")]),
            Strong = sum(freq[bases %in% c("G", "C")]),
            Amino = sum(freq[bases %in% c("A", "C")]),
            Keto = sum(freq[bases %in% c("G", "U")]),
            B = sum(freq[bases %in% c("C", "G", "U")]),
            D = sum(freq[bases %in% c("A", "G", "U")]),
            H = sum(freq[bases %in% c("A", "C", "U")]),
            V = sum(freq[bases %in% c("A", "C", "G")]),
            D = sum(freq[bases %in% c("A", "G", "U")]),
            GC = sum(freq[bases == "G:C"]),
            AU = sum(freq[bases == "A:U"]),
            UA = sum(freq[bases == "U:A"]),
            CG = sum(freq[bases == "C:G"]),
            GU = sum(freq[bases == "G:U"]),
            UG = sum(freq[bases == "U:G"]),
            DoubleGap = sum(freq[bases == "-:-"]), 
            PurinePyrimidine = sum(freq[bases %in% c("A:U", "G:C")]),
            PyrimidinePurine = sum(freq[bases %in% c("U:A", "C:G")]),
            StrongPair = sum(freq[bases %in% c("G:C", "C:G")]),
            WeakPair = sum(freq[bases %in% c("A:U", "U:A")]),
            Wobble = sum(freq[bases %in% c("G:U", "U:G")]),
            Paired = sum(freq[bases %in% c("A:U", "U:A", "C:G", "G:C", "G:U", "U:G")]),
            Bulge = sum(freq[bases %in% c("A:-", "U:-", "C:-", "G:-", "-:A", "-:G", "-:C", "-:U")]),
            Mismatched = sum(freq[bases %in% c("A:A", "G:G", "C:C", "U:U", "A:G", "A:C", "C:A", "C:U", "G:A", "U:C")])
             ) %>%
  melt %>%
  mutate(freq=value/110238)

Using positions as id variables


In [73]:
clade_iso_freqs = identities %>%
  select(match(c('clade', 'isotype', positions), colnames(identities))) %>%
  gather(positions, bases, -clade, -isotype) %>%
  group_by(clade, isotype, positions, bases) %>%
  tally() %>%
  group_by(clade, isotype, positions) %>%
  mutate(freq=n) %>%
  group_by(clade, isotype, positions) %>%
  summarize(A = sum(freq[bases == "A"]),
            C = sum(freq[bases == "C"]),
            G = sum(freq[bases == "G"]),
            U = sum(freq[bases == "U"]),
            Deletion = sum(freq[bases %in% c("-", ".")]), 
            Purine = sum(freq[bases %in% c("A", "G")]),
            Pyrimidine = sum(freq[bases %in% c("C", "U")]),
            Weak = sum(freq[bases %in% c("A", "U")]),
            Strong = sum(freq[bases %in% c("G", "C")]),
            Amino = sum(freq[bases %in% c("A", "C")]),
            Keto = sum(freq[bases %in% c("G", "U")]),
            B = sum(freq[bases %in% c("C", "G", "U")]),
            D = sum(freq[bases %in% c("A", "G", "U")]),
            H = sum(freq[bases %in% c("A", "C", "U")]),
            V = sum(freq[bases %in% c("A", "C", "G")]),
            D = sum(freq[bases %in% c("A", "G", "U")]),
            GC = sum(freq[bases == "G:C"]),
            AU = sum(freq[bases == "A:U"]),
            UA = sum(freq[bases == "U:A"]),
            CG = sum(freq[bases == "C:G"]),
            GU = sum(freq[bases == "G:U"]),
            UG = sum(freq[bases == "U:G"]),
            PairDeletion = sum(freq[bases == "-:-"]), 
            PurinePyrimidine = sum(freq[bases %in% c("A:U", "G:C")]),
            PyrimidinePurine = sum(freq[bases %in% c("U:A", "C:G")]),
            StrongPair = sum(freq[bases %in% c("G:C", "C:G")]),
            WeakPair = sum(freq[bases %in% c("A:U", "U:A")]),
            Wobble = sum(freq[bases %in% c("G:U", "U:G")]),
            Paired = sum(freq[bases %in% c("A:U", "U:A", "C:G", "G:C", "G:U", "U:G")]),
            Bulge = sum(freq[bases %in% c("A:-", "U:-", "C:-", "G:-", "-:A", "-:G", "-:C", "-:U")]),
            Mismatched = sum(freq[bases %in% c("A:A", "G:G", "C:C", "U:U", "A:G", "A:C", "C:A", "C:U", "G:A", "U:C")])
            ) %>%
  mutate(total = A + B + Deletion + Paired + Mismatched + Bulge + PairDeletion) %>%
  melt(id.vars=c("clade", "isotype", "positions", "total")) %>%
  mutate(freq=value/total)

In [75]:
euk_freqs =  clade_iso_freqs %>% group_by(positions, variable) %>%
  summarize(count=sum(value), freq=sum(value)/sum(total)) %>%
consensus = clade_iso_freqs %>%
  filter(freq > 0.95) %>%
  group_by(positions) %>% # remove duplicates
  filter(row_number(freq) == 1) %>%
  arrange(positions)
consensus

Unnamed: 0,positions,variable,count,freq
1,X10.25,Paired,108575,0.9849145
2,X14,A,109604,0.9942488
3,X15,Purine,109916,0.997079
4,X18,G,109042,0.9891507
5,X18.55,GU,107878,0.9785918
6,X19,G,108902,0.9878808
7,X19.56,Paired,108429,0.983599
8,X20,H,106812,0.9689218
9,X21,A,108344,0.982819
10,X26,D,107831,0.9781654


# Analysis of eukaryotic all-tRNA consensus identity elements

## 3-70

M&G: Even numbers across all pairs, 9 mismatches. G3-U70 unique to Ala. A few other isotypes have single exceptions. Antideterminant for Thr. C3-G70 positive for iMet. Dependent on 2-71 context.

In [49]:
head(clade_iso_freqs)

Unnamed: 0,clade,isotype,positions,total,variable,value,freq
1,Fungi,Ala,X10.25,778,A,0,0.0
2,Fungi,Ala,X10.25.45,0,A,0,
3,Fungi,Ala,X11.24,778,A,0,0.0
4,Fungi,Ala,X12.23,778,A,0,0.0
5,Fungi,Ala,X13.22,778,A,0,0.0
6,Fungi,Ala,X13.22.46,0,A,0,


In [None]:
clade_iso_freqs %>% filter(positions == 'X3.70' & isotype == "Ala") %>%
  group_by(isotype, positions, variable) %>%
  summarize(total=sum(value), freq=value/sum(total)) %>% filter(freq>0.1)
                                                                         #filter(positions == 'X3.70' & isotype == "Ala") %>% group_by(clade, isotype, positions) %>% mutate(freq=n())# %>% filter(freq > 0.1)

## 10-25

M&G: 10/41 GC, 31/41 GU, positive determinant for yeast Asp, negative determinant for yeast M22G on 26, interacts with 45.

This is pretty par for the course. Our data show that GC $\approx$ 75% is more common though (GU $\approx$ 22%).

In [18]:
euk_freqs %>% filter(positions == 'X10.25' & freq > 0.05)

Unnamed: 0,positions,variable,value,freq
1,X10.25,GC,83468,0.7571618
2,X10.25,GU,23788,0.2157877
3,X10.25,PurinePyrimidine,83984,0.7618426
4,X10.25,StrongPair,83774,0.7599376
5,X10.25,Wobble,23790,0.2158058
6,X10.25,Paired,108575,0.9849145


# No. tRNAs by missing IDEs