In [1]:
library(ggplot2)
library(reshape2)
library(dplyr)
library(stringr)
library(tidyr)
theme_set(theme_bw())
options(repr.plot.width=7, repr.plot.height=4)


Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union


Attaching package: ‘tidyr’

The following object is masked from ‘package:reshape2’:

    smiths



# Introduction

In [clade-freqs](clade-freqs.ipynb), I've annotated a list of identity elements that appear to be conserved in over 95% of tRNAs. Some of these are reflected in the literature; others are not. I expect that the well-studied IDEs that are agreed to be universal to hold true. Thus, I need to look into the tRNAs that *don't* have this IDE. 

Do these exceptions function as tRNAs? Using a suite of supposedly gold standard IDEs, we would expect to be able to differentiate between bona fide tRNAs and tRNA pseudogenes.

I'll get a set of tRNAs that may or may not be missing a key IDE. I'll then proceed in two branches. 
1) IDE rules. We've learned something about which IDEs are required. We now know how to choose canonical tRNAs. Filter based on these IDEs or based on suites of IDEs, regenerate frequencies, rinse and repeat.
2) Interesting exceptions to the rule. Some tRNAs are exceptional. Look deeply into a few examples where they're missing a key IDE. Are any of them functional? Are they missing all of the other IDEs?
 
#2 is easier to tackle first, as we isolate the tRNAs. First, we'll recreate the frequency table.

# Data wrangling

## Import alignment and bases


In [2]:
identities = read.delim('identities.tsv', sep='\t', stringsAsFactors=FALSE)
identities$quality = as.logical(identities$quality)
identities$restrict = as.logical(identities$restrict)

In [3]:
positions = colnames(identities)[which(str_detect(colnames(identities), "X\\d+\\.\\d+$"))]
positions = c(positions, 'X8', 'X9', 'X14', 'X15', 'X16', 'X17', 'X18', 'X19', 'X20', 'X20a', 'X21', 'X26', 'X32', 'X33', 'X34', 'X35', 'X36', 'X37', 'X38', 'X44', 'X45', 'X46', 'X47', 'X48', 'X54', 'X55', 'X56', 'X57', 'X58', 'X59', 'X60', 'X73')

## Get frequencies

In [9]:
clade_iso_ac_freqs = identities %>%
  select(match(c('clade', 'isotype', 'anticodon', positions), colnames(identities))) %>%
  gather(positions, bases, -clade, -isotype, -anticodon) %>%
  group_by(clade, isotype, anticodon, positions, bases) %>%
  tally() %>%
  group_by(clade, isotype, anticodon, positions) %>%
  mutate(freq=n) %>%
  group_by(clade, isotype, anticodon, positions) %>%
  summarize(A = sum(freq[bases == "A"]),
            C = sum(freq[bases == "C"]),
            G = sum(freq[bases == "G"]),
            U = sum(freq[bases == "U"]),
            Deletion = sum(freq[bases %in% c("-", ".")]), 
            Purine = sum(freq[bases %in% c("A", "G")]),
            Pyrimidine = sum(freq[bases %in% c("C", "U")]),
            Weak = sum(freq[bases %in% c("A", "U")]),
            Strong = sum(freq[bases %in% c("G", "C")]),
            Amino = sum(freq[bases %in% c("A", "C")]),
            Keto = sum(freq[bases %in% c("G", "U")]),
            B = sum(freq[bases %in% c("C", "G", "U")]),
            D = sum(freq[bases %in% c("A", "G", "U")]),
            H = sum(freq[bases %in% c("A", "C", "U")]),
            V = sum(freq[bases %in% c("A", "C", "G")]),
            D = sum(freq[bases %in% c("A", "G", "U")]),
            GC = sum(freq[bases == "G:C"]),
            AU = sum(freq[bases == "A:U"]),
            UA = sum(freq[bases == "U:A"]),
            CG = sum(freq[bases == "C:G"]),
            GU = sum(freq[bases == "G:U"]),
            UG = sum(freq[bases == "U:G"]),
            PairDeletion = sum(freq[bases == "-:-"]), 
            PurinePyrimidine = sum(freq[bases %in% c("A:U", "G:C")]),
            PyrimidinePurine = sum(freq[bases %in% c("U:A", "C:G")]),
            StrongPair = sum(freq[bases %in% c("G:C", "C:G")]),
            WeakPair = sum(freq[bases %in% c("A:U", "U:A")]),
            Wobble = sum(freq[bases %in% c("G:U", "U:G")]),
            Paired = sum(freq[bases %in% c("A:U", "U:A", "C:G", "G:C", "G:U", "U:G")]),
            Bulge = sum(freq[bases %in% c("A:-", "U:-", "C:-", "G:-", "-:A", "-:G", "-:C", "-:U")]),
            Mismatched = sum(freq[bases %in% c("A:A", "G:G", "C:C", "U:U", "A:G", "A:C", "C:A", "C:U", "G:A", "U:C")])
            ) %>%
  mutate(total = A + B + Deletion + Paired + Mismatched + Bulge + PairDeletion) %>%
  melt(id.vars=c("clade", "isotype", "anticodon", "positions", "total")) %>%
  mutate(freq=value/total)

In [11]:
clade_iso_freqs = clade_iso_ac_freqs %>%
  group_by(positions, isotype, variable) %>%
  summarize(count=sum(value), freq=sum(value)/sum(total))

euk_freqs = clade_iso_ac_freqs %>%
  group_by(positions, variable) %>%
  summarize(count=sum(value), freq=sum(value)/sum(total))

consensus = euk_freqs %>%
  filter(freq > 0.95) %>%
  group_by(positions) %>% # remove duplicates
  filter(row_number(freq) == 1) %>%
  arrange(positions)
consensus

Unnamed: 0,positions,variable,count,freq
1,X10.25,Paired,108575,0.9849145
2,X14,A,109604,0.9942488
3,X15,Purine,109916,0.997079
4,X18,G,109042,0.9891507
5,X18.55,GU,107878,0.9785918
6,X19,G,108902,0.9878808
7,X19.56,Paired,108429,0.983599
8,X20,H,106812,0.9689218
9,X21,A,108344,0.982819
10,X26,D,107831,0.9781654


# Analysis of eukaryotic all-tRNA consensus identity elements

## 3:70

M&G: Even numbers across all pairs, 9 mismatches. G3-U70 unique to Ala. A few other isotypes have single exceptions. Antideterminant for Thr. C3-G70 positive for iMet. Dependent on 2-71 context.

First, let's check the iMet frequencies. 

In [12]:
head(clade_iso_freqs)

Unnamed: 0,positions,isotype,variable,count,freq
1,X10.25,Ala,A,0,0
2,X10.25,Ala,C,0,0
3,X10.25,Ala,G,0,0
4,X10.25,Ala,U,0,0
5,X10.25,Ala,Deletion,0,0
6,X10.25,Ala,Purine,0,0


In [13]:
clade_iso_ac_freqs %>% filter(positions == 'X3.70' & isotype == "iMet") %>%
  group_by(positions, isotype, variable) %>%
  summarize(count=sum(value), freq=sum(value)/sum(total)) %>%
  filter(freq > 0.05)

Unnamed: 0,positions,isotype,variable,count,freq
1,X3.70,iMet,CG,1195,0.9991639
2,X3.70,iMet,PyrimidinePurine,1195,0.9991639
3,X3.70,iMet,StrongPair,1195,0.9991639
4,X3.70,iMet,Paired,1196,1.0


M&G's frequencies with iMet are confirmed. This is a pretty strong determinant for initiator methionine.

For alanine, previous work (e.g. with [Chihade et al. 1998](http://pubs.acs.org/doi/pdf/10.1021/bi9804636)) shows that G3-U70 is a strong determinant in *C. elegans*. M&G do find that a few other tRNAs also contain G3-U70. [Beuning et al. 2002](http://rnajournal.cshlp.org/content/8/5/659.full.pdf) also shows that the orientation of a 2:71 purine:pyrimidine pair is helpful for charging. Let's see if G3-U70 is specifically enriched in alanine.

In [24]:
clade_iso_ac_freqs %>% filter(positions == 'X3.70' & variable == "GU") %>%
  group_by(positions, isotype, variable) %>%
  summarize(count=sum(value), freq=sum(value)/sum(total)) %>%
  filter(freq > 0.01)

clade_iso_ac_freqs %>% filter(positions == 'X3.70' & isotype == 'Ala' & variable == "GU") %>%
  mutate(isNGC=(str_detect(anticodon, "[AGCT]GC"))) %>%
  group_by(isNGC) %>%
  summarize(count=sum(value), freq=sum(value)/6658) %>%
  filter(freq > 0.001)

clade_iso_ac_freqs %>% filter(positions == 'X3.70' & isotype == 'Ala') %>%
  summarize(sum(value))

Unnamed: 0,positions,isotype,variable,count,freq
1,X3.70,Ala,GU,6658,0.2527043
2,X3.70,Cys,GU,27,0.01237964
3,X3.70,Gly,GU,62,0.0118865


Unnamed: 0,isNGC,count,freq
1,False,590,0.0886152
2,True,6068,0.9113848


Unnamed: 0,sum(value)
1,87549


This basically confirms M&G's (non-)conclusions - GU is enriched for Ala, though no recriprocal relationship exists, except for Ala-NGC.

## U8-A14

This is known to be extremely conserved, since it stabilizes the tertiary structure. M&G found that a variety of bacteria and archaea contain a C8 variation. Our data fits the eukaryotic side of things at 97%.

## R9 and 9:23

M&G: mostly a purine here. Interacts with base 23 in class I tRNAs. 

Our data supports this, and goes a step further in class II tRNAs, where it's a G9. The 9-23 interaction is not restricted to a particular interaction in any way, which agrees with my previous [tertiary interactions analysis](../tertiary-interactions.ipynb). As for fungi, where it was first identified, it's possible that this inte

In [47]:
clade_iso_ac_freqs %>% filter(!(isotype %in% c("Ser", "Leu")) & positions %in% c('X9', 'X9.23')) %>%
  group_by(positions, variable) %>%
  summarize(count=sum(value), freq=sum(value)/sum(total)) %>%
  filter(freq > 0.9)

clade_iso_ac_freqs %>% filter(isotype %in% c("Ser", "Leu") & positions %in% c('X9', 'X9.23')) %>%
  group_by(positions, variable) %>%
  summarize(count=sum(value), freq=sum(value)/sum(total)) %>%
  filter(freq > 0.9)

clade_iso_ac_freqs %>% filter(positions == 'X9.23') %>%
  group_by(positions, clade, variable) %>%
  summarize(count=sum(value), freq=sum(value)/sum(total)) %>%
  filter(freq > 0.1)

Unnamed: 0,positions,variable,count,freq
1,X9,Purine,99885,0.9863626
2,X9,D,100207,0.9895424
3,X9,V,100875,0.9961389
4,X9.23,Mismatched,95677,0.9448087


Unnamed: 0,positions,variable,count,freq
1,X9,G,8929,0.9952073
2,X9,Purine,8957,0.9983281
3,X9,Strong,8934,0.9957646
4,X9,Keto,8938,0.9962104
5,X9,B,8943,0.9967677
6,X9,D,8966,0.9993313
7,X9,V,8962,0.9988854
8,X9.23,Mismatched,8647,0.9637762


Unnamed: 0,positions,clade,variable,count,freq
1,X9.23,Fungi,GU,1173,0.1026606
2,X9.23,Fungi,Wobble,1173,0.1026606
3,X9.23,Fungi,Paired,1683,0.1472956
4,X9.23,Fungi,Mismatched,9743,0.8527044
5,X9.23,Insecta,Mismatched,901,0.9019019
6,X9.23,Mammalia,Mismatched,50648,0.9666571
7,X9.23,Nematoda,Mismatched,4615,0.9380081
8,X9.23,Spermatophyta,Paired,210,0.1171875
9,X9.23,Spermatophyta,Mismatched,1582,0.8828125
10,X9.23,Streptophyta,Mismatched,4975,0.9012681


## 10:25

M&G: 10/41 GC, 31/41 GU, positive determinant for yeast Asp, negative determinant for yeast M22G on 26, interacts with 45.

This is pretty par for the course. Our data show that GC $\approx$ 75% is more common though (GU $\approx$ 22%).

In [79]:
euk_freqs %>% filter(positions == 'X10.25' & freq > 0.05)

Unnamed: 0,positions,variable,count,freq
1,X10.25,GC,83468,0.7571618
2,X10.25,GU,23788,0.2157877
3,X10.25,PurinePyrimidine,83984,0.7618426
4,X10.25,StrongPair,83774,0.7599376
5,X10.25,Wobble,23790,0.2158058
6,X10.25,Paired,108575,0.9849145


## Non-consensus identity elements

There's plenty of examples where our frequencies confirm known rules, supplant known rules, or indicate new rules. There's also plenty of rules that weren't recapitulated above - and those are worth looking into individually.

## C1:G72

# No. tRNAs by missing IDEs