In [39]:
library(ggplot2)
library(reshape2)
library(RColorBrewer)
suppressMessages(library(dplyr))
library(stringr)
suppressMessages(library(tidyr))
theme_set(theme_bw())
library(scales)
options(repr.plot.width=7, repr.plot.height=4)
isotypes = c('Ala', 'Arg', 'Asn', 'Asp', 'Cys', 'Gln', 'Glu', 'Gly', 'His', 'Ile', 'iMet', 'Leu', 'Lys', 'Met', 'Phe', 'Pro', 'Ser', 'Thr', 'Trp', 'Tyr', 'Val')
single_positions = c('X8'='8', 'X9'='9', 'X14'='14', 'X15'='15', 'X16'='16', 'X17'='17', 'X17a'='17a', 'X17b'='17b', 'X18'='18', 'X19'='19', 'X20'='20', 'X20a'='20a', 'X20b'='20b', 'X21'='21', 'X26'='26', 'X32'='32', 'X33'='33', 'X34'='34', 'X35'='35', 'X36'='36', 'X37'='37', 'X38'='38', 'X44'='44', 'X45'='45', 'X46'='46', 'X47'='47', 'X48'='48', 'X54'='54', 'X55'='55', 'X56'='56', 'X57'='57', 'X58'='58', 'X59'='59', 'X60'='60', 'X73'='73')
paired_positions = c('X1.72'='1:72', 'X2.71'='2:71', 'X3.70'='3:70', 'X4.69'='4:69', 'X5.68'='5:68', 'X6.67'='6:67', 'X7.66'='7:66', 'X8.14'='*8:14', 'X10.25'='10:25', 'X11.24'='11:24', 'X12.23'='12:23', 'X13.22'='13:22', 'X15.48'='*15:48','X18.55'='*18:55', 'X19.56'='*19:56', 'X27.43'='27:43', 'X28.42'='28:42', 'X29.41'='29:41', 'X30.40'='30:40', 'X31.39'='31:39', 'X49.65'='49:65', 'X50.64'='50:64', 'X51.63'='51:63', 'X52.62'='52:62', 'X53.61'='53:61', 'X54.58'='*54:58')
fills = c('A'='#ffd92f', 'C'='#4daf4a', 'G'='#e41a1c', 'U'='#377eb8', 'A:U'='#93da69', 'U:A'='#93da69', 'G:C'='#c1764a', 'C:G'='#c1764a', 'G:U'='#b26cbd', 'U:G'='#b26cbd', '-'='gray60', '-:-'='gray60')
suppressWarnings(suppressMessages(library(Biostrings)))

In [3]:
paired_positions = c('X1.72'='1:72', 'X2.71'='2:71', 'X3.70'='3:70', 'X4.69'='4:69', 'X5.68'='5:68', 'X6.67'='6:67', 'X7.66'='7:66', 'X8.14'='*8:14', 'X10.25'='10:25', 'X11.24'='11:24', 'X12.23'='12:23', 'X13.22'='13:22', 'X15.48'='*15:48','X18.55'='*18:55', 'X19.56'='*19:56', 'X27.43'='27:43', 'X28.42'='28:42', 'X29.41'='29:41', 'X30.40'='30:40', 'X31.39'='31:39', 'X49.65'='49:65', 'X50.64'='50:64', 'X51.63'='51:63', 'X52.62'='52:62', 'X53.61'='53:61', 'X54.58'='*54:58')

In [4]:
load('best-freqs.RData')
load('clade-isotype-specific.RData')
load('isotype-specific.RData')
load('consensus-IDEs.RData')
load('clade-isotype-specific-freqs.RData')

In [5]:
identities = read.delim('identities.tsv', sep='\t')
identities$quality = as.logical(identities$quality)
identities$restrict = as.logical(identities$restrict)
identities = identities %>% mutate(quality=quality & !restrict)

In [6]:
genome_table = read.delim('genome_table+.txt', sep='\t', stringsAsFactors=FALSE, header=FALSE, col.names=c("species_short", "species", "species_long", "domain", "clade"))

# Introduction

Previous analysis in `euk-tRNAs` and `isotype-clade-specific` focused on processing and dicing the data with appropriate assumptions. Now, we move on to the meat of this project. For **new identity stories**, I'll employ a few strategies to find interesting stories in a systematic way. Some of these will have been covered in `first-pass-consensus`.

1. Find rejected universal sequence features. Set a cutoff at 95% presence, but fails (a) isotype check, (b) clade check, or (c) species check. Spin in previous work, if any. Bonus points if previous work _conflicts_ with our frequencies. I will select one of each for inclusion in the paper.
    
2. Find rejected isotype-specific sequence features. Slightly different from (1a) above, since this could apply to highly isotype-specific positions (e.g. N73). For the paper, find one example (choose between species or clade checks).

Per Todd's advice, I'll take a look at a few in each category.

# Strategy 1: Universal features

### Find targets

In [7]:
code_groups = c('A'=1, 'C'=1, 'G'=1, 'U'=1, 'Absent'=1, 
                'Purine'=2, 'Pyrimidine'=2,
                'Weak'=3, 'Strong'=3, 'Amino'=3, 'Keto'=3,
                'B'=4, 'D'=4, 'H'=4, 'V'=4,
                'GC'=1, 'AU'=1, 'UA'=1, 'CG'=1, 'GU'=1, 'UG'=1,
                'StrongPair'=2, 'WeakPair'=2, 'Wobble'=2,
                'PurinePyrimidine'=3, 'PyrimidinePurine'=3, 'AminoKeto'=3, 'KetoAmino'=3,
                'Paired'=4, 'Mismatched'=4, 'Bulge'=4)

cutoff_freqs = data.frame()
for (cutoff in c(0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.99, 1.0)) {
  df = clade_iso_ac_freqs %>%
    group_by(positions, variable) %>%
    summarize(count=sum(value), freq=sum(value)/sum(total)) %>%
    filter(freq >= cutoff) %>%
    mutate(cutoff=as.character(cutoff)) %>%
    select(positions, variable, freq, cutoff) %>%
    group_by(positions) %>%
    arrange(code_groups[variable], desc(freq)) %>%
    filter(row_number(positions) == 1)
  if (nrow(cutoff_freqs) == 0) cutoff_freqs = df
  else cutoff_freqs = rbind(cutoff_freqs, df)
}

In [8]:
cutoff_freqs %>% 
  mutate(variable = as.character(variable)) %>%
  bind_rows(consensus %>% mutate(variable = identity, freq = 0.9, cutoff = "Consensus") %>% select(positions, variable, freq, cutoff)) %>%
  filter(positions %in% c(names(single_positions), names(paired_positions))) %>%
  select(positions, cutoff, variable) %>%
  spread(cutoff, variable)

Unnamed: 0,positions,0.5,0.6,0.7,0.8,0.9,0.95,0.99,Consensus
1,X1.72,GC,GC,GC,GC,Paired,Paired,,Paired
2,X10.25,GC,GC,GC,Paired,Paired,Paired,Paired,Paired
3,X11.24,CG,CG,PyrimidinePurine,PyrimidinePurine,PyrimidinePurine,PyrimidinePurine,Paired,PyrimidinePurine
4,X12.23,PyrimidinePurine,PyrimidinePurine,Paired,Paired,Paired,Paired,,
5,X13.22,PyrimidinePurine,Paired,,,,,,
6,X14,A,A,A,A,A,A,A,A
7,X15,G,G,G,Purine,Purine,Purine,Purine,V
8,X15.48,GC,GC,GC,PurinePyrimidine,PurinePyrimidine,Paired,,
9,X16,U,U,U,Pyrimidine,Pyrimidine,B,,
10,X17,Absent,Absent,Absent,,,,,


For strategy 1A, some good candidates here: 
- R9 (95%)
- Y11:R24 (99%)
- G18, U55, and G18:U55 (99%)
- G19:C56 (99%)
- A21 and A14/A21 (99%)
- U33 (99%)
- R37 (99%)
- R46 (99%)
- G53:C61 (99%)
- U54:A58 (99%)
- U55 (99%)
- R57 (99%)
- Y60 (90%)

### R9 and R46

#### What's known

- Marck and Grosjean have a purine for iMet and no consensus for elongator, though they are mostly R or V. R9 is also conserved in initiators for archaea/bacteria.
- 9-12-23 is a tertiary interaction. [Gautheret et al. (1995)](http://dx.doi.org/10.1006/jmbi.1995.0200) has these frequencies from the Sprinzl 1991 database: ![9-12-23 frequency matrix](figures/9-12-23-gautheret.png)

- Position 9 is known to be modified with m$^1$G (along with 37) in a wide range of eukaryotes. 

#### Our data

- Here's our frequencies:

In [9]:
table(paste0(identities[identities$quality, ]$X12, ':', identities[identities$quality, ]$X23), identities[identities$quality, ]$X9)

     
         -    A    C    G    U
  A:A    0    5    1    7    0
  A:C    0   25    2  227    0
  A:G    0    2    1   14    0
  A:U    0  477  146   65    4
  C:-    0    0    0    1    0
  C:A    0    8    0   57    1
  C:C    0    1    0    4    0
  C:G    0  232   58 6648   52
  C:U    0    0    0    6    0
  G:A    0    2    0    2    0
  G:C    1 1592    7 4064    5
  G:G    0    0    0    8    0
  G:U    0   14    1   31    0
  U:A    0 6457    6 1771   33
  U:C    0    1    0    2    0
  U:G    0   23    0   68    0
  U:U    0    5    0    1    0

**Which isotypes/clades/species fail the consensus checks?**

In [10]:
## Clade/isotype check
best_freqs %>% filter(positions == 'X9') %>% group_by(clade, isotype) %>%
  summarize(status=sum((variable %in% c("G", "A", "Purine"))*count)/sum(count) == 1) %>%
  filter(!status)

## Species check
identities %>% select(clade, species, isotype, X9) %>% group_by(clade, species, isotype) %>%
  summarize(status=sum(X9 %in% c("G", "A", "Purine")/n()) >= 0.1,
            freq=sum(X9 %in% c("G", "A", "Purine")), ntRNAs=n()) %>%
  filter(!status)

Unnamed: 0,clade,isotype,status
1,Insecta,Glu,False
2,Insecta,His,False
3,Mammalia,His,False
4,Vertebrata,His,False


Unnamed: 0,clade,species,isotype,status,freq,ntRNAs
1,Fungi,crypGatt_WM276,Gly,FALSE,0,2
2,Fungi,crypNeof_VAR_GRUBII_H99,Gly,FALSE,0,2
3,Fungi,crypNeof_VAR_NEOFORMANS_B_3501,Gly,FALSE,0,2
4,Fungi,crypNeof_VAR_NEOFORMANS_JEC21,Gly,FALSE,0,2
5,Insecta,aedAeg1,His,FALSE,0,33
6,Insecta,apiMel1,His,FALSE,0,7
7,Insecta,bomTer1,His,FALSE,0,6
8,Insecta,dm6,His,FALSE,0,5
9,Insecta,dp4,His,FALSE,0,5
10,Insecta,droAna3,His,FALSE,0,5


#### What's new

The 9:12:23 ratios are a bit different. But it shows the same thing: there is some selectivity for what base triples are allowed, but enough tolerance that there are a few interactions that persist.

The default hypothesis is that the 9-23 interaction has a conserved interaction type that explains the frequencies shown. We can marginalize as such because 12:23 is always a WC pair. This is not the case - although some (R:R) have a a strong trans WC-Hoogsteen pair of hydrogen bonds, others (C:R) do not. The hoogsteen side of C is just two stable carbons unlikely to hydrogen bond.

Examining the frequencies _without_ histidine shows that the C9:U23 is conserved in His, and A9:U23 is conserved in Asp, for some clades. (See tertiary interactions figure).

In [11]:
table(identities[identities$quality, ]$X9, identities[identities$quality, ]$X23)

   
       -    A    C    G    U
  -    0    0    1    0    0
  A    0 6472 1619  257  496
  C    0    7    9   59  147
  G    1 1837 4297 6738  103
  U    0   34    5   52    4

So here's what we know.
- 12:23 is almost always WC paired, so we can marginalize it.
- 9 is typically a purine, though there are isotype- and clade-specific exceptions. For example, His and Asp deviate in 4 clades.
    - In the His case, C and U can interact via the 2-carbonyl [amino pairing](http://www.columbuslabs.org/wp-content/uploads/2008/03/basepairs.pdf).
- 9:23 is thought to be a _trans_ interaction with two hydrogen bonds.

**Conclusion 1**: Position 9 was thought to be a purine. Instead, it varies by isotype and clade. Much of the variation from R9 can be explained by looking at 9:23 as an clade/isotype discriminating tertiary interaction.

We can leave it at that. But we should also look into compensatory interactions if 9:23 is disrupted.

### Covariation for core D stem 3d pairs

#### tRNA covariation frequencies for 9:23, 22:46, and 10:45

In [12]:
df = table(paste0(identities[identities$quality, ]$X9, ':', identities[identities$quality, ]$X23),
           paste0(identities[identities$quality, ]$X46, ':', identities[identities$quality, ]$X22, ' / ', identities[identities$quality, ]$X45, ':', identities[identities$quality, ]$X10),
           identities[identities$quality, ]$isotype)

In [13]:
as.data.frame(df) %>% group_by(Var3) %>% filter(Freq > 50)

Unnamed: 0,Var1,Var2,Var3,Freq
1,A:A,G:G / G:G,Ala,1382
2,G:C,G:G / G:G,Ala,88
3,G:A,G:U / G:G,Ala,51
4,G:C,A:A / G:G,Arg,288
5,G:G,A:A / G:G,Arg,954
6,G:C,G:G / G:G,Arg,449
7,G:G,A:A / G:G,Asn,68
8,A:A,G:G / G:G,Asn,108
9,G:C,G:G / G:G,Asn,457
10,A:U,A:G / G:G,Asp,323


This is messy. For each isotype there may be a different tertiary interaction compensating for the lack of A:A. It seems that purine:purine interactions are unusually enriched among all three of these except in valine. 

![classical interaction structure](figures/3d-interactions-oliva.png)

#### Average number of purine:purine interactions by isotype and clade

In [14]:
RRs = c("A:A", "A:G", "G:G", "G:A")
identities %>% select(isotype, clade, quality, X9.23, X22.46, X10.45, X26.44) %>% 
  filter(quality) %>%
  rowwise() %>% 
  mutate(nRR=(X9.23 %in% RRs) + (X22.46 %in% RRs) + (X10.45 %in% RRs) + (X26.44 %in% RRs)) %>%
  group_by(isotype, clade) %>% 
  summarize(nRR=signif(mean(nRR), 3)) %>%
  spread(isotype, nRR)

“Grouping rowwise data frame strips rowwise nature”

Unnamed: 0,clade,Ala,Arg,Asn,Asp,Cys,Gln,Glu,Gly,His,⋯,Leu,Lys,Met,Phe,Pro,Ser,Thr,Trp,Tyr,Val
1,Fungi,2.89,3.27,3.5,3.05,3.17,1.45,1.85,1.93,2.85,⋯,1.97,3.54,3.86,3.86,2.03,1.77,3.84,3.02,3.63,2.3
2,Insecta,4.0,3.62,3.0,2.0,4.0,2.0,3.3,2.5,2.07,⋯,2.0,4.0,3.0,3.94,2.05,1.99,3.97,2.96,4.0,1.99
3,Mammalia,3.95,3.54,2.95,2.11,3.97,1.97,2.96,2.64,2.04,⋯,2.0,3.92,2.99,3.92,2.0,2.05,3.99,2.98,3.99,2.17
4,Nematoda,3.98,3.59,2.95,2.0,4.0,1.94,2.75,2.48,1.95,⋯,1.98,2.59,2.94,3.88,2.02,1.99,3.97,2.82,3.96,2.0
5,Spermatophyta,3.98,3.47,3.75,2.0,3.67,2.39,3.43,3.2,3.0,⋯,2.0,3.01,3.0,3.97,2.04,1.23,3.72,3.0,3.84,2.0
6,Vertebrata,3.96,3.45,2.93,2.0,3.92,1.98,2.96,2.34,1.99,⋯,1.97,3.94,2.96,3.72,1.99,2.06,3.95,2.99,3.93,1.97


#### Average number of purine:purine interactions by position/isotype

In [15]:
RRs = c("A:A", "A:G", "G:G", "G:A")
identities %>% select(isotype, quality, X9.23, X22.46, X10.45, X26.44) %>% 
  filter(quality) %>%
  rowwise() %>% 
  mutate(X9.23=X9.23 %in% RRs, X22.46=X22.46 %in% RRs, X10.45=X10.45 %in% RRs, X26.44=X26.44 %in% RRs) %>%
  gather(position, RR, X9.23, X22.46, X10.45, X26.44, -isotype, -quality) %>%
  group_by(isotype, position) %>%
  summarize(RR=round(mean(RR), digits=1)) %>%
  spread(isotype, RR)

Unnamed: 0,position,Ala,Arg,Asn,Asp,Cys,Gln,Glu,Gly,His,⋯,Leu,Lys,Met,Phe,Pro,Ser,Thr,Trp,Tyr,Val
1,X10.45,1.0,1.0,1.0,1.0,1.0,1.0,0.9,0.9,1.0,⋯,0,0.9,1.0,0.9,1,0.1,1,1.0,1,0.9
2,X22.46,0.9,1.0,1.0,1.0,1.0,0.8,0.9,0.9,0.9,⋯,1,1.0,1.0,1.0,0,0.9,1,1.0,1,0.1
3,X26.44,1.0,1.0,0.9,0.1,0.9,0.0,0.2,0.1,0.2,⋯,0,0.9,1.0,1.0,0,0.0,1,0.9,1,1.0
4,X9.23,0.9,0.6,0.3,0.2,1.0,0.1,0.8,0.6,0.2,⋯,1,1.0,0.1,1.0,1,0.9,1,0.1,1,0.1


#### Average number of interactions containing a purine by position/isotype

In [16]:
RRs = c("A:A", "A:G", "G:G", "G:A", "G:C", "C:G", "A:C", "C:A", "U:G", "G:U", "A:U", "U:A")
identities %>% select(isotype, quality, X9.23, X22.46, X10.45, X26.44) %>% 
  filter(quality) %>%
  rowwise() %>% 
  mutate(X9.23=X9.23 %in% RRs, X22.46=X22.46 %in% RRs, X10.45=X10.45 %in% RRs, X26.44=X26.44 %in% RRs) %>%
  gather(position, RR, X9.23, X22.46, X10.45, X26.44, -isotype, -quality) %>%
  group_by(isotype, position) %>%
  summarize(RR=round(mean(RR), digits=1)) %>%
  spread(isotype, RR)

Unnamed: 0,position,Ala,Arg,Asn,Asp,Cys,Gln,Glu,Gly,His,⋯,Leu,Lys,Met,Phe,Pro,Ser,Thr,Trp,Tyr,Val
1,X10.45,1,1,1,1.0,1,1.0,1.0,0.9,1.0,⋯,0,1,1,1,1,0.1,1,1,1,1
2,X22.46,1,1,1,1.0,1,1.0,1.0,1.0,1.0,⋯,1,1,1,1,1,1.0,1,1,1,1
3,X26.44,1,1,1,0.9,1,0.1,0.4,1.0,0.4,⋯,1,1,1,1,1,1.0,1,1,1,1
4,X9.23,1,1,1,1.0,1,1.0,1.0,1.0,0.5,⋯,1,1,1,1,1,1.0,1,1,1,1


#### Are non-purine-containing interactions enriched for some other type of interaction?

In [17]:
identities %>% select(isotype, quality, X9.23, X22.46, X10.45, X26.44) %>% 
  filter(quality) %>%
  gather(position, identity, X9.23, X22.46, X10.45, X26.44, -isotype, -quality) %>%
  filter(!(identity %in% c(RRs, "G:-", "A:-"))) %>%
  group_by(position, isotype, identity) %>%
  summarize(count=n()) %>%
  filter(count > 20) %>%
  spread(identity, count, fill=0)

“attributes are not identical across measure variables; they will be dropped”

Unnamed: 0,position,isotype,C:-,C:U,U:-,U:C,U:U
1,X10.45,Leu,48,0,0,0,0
2,X10.45,Ser,93,0,173,0,0
3,X26.44,Asp,0,0,0,0,43
4,X26.44,Gln,0,31,0,684,123
5,X26.44,Glu,0,0,0,561,102
6,X26.44,Gly,0,0,0,0,26
7,X26.44,His,0,0,0,175,0
8,X26.44,Ile,0,0,0,33,0
9,X9.23,His,0,147,0,0,0


They are not - but they are isotype-specific.

**Summary**: Tertiary interactions within this "tRNA core region" along the D stem are enriched in purines and purine-purine interactions, though the extent varies by isotype and position. Exceptions are isotype-specific.

### G18, U55, and G18:U55

#### Number of tRNAs without G18:U55

In [18]:
identities %>% select(clade, species, quality, X18, X55, X18.55) %>%
  filter(quality & X18.55 != "G:U") %>% nrow()

#### Which isotypes/clades/species fail the consensus checks?

In [19]:
## Clade/isotype check
best_freqs %>% filter(positions == 'X18.55') %>% group_by(clade, isotype) %>%
  summarize(status=sum((variable == "GU")*count)/sum(count) == 1) %>%
  filter(!status)

## Species check
identities %>% select(clade, species, isotype, quality, X18.55) %>%
  filter(quality) %>%
  group_by(clade, species, isotype) %>%
  summarize(status=sum((X18.55 == "G:U")/n()) >= 0.1,
            freq=sum(X18.55 == "G:U"), tRNAs=n()) %>%
  filter(!status)

Unnamed: 0,clade,isotype,status


Unnamed: 0,clade,species,isotype,status,freq,tRNAs
1,Fungi,flamVelu_KACC42780,Pro,0,0,1
2,Fungi,valsMali_03_8,Pro,0,0,2


In [20]:
identities %>% select(clade, quality, isotype, X18.55) %>%
  filter(quality & ((clade == "Nematoda" & isotype == "Asn") | 
                    (clade == "Vertebrata" & isotype == "Gly"))) %>%
  group_by(isotype, clade) %>%
  mutate(total=n()) %>%
  group_by(isotype, clade, total, X18.55) %>%
  summarize(count=n()) %>%
  mutate(freq=count/total) %>%
  select(-total, -count) %>%
  spread(X18.55, freq, fill=0)

Adding missing grouping variables: `total`


Unnamed: 0,total,isotype,clade,-:U,A:U,G:C,G:G,G:U
1,19,Asn,Nematoda,0.0,0.105263157894737,0.0,0.0,0.894736842105263
2,250,Gly,Vertebrata,0.004,0.0,0.136,0.004,0.856


The isotype/clade check fails for nematode Asn at 88% and vertebrate Gly at 86%. 

Earlier, we had a vertebrate check fail because of the methods used to build the quality set. I've relaxed the species check constraints to 10% to better identify IDEs even when there are massively amplified tRNAs. Curiously, in danRer10, gasAcu1, and oryLat2, the majority of the non-G:U tRNAs contain a G:C. It's pretty likely that, given the high copy number, many of these tRNAs are actually pseudogenes. In addition, there are a decent amount of G18:U55 tRNAs to actually drive translation.

The two fungi listed here are rare exceptions to the rule. _Flammulina velutipes_ is the enokitake mushroom, and none of its proline tRNAs contain G18. _Valsa mali_ is a fungus that infects apple trees. It contains 4 proline tRNAs. The first two (Pro-AGG-1, Pro-CGG-1) score at 22 and 47, contain G18:U55, and are filtered out. The other two (Pro-TGG-1) score at 58, and contain C18:U55. So we have two main possibilities here.

1. G18:U55 is required for proper folding and recognition. Pro-CGG-1 is used as the main proline tRNA.
2. G18:U55 is not required. The synthetase is able to recognize some combination of Pro-CGG and Pro-TGG.

The evidence below points to option 2. In fungi, TGG is a much more common anticodon than CGG. Pro-TGG in this species also scores better, and contains a position 47. There are only two other fungal proline tRNAs without a position 47, both of which exist in _Encephalitozoon_ species, and both of which are supplemented by other high-scoring proline tRNAs.

#### Fungal proline anticodon usage /  Fungal proline position 47 incidence

In [21]:
identities %>% select(clade, quality, isotype, anticodon) %>%
  filter(quality & isotype == "Pro" & clade == "Fungi") %>%
  group_by(clade, isotype, anticodon) %>%
  summarize(count=n())

identities %>% select(clade, quality, isotype, X47) %>%
  filter(quality & isotype == "Pro" & clade == "Fungi") %>%
  group_by(clade, isotype, X47) %>%
  summarize(count=n()) %>%
  spread(X47, count)

identities %>% select(clade, quality, species, species_long, seqname, isotype, score, X47) %>%
  filter(quality & isotype == "Pro" & clade == "Fungi" & X47 == "-")

Unnamed: 0,clade,isotype,anticodon,count
1,Fungi,Pro,AGG,65
2,Fungi,Pro,CGG,33
3,Fungi,Pro,TGG,92


Unnamed: 0,clade,isotype,-,C,U
1,Fungi,Pro,2,53,135


Unnamed: 0,clade,quality,species,species_long,seqname,isotype,score,X47
1,Fungi,1,enceHell_ATCC50504,Encephalitozoon hellem ATCC 50504,enceHell_ATCC50504_chrX.trna2-ProCGG,Pro,60.7,-
2,Fungi,1,enceInte_ATCC50506,Encephalitozoon intestinalis ATCC 50506,enceInte_ATCC50506_chrX.trna2-ProCGG,Pro,63.4,-


### G19:C56

We now look at 19:56 in context of 18:55. G19:C56 similarly hits 99% frequency.

#### Distribution of 18:55 and 19:56

In [22]:
identities %>% select(quality, X18.55, X19.56) %>%
  filter(quality) %>%
  group_by(X18.55, X19.56) %>%
  summarize(count=n()) %>%
  filter(count > 5) %>%
  spread(X19.56, count, fill=0)

Unnamed: 0,X18.55,A:C,C:C,G:A,G:C,G:G,G:U,U:C
1,A:U,0,0,0,137,0,9,0
2,C:U,0,0,0,19,0,0,0
3,G:A,0,0,0,6,0,0,0
4,G:C,0,0,0,89,0,0,0
5,G:G,0,0,0,14,0,0,0
6,G:U,74,14,34,21232,27,419,19
7,U:U,0,0,0,15,0,0,0


What may be happening here is that 18:55 and 19:56 act in concert. Both are required, but failing that, having at least one of them will weakly preserve the tertiary structure. This makes sense in context of the tertiary structure shown above - 18:55 and 19:56 are right next to each other and perform the same function, forming the tRNA elbow. 

In this case, whether it fails the species or isotype/clade check doesn't matter anymore, unless we specifically see enrichment for a clade/isotype.

The one exception is with the 9 tRNAs for A18:U55 and G19:U56. Let's take a look at both of them.

#### Is there enrichment for certain isotypes/clades for the non-GU/GC tRNAs?

This measures the percentage of tRNAs in the isotype/clade with non-GU/GC tRNAs.

In [23]:
identities %>% select(isotype, clade, quality, X18.55, X19.56) %>%
  filter(quality) %>%
  group_by(isotype, clade) %>%
  mutate(total=n()) %>%
  filter(X18.55 != 'G:U' | X19.56 != 'G:C') %>%
  summarize(freq=signif(n()/unique(total), digits=2), count=n(), total=total[1]) %>%
  filter(freq > 0.05) %>%
  arrange(desc(freq))

Unnamed: 0,isotype,clade,freq,count,total
1,Gly,Vertebrata,0.21,52,250
2,Lys,Mammalia,0.17,127,739
3,Lys,Spermatophyta,0.12,16,138
4,Met,Vertebrata,0.12,17,139
5,Thr,Vertebrata,0.12,43,370
6,Asn,Insecta,0.11,2,18
7,Asn,Nematoda,0.11,2,19
8,Lys,Vertebrata,0.11,31,294
9,Tyr,Vertebrata,0.11,20,183
10,Phe,Mammalia,0.097,22,226


In [24]:
identities %>% select(isotype, clade, quality, X18.55, X19.56) %>%
  filter(quality) %>%
  group_by(isotype, clade) %>%
  mutate(total=n()) %>%
  filter(X18.55 != 'G:U' & X19.56 != 'G:C') %>%
  summarize(freq=signif(n()/unique(total), digits=2), count=n(), total=total[1]) %>%
  filter(freq > 0.01) %>%
  arrange(desc(freq))

Unnamed: 0,isotype,clade,freq,count,total
1,Asn,Insecta,0.056,1,18
2,Met,Fungi,0.02,2,99
3,Gly,Vertebrata,0.012,3,250
4,Lys,Mammalia,0.011,8,739


We see differential enrichment in the "not GU or not GC" case, but not in this case ("not GU/GC"). Which is to be expected - it's easier to mutate one or the other, rather than both. But just in case, we'll double check vertebrate glycine and mammalian lysine.

#### Which species have which divergent identities at 18:55 and 19:56?

In [25]:
identities %>% select(isotype, clade, quality, X18.55, X19.56) %>%
  filter(quality & ((isotype == "Gly" & clade == "Vertebrata") | (isotype == "Lys" & clade == "Mammalia"))) %>%
  group_by(isotype, clade, X18.55, X19.56) %>%
  summarize(count=n()) %>%
  filter(count > 5) %>%
  spread(X19.56, count, fill=0)

identities %>% select(isotype, clade, species, quality, X18.55, X19.56) %>%
  filter(quality & isotype == "Gly" & clade == "Vertebrata") %>%
  group_by(species, clade, isotype) %>% mutate(total_tRNAs=n()) %>%
  filter(X18.55 == "G:C" & X19.56 == "G:C") %>%
  group_by(species, clade, isotype, total_tRNAs) %>% summarize(count=n())

identities %>% select(isotype, clade, species, quality, X18.55, X19.56) %>%
  filter(quality & isotype == "Lys" & clade == "Mammalia") %>%
  group_by(species, clade, isotype) %>% mutate(total_tRNAs=n()) %>%
  filter(X18.55 == "G:U" & X19.56 == "G:U") %>%
  group_by(species, clade, isotype, total_tRNAs) %>% summarize(count=n())

Unnamed: 0,isotype,clade,X18.55,A:C,G:C,G:U
1,Gly,Vertebrata,G:C,0,32,0
2,Gly,Vertebrata,G:U,0,198,14
3,Lys,Mammalia,A:U,0,48,0
4,Lys,Mammalia,G:U,11,612,47


Unnamed: 0,species,clade,isotype,total_tRNAs,count
1,danRer10,Vertebrata,Gly,50,31
2,fr3,Vertebrata,Gly,9,1


Unnamed: 0,species,clade,isotype,total_tRNAs,count
1,ailMel1,Mammalia,Lys,40,5
2,calJac3,Mammalia,Lys,8,1
3,canFam3,Mammalia,Lys,31,4
4,cavPor3,Mammalia,Lys,14,1
5,criGri1,Mammalia,Lys,39,2
6,dasNov3,Mammalia,Lys,29,2
7,eriEur2,Mammalia,Lys,50,7
8,felCat5,Mammalia,Lys,40,2
9,gorGor3,Mammalia,Lys,11,1
10,hetGla2,Mammalia,Lys,10,2


If it's the case that G:C/G:C or G:U/G:U is deficient, both vertebrate glycine and mammalian lysine are well-compensated in each species by other tRNAs, since there are so many tRNA copies.

Now to go back to the original 18:55 and 19:56 table - 

#### Is there enrichment in certain isotypes/clades/species for A18:U55 and G19:U56?

In [26]:
identities %>% select(isotype, clade, species, quality, X18.55, X19.56) %>%
  filter(quality) %>%
  group_by(isotype, clade, species) %>% mutate(total_tRNAs=n()) %>%
  filter(X18.55 == "A:U" & X19.56 == "G:U")

Unnamed: 0,isotype,clade,species,quality,X18.55,X19.56,total_tRNAs
1,Ala,Mammalia,hetGla2,1,A:U,G:U,18
2,Ala,Mammalia,speTri2,1,A:U,G:U,49
3,Gly,Mammalia,micMur1,1,A:U,G:U,8
4,Ile,Mammalia,triMan1,1,A:U,G:U,10
5,Leu,Mammalia,nomLeu3,1,A:U,G:U,18
6,Lys,Mammalia,criGri1,1,A:U,G:U,39
7,Lys,Vertebrata,melUnd1,1,A:U,G:U,9
8,Lys,Vertebrata,oreNil2,1,A:U,G:U,18
9,Lys,Mammalia,otoGar3,1,A:U,G:U,11


Nope! No enrichment. Each of these species has a lot of tRNAs that can compensate for one faulty tRNA. It's also worth noting that A:U and G:U are the most common single mutations, making it likely that they can still participate in the 3D interaction. Thus, a double mutation may be the strongest non-GU/GC interaction.

**Do the fungal proline tRNAs without G18:U55 have G19:C56?** They have to, otherwise they would've appeared in the table above!

#### What's known

G19 often contains a 2'-O-methyl modification, and in the majority of tested tRNAs, U55 is modified to $\Psi$. It's known that G18:U55 and G19:C56 are conserved.

A kinetics study showed an 80-fold decrease in translation efficiency (because of ribosome docking) without G18:U55 ([Pan et al. 2006](http://www.nature.com/nsmb/journal/v13/n4/full/nsmb1074.html)).

#### Summary

- 18:55 and 19:56 frequencies have been studied in isolation but not together.
- G18:U55 and G19:C56 are vital for forming the tertiary structure at the tRNA elbow, and exist in 96% of tRNAs.
- If it's not 100% conserved, is it a clade/isotype IDE, or is it an artifact of high-copy pseudogenes?
    - Most of the remaining 4% has one mutation. For the two most egregious isotypes/clades, each species has several more tRNAs of the same isoacceptor to compensate. These can be explained by pseudogenes getting through the quality filter.
    - There are also 9 tRNAs with two mutations. All of them are AU/GU, and like before, each species have several more tRNAs of the same isoacceptor to compensate.
- Based on this evidence, I would say that having one of G18:U55 and G19:C56 is an identity element.    

### Y11:R24

This one meets our definition of consensus, though it doesn't reach the 99% threshold. Let's see if the tRNAs themselves are enriched in any way.

#### Number of tRNAs without C11:G24 or U11:A24

In [35]:
identities %>% select(clade, species, quality, X11.24) %>%
  filter(quality & !(X11.24 %in% c("C:G", "U:A"))) %>% nrow()

#### Clade / isotype enrichment for tRNAs without Y11:R24

In [62]:
display_markdown(paste(as.character(kable(identities %>% select(clade, quality, isotype, X11.24) %>%
  filter(quality & !(X11.24 %in% c("C:G", "U:A"))) %>%
  group_by(clade, isotype) %>%
  summarize(count = n()) %>%
  spread(isotype, count, fill = 0))), collapse = "\n"))

|clade         | Ala| Arg| Asn| Asp| Cys| Gln| Glu| Gly| His| Ile| iMet| Leu| Lys| Met| Phe| Pro| Ser| Thr| Trp| Tyr| Val|
|:-------------|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|----:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|
|Fungi         |   5|   1|   0|   0|   0|   2|   1|   1|   0|   0|    0|   1|   0|   0|   0|   1|   0|   1|   0|   0|  89|
|Insecta       |   0|   1|   1|   0|   0|   0|   0|   0|   0|   0|    0|   0|   0|   0|   0|   0|   0|   0|   0|   1|   0|
|Mammalia      |  10|   4|  18|   2|   9|   1|   5|   4|   0|   8|    1|   3|  25|  17|   4|   3|   5|   3|   6|   1|   5|
|Nematoda      |   3|   2|   1|   0|   0|   0|   3|   1|   1|   1|    0|   3|   1|   1|   0|   3|   0|   1|   0|   1|   0|
|Spermatophyta |   1|   2|   0|   0|   0|   0|   2|   1|   0|   1|    0|   4|   4|   3|   0|   0|   2|   0|   0|   0|   0|
|Vertebrata    |   5|   6|  18|   1|   5|   2|   6|   0|   0|   5|    2|   3|   9|   5|  11|   2|  11|   4|  11|   0|   5|

What's this strange bump in fungal valine tRNAs?

#### Frequency of non-Y11:R24 tRNAs in fungi Val

In [31]:
identities %>% select(clade, quality, isotype, X11.24) %>%
  filter(quality & clade == "Fungi" & isotype == "Val") %>%
  group_by(clade, isotype, X11.24) %>%
  summarize(count = n())

Unnamed: 0,clade,isotype,X11.24,count
1,Fungi,Val,A:U,1
2,Fungi,Val,C:G,31
3,Fungi,Val,G:C,1
4,Fungi,Val,U:A,194
5,Fungi,Val,U:G,87


There's high frequency of U11:G24 (though technically, it is also a pyrimidine:purine interaction).

#### What's known

- [Ohana et al. 1994](http://dx.doi.org/10.1006/abbi.1994.1503) studied yeast Ser, which contains C11:G24. A switch to G11:C24 decreased aminoacylation efficiency to 78%.
- C11:G24, along with G10:C25, are vital for m$^2_2$G26, according to [Edqvist et al. 1992](https://doi.org/10.1093/nar/20.24.6575). However, other eukaryotes such as _Drosophila_ contain the modification even without these identities.
- U11:A24 prevents RNase P processing in _Drosophila_ ([Levinger et al. 1995](10.1074/jbc.270.32.18903)). It's noted that the tRNA probably folds correctly, and the recognition is stalled by "local helix deformation". 
- This identity element is also important in mitochondrial tRNAs ([Takeuchi et al. 2001)](http://www.jbc.org/content/276/23/20064.full)).

#### What's new

#### Covariance of 10:25 and 11:24 for fungi vs all other clades

In [29]:
identities %>% select(clade, isotype, quality, X10.25, X11.24) %>%
  filter(isotype == "Val") %>%
  mutate(clade=ifelse(clade == "Fungi", "Fungi", "Other")) %>%
  filter(X11.24 %in% c("U:A", "C:G", "U:G")) %>%
  group_by(clade) %>%
  mutate(total=n()) %>%
  group_by(clade, total, X10.25, X11.24) %>%
  summarize(count=n()) %>%
  mutate(freq=count/total) %>%
  filter(freq > 0.01) %>%
  select(-total, -count) %>%
  spread(X10.25, freq, fill=0)

Adding missing grouping variables: `total`


Unnamed: 0,total,clade,X11.24,G:C,G:U
1,1010,Fungi,C:G,0.02871287,0.0
2,1010,Fungi,U:A,0.3683168,0.03465347
3,1010,Fungi,U:G,0.5594059,0.0
4,3415,Other,U:A,0.9385066,0.04919473


A whopping **55%** of fungal valine tRNAs contain G10:C25 and U11:G24. This is very different from other eukaryotic valines, which hovers around 94% for G10:C25 and U11:A24. This seems to be tolerated specifically in fungal valine tRNAs. This could potentially be a switch for modification or for cleavage. Is this enriched in specific species?

In [59]:
identities %>% select(species, clade, isotype, quality, X10.25, X11.24) %>%
  filter(quality, isotype == "Val", clade == "Fungi", X11.24 %in% c("U:A", "C:G", "U:G")) %>%
  mutate(identity = paste0(X10.25, ' / ', X11.24)) %>%
  group_by(species, identity) %>%
  summarize(count = n()) %>%
  spread(identity, count, fill = 0)

Unnamed: 0,species,G:C / C:G,G:C / U:A,G:C / U:G,G:U / C:G,G:U / U:A,U:U / U:A
1,ashbGoss_ATCC10895,0,5,2,0,0,0
2,aspeFumi_AF293,1,3,0,0,2,0
3,aspeNidu_FGSC_A4,0,3,0,0,1,0
4,aspeOryz_RIB40,0,3,0,0,1,0
5,botrCine_B05_10,0,2,0,0,0,0
6,candAlbi_WO_1,0,2,3,0,0,0
7,candDubl_CD36,0,2,3,0,0,0
8,candGlab_CBS_138,0,3,1,0,0,0
9,candOrth_CO_90_125,0,1,2,0,0,0
10,chaeTher_VAR_THERMOPHILUM_DSM1,0,10,0,0,0,0


G10:C25 / U11:A24 (canonical) and G10:C25 / U11:G24 (fungal valine) are broadly distributed. Seems like in fungi, both are tolerated by the synthetase or do some other conserved function.

There are a few species that are noteworthy, though:

_Flammulina velutipes_ shows up here again. _Cryptococcus_, _Penicillium chrysogenum_, _Ustilago maydis_, _Sporisorium reilianum_, and _S. pombe_ are all exceptions that contain neither the canonical set or the fungal valine set of features. Instead, they contain a variety of different features.

<img src='figures/fungi-tree.png' style="width: 700px;">

All of the aforementioned species group together. 

### R15:Y48

#### Number of tRNAs without R15:Y48

In [63]:
identities %>% select(clade, species, quality, X15.48) %>%
  filter(quality & !(X15.48 %in% c('G:C', 'A:U'))) %>% nrow()

#### Which isotypes/clades/species fail the consensus checks?

In [64]:
## Clade/isotype check
best_freqs %>% filter(positions == 'X15.48') %>% group_by(clade, isotype) %>%
  summarize(status=sum((variable %in% c('GC', 'AU', 'PurinePyrimidine'))*count)/sum(count) == 1) %>%
  filter(!status)

## Species check
identities %>% select(clade, species, isotype, quality, X15.48) %>%
  filter(quality) %>%
  group_by(clade, species, isotype) %>%
  summarize(status=sum((X15.48 %in% c('G:C', 'A:U'))/n()) >= 0.1,
            freq=sum(X15.48 %in% c('G:C', 'A:U')), tRNAs=n()) %>%
  filter(!status)

Unnamed: 0,clade,isotype,status


Unnamed: 0,clade,species,isotype,status,freq,tRNAs
1,Fungi,ashbGoss_ATCC10895,Glu,0,0,2
2,Fungi,aspeFumi_AF293,Asp,0,0,2
3,Fungi,aspeNidu_FGSC_A4,Asp,0,0,1
4,Fungi,botrCine_B05_10,Phe,0,0,4
5,Fungi,candGlab_CBS_138,Glu,0,0,2
6,Fungi,candOrth_CO_90_125,Phe,0,0,1
7,Fungi,crypGatt_WM276,Ala,0,0,1
8,Fungi,crypNeof_VAR_GRUBII_H99,Ala,0,0,1
9,Fungi,crypNeof_VAR_GRUBII_H99,Gln,0,0,1
10,Fungi,crypNeof_VAR_NEOFORMANS_B_3501,Ala,0,0,1


This one's the Levitt base pair, and has been shown in E. coli to tolerate different combinations other than R15:Y48. Despite the historical hubbub, it looks like a typical clade/isotype-specific IDE.

### A21 and 8-14-21

#### Number of tRNAs without A21 or U8-A14-A21

In [65]:
identities %>% select(clade, species, quality, X8.14, X21) %>%
  filter(quality & X8.14 != 'U:A' & X21 != 'A') %>% nrow()

identities %>% select(clade, species, quality, X8.14, X21) %>%
  filter(quality & X21 != 'A') %>% nrow()

Even though U:A:A is almost universally conserved, I'm going to focus on A21. The 4 tRNAs without U:A:A can easily be explained away by compensatory tRNAs with U:A:A.

#### Which isotypes/clades/species fail the consensus checks?

In [66]:
## Clade/isotype check
best_freqs %>% filter(positions == 'X21') %>% group_by(clade, isotype) %>%
  summarize(status=sum((variable == 'A')*count)/sum(count) == 1) %>%
  filter(!status)

## Species check
identities %>% select(clade, species, isotype, quality, X21) %>%
  filter(quality) %>%
  group_by(clade, species, isotype) %>%
  summarize(status=sum(X21 == 'A')/n() >= 0.1,
            freq=sum(X21 == 'A'), tRNAs=n()) %>%
  filter(!status)

Unnamed: 0,clade,isotype,status


Unnamed: 0,clade,species,isotype,status,freq,tRNAs
1,Fungi,aspeNidu_FGSC_A4,Phe,0,0,2
2,Fungi,sporReil_SRZ2,Phe,0,0,4
3,Fungi,ustiMayd_521,Phe,0,0,2


In [67]:
identities %>% select(clade, species, isotype, anticodon, score, isoscore, quality, X8, X14, X15, X16, X17, X17a, X18, X19, X20, X20a, X20b, X21, X22) %>%
  filter(quality & species %in% c("aspeNidu_FGSC_A4", "sporReil_SRZ2", "ustiMayd_521") & isotype == "Phe")

Unnamed: 0,clade,species,isotype,anticodon,score,isoscore,quality,X8,X14,X15,X16,X17,X17a,X18,X19,X20,X20a,X20b,X21,X22
1,Fungi,aspeNidu_FGSC_A4,Phe,GAA,65.0,86.7,1,U,A,G,U,U,-,G,G,U,A,-,U,G
2,Fungi,aspeNidu_FGSC_A4,Phe,GAA,65.1,86.7,1,U,A,G,U,U,-,G,G,U,A,-,U,G
3,Fungi,sporReil_SRZ2,Phe,GAA,56.5,97.8,1,U,A,G,U,U,-,G,G,U,A,-,C,G
4,Fungi,sporReil_SRZ2,Phe,GAA,63.8,100.0,1,U,A,G,U,U,-,G,G,U,A,-,C,G
5,Fungi,sporReil_SRZ2,Phe,GAA,60.9,101.0,1,U,A,G,U,U,-,G,G,U,A,-,C,G
6,Fungi,sporReil_SRZ2,Phe,GAA,60.6,101.0,1,U,A,G,U,U,-,G,G,U,A,-,C,G
7,Fungi,ustiMayd_521,Phe,GAA,62.7,93.1,1,U,A,G,U,U,-,G,G,U,A,-,C,G
8,Fungi,ustiMayd_521,Phe,GAA,61.4,101.0,1,U,A,G,U,U,-,G,G,U,A,-,C,G


In [73]:
display_html(paste(kable(identities %>% select(clade, species, isotype, anticodon, score, isoscore, quality, X8, X14, X15, X16, X17, X17a, X18, X19, X20, X20a, X20b, X21, X22) %>%
  filter(species %in% c("schiPomb_972H")), format = 'html'), collapse = '\n'))

clade,species,isotype,anticodon,score,isoscore,quality,X8,X14,X15,X16,X17,X17a,X18,X19,X20,X20a,X20b,X21,X22
Fungi,schiPomb_972H,Ala,TGC,70.6,87.6,True,U,A,G,U,-,-,G,G,U,-,-,A,G
Fungi,schiPomb_972H,Ala,AGC,57.1,76.4,True,U,A,G,A,U,-,G,G,U,U,-,A,U
Fungi,schiPomb_972H,Ala,CGC,68.5,104.2,True,U,A,G,G,-,-,G,G,U,-,-,A,U
Fungi,schiPomb_972H,Ala,AGC,57.1,76.4,False,U,A,G,A,U,-,G,G,U,U,-,A,U
Fungi,schiPomb_972H,Ala,TGC,70.6,87.6,False,U,A,G,U,-,-,G,G,U,-,-,A,G
Fungi,schiPomb_972H,Ala,AGC,57.1,76.4,False,U,A,G,A,U,-,G,G,U,U,-,A,U
Fungi,schiPomb_972H,Ala,AGC,57.1,76.4,False,U,A,G,A,U,-,G,G,U,U,-,A,U
Fungi,schiPomb_972H,Ala,AGC,57.1,76.4,False,U,A,G,A,U,-,G,G,U,U,-,A,U
Fungi,schiPomb_972H,Ala,AGC,57.1,76.4,False,U,A,G,A,U,-,G,G,U,U,-,A,U
Fungi,schiPomb_972H,Ala,AGC,57.1,76.4,False,U,A,G,A,U,-,G,G,U,U,-,A,U


It's hard to say whether these are misalignments or not. The standard alignment model lists A20a as A21, and Y21 as Y21i. None of these species have compensatory tRNAs. It's also unusual that all of them are Phe. It's probably not evolutionarily related since there are several more species more related to each of them (e.g., _Aspergillus oryzae_). One possibility is that the fungi Phe synthetase or processing machinery is more permissive in general, and that it can tolerate mutations.

#### What's known

M&G: Always A21, but there are 3 copies of G21 in _S. pombe_. It interacts as U8-A14-A21.

_S. pombe_ Met tRNAs disappear from our data because Thr scores higher through the isotype-specific models. They are also a very odd case - in addition to G21, elongator _S. pombe_ Met tRNAs contain short, 6 and 8 bp introns at position 37. More on that [here](http://rna.cshl.edu/content/free/chapters/21_rna_world_2nd.pdf).

### U33

#### Number of tRNAs without U33

In [35]:
identities %>% select(clade, species, quality, X33) %>%
  filter(quality & X33 != 'U') %>% nrow()

#### Which isotypes/clades/species fail the consensus checks?

In [36]:
## Clade/isotype check
best_freqs %>% filter(positions == 'X33') %>% group_by(clade, isotype) %>%
  summarize(status=sum((variable == 'U')*count)/sum(count) == 1) %>%
  filter(!status)

## Species check
identities %>% select(clade, species, isotype, quality, X33) %>%
  filter(quality) %>%
  group_by(clade, species, isotype) %>%
  summarize(status=sum(X33 == 'U')/n() >= 0.1,
            freq=sum(X33 == 'U'), tRNAs=n()) %>%
  filter(!status)

Unnamed: 0,clade,isotype,status
1,Insecta,iMet,False
2,Mammalia,iMet,False
3,Nematoda,iMet,False
4,Spermatophyta,iMet,False
5,Streptophyta,iMet,False
6,Vertebrata,iMet,False


Unnamed: 0,clade,species,isotype,status,freq,tRNAs
1,Fungi,flamVelu_KACC42780,iMet,False,0,1
2,Insecta,dm6,iMet,False,0,2
3,Insecta,dp4,iMet,False,0,1
4,Insecta,droAna3,iMet,False,0,2
5,Insecta,droEre2,iMet,False,0,1
6,Insecta,droGri2,iMet,False,0,3
7,Insecta,droMoj3,iMet,False,0,1
8,Insecta,droPer1,iMet,False,0,1
9,Insecta,droSec1,iMet,False,0,1
10,Insecta,droSim1,iMet,False,0,2


This is a pretty strong non-Fungi iMet identity element (which is known). M&G note some Cs here and there, plus a G33 in _Candida albicans_ that disrupts the anticodon arm and deforms the tRNA, decreasing charging efficiency.

In [37]:
identities %>% select(species, quality, isotype, isotype_ac) %>%
  filter(quality) %>%
  group_by(isotype_ac, species) %>% summarize(count=n()>=1) %>% 
  group_by(isotype_ac) %>% summarize(count=sum(count)) %>% arrange(count)

Unnamed: 0,isotype_ac,count
1,Tyr,143
2,His,150
3,Trp,154
4,Asp,156
5,Lys,158
6,Glu,159
7,Met,159
8,Cys,161
9,Ile,161
10,Pro,161


## Other thoughts

- Although the synthetase and tRNA should represent a stronger evolutionary signal w.r.t speciation, this isn't the case for tRNA processing that is not isotype-specific.
    - For instance, yeast has an unusually high amount of introns. 
    - This could apply to other processing enzymes, e.g., RNase P, RNase Z, etc.


# Strategy 2: Isotype-specific features

In [117]:
iso_cutoff_freqs = data.frame()
for (cutoff in c(0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.99, 1.0)) {
  df = clade_iso_ac_freqs %>%
    group_by(isotype, positions, variable) %>%
    summarize(count=sum(value), freq=sum(value)/sum(total)) %>%
    filter(freq >= cutoff) %>%
    mutate(cutoff=as.character(cutoff)) %>%
    select(isotype, positions, variable, freq, cutoff) %>%
    group_by(isotype, positions) %>%
    arrange(code_groups[variable], desc(freq)) %>%
    filter(row_number(positions) == 1)
  if (nrow(iso_cutoff_freqs) == 0) iso_cutoff_freqs = df
  else iso_cutoff_freqs = rbind(iso_cutoff_freqs, df)
}

In [165]:
head(cutoff_freqs)

Unnamed: 0,positions,variable,freq,cutoff,isotype
1,X17a,Absent,0.998238323245099,0.5,
2,X14,A,0.996431475291354,0.5,
3,X8,U,0.995482880115638,0.5,
4,X55,U,0.994714969735297,0.5,
5,X19,G,0.994353600144548,0.5,
6,X21,A,0.993901888156112,0.5,


In [232]:
iso_cutoff_freqs %>% 
  mutate(variable = as.character(variable)) %>%
  bind_rows(isotype_specific %>% mutate(variable = identity, freq = 0.9, cutoff = "Isotype-specific") %>% select(-identity)) %>%
  filter(positions %in% c(names(paired_positions), names(single_positions))) %>%
  filter(!(positions %in% cutoff_freqs$positions[cutoff_freqs$cutoff == 0.9 & cutoff_freqs$variable %in% c("A", "C", "G", "U", "AU", "UA", "GC", "CG", "UG", "GU")])) %>%
  filter(cutoff %in% c(0.95, 'Isotype-specific')) %>%
  select(-freq) %>%
  spread(cutoff, variable) %>% # generate NAs for missing identities
  gather(cutoff, variable, -isotype, -positions) %>%
  group_by(isotype, positions) %>%
  arrange(cutoff) %>% 
  summarize(identity = paste(variable, collapse = " / ")) %>%
  spread(isotype, identity, fill = '-') %>%
  kable(format = 'html') %>% as.character %>% display_html

positions,Ala,Arg,Asn,Asp,Cys,Gln,Glu,Gly,His,Ile,iMet,Leu,Lys,Met,Phe,Pro,Ser,Thr,Trp,Tyr,Val
X1.72,Paired / GC,Paired / KetoAmino,Paired / KetoAmino,Paired / KetoAmino,GC / GC,GC / GC,UA / UA,GC / PurinePyrimidine,GC / GC,GC / GC,AU / AU,PurinePyrimidine / GC,Paired / Paired,GC / GC,GC / GC,GC / GC,GC / GC,GC / GC,GC / GC,CG / CG,GC / PurinePyrimidine
X10.25,Paired / Paired,GC / GC,GC / GC,GU / Paired,GC / Paired,GC / PurinePyrimidine,Paired / Paired,GU / GU,Paired / Paired,GC / PurinePyrimidine,Paired / Paired,StrongPair / PurinePyrimidine,GC / Paired,GC / Paired,Paired / PurinePyrimidine,GU / Paired,Paired / Paired,GC / GC,GC / Paired,GC / GC,Paired / Paired
X11.24,PyrimidinePurine / PyrimidinePurine,CG / CG,Paired / CG,UA / UA,CG / PyrimidinePurine,UA / UA,UA / UA,UA / UA,UA / PyrimidinePurine,CG / CG,CG / PyrimidinePurine,CG / CG,CG / CG,PyrimidinePurine / PyrimidinePurine,CG / CG,UA / PyrimidinePurine,CG / CG,CG / PyrimidinePurine,PyrimidinePurine / PyrimidinePurine,CG / PyrimidinePurine,Paired / PyrimidinePurine
X12.23,Paired / KetoAmino,Paired / StrongPair,Paired / Paired,Paired / Paired,PyrimidinePurine / NA,StrongPair / Paired,StrongPair / StrongPair,NA / Paired,Paired / Paired,Paired / Paired,GC / KetoAmino,CG / CG,UA / UA,Paired / Paired,UA / KetoAmino,CG / Paired,Paired / AminoKeto,PyrimidinePurine / UA,Paired / NA,PyrimidinePurine / Paired,StrongPair / Paired
X13.22,-,NA / PyrimidinePurine,-,Paired / NA,CG / Paired,-,-,-,-,-,CG / CG,Mismatched / NA,CG / CG,CG / NA,CG / CG,Mismatched / NA,Mismatched / NA,-,CG / Paired,-,-
X15,Purine / V,Purine / A,Purine / Purine,Purine / Purine,G / G,Purine / Purine,Purine / Purine,Purine / Purine,G / G,Purine / Purine,G / G,G / G,Purine / Purine,G / Purine,G / Strong,G / G,G / G,G / G,A / Purine,G / G,G / G
X15.48,-,PurinePyrimidine / AU,PurinePyrimidine / NA,PurinePyrimidine / NA,GC / GC,PurinePyrimidine / NA,-,-,GC / Paired,PurinePyrimidine / PurinePyrimidine,GC / GC,GC / GC,Paired / PurinePyrimidine,GC / PurinePyrimidine,Paired / NA,GC / GC,GC / GC,GC / GC,AU / PurinePyrimidine,GC / GC,GC / GC
X16,D / H,Pyrimidine / U,Pyrimidine / Pyrimidine,U / Pyrimidine,B / NA,Pyrimidine / Pyrimidine,Pyrimidine / B,Pyrimidine / Pyrimidine,U / Keto,Pyrimidine / Pyrimidine,Pyrimidine / B,Pyrimidine / Pyrimidine,Pyrimidine / U,Pyrimidine / Pyrimidine,Pyrimidine / H,D / Keto,Weak / U,B / Pyrimidine,Pyrimidine / Pyrimidine,Pyrimidine / Pyrimidine,Pyrimidine / Pyrimidine
X17,-,Absent / Absent,-,Absent / Absent,Absent / NA,Absent / Absent,-,Absent / Absent,Absent / Absent,NA / Pyrimidine,Absent / Absent,-,Pyrimidine / Pyrimidine,-,U / NA,Absent / Absent,Absent / Absent,-,Absent / NA,U / NA,-
X17a,Absent / Absent,Absent / Absent,Absent / Absent,Absent / Absent,Absent / NA,Absent / Absent,Absent / Absent,Absent / Absent,Absent / Absent,Absent / Absent,Absent / Absent,Absent / Absent,Absent / Absent,Absent / Absent,Absent / Absent,Absent / Absent,Absent / Absent,Absent / Absent,Absent / Absent,Absent / Absent,Absent / Absent


Literally hundreds of candidates here. I want to choose a position that is (a) not involved in 3D interactions, (b) not involved in base pairing, (c) has clear isotype specificity, (d) does not have clear functionality (aside from intron splicing, but this is an exception), (e) is a single position (we looked at a lot of base pairs above), and (f) differs between 99% and isotype-specific cutoffs. It'd also be nice if the isotype is not Pro and is of Type I, and the exception is not from Fungi.

First - we try Tyr-A38:

In [197]:
iso_cutoff_freqs %>% mutate(variable = as.character(variable)) %>%
  bind_rows(isotype_specific %>% mutate(variable = identity, freq = 0.9, cutoff = "Isotype-specific") %>% select(-identity)) %>%
  filter(isotype == "Tyr" & positions == "X38") %>%
  spread(cutoff, variable, fill = '-') %>% select(-freq)

Unnamed: 0,isotype,positions,0.5,0.6,0.7,0.8,0.9,0.95,0.99,1,Isotype-specific
1,Tyr,X38,-,-,-,-,-,-,-,-,Amino
2,Tyr,X38,A,A,A,A,A,A,A,-,-
3,Tyr,X38,-,-,-,-,-,-,-,V,-


#### No. of Tyr RNAs without A38

In [198]:
identities %>% select(clade, species, isotype, quality, X38) %>%
  filter(quality & isotype == "Tyr" & X38 != 'A') %>% nrow()

#### Which species or clades fail the check?

In [202]:
## Clade check
best_freqs %>% filter(positions == 'X38' & isotype == "Tyr") %>% group_by(clade) %>%
  summarize(status=sum((variable == 'A')*count)/sum(count) == 1) %>%
  filter(!status)

## Species check
identities %>% select(clade, species, isotype, quality, X38) %>%
  filter(quality & isotype == "Tyr") %>%
  group_by(clade, species) %>%
  summarize(status=sum(X38 == 'A')/n() >= 0.1,
            freq=sum(X38 == 'A'), tRNAs=n()) %>%
  filter(!status)

Unnamed: 0,clade,status


Unnamed: 0,clade,species,status,freq,tRNAs
1,Fungi,sporReil_SRZ2,0,0,2
2,Fungi,ustiMayd_521,0,0,1


Yet another example with these two fungi. Next, we try Asp-U59:

#### No. of Asp RNAs without U59

In [208]:
iso_cutoff_freqs %>% mutate(variable = as.character(variable)) %>%
  bind_rows(isotype_specific %>% mutate(variable = identity, freq = 0.9, cutoff = "Isotype-specific") %>% select(-identity)) %>%
  filter(isotype == "Asp" & positions == "X59") %>%
  spread(cutoff, variable, fill = '-') %>% select(-freq)

Unnamed: 0,isotype,positions,0.5,0.6,0.7,0.8,0.9,0.95,0.99,1,Isotype-specific
1,Asp,X59,-,-,-,-,-,-,-,-,H
2,Asp,X59,U,U,U,U,U,U,-,-,-
3,Asp,X59,-,-,-,-,-,-,H,H,-


#### Which species or clades fail the check?

In [219]:
## Clade check
best_freqs %>% filter(positions == 'X59' & isotype == "Asp") %>% group_by(clade) %>%
  summarize(status=sum((variable == 'U')*count)/sum(count) == 1) %>%
  filter(!status)

## Species check
identities %>% select(clade, species, isotype, quality, X59) %>%
  filter(quality & isotype == "Asp") %>%
  group_by(clade, species) %>%
  summarize(status=sum(X59 == 'U')/n() >= 0.1,
            freq=sum(X59 == 'U'), tRNAs=n()) %>%
  filter(!status)

Unnamed: 0,clade,status


Unnamed: 0,clade,species,status,freq,tRNAs
1,Fungi,aspeFumi_AF293,0,0,2
2,Fungi,aspeNidu_FGSC_A4,0,0,1
3,Fungi,enceCuni_GB_M1,0,0,1
4,Fungi,enceHell_ATCC50504,0,0,1
5,Fungi,enceInte_ATCC50506,0,0,1
6,Fungi,enceRoma_SJ_2008,0,0,1


More fungi examples. Next, we try Ile 29:41.

#### Ile G29:C41

In [224]:
identities %>% select(clade, species, isotype, quality, X29.41) %>%
  filter(quality & isotype == "Ile") %>%
  group_by(clade, species) %>% head

Unnamed: 0,clade,species,isotype,quality,X29.41
1,Insecta,aedAeg1,Ile,1,G:C
2,Insecta,aedAeg1,Ile,1,G:C
3,Insecta,aedAeg1,Ile,1,G:C
4,Insecta,aedAeg1,Ile,1,G:C
5,Insecta,aedAeg1,Ile,1,G:C
6,Insecta,aedAeg1,Ile,1,G:C


In [225]:
## No. tRNAs without G29:C41
identities %>% select(clade, species, isotype, quality, X29.41) %>%
  filter(quality & isotype == "Ile" & X29.41 != 'G:C') %>% nrow()

## Clade check
best_freqs %>% filter(positions == 'X29.41' & isotype == "Ile") %>% group_by(clade) %>%
  summarize(status=sum((variable == 'GC')*count)/sum(count) == 1) %>%
  filter(!status)

## Species check
identities %>% select(clade, species, isotype, quality, X29.41) %>%
  filter(quality & isotype == "Ile") %>%
  group_by(clade, species) %>%
  summarize(status=sum(X29.41 == 'G:C')/n() >= 0.1,
            freq=sum(X29.41 == 'G:C'), tRNAs=n()) %>%
  filter(!status)

Unnamed: 0,clade,status


Unnamed: 0,clade,species,status,freq,tRNAs
1,Fungi,schiPomb_972H,0,0,1
2,Fungi,sporReil_SRZ2,0,0,1
3,Fungi,ustiMayd_521,0,0,1


#### Cys A38

In [226]:
## No. tRNAs without A38
identities %>% select(clade, species, isotype, quality, X38) %>%
  filter(quality & isotype == "Cys" & X38 != 'A') %>% nrow()

## Clade check
best_freqs %>% filter(positions == 'X38' & isotype == "Cys") %>% group_by(clade) %>%
  summarize(status=sum((variable == 'A')*count)/sum(count) == 1) %>%
  filter(!status)

## Species check
identities %>% select(clade, species, isotype, quality, X38) %>%
  filter(quality & isotype == "Cys") %>%
  group_by(clade, species) %>%
  summarize(status=sum(X38 == 'A')/n() >= 0.1,
            freq=sum(X38 == 'A'), tRNAs=n()) %>%
  filter(!status)

Unnamed: 0,clade,status


Unnamed: 0,clade,species,status,freq,tRNAs
1,Fungi,aspeNidu_FGSC_A4,0,0,1
2,Fungi,chaeTher_VAR_THERMOPHILUM_DSM1,0,0,1
3,Fungi,myceTher_ATCC42464,0,0,2
4,Fungi,neurCras_OR74A,0,0,1
5,Fungi,thieTerr_NRRL_8126,0,0,2
6,Nematoda,panRed1,0,0,1


Finally, an example outside of fungi!

In [237]:
identities %>% filter(species == "panRed1", isotype == "Cys") %>% kable(format = 'html') %>% as.character %>% display_html

clade,domain,isotype,seqname,species,species_long,taxid,isotype_best,anticodon,score,isoscore,GC,intron,insertions,deletions,D.loop,AC.loop,TPC.loop,V.arm,quality,restrict,X1.72,X1,X1i1,X2.71,X2,X2i1,X3.70,X3,X3i1,X3i2,X3i3,X3i4,X3i5,X3i6,X4.69,X4,X4i1,X4i2,X4i3,X4i4,X4i5,X4i6,X4i7,X4i8,X4i9,X4i10,X4i11,X5.68,X5,X5i1,X5i2,X5i3,X5i4,X5i5,X5i6,X6.67,X6,X6i1,X7.66,X7,X7i1,X7i2,X7i3,X7i4,X7i5,X7i6,X7i7,X7i8,X7i9,X8,X8.14.21,X8.14,X8i1,X8i2,X9,X9.12.23,X9.23,X9i1,X10.25,X10,X10.25.45,X10.45,X10i1,X11.24,X11,X12.23,X12,X12i1,X13.22,X13,X13.22.46,X14,X14i1,X14i2,X14i3,X14i4,X14i5,X14i6,X14i7,X14i8,X14i9,X14i10,X14i11,X14i12,X14i13,X15,X15.48,X16,X16i1,X16i2,X16i3,X16i4,X16i5,X16i6,X16i7,X16i8,X16i9,X16i10,X16i11,X16i12,X16i13,X16i14,X16i15,X16i16,X17,X17a,X18,X18.55,X19,X19.56,X19i1,X19i2,X19i3,X19i4,X19i5,X19i6,X19i7,X19i8,X19i9,X19i10,X20,X20i1,X20i2,X20i3,X20i4,X20a,X20b,X21,X22,X22.46,X22i1,X23,X23i1,X23i2,X23i3,X23i4,X23i5,X23i6,X24,X24i1,X25,X25i1,X25i2,X25i3,X25i4,X26,X26.44,X26i1,X26i2,X26i3,X26i4,X26i5,X26i6,X26i7,X26i8,X26i9,X26i10,X27.43,X27,X27i1,X27i2,X27i3,X27i4,X28.42,X28,X28i1,X29.41,X29,X29i1,X29i2,X30.40,X30,X30i1,X31.39,X31,X32,X33,X34,X35,X35i1,X36,X37,X37i1,X37i2,X37i3,X37i4,X37i5,X37i6,X37i7,X37i8,X37i9,X38,X38i1,X39,X39i1,X39i2,X40,X40i1,X40i2,X41,X41i1,X41i2,X42,X42i1,X43,X44,X44i1,X44i2,X44i3,X44i4,X44i5,X44i6,X44i7,X44i8,X44i9,X44i10,X44i11,X44i12,X44i13,X44i14,X44i15,X44i16,X44i17,X44i18,X44i19,X44i20,X44i21,X44i22,X44i23,X45,V11.V21,V12.V22,V13.V23,V14.V24,V15.V25,V16.V26,V17.V27,V1,V2,V3,V4,V5,V11,V12,V13,V14,V15,V16,V17,V21,V22,V23,V24,V25,V26,V27,X46,X47,X47i1,X47i2,X47i3,X48,X49.65,X49,X49i1,X50.64,X50,X50i1,X50i2,X50i3,X50i4,X50i5,X50i6,X50i7,X51.63,X51,X51i1,X51i2,X51i3,X52.62,X52,X52i1,X53.61,X53,X53i1,X54,X54.58,X54i1,X54i2,X54i3,X54i4,X55,X55i1,X55i2,X55i3,X55i4,X55i5,X56,X56i1,X56i2,X56i3,X56i4,X57,X57i1,X58,X58i1,X58i2,X59,X59i1,X60,X60i1,X60i2,X60i3,X60i4,X60i5,X60i6,X60i7,X60i8,X60i9,X60i10,X60i11,X61,X61i1,X62,X63,X64,X64i1,X64i2,X65,X65i1,X65i2,X65i3,X65i4,X65i5,X65i6,X65i7,X65i8,X65i9,X66,X66i1,X67,X67i1,X68,X68i1,X68i2,X69,X69i1,X70,X70i1,X70i2,X70i3,X70i4,X70i5,X70i6,X70i7,X70i8,X70i9,X71,X71i1,X71i2,X72,X73,X74,X75,X76
Nematoda,eukaryota,Cys,panRed1_KB455139.1.trna1-CysGCA,panRed1,Panagrellus redivivus MT8872,.,Cys,GCA,69.8,91.5,0.4146341,0,0,0,7,7,8,0,True,False,G:C,G,.,G:C,G,.,G:C,G,.,.,.,.,.,.,U:G,U,.,.,.,.,.,.,.,.,.,.,.,C:G,C,.,.,.,.,.,.,U:G,U,.,A:U,A,.,.,.,.,.,.,.,.,.,U,U:A:A,U:A,.,.,A,A:U:A,A:A,.,G:C,G,G:C:G,G:G,.,C:G,C,U:A,U,.,C:G,C,C:G:G,A,.,.,.,.,.,.,.,.,.,.,.,.,.,G,G:C,U,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,-,-,G,G:U,G,G:C,.,.,.,.,.,.,.,.,.,.,C,.,.,.,.,-,-,A,G,G:G,.,A,.,.,.,.,.,.,G,.,C,.,.,.,.,A,A:A,.,.,.,.,.,.,.,.,.,.,A:U,A,.,.,.,.,U:A,U,.,C:G,C,.,.,G:C,G,.,G:C,G,C,U,G,C,.,A,G,.,.,.,.,.,.,.,.,.,U,.,C,.,.,C,.,.,G,.,.,A,.,U,A,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,G,-:-,-:-,-:-,-:-,-:-,-:-,-:-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,G,U,.,.,.,C,C:G,C,.,C:G,C,.,.,.,.,.,.,.,C:G,C,.,.,.,G:C,G,.,G:C,G,.,U,U:A,.,.,.,.,U,.,.,.,.,.,C,.,.,.,.,A,.,A,.,.,C,.,U,.,.,.,.,.,.,.,.,.,.,.,C,.,C,G,G,.,.,G,.,.,.,.,.,.,.,.,.,U,.,G,.,G,.,.,G,.,C,.,.,.,.,.,.,.,.,.,C,.,.,C,U,-,-,-
Nematoda,eukaryota,Cys,panRed1_KB455336.1.trna2-CysGCA,panRed1,Panagrellus redivivus MT8872,.,Cys,GCA,69.8,91.5,0.4146341,0,0,0,7,7,8,0,False,True,G:C,G,.,G:C,G,.,G:C,G,.,.,.,.,.,.,U:G,U,.,.,.,.,.,.,.,.,.,.,.,C:G,C,.,.,.,.,.,.,U:G,U,.,A:U,A,.,.,.,.,.,.,.,.,.,U,U:A:A,U:A,.,.,A,A:U:A,A:A,.,G:C,G,G:C:G,G:G,.,C:G,C,U:A,U,.,C:G,C,C:G:G,A,.,.,.,.,.,.,.,.,.,.,.,.,.,G,G:C,U,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,-,-,G,G:U,G,G:C,.,.,.,.,.,.,.,.,.,.,C,.,.,.,.,-,-,A,G,G:G,.,A,.,.,.,.,.,.,G,.,C,.,.,.,.,A,A:A,.,.,.,.,.,.,.,.,.,.,A:U,A,.,.,.,.,U:A,U,.,C:G,C,.,.,G:C,G,.,G:C,G,C,U,G,C,.,A,G,.,.,.,.,.,.,.,.,.,U,.,C,.,.,C,.,.,G,.,.,A,.,U,A,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,G,-:-,-:-,-:-,-:-,-:-,-:-,-:-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,G,U,.,.,.,C,C:G,C,.,C:G,C,.,.,.,.,.,.,.,C:G,C,.,.,.,G:C,G,.,G:C,G,.,U,U:A,.,.,.,.,U,.,.,.,.,.,C,.,.,.,.,A,.,A,.,.,C,.,U,.,.,.,.,.,.,.,.,.,.,.,C,.,C,G,G,.,.,G,.,.,.,.,.,.,.,.,.,U,.,G,.,G,.,.,G,.,C,.,.,.,.,.,.,.,.,.,C,.,.,C,U,-,-,-
Nematoda,eukaryota,Cys,panRed1_KB455421.1.trna1-CysGCA,panRed1,Panagrellus redivivus MT8872,.,Cys,GCA,69.8,91.5,0.4146341,0,0,0,7,7,8,0,False,True,G:C,G,.,G:C,G,.,G:C,G,.,.,.,.,.,.,U:G,U,.,.,.,.,.,.,.,.,.,.,.,C:G,C,.,.,.,.,.,.,U:G,U,.,A:U,A,.,.,.,.,.,.,.,.,.,U,U:A:A,U:A,.,.,A,A:U:A,A:A,.,G:C,G,G:C:G,G:G,.,C:G,C,U:A,U,.,C:G,C,C:G:G,A,.,.,.,.,.,.,.,.,.,.,.,.,.,G,G:C,U,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,-,-,G,G:U,G,G:C,.,.,.,.,.,.,.,.,.,.,C,.,.,.,.,-,-,A,G,G:G,.,A,.,.,.,.,.,.,G,.,C,.,.,.,.,A,A:A,.,.,.,.,.,.,.,.,.,.,A:U,A,.,.,.,.,U:A,U,.,C:G,C,.,.,G:C,G,.,G:C,G,C,U,G,C,.,A,G,.,.,.,.,.,.,.,.,.,U,.,C,.,.,C,.,.,G,.,.,A,.,U,A,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,G,-:-,-:-,-:-,-:-,-:-,-:-,-:-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,G,U,.,.,.,C,C:G,C,.,C:G,C,.,.,.,.,.,.,.,C:G,C,.,.,.,G:C,G,.,G:C,G,.,U,U:A,.,.,.,.,U,.,.,.,.,.,C,.,.,.,.,A,.,A,.,.,C,.,U,.,.,.,.,.,.,.,.,.,.,.,C,.,C,G,G,.,.,G,.,.,.,.,.,.,.,.,.,U,.,G,.,G,.,.,G,.,C,.,.,.,.,.,.,.,.,.,C,.,.,C,U,-,-,-
Nematoda,eukaryota,Cys,panRed1_KB455423.1.trna5-CysGCA,panRed1,Panagrellus redivivus MT8872,.,Cys,GCA,69.8,91.5,0.4146341,0,0,0,7,7,8,0,False,True,G:C,G,.,G:C,G,.,G:C,G,.,.,.,.,.,.,U:G,U,.,.,.,.,.,.,.,.,.,.,.,C:G,C,.,.,.,.,.,.,U:G,U,.,A:U,A,.,.,.,.,.,.,.,.,.,U,U:A:A,U:A,.,.,A,A:U:A,A:A,.,G:C,G,G:C:G,G:G,.,C:G,C,U:A,U,.,C:G,C,C:G:G,A,.,.,.,.,.,.,.,.,.,.,.,.,.,G,G:C,U,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,-,-,G,G:U,G,G:C,.,.,.,.,.,.,.,.,.,.,C,.,.,.,.,-,-,A,G,G:G,.,A,.,.,.,.,.,.,G,.,C,.,.,.,.,A,A:A,.,.,.,.,.,.,.,.,.,.,A:U,A,.,.,.,.,U:A,U,.,C:G,C,.,.,G:C,G,.,G:C,G,C,U,G,C,.,A,G,.,.,.,.,.,.,.,.,.,U,.,C,.,.,C,.,.,G,.,.,A,.,U,A,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,G,-:-,-:-,-:-,-:-,-:-,-:-,-:-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,G,U,.,.,.,C,C:G,C,.,C:G,C,.,.,.,.,.,.,.,C:G,C,.,.,.,G:C,G,.,G:C,G,.,U,U:A,.,.,.,.,U,.,.,.,.,.,C,.,.,.,.,A,.,A,.,.,C,.,U,.,.,.,.,.,.,.,.,.,.,.,C,.,C,G,G,.,.,G,.,.,.,.,.,.,.,.,.,U,.,G,.,G,.,.,G,.,C,.,.,.,.,.,.,.,.,.,C,.,.,C,U,-,-,-
Nematoda,eukaryota,Cys,panRed1_KB455423.1.trna6-CysGCA,panRed1,Panagrellus redivivus MT8872,.,Cys,GCA,69.8,91.5,0.4146341,0,0,0,7,7,8,0,False,True,G:C,G,.,G:C,G,.,G:C,G,.,.,.,.,.,.,U:G,U,.,.,.,.,.,.,.,.,.,.,.,C:G,C,.,.,.,.,.,.,U:G,U,.,A:U,A,.,.,.,.,.,.,.,.,.,U,U:A:A,U:A,.,.,A,A:U:A,A:A,.,G:C,G,G:C:G,G:G,.,C:G,C,U:A,U,.,C:G,C,C:G:G,A,.,.,.,.,.,.,.,.,.,.,.,.,.,G,G:C,U,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,-,-,G,G:U,G,G:C,.,.,.,.,.,.,.,.,.,.,C,.,.,.,.,-,-,A,G,G:G,.,A,.,.,.,.,.,.,G,.,C,.,.,.,.,A,A:A,.,.,.,.,.,.,.,.,.,.,A:U,A,.,.,.,.,U:A,U,.,C:G,C,.,.,G:C,G,.,G:C,G,C,U,G,C,.,A,G,.,.,.,.,.,.,.,.,.,U,.,C,.,.,C,.,.,G,.,.,A,.,U,A,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,G,-:-,-:-,-:-,-:-,-:-,-:-,-:-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,G,U,.,.,.,C,C:G,C,.,C:G,C,.,.,.,.,.,.,.,C:G,C,.,.,.,G:C,G,.,G:C,G,.,U,U:A,.,.,.,.,U,.,.,.,.,.,C,.,.,.,.,A,.,A,.,.,C,.,U,.,.,.,.,.,.,.,.,.,.,.,C,.,C,G,G,.,.,G,.,.,.,.,.,.,.,.,.,U,.,G,.,G,.,.,G,.,C,.,.,.,.,.,.,.,.,.,C,.,.,C,U,-,-,-


There are 5 copies of Cys in _Panagrellus redivivus_, a nematode commonly used to feed aquarium fish.

#### Asp-A46

In [238]:
## No. tRNAs without A46
identities %>% select(clade, species, isotype, quality, X46) %>%
  filter(quality & isotype == "Asp" & X46 != 'A') %>% nrow()

## Clade check
best_freqs %>% filter(positions == 'X46' & isotype == "Asp") %>% group_by(clade) %>%
  summarize(status=sum((variable == 'A')*count)/sum(count) == 1) %>%
  filter(!status)

## Species check
identities %>% select(clade, species, isotype, quality, X46) %>%
  filter(quality & isotype == "Asp") %>%
  group_by(clade, species) %>%
  summarize(status=sum(X46 == 'A')/n() >= 0.1,
            freq=sum(X46 == 'A'), tRNAs=n()) %>%
  filter(!status)

Unnamed: 0,clade,status


Unnamed: 0,clade,species,status,freq,tRNAs


Is this enriched in certain clades?

In [239]:
identities %>% select(clade, species, isotype, quality, X46) %>%
  filter(quality & isotype == "Asp" & X46 != 'A') %>% group_by(clade) %>% tally()

Unnamed: 0,clade,n
1,Mammalia,23
2,Nematoda,1
3,Spermatophyta,2


How about certain species?

In [251]:
identities %>% select(clade, species, isotype, quality, X46) %>%
  group_by(species) %>%
  mutate(ntRNAs = n()) %>%
  ungroup() %>%
  filter(quality & isotype == "Asp" & X46 != 'A') %>% 
  group_by(clade, species, ntRNAs) %>% 
  summarize(countG46 = n()) %>%
  ungroup() %>%
  mutate(nPseudo = c(191918, 228656, 1908, 222374, 102, 1679, 93827, 244, 154651, 103, 8, 12))

Unnamed: 0,clade,species,ntRNAs,countG46,nPseudo
1,Mammalia,balAcu1,436,7,191918
2,Mammalia,bosTau8,602,3,228656
3,Mammalia,macEug2,308,2,1908
4,Mammalia,oviAri3,527,1,222374
5,Mammalia,panTro4,399,1,102
6,Mammalia,sarHar1,379,2,1679
7,Mammalia,sorAra2,313,1,93827
8,Mammalia,tarSyr2,288,1,244
9,Mammalia,turTru2,385,5,154651
10,Nematoda,ce11,559,1,103


What's the typical score for these guys?

In [257]:
# Score for all identified species
identities %>% select(clade, species, isotype, quality, X46, score, isoscore) %>%
  filter(quality, isotype == "Asp") %>%
  group_by(X46) %>%
  summarize(score = mean(score), isoscore = mean(isoscore))

# Score for low pseudogene count species
identities %>% select(clade, species, isotype, quality, X46, score, isoscore) %>%
  filter(quality, isotype == "Asp") %>%
  mutate(species = ifelse(species %in% c("panTro4", "tarSyr2", "ce11", "araTha1", "orySat7"), "Low pseudo", "Hi pseudo")) %>%
  group_by(species, X46) %>%
  summarize(score = mean(score), isoscore = mean(isoscore))

Unnamed: 0,X46,score,isoscore
1,A,63.4185995623632,100.421881838074
2,G,60.95,88.6461538461539


Unnamed: 0,species,X46,score,isoscore
1,Hi pseudo,A,63.4113378684807,100.580498866213
2,Hi pseudo,G,60.2142857142857,87.4714285714286
3,Low pseudo,A,63.61875,96.05
4,Low pseudo,G,64.04,93.58


These actually score fairly well. In high tRNA pseudogene species, there's a slight difference between A46 and G46 Asp-specific scores, whereas in low pseudogene species, there's a negligible difference.

To be fair, there aren't a lot of low pseudogene species and tRNAs to compare. It's a small signal, a very small piece of evidence. I'll run the low pseudo guys through the single tRNA analysis pipeline.

In [262]:
identities %>%
  filter(quality, isotype == "Asp", X46 == "G") %>% 
  filter(species %in% c("panTro4", "tarSyr2", "ce11", "araTha1", "orySat7"))

Unnamed: 0,clade,domain,isotype,seqname,species,species_long,taxid,isotype_best,anticodon,score,⋯,X70i8,X70i9,X71,X71i1,X71i2,X72,X73,X74,X75,X76
1,Spermatophyta,eukaryota,Asp,araTha1_chr1.trna30-AspGTC,araTha1,Arabidopsis thaliana (TAIR10 Feb 2011),.,Asp,GTC,71.4,⋯,.,.,G,.,.,C,G,-,-,-
2,Nematoda,eukaryota,Asp,ce11_chrIV.trna8-AspGTC,ce11,Caenorhabditis elegans (C. elegans Feb 2013 WBcel235/ce11),.,Asp,GTC,63.3,⋯,.,.,G,.,.,A,G,-,-,-
3,Spermatophyta,eukaryota,Asp,orySat7_8.trna18-AspGTC,orySat7,Oryza sativa,.,Asp,GTC,60.3,⋯,.,.,G,.,.,C,G,-,-,-
4,Mammalia,eukaryota,Asp,panTro4_chr6.trna51-AspGTC,panTro4,Pan troglodytes (Chimp Feb. 2011 CSAC 2.1.4/panTro4),.,Asp,GTC,68.1,⋯,.,.,G,.,.,A,G,-,-,-
5,Mammalia,eukaryota,Asp,tarSyr2_KE931301v1.trna1-AspGTC,tarSyr2,Tarsius syrichta (Tarsier Sep. 2013 Tarsius_syrichta-2.0.1/tarSyr2),.,Asp,GTC,57.1,⋯,.,.,G,.,.,A,G,-,-,-


In [263]:
identities %>% filter(species == "araTha1", isotype == "Asp")

Unnamed: 0,clade,domain,isotype,seqname,species,species_long,taxid,isotype_best,anticodon,score,⋯,X70i8,X70i9,X71,X71i1,X71i2,X72,X73,X74,X75,X76
1,Spermatophyta,eukaryota,Asp,araTha1_chr1.trna5-AspGTC,araTha1,Arabidopsis thaliana (TAIR10 Feb 2011),.,Asp,GTC,69.8,⋯,.,.,G,.,.,C,G,-,-,-
2,Spermatophyta,eukaryota,Asp,araTha1_chr1.trna27-AspGTC,araTha1,Arabidopsis thaliana (TAIR10 Feb 2011),.,Asp,GTC,69.8,⋯,.,.,G,.,.,C,G,-,-,-
3,Spermatophyta,eukaryota,Asp,araTha1_chr1.trna28-AspGTC,araTha1,Arabidopsis thaliana (TAIR10 Feb 2011),.,Asp,GTC,69.8,⋯,.,.,G,.,.,C,G,-,-,-
4,Spermatophyta,eukaryota,Asp,araTha1_chr1.trna30-AspGTC,araTha1,Arabidopsis thaliana (TAIR10 Feb 2011),.,Asp,GTC,71.4,⋯,.,.,G,.,.,C,G,-,-,-
5,Spermatophyta,eukaryota,Asp,araTha1_chr1.trna174-AspGTC,araTha1,Arabidopsis thaliana (TAIR10 Feb 2011),.,Asp,GTC,69.8,⋯,.,.,G,.,.,C,G,-,-,-
6,Spermatophyta,eukaryota,Asp,araTha1_chr1.trna176-AspGTC,araTha1,Arabidopsis thaliana (TAIR10 Feb 2011),.,Asp,GTC,69.8,⋯,.,.,G,.,.,C,G,-,-,-
7,Spermatophyta,eukaryota,Asp,araTha1_chr1.trna179-AspGTC,araTha1,Arabidopsis thaliana (TAIR10 Feb 2011),.,Asp,GTC,69.8,⋯,.,.,G,.,.,C,G,-,-,-
8,Spermatophyta,eukaryota,Asp,araTha1_chr1.trna212-AspGTC,araTha1,Arabidopsis thaliana (TAIR10 Feb 2011),.,Asp,GTC,69.8,⋯,.,.,G,.,.,C,G,-,-,-
9,Spermatophyta,eukaryota,Asp,araTha1_chr1.trna213-AspGTC,araTha1,Arabidopsis thaliana (TAIR10 Feb 2011),.,Asp,GTC,63.9,⋯,.,.,A,.,.,C,G,-,-,-
10,Spermatophyta,eukaryota,Asp,araTha1_chr2.trna33-AspGTC,araTha1,Arabidopsis thaliana (TAIR10 Feb 2011),.,Asp,GTC,61.7,⋯,.,.,G,.,.,C,G,-,-,-


- _A. thaliana_'s Asp is the highest scoring one, and is single copy. It differs from all other Asp tRNAs by one base (G46). Same with _C. elegans_.
- In _O. sativa_, _T. syrichta_, and _P. troglodytes_, it's single copy.

