In [None]:
library(ggplot2)
library(reshape2)
library(dplyr)
library(stringr)
library(tidyr)
theme_set(theme_bw())
options(repr.plot.width=7, repr.plot.height=4)

In [2]:
identities = read.delim('identities.tsv', sep='\t', stringsAsFactors=FALSE)
identities$quality = as.logical(identities$quality)
identities$restrict = as.logical(identities$restrict)

# Introduction

Here, I measure base frequencies and tabulate identity elements by clade. Biologists in the past have named things as identity elements if 100% of tRNAs have this element. I'll relax those restrictions to 98% among the quality and species-restricted tRNAs.

Terminology:
- An **identity element** is a base or base class (G, purine, etc.) located at a specific position.
- **Consensus** identity elements for a clade means that all subclades contain this identity element.
- **Clade-specific** identity elements for a subclade means that there exists sister subclades do not have the same IDE at this position. They may have an IDE at the position, but not the same one. Clade-specific IDEs can belong to multiple clades. 
    - For example, if archaea and bacteria have A41 but eukaryota have G41, then G41 is a eukaryota-specific IDE, and A41 is a archaea- and bacteria-specific IDE. This isn't exclusive with R41 as a consensus IDE. 


## Base combination table

Reproduced from euk-Leu

Symbol   |   Bases  | Rule  | Rank
:--------|----------|-------|------
R | G A | Purine | 2
Y | U C | Pyrimidine | 2
K | G U | Keto | 2
M | A C | Amino | 2
S | G C | Strong | 2
W | A U | Weak | 2
B | C G U | not A | 3
D | G A U | not C | 3
H | A C U | not G | 3
V | G C A | not U | 3
R:Y | G:C A:U | Purine-Pyrimidine | 2
Y:R | C:G U:A | Pyrimidine-Purine | 2
S:S | G:C C:G | Strong | 2
W:W | A:U U:A | Weak | 2
W:O | G:U U:G | Wobble | 2
B:V | G:C C:G U:A | not A:U | 3
V:B | G:C C:G A:U | not U:A | 3
D:H | A:U U:A G:C | not C:G | 3
H:D | A:U U:A C:G | not G:C | 3
W:C | A:U U:A G:C C:G | Watson-Crick | 4
G:N | G:C G:U C:G U:G | G | 5
U:N | U:G U:A A:U G:U | U | 5
C:O | A:U U:A G:C C:G G:U U:G | Canonical pairs | 6
Y:Y | U:C C:U U:U C:C | Pyrimidine-pyrimidine mismatch | 7
R:R | A:G G:A A:A G:G | Purine-purine mismatch | 7
M:M | A:A G:G C:C U:U A:G A:C C:A C:U G:A U:C | Mismatch | 8

# Eukaryotic tRNA identity elements

## Consensus IDEs

I will start with the easiest case: IDEs that are almost perfectly conserved among all eukaryotes.

In [3]:
positions = colnames(identities)[which(str_detect(colnames(identities), 'X\\d+\\.'))]
positions = c(positions, 'X8', 'X9', 'X14', 'X15', 'X16', 'X17', 'X18', 'X19', 'X20', 'X20a', 'X21', 'X26', 'X32', 'X33', 'X34', 'X35', 'X36', 'X37', 'X38', 'X44', 'X45', 'X46', 'X47', 'X48', 'X54', 'X55', 'X56', 'X57', 'X58', 'X59', 'X60', 'X73')

In [95]:
euk_freqs = identities %>% select(match(positions, colnames(identities))) %>%
  gather(positions, bases) %>%
  group_by(positions, bases) %>%
  tally() %>%
  group_by(positions) %>%
  mutate(freq=n/sum(n)) %>%
  group_by(positions) %>%
  summarize(A = sum(freq[bases == "A"]),
            C = sum(freq[bases == "C"]),
            G = sum(freq[bases == "G"]),
            U = sum(freq[bases == "U"]),
            Purine = sum(freq[bases %in% c("A", "G")]),
            Pyrimidine = sum(freq[bases %in% c("C", "U")]),
            Weak = sum(freq[bases %in% c("A", "U")]),
            Strong = sum(freq[bases %in% c("G", "C")]),
            Amino = sum(freq[bases %in% c("A", "C")]),
            Keto = sum(freq[bases %in% c("G", "U")]),
            B = sum(freq[bases %in% c("C", "G", "U")]),
            D = sum(freq[bases %in% c("A", "G", "U")]),
            H = sum(freq[bases %in% c("A", "C", "U")]),
            V = sum(freq[bases %in% c("A", "C", "G")]),
            D = sum(freq[bases %in% c("A", "G", "U")]),
            GC = sum(freq[bases == "G:C"]),
            AU = sum(freq[bases == "A:U"]),
            UA = sum(freq[bases == "U:A"]),
            CG = sum(freq[bases == "C:G"]),
            GU = sum(freq[bases == "G:U"]),
            UG = sum(freq[bases == "U:G"]),
            PurinePyrimidine = sum(freq[bases %in% c("A:U", "G:C")]),
            PyrimidinePurine = sum(freq[bases %in% c("U:A", "C:G")]),
            StrongPair = sum(freq[bases %in% c("G:C", "C:G")]),
            WeakPair = sum(freq[bases %in% c("A:U", "U:A")]),
            Wobble = sum(freq[bases %in% c("G:U", "U:G")]),
            Paired = sum(freq[bases %in% c("A:U", "U:A", "C:G", "G:C", "G:U", "U:G")]),
            Mismatched = sum(freq[bases %in% c("A:A", "G:G", "C:C", "U:U", "A:G", "A:C", "C:A", "C:U", "G:A", "U:C")])
            )

In [96]:
consensus = euk_freqs %>%
  melt %>% 
  filter(value > 0.95) %>%
  group_by(positions) %>% # remove duplicates
  filter(row_number(value) == 1) %>%
  arrange(positions)
consensus# %>% unite("identity element", variable, str_replace("X", "", positions))

Using positions as id variables


Unnamed: 0,positions,variable,value
1,X10.25,Paired,0.9849145
2,X14,A,0.9942488
3,X15,Purine,0.997079
4,X18,G,0.9891507
5,X18.55,GU,0.9785918
6,X19,G,0.9878808
7,X19.56,Paired,0.9835901
8,X20,H,0.9689218
9,X21,A,0.982819
10,X26,D,0.9781654


## Clade-specific IDEs

In [97]:
clade_freqs = identities %>%
  select(match(c('clade', positions), colnames(identities))) %>%
  gather(positions, bases, -clade) %>%
  group_by(clade, positions, bases) %>%
  tally() %>%
  group_by(clade, positions) %>%
  mutate(freq=n/sum(n)) %>%
  group_by(clade, positions) %>%
  summarize(A = sum(freq[bases == "A"]),
            C = sum(freq[bases == "C"]),
            G = sum(freq[bases == "G"]),
            U = sum(freq[bases == "U"]),
            Purine = sum(freq[bases %in% c("A", "G")]),
            Pyrimidine = sum(freq[bases %in% c("C", "U")]),
            Weak = sum(freq[bases %in% c("A", "U")]),
            Strong = sum(freq[bases %in% c("G", "C")]),
            Amino = sum(freq[bases %in% c("A", "C")]),
            Keto = sum(freq[bases %in% c("G", "U")]),
            B = sum(freq[bases %in% c("C", "G", "U")]),
            D = sum(freq[bases %in% c("A", "G", "U")]),
            H = sum(freq[bases %in% c("A", "C", "U")]),
            V = sum(freq[bases %in% c("A", "C", "G")]),
            D = sum(freq[bases %in% c("A", "G", "U")]),
            GC = sum(freq[bases == "G:C"]),
            AU = sum(freq[bases == "A:U"]),
            UA = sum(freq[bases == "U:A"]),
            CG = sum(freq[bases == "C:G"]),
            GU = sum(freq[bases == "G:U"]),
            UG = sum(freq[bases == "U:G"]),
            PurinePyrimidine = sum(freq[bases %in% c("A:U", "G:C")]),
            PyrimidinePurine = sum(freq[bases %in% c("U:A", "C:G")]),
            StrongPair = sum(freq[bases %in% c("G:C", "C:G")]),
            WeakPair = sum(freq[bases %in% c("A:U", "U:A")]),
            Wobble = sum(freq[bases %in% c("G:U", "U:G")]),
            Paired = sum(freq[bases %in% c("A:U", "U:A", "C:G", "G:C", "G:U", "U:G")]),
            Mismatched = sum(freq[bases %in% c("A:A", "G:G", "C:C", "U:U", "A:G", "A:C", "C:A", "C:U", "G:A", "U:C")])
            )

In [98]:
clade_specific = clade_freqs %>%
  melt %>% 
  filter(value > 0.95) %>%
  group_by(clade, positions) %>% # remove duplicates
  filter(row_number(value) == 1) %>%
  group_by(positions) %>%
  filter(n() != 7) %>%
  group_by(positions, variable) %>%
  summarize(clades=toString(clade)) %>%
  arrange(positions)
clade_specific

Using clade, positions as id variables


Unnamed: 0,positions,variable,clades
1,X15.48,PurinePyrimidine,Fungi
2,X16,Pyrimidine,"Fungi, Nematoda"
3,X16,H,"Spermatophyta, Streptophyta, Vertebrata"
4,X1.72,Paired,"Fungi, Insecta, Mammalia, Nematoda, Spermatophyta, Streptophyta"
5,X26,D,"Fungi, Mammalia, Nematoda, Spermatophyta, Streptophyta, Vertebrata"
6,X2.71,Paired,"Fungi, Insecta, Nematoda, Spermatophyta"
7,X27.43,Paired,"Fungi, Insecta, Nematoda, Spermatophyta, Streptophyta, Vertebrata"
8,X3.70,Paired,"Fungi, Insecta, Mammalia, Nematoda, Spermatophyta, Streptophyta"
9,X44,Weak,Spermatophyta
10,X44,D,"Fungi, Mammalia, Streptophyta"


# Isotype-specific identity elements

Like above, there are two cases to explore here: consensus and clade-specific identity elements.

## Isotype-specific consensus identity elements

In [85]:
clade_freqs = identities %>%
  select(match(c('clade', 'isotype', positions), colnames(identities))) %>%
  gather(positions, bases, -clade, -isotype) %>%
  group_by(clade, isotype, positions, bases) %>%
  tally() %>%
  group_by(clade, isotype, positions) %>%
  mutate(freq=n/sum(n)) %>%
  group_by(clade, isotype, positions) %>%
  summarize(A = sum(freq[bases == "A"]),
            C = sum(freq[bases == "C"]),
            G = sum(freq[bases == "G"]),
            U = sum(freq[bases == "U"]),
            Purine = sum(freq[bases %in% c("A", "G")]),
            Pyrimidine = sum(freq[bases %in% c("C", "U")]),
            Weak = sum(freq[bases %in% c("A", "U")]),
            Strong = sum(freq[bases %in% c("G", "C")]),
            Amino = sum(freq[bases %in% c("A", "C")]),
            Keto = sum(freq[bases %in% c("G", "U")]),
            B = sum(freq[bases %in% c("C", "G", "U")]),
            D = sum(freq[bases %in% c("A", "G", "U")]),
            H = sum(freq[bases %in% c("A", "C", "U")]),
            V = sum(freq[bases %in% c("A", "C", "G")]),
            D = sum(freq[bases %in% c("A", "G", "U")]),
            GC = sum(freq[bases == "G:C"]),
            AU = sum(freq[bases == "A:U"]),
            UA = sum(freq[bases == "U:A"]),
            CG = sum(freq[bases == "C:G"]),
            GU = sum(freq[bases == "G:U"]),
            UG = sum(freq[bases == "U:G"]),
            PurinePyrimidine = sum(freq[bases %in% c("A:U", "G:C")]),
            PyrimidinePurine = sum(freq[bases %in% c("U:A", "C:G")]),
            StrongPair = sum(freq[bases %in% c("G:C", "C:G")]),
            WeakPair = sum(freq[bases %in% c("A:U", "U:A")]),
            Wobble = sum(freq[bases %in% c("G:U", "U:G")]),
            Paired = sum(freq[bases %in% c("A:U", "U:A", "C:G", "G:C", "G:U", "U:G")]),
            Mismatched = sum(freq[bases %in% c("A:A", "G:G", "C:C", "U:U", "A:G", "A:C", "C:A", "C:U", "G:A", "U:C")])
            )

In [87]:
isotype_consensus = euk_isotype_freqs %>%
  melt %>% 
  filter(value > 0.95) %>%
  group_by(isotype, positions) %>% # remove duplicates
  filter(row_number(value) == 1) %>%
  arrange(isotype)

Using isotype, positions as id variables


In [88]:
isotype_consensus

Unnamed: 0,isotype,positions,variable,value
1,Ala,X14,A,0.9896015
2,Ala,X21,A,0.980759
3,Ala,X37,A,0.9552941
4,Ala,X44,A,0.9646679
5,Ala,X56,C,0.9649715
6,Ala,X18,G,0.9859962
7,Ala,X19,G,0.9802657
8,Ala,X46,G,0.9683112
9,Ala,X20,U,0.9560531
10,Ala,X33,U,0.980797


## Clade-specific isotype-specific IDEs

In [103]:
clade_iso_freqs = identities %>%
  select(match(c('clade', 'isotype', positions), colnames(identities))) %>%
  gather(positions, bases, -clade, -isotype) %>%
  group_by(clade, isotype, positions, bases) %>%
  tally() %>%
  group_by(clade, isotype, positions) %>%
  mutate(freq=n/sum(n)) %>%
  group_by(clade, isotype, positions) %>%
  summarize(A = sum(freq[bases == "A"]),
            C = sum(freq[bases == "C"]),
            G = sum(freq[bases == "G"]),
            U = sum(freq[bases == "U"]),
            Purine = sum(freq[bases %in% c("A", "G")]),
            Pyrimidine = sum(freq[bases %in% c("C", "U")]),
            Weak = sum(freq[bases %in% c("A", "U")]),
            Strong = sum(freq[bases %in% c("G", "C")]),
            Amino = sum(freq[bases %in% c("A", "C")]),
            Keto = sum(freq[bases %in% c("G", "U")]),
            B = sum(freq[bases %in% c("C", "G", "U")]),
            D = sum(freq[bases %in% c("A", "G", "U")]),
            H = sum(freq[bases %in% c("A", "C", "U")]),
            V = sum(freq[bases %in% c("A", "C", "G")]),
            D = sum(freq[bases %in% c("A", "G", "U")]),
            GC = sum(freq[bases == "G:C"]),
            AU = sum(freq[bases == "A:U"]),
            UA = sum(freq[bases == "U:A"]),
            CG = sum(freq[bases == "C:G"]),
            PurinePyrimidine = sum(freq[bases %in% c("A:U", "U:A", "C:G", "G:C")]),
            StrongPair = sum(freq[bases %in% c("G:C", "C:G")]),
            WeakPair = sum(freq[bases %in% c("A:U", "U:A")]),
            Wobble = sum(freq[bases %in% c("G:U", "U:G")]),
            Paired = sum(freq[bases %in% c("A:U", "U:A", "C:G", "G:C", "G:U", "U:G")]),
            Mismatched = sum(freq[bases %in% c("A:A", "G:G", "C:C", "U:U", "A:G", "A:C", "C:A", "C:U", "G:A", "U:C")])
            )

In [None]:
clade_iso_specific = clade_iso_freqs %>%
  melt %>% 
  filter(value > 0.95) %>%
  group_by(clade, isotype, positions) %>% # remove duplicates
  filter(row_number(value) == 1) %>%
  group_by(positions) %>%
  filter(n() != 7) %>%
  group_by(positions, isotype, variable) %>%
  summarize(clades=toString(clade)) %>%
  arrange(isotype, positions)
clade_iso_specific

Using clade, isotype, positions as id variables
