What if we only used the complete eukaryotic model to align all tRNAs? Then comparing identity elements across isotypes by position becomes much easier.

Also: track acceptor arm length, loop length, etc.

In [16]:
import pandas as pd
import subprocess
from tRNA_position import get_positions, position_generator

isotypes = ['Ala', 'Arg', 'Asn', 'Asp', 'Cys', 'Gln', 'Glu', 'Gly', 'His', 'Ile', 'iMet', 'Leu', 'Lys', 'Met', 'Phe', 'Pro', 'Ser', 'Thr', 'Trp', 'Tyr', 'Val']
identities = pd.DataFrame()
for isotype in isotypes:
  # create new alignment file
  model = '/projects/lowelab/users/blin/tRNAscan/models/current/TRNAinf-euk.cm'
  fasta = '/projects/lowelab/users/blin/tRNAscan/models/1.6/fasta/euk-' + isotype + '-r2-031616.fa'
  alignment = 'alignments/euk-' + isotype + '.sto'
  subprocess.call('cmalign -g --notrunc --matchonly -o {} {} {} > /dev/null'.format(alignment, model, fasta), shell=True)
  
  # get positions
  positions = get_positions(alignment)
  
  # get identities
  df = pd.concat(pd.DataFrame({'Position': [position.position], 
                               'Symbol': [symbol],
                               'Frequency': [freq],
                               'Isotype': isotype,
                               'Clade': 'Mammalia'}) for position, symbol, freq in position_generator(positions))
  # combine into larger df
  identities = pd.concat([identities, df])
  
identities.head()

Unnamed: 0,Clade,Frequency,Isotype,Position,Symbol
0,Mammalia,0.946895,Ala,1:89,C:O
0,Mammalia,0.961296,Ala,2:88,C:O
0,Mammalia,0.959496,Ala,3:87,C:O
0,Mammalia,0.988749,Ala,4:86,C:O
0,Mammalia,0.957696,Ala,5:85,C:O


In [104]:
key_ides = identities.ix[identities.Frequency >= 0.98].groupby(["Position", "Symbol"])
key_ides = key_ides.apply(lambda x: ', '.join(x.Isotype)).to_frame()
#key_ides.axes
#key_ides.columns = key_ides.columns.get_level_values(0)


In [105]:
key_ides

Unnamed: 0_level_0,Unnamed: 1_level_0,0
Position,Symbol,Unnamed: 2_level_1
10:25,C:O,"Asp, Gly, Ile, Phe, Ser, Thr"
10:25,G:C,Gln
10:25,G:N,"Ala, Arg, Asn, Glu, His, iMet, Leu, Lys, Met, ..."
11:24,B:V,Gly
11:24,C:G,Ser
11:24,C:O,"Asp, Glu, Ile, Met, Trp, Val"
11:24,G:N,"Cys, Lys, Phe"
11:24,Y:R,"Ala, Arg, Gln, His, Leu, Pro, Thr, Tyr"
12:23,B:V,"Ala, Cys, Ile, Pro, Val"
12:23,C:O,"Asp, Met, Ser"


This is a great list, but I'd like to categorize isotypes into ranked groups, much like for bases and base pairs. 



Symbol  |  Isotypes  | Rule   | Rank


In [None]:
key_ides[key_ides.Position == ""]

According to this list, we have several new identity elements. Here, I look at the ones that 

I'm cross-referencing Giegé et al. (1998), but there may be further work after this publication that identifies these IDEs.

Position  |  Isotype  | 