# Codon Usage and tRNA Anticodon

Using a list of genbank files as input, this script:

- [x] Extracts all its CDS's
- [x] Calculates codon usage

---

- [x] Converts each sequence to fasta (or only the concatenated tRNAs)
- [x] Uses the trnascan-SE to identify the anticodon

---

Generates an excel spreadsheet, containing the following fields:

- [x] Species

- [x] Translation table

- [x] Aminoacid (Three-letter code)

- [x] Codon for that aminoacid

- [ ] Has anticodon in mito tRNAs? YES or EMPTY VALUE

- [x] Number of occurrences of that codon in the genes 

- [x] \1000 - (number_of_codon_occurences / total number of aa - including start_stop_codons) * 1000

- [x] Fraction: codon occurences / sum of all codon occurences for that aa

Let's import everything we need:

In [1]:
from Bio import SeqIO
import pandas
from Bio.Seq import Seq
from Bio.Data import CodonTable
from Bio.SeqUtils import seq3
import subprocess, os
import glob

In order for this script to work, you have to execute it from within the directory with the genbank files or change the working directory using  **os.chdir("path/to/directory/")**

First, we need to add all gb files **(no .gbk allowed!)** to a list:

In [2]:
genbanks = glob.glob("./gbks/*.gb")

In [3]:
#genbank = !ls *gb
#genbank

# trnascan
There are files for the different genetic tables in /home/gabriel/bioinfo/anaconda3/lib/tRNAscan-SE/gcode/gcode.vertmito.
Maybe looking at this will give me some ideas on how to run the trnascan for other tables.

In [4]:
def genbank_to_fasta(genbank):
    record = SeqIO.read(genbank, "genbank")
    species = record.annotations.get("organism").replace(" ", "_")
    seq = record.seq
    fasta = ">{}\n{}".format(species, seq)
    return(fasta)

In [5]:
for i in genbanks:
    seqio = SeqIO.read(i, "genbank")
    print(seqio.id, len(seqio))

NC_029755.1 14049
NC_028077.1 14402
NC_020322.1 13931
NC_025223.1 14513
NC_024877.1 14316
NC_032402.1 14205
NC_010777.1 13991
NC_024287.1 14601
NC_044695.1 14032
NC_059873.1 14149
NC_008063.1 14436
NC_044081.1 14213
XXXXXXXX 14877




NC_028078.1 14193
XXXXXXXX 14925
NC_033971.1 14776
NC_029756.1 14161
NC_005925.1 13874
NC_044696.1 14687
NC_026863.1 16000
NC_042829.1 14625
NC_020324.1 14459
NC_044099.1 14220
NC_042902.1 14683
NC_028068.1 14639
NC_044101.1 14092
NC_020323.1 14197
XXXXXXXX 13885
NC_044653.1 14074
NC_027682.1 14575
NC_058921.1 14724
NC_027492.1 14563
NC_060386.1 14610
NC_041120.1 14334
NC_053648.1 14431
NC_005924.1 14215
NC_057201.1 14458
NC_040861.1 14727
NC_053738.1 13874
XXXXXXXX 14496
NC_026290.1 14156
NC_025557.1 14407
NC_046736.1 15078
NC_025224.1 14442
NC_026123.1 14741
NC_025775.1 14414
NC_025634.1 14617
NC_060328.1 14325
NC_024281.1 14063
NC_005942.1 14381
NC_024878.1 14272
NC_040859.1 14941
NC_031355.1 14783


---
We also have to calculate codon usage for all CDSs. In order to do so, we first need to extract all CDS from the genbank file, make sure that all sequences are multiple of 3 (or otherwise adding "A" residues from poly-A tail to complete the codon) and concatenate the CDS's.

In [6]:
def extract_cds(genbank):
    CDS_list = list()
    for record in SeqIO.parse(genbank, "genbank"):
        accession = record.id
        species_name = record.annotations.get("organism").replace(" ", "_")
        for FEATURE in record.features:
            if FEATURE.type == "CDS":
                gene = FEATURE.qualifiers.get("gene")[0]
                sequence = FEATURE.location.extract(record).seq
                CDS_list.append([accession, species_name, gene, sequence])
    return(CDS_list)

In [10]:
#Checking if loxosceles similis has 13 CDS's
lox = SeqIO.read('./gbks/loxosceles_similis.gb', "genbank")
print(dir(lox))
cds = 0
for i in lox.features:
    if i.type == "CDS":
        cds += 1
        print(i)
print("Total cds: ", cds)

['__add__', '__bool__', '__class__', '__contains__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__le___', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__radd__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_per_letter_annotations', '_seq', '_set_per_letter_annotations', '_set_seq', 'annotations', 'dbxrefs', 'description', 'features', 'format', 'id', 'letter_annotations', 'lower', 'name', 'reverse_complement', 'seq', 'translate', 'upper']
type: CDS
location: [444:1359](+)
qualifiers:
    Key: codon_start, Value: ['1']
    Key: db_xref, Value: ['GeneID:40508922']
    Key: gene, Value: ['ND2']
    Key: locus_tag, Value: ['FKU06_mgp12']
    Key: product, Value: ['NADH dehydrogenase subunit 2']
    Key: protein_id, Value: ['YP_009660716.1']
    Key: trans

In [9]:
#All seqs have all 13 CDS's
for i in genbanks:
    genes = extract_cds(i)
    total = concatenate_cds(fix_truncated(genes))
    print(genes[0][1], len(genes), len(total))

NameError: name 'concatenate_cds' is not defined

In [8]:
def fix_truncated(coding_list):
    not_trunc_cds_list = list()
    for i in coding_list:
        truncated_nucs = len(i[3]) % 3
        if truncated_nucs:
            missing_nucs = 3 - truncated_nucs
            #REWRITE THIS IN ORDER TO JUST ADD THE RESIDUES FROM THE POLY-A TAIL - DONE
            #print("Truncated stop codon for gene {}".format(i[2]))
            #print("Before:{} truncated_nucs".format(len(i[3]) % 3))
            #print("Before: {}".format(i[3]))
            i[3] = i[3] + (missing_nucs * 'A')
            #print()
            #print("After: {}".format(i[3]))
            #print("After:{} truncated_nucs".format(len(i[3]) % 3))
            not_trunc_cds_list.append(i)
        else:
            not_trunc_cds_list.append(i)
    return(not_trunc_cds_list)


# NEXT STEP:

Take a look at the "CAI.py" file and use it as a reference to write a function that counts codons for any genetic code.

Adapted from CAI.py:

In [13]:
#from itertools import chain
#from Bio.Data import CodonTable
#from collections import Counter
#from Bio.SeqUtils import seq3

# get rid of Biopython warning
#import warnings
#from Bio import BiopythonWarning

#warnings.simplefilter("ignore", BiopythonWarning)


def set_ncbi_genetic_codes():


    ncbi_genetic_codes = {}
    for table, codes in CodonTable.unambiguous_dna_by_id.items():
        
        # invert the genetic code dictionary to map each amino acid to its codons
        # create dictionary of synonymous codons;
        # Example: {'Phe': ['TTT', 'TTC'], ..., 'End': ['TAA', 'TAG', 'TGA']}
        codons_for_amino_acid = {}
        for codon, amino_acid in codes.forward_table.items():
            amino_acid = seq3(amino_acid)
            codons_for_amino_acid[amino_acid] = codons_for_amino_acid.get(amino_acid, [])
            codons_for_amino_acid[amino_acid].append(codon)
        #Adding the stop codons
        codons_for_amino_acid["End"] = CodonTable.unambiguous_dna_by_id[table].stop_codons
        
        #create dictionary of synonymous codons for each genetic table;
        ncbi_genetic_codes[table] = codons_for_amino_acid
        
    return(ncbi_genetic_codes)

#codon_tables = set_ncbi_genetic_codes()

In [14]:
print(set_ncbi_genetic_codes().get(5))

{'Phe': ['TTT', 'TTC'], 'Leu': ['TTA', 'TTG', 'CTT', 'CTC', 'CTA', 'CTG'], 'Ser': ['TCT', 'TCC', 'TCA', 'TCG', 'AGT', 'AGC', 'AGA', 'AGG'], 'Tyr': ['TAT', 'TAC'], 'Cys': ['TGT', 'TGC'], 'Trp': ['TGA', 'TGG'], 'Pro': ['CCT', 'CCC', 'CCA', 'CCG'], 'His': ['CAT', 'CAC'], 'Gln': ['CAA', 'CAG'], 'Arg': ['CGT', 'CGC', 'CGA', 'CGG'], 'Ile': ['ATT', 'ATC'], 'Met': ['ATA', 'ATG'], 'Thr': ['ACT', 'ACC', 'ACA', 'ACG'], 'Asn': ['AAT', 'AAC'], 'Lys': ['AAA', 'AAG'], 'Val': ['GTT', 'GTC', 'GTA', 'GTG'], 'Ala': ['GCT', 'GCC', 'GCA', 'GCG'], 'Asp': ['GAT', 'GAC'], 'Glu': ['GAA', 'GAG'], 'Gly': ['GGT', 'GGC', 'GGA', 'GGG'], 'End': ['TAA', 'TAG']}


As we can see, all tables have exactly 64 codons, and 20 aminoacids (20 + stop) as expected:

In [15]:
'''for table in codon_tables.keys():
    codons = 0
    for aa in codon_tables[table]:
        codons += len(codon_tables[table][aa])
    print("Table number: {} \t Number of codons: {} \t Aminoacids: {}".format\
          (table, codons , len(codon_tables[table].keys())))'''

'for table in codon_tables.keys():\n    codons = 0\n    for aa in codon_tables[table]:\n        codons += len(codon_tables[table][aa])\n    print("Table number: {} \t Number of codons: {} \t Aminoacids: {}".format          (table, codons , len(codon_tables[table].keys())))'

In [10]:
def count_codons(sequence):
    from itertools import product
    
    #create dictionary with each codon for counting (list comprehension inside dictionary comprehension)
    codon_count_dict = {codon: 0 for codon in 
                  [''.join(i) for i in product("ATCG", repeat=3)]}
    
    if len(sequence) % 3 != 0:
        raise ValueError("Input sequence not divisible by three")
    ambiguous = 0
    for n, i in enumerate(range(0, len(sequence), 3), 1):
        codon = str(sequence[i:i+3].upper())
        if codon not in codon_count_dict.keys():
            ambiguous += 1
            continue
            #raise ValueError("This sequence probably has ambiguous IUPAC characters. " 
            #                 "Please submit sequences containing only A, T, G or C.")
        codon_count_dict[codon] += 1
    if ambiguous:
        print("This sequence probably has ambiguous IUPAC characters.",
                  f"{ambiguous} codons with such characters have been ignored")
    return(codon_count_dict)

#print(count_codons(concat))

#codon_count = count_codons(concat)

# NEXT STEP:
Calculate all metrics associated with codon usage AND map it to its corresponding aminoacids, species, etc.
Accomodate this data into a dataframe and plot it into an excel file. 

In [17]:
def per_thousand(codon_count_dict):
    total_codon_number = sum(codon_count_dict.values())
    per_thousand = {codon: round((count/total_codon_number)*1000, 2) 
                    for codon, count in codon_count_dict.items()}
    return(per_thousand)    

#per_thousand = per_thousand(codon_count)
#print(per_thousand)

In [18]:
#def identify_genetic_code(genbank):
#    record = SeqIO.read(genbank, "genbank")
#    genetic_code = set()
#    for i in record.features:
#        if i.type == "CDS":
#            genetic_code.add(i.qualifiers.get("transl_table")[0])
#    if len(genetic_code) != 1:
#        raise ValueError("The {} file has more than one genetic code for its CDS's: {} " 
#                             "Please correct that before submiting".format(genbank, genetic_code))
#    return(int(list(genetic_code)[0]))
    
#genetic_code = identify_genetic_code(genbank[0]) 

In [19]:
for genbank in genbanks:
    print(SeqIO.read(genbank, "genbank"))
    print("Genetic_code:", identify_genetic_code(genbank))
    print()

ID: NC_029755.1
Name: NC_029755
Description: Neoscona nautica mitochondrion, complete genome
Database cross-references: BioProject:PRJNA316287
Number of features: 76
/molecule_type=DNA
/topology=circular
/data_file_division=INV
/date=12-APR-2016
/accessions=['NC_029755']
/sequence_version=1
/keywords=['RefSeq']
/source=mitochondrion Neoscona nautica (brown sailor spider)
/organism=Neoscona nautica
/taxonomy=['Eukaryota', 'Metazoa', 'Ecdysozoa', 'Arthropoda', 'Chelicerata', 'Arachnida', 'Araneae', 'Araneomorphae', 'Entelegynae', 'Araneoidea', 'Araneidae', 'Neoscona']
/references=[Reference(title='The complete mitochondrial genome of Neoscona nautica', ...), Reference(title='Direct Submission', ...), Reference(title='Direct Submission', ...)]
/comment=REVIEWED REFSEQ: This record has been curated by NCBI staff. The
reference sequence is identical to KR259804.
COMPLETENESS: full length.
Seq('AATATTGTCAGCTAATAAAGCTAATGAGTTCATACCTCATAAATGGAATTATTA...ATT')


NameError: name 'identify_genetic_code' is not defined

In [20]:
#print(codon_count)

In [21]:
def fraction(codon_count_dict, transl_table):
    fraction = dict()
    for codon_list in transl_table.values():
        codon_occurrences_per_aa = 0
        for codon in codon_list:
            codon_occurrences_per_aa += codon_count_dict[codon]
        for codon in codon_list:
            fraction[codon] = round(codon_count_dict[codon]/codon_occurrences_per_aa, 2)
    return(fraction)

#fraction = fraction(codon_count, codon_tables[genetic_code])

In [60]:
#print(trna_scan_parse)

# NEXT:
Create main function and put all data into a dataframe.
**TIP:** In the main function, put the codon usage functions first, then the anticodon.
Structure chosen for the dataframe: list of dictionaries.

In [33]:
#template:

'''excel_dict = [
    {'species': species_name, 'transl_table': translat_table,
     'aminoacid': Three-letter aa, "Has anticodon in mito?": Yes or no,
    'Number of codon occurences': number of codons, '/1000': per_thousand,
    'Fraction': fraction}
]'''

'excel_dict = [\n    {\'species\': species_name, \'transl_table\': translat_table,\n     \'aminoacid\': Three-letter aa, "Has anticodon in mito?": Yes or no,\n    \'Number of codon occurences\': number of codons, \'/1000\': per_thousand,\n    \'Fraction\': fraction}\n]'

- [x] Species

- [x] Translation table

- [x] Aminoacid (Three-letter code)

- [x] Codon for that aminoacid

- [ ] Has anticodon in mito tRNAs? YES or EMPTY VALUE

- [x] Number of occurrences of that codon in the genes 

- [x] \1000 - (number_of_codon_occurences / total number of aa - including start_stop_codons) * 1000

- [x] Fraction: codon occurences / sum of all codon occurences for that aa

In [11]:
def concatenate_cds(not_trunc):
    concat = ''
    for i in not_trunc:
        concat += i[3]
    return concat

def species_name(not_trunc):
    species = set()
    for i in not_trunc:
        species.add(i[1])
    if len(species) == 1:
        return list(species)[0]
    else:
        raise ValueError("More than one organism for a single genbank file?")

In [74]:
coding_list = extract_cds( './gbks/l_laeta.gb')
notrunc = fix_truncated(coding_list)
concat = concatenate_cds(notrunc)

In [78]:
str(concat)

'ATTATTATCCCCTCTTTTTTGGTTGTTGTTATTTTGTATTTGGTTAGTTTTTTGTTTGTATTTGGTTTCGATGAGTGGTTTTTTATCTGGTTGGGACTGGAGATGAACATGTTTAGATTTGTGTTGTTGGTTTACCGGCGGTTTAACGCCGCGGCTTTGGAGGGGTGTTTTAGGTATTTTTTTGTTCAGAGTTTAGGGTCTGGTTTATTTTTAGGGGGGGTTTATTTAGGTTTAGGCGAGAGTATGATGCTTTTGGTATTGAGATATAAGATAGGGGTAGGTCCTTTCTATTTCTGGTTTCCTCCTGTAGTTGAAAGATTAGAATGGAGAAGTTGTGGTTTGTTAATAACCTTTCAAAAAGTGTTGCCTTTTTATTTGTTTTATATATTTAGAGGTTGGTTAGTTTGGCTAATGGGAGTAATGAGCTTGTTTATTGGTGTTGTGGGATCATTTAATCAATTGAGTTTAAGGAAGTTGATAGCTTTTTCTTCTATTTATCATTTGGGTTGGATGATGATTTGTCAGTGTGTAGATGGGGGGCTTTGGCTCTATTACTTGGTGGTGTATACTGTTTTAATTATTAGTGTTGTTACAATGTTTTTTAGGATAGGGGAAGGAGATTTGGTCCCTAGAAAAGGAAGCCGTCTGAGATTTATAGTAGGGATTTTCAGTATGGGGGGGGTTCCTCCTATATTGGGATTTGTACTAAAATTTATTAGTTTTGTCTATTTTTTAGAGTATGATTATTTTTTATTGTCTTTTATAATTGTTTTATCAATTGTTATGATATACGTTTATATGCGTTTTATTTATAATGAGTTTTTAGGGGTCGAGGAGGTTGCGTGGGTCGGGGGTTGAAATTTAGGTTTAGGTTTGAGAGAGAGAGTGAATTTATTGGGGGGGGTAATTGTTTTGTGTCTCTTGTGATTATTCTTATAATTACTGCGATGATTTTATTCAACTAATCATAAGGATATCGGAACGTTATATTTGTTGTTT

In [13]:
species_name(notrunc)

'Neoscona_nautica'

In [34]:
teste = SeqIO.read(genbanks[0], format = 'genbank')

In [39]:
print(dir(teste))

['__add__', '__bool__', '__class__', '__contains__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__le___', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__radd__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_per_letter_annotations', '_seq', '_set_per_letter_annotations', '_set_seq', 'annotations', 'dbxrefs', 'description', 'features', 'format', 'id', 'letter_annotations', 'lower', 'name', 'reverse_complement', 'seq', 'translate', 'upper']


In [45]:
teste.annotations.get('organism')

'Neoscona nautica'

In [48]:
from collections import Counter

count = Counter(teste.seq)

In [50]:
print(dir(count))

['__add__', '__and__', '__class__', '__class_getitem__', '__contains__', '__delattr__', '__delitem__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__iadd__', '__iand__', '__init__', '__init_subclass__', '__ior__', '__isub__', '__iter__', '__le__', '__len__', '__lt__', '__missing__', '__module__', '__ne__', '__neg__', '__new__', '__or__', '__pos__', '__reduce__', '__reduce_ex__', '__repr__', '__reversed__', '__ror__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__sub__', '__subclasshook__', '__weakref__', '_keep_positive', 'clear', 'copy', 'elements', 'fromkeys', 'get', 'items', 'keys', 'most_common', 'pop', 'popitem', 'setdefault', 'subtract', 'total', 'update', 'values']


In [62]:
AT = (count.get("A") + count.get("T"))*100 / count.total()
print(round(AT, 2))
print(count.total())

78.77
14049


In [79]:
def calculate_AT(gb):
    seqio = SeqIO.read(gb, format="genbank")
    species = seqio.annotations.get('organism')
    full_seq = seqio.seq
    coding_list = extract_cds(gb)
    notrunc = fix_truncated(coding_list)
    pcg_concat = concatenate_cds(notrunc)
    mito_AT = get_perc_AT(full_seq)
    pcg_AT = get_perc_AT(pcg_concat)
    return ( species, mito_AT, pcg_AT)
    
def get_perc_AT(seq):
    count = Counter(seq)
    AT = ( count.get("A") + count.get("T") ) * 100 / count.total()
    AT = round(AT, 2)
    return AT

In [85]:
data = []
for i in genbanks:
    #print(calculate_AT(i))
    data.append(calculate_AT(i))
AT_df = pandas.DataFrame(data, columns=['Species','AT %', 'AT % (PCGs)'])



In [88]:
AT_df

Unnamed: 0,Species,AT %,AT % (PCGs)
0,Neoscona nautica,78.77,78.53
1,Cyrtarachne nagasakiensis,75.7,74.98
2,Phyxioschema suthepium,67.4,66.61
3,Pardosa laura,77.42,76.88
4,Plexippus paykulli,73.49,72.79
5,Araneus angulatus,75.14,74.42
6,Hypochilus thorelli,70.33,69.03
7,Telamonia vlijmi,77.3,76.76
8,Argiope perforata,74.19,73.34
9,Atypus karschi,73.65,72.76


In [87]:
AT_df.to_csv("AT_percentage.csv", index=None)

In [15]:
def invert_codon_table(codon_table):
    new_codon_table = dict()
    for key in codon_table.keys():
        #print(key)
        for codon in codon_table.get(key):
            new_codon_table[codon] = key
            
    return(new_codon_table)    

In [38]:
#def has_anticodon(codon, aa_3letter, anticodon_dict):
#    #print(aa_3letter)
#    if aa_3letter in anticodon_dict.keys():
#        for value in anticodon_dict.get(aa_3letter):
#            anticodon = Seq(value, generic_dna)
#            if codon == str(anticodon.reverse_complement()):
#                return(value)
#    return(None)

In [39]:
def dataframe_parser(species, genetic_code, inverse_codon_table, 
                     codon_count_dict, per_thousand_dict, fraction_dict, 
                     anticodon_dict):
    #fields = ['Species', 'Translation Table', 'Aminoacid', 'Codon', 'Anticodon', 
    #              'Number of Codon Occurences', '/1000', 'Fraction']
    fields = ['Species', 'Translation Table', 'Aminoacid', 'Codon', 'Anticodon',
                  'Number of Codon Occurences', '/1000', 'Fraction']
    dataframe_dict = {k : list() for k in fields}
    for codon, fraction in fraction_dict.items():
        dataframe_dict.get('Species').append(species)
        dataframe_dict.get('Translation Table').append(genetic_code)
        aa_3letter = inverse_codon_table.get(codon)
        dataframe_dict.get('Aminoacid').append(aa_3letter)
        dataframe_dict.get('Codon').append(codon)
        dataframe_dict.get('Anticodon').append(has_anticodon(codon, aa_3letter, anticodon_dict))
        dataframe_dict.get('Number of Codon Occurences').append(codon_count_dict.get(codon))
        dataframe_dict.get('/1000').append(per_thousand_dict.get(codon))
        dataframe_dict.get('Fraction').append(fraction)
    return(dataframe_dict)
    

In [40]:
def create_dataframe(parsed_data):
    codon_usage = pandas.DataFrame(parsed_data)
    return(codon_usage)    


In [41]:
def concat_dataframes(dataframe_data):
    dataframes = []
    for i in dataframe_data.keys():
        dataframes.append(dataframe_data.get(i))
    final_dataframe = pandas.concat(dataframes)
    return(final_dataframe)

In [42]:
def export_dataframe_to_excel(final_dataframe):
    final_dataframe.to_excel("Codon_usage.xlsx", index=False, sheet_name="codon_usage")
    
def export_dataframe_to_csv(final_dataframe):
    final_dataframe.to_csv("Codon_usage.csv", index=False)

In [43]:
default_anticodons = {'Ala': ['TGC'],
                      'Arg': ['TCG'],
                      'Asn': ['GTT'],
                      'Asp': ['GTC'],
                      'Cys': ['GCA'],
                      'Gln': ['TTG'],
                      'Glu': ['TTC'],
                      'Gly': ['TCC'],
                      'His': ['GTG'],
                      'Ile': ['GAT'],
                      'Leu': ['TAA', 'TAG'],
                      'Lys': ['CTT'],
                      'Met': ['CAT'],
                      'Phe': ['GAA'],
                      'Pro': ['TGG'],
                      'Ser': ['GCT', 'TGA'],
                      'Thr': ['TGT'],
                      'Trp': ['TCA'],
                      'Tyr': ['GTA'],
                      'Val': ['TAC']}

In [44]:
#def trnascan_se_parser(trnascan_output, sequence_name):
#    aminoacid = list()
#    anticodon = list()
#    for i in trnascan_output:
#         if i.startswith(sequence_name):
#                i = i.split("\t")
#                aminoacid.append(i[4].strip())
#                anticodon.append(i[5].strip())
#    trnascan_dict = {sequence_name: {i[0]: i[1] for i in zip(aminoacid, anticodon)}}
#    return(trnascan_dict)


def has_anticodon(codon, aa_3letter, anticodon_dict):
    #print(aa_3letter)
    if aa_3letter in anticodon_dict.keys():
        for value in anticodon_dict.get(aa_3letter):
            anticodon = Seq(value)
            if codon == str(anticodon.reverse_complement()):
                return(value)
    return(None)

In [50]:
#def dataframe_parser(species, genetic_code, inverse_codon_table, 
#                     codon_count_dict, per_thousand_dict, fraction_dict, 
#                     anticodon_dict):
    
def main_func(genbank_list):
    codon_tables = set_ncbi_genetic_codes()
    dataframe_data = dict()
    for index, genbank in enumerate(genbanks, 1):
        print("Running analysis for {}".format(genbank))
        try:
            not_trunc = fix_truncated(extract_cds(genbank))
        
            cds_concat = concatenate_cds(not_trunc)
        
            species = species_name(not_trunc)
        
            codon_count_dict = count_codons(cds_concat)
        
            per_thousand_dict = per_thousand(codon_count_dict)
        
            #genetic_code = identify_genetic_code(genbank)
            genetic_code = 5
                
            fraction_dict = fraction(codon_count_dict, codon_tables[genetic_code])
        
        
            #anticodon_dict = trnascan_se_anticodon_parser(trnascan_output, species)
            anticodon_dict = default_anticodons
        
            inverse_codon_table = invert_codon_table(codon_tables[genetic_code])
        
            parsed_data = dataframe_parser(species, genetic_code, inverse_codon_table, 
                                           codon_count_dict, per_thousand_dict, fraction_dict, 
                                           anticodon_dict)
        except Exception as e:
            print("{} analysis failed".format(genbank))
            print(e)
            continue
        
        dataframe_data[index] = create_dataframe(parsed_data)
        
    final_dataframe = concat_dataframes(dataframe_data)
    print(final_dataframe)
    
    #export_dataframe_to_excel(final_dataframe)
    export_dataframe_to_csv(final_dataframe)
        
        #print(anticodon_dict.keys())
        #print(len(anticodon_dict.keys()))
        #print(set(inverse_codon_table.values()))
        #print(len(set(inverse_codon_table.values())))
        #print(species, type(species))
        #print(not_trunc)
        #print(cds_concat)
        #print("per_thousand_dict:\n{}".format(per_thousand_dict))
        #print("codon_count_dict:\n{}".format(codon_count_dict))
        #print("genetic_code {}:\n{}".format(genetic_code, codon_tables[genetic_code]))    
        #print("fraction_dict:\n{}".format(fraction_dict))
'''        fields = ['species', 'transl_table', 'aminoacid', "Anticodon", 
                  'Number of codon occurences', '/1000', 'Fraction']
        excel_output_dict = {k : list() for k in fields}
'''
main_func(genbank)

Running analysis for ./gbks/Neoscona_nautica_NC_029755.1.gb
Running analysis for ./gbks/Cyrtarachne_nagasakiensis_NC_028077.1.gb
Running analysis for ./gbks/Phyxioschema_suthepium_NC_020322.1.gb
Running analysis for ./gbks/Pardosa_laura_NC_025223.1.gb
Running analysis for ./gbks/Plexippus_paykulli_NC_024877.1.gb
Running analysis for ./gbks/Araneus_angulatus_NC_032402.1.gb
Running analysis for ./gbks/Hypochilus_thorelli_NC_010777.1.gb
This sequence probably has ambiguous IUPAC characters. 1 codons with such characters have been ignored
Running analysis for ./gbks/Telamonia_vlijmi_NC_024287.1.gb
Running analysis for ./gbks/Argiope_perforata_NC_044695.1.gb
Running analysis for ./gbks/Atypus_karschi_NC_059873.1.gb
Running analysis for ./gbks/Trichonephila_clavata_NC_008063.1.gb
Running analysis for ./gbks/Harpactocrates_apennicola_NC_044081.1.gb
Running analysis for ./gbks/phoneutria_sp.gb




This sequence probably has ambiguous IUPAC characters. 1 codons with such characters have been ignored
Running analysis for ./gbks/Hypsosinga_pygmaea_NC_028078.1.gb
Running analysis for ./gbks/l_laeta.gb
Running analysis for ./gbks/Agelena_silvatica_NC_033971.1.gb
Running analysis for ./gbks/Neoscona_adianta_NC_029756.1.gb
Running analysis for ./gbks/Cyriopagopus_schmidti_NC_005925.1.gb
Running analysis for ./gbks/Cyclosa_japonica_NC_044696.1.gb
Running analysis for ./gbks/Argyroneta_aquatica_NC_026863.1.gb
This sequence probably has ambiguous IUPAC characters. 4 codons with such characters have been ignored
Running analysis for ./gbks/Epeus_alboguttatus_NC_042829.1.gb
Running analysis for ./gbks/Pholcus_phalangioides_NC_020324.1.gb
Running analysis for ./gbks/Parachtes_romandiolae_NC_044099.1.gb
Running analysis for ./gbks/loxosceles_similis.gb
Running analysis for ./gbks/Tetragnatha_nitens_NC_028068.1.gb
Running analysis for ./gbks/Neoscona_scylla_NC_044101.1.gb
Running analysis for 

In [109]:
gb = SeqIO.read("./gbks/Oxyopes_sertatus_NC_025224.1.gb", "genbank")

In [110]:
print(dir(gb))

['__add__', '__bool__', '__class__', '__contains__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__le___', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__radd__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_per_letter_annotations', '_seq', '_set_per_letter_annotations', '_set_seq', 'annotations', 'dbxrefs', 'description', 'features', 'format', 'id', 'letter_annotations', 'lower', 'name', 'reverse_complement', 'seq', 'translate', 'upper']


In [111]:
print(gb.format("gb"))
print(len(gb))

LOCUS       NC_025224              14442 bp    DNA     circular INV 16-OCT-2014
DEFINITION  Oxyopes sertatus voucher 2013-phc-nj-ox-se mitochondrion, complete
            genome.
ACCESSION   NC_025224
VERSION     NC_025224.1
DBLINK      BioProject: PRJNA262796
KEYWORDS    RefSeq.
SOURCE      mitochondrion Oxyopes sertatus
  ORGANISM  Oxyopes sertatus
            Eukaryota; Metazoa; Ecdysozoa; Arthropoda; Chelicerata; Arachnida;
            Araneae; Araneomorphae; Entelegynae; Lycosoidea; Oxyopidae; Oxyopes.
REFERENCE   1  (bases 1 to 14442)
  AUTHORS   Pan,W.J., Fang,H.Y., Zhang,P. and Pan,H.C.
  TITLE     The complete mitochondrial genome of striped lynx spider Oxyopes
            sertatus (Araneae: Oxyopidae)
  JOURNAL   Mitochondrial DNA, 1-2 (2014) In press
   PUBMED   25208169
  REMARK    Publication Status: Available-Online prior to print
REFERENCE   2  (bases 1 to 14442)
  CONSRTM   NCBI Genome Project
  TITLE     Direct Submission
  JOURNAL   Submitted (04-OCT-2014) National Ce

In [112]:
print(len(gb.seq))

14442


## Just me messing around with codon tables and such

In [137]:
codon_tables = set_ncbi_genetic_codes()

print(codon_tables[2])

{'Phe': ['TTT', 'TTC'], 'Leu': ['TTA', 'TTG', 'CTT', 'CTC', 'CTA', 'CTG'], 'Ser': ['TCT', 'TCC', 'TCA', 'TCG', 'AGT', 'AGC'], 'Tyr': ['TAT', 'TAC'], 'Cys': ['TGT', 'TGC'], 'Trp': ['TGA', 'TGG'], 'Pro': ['CCT', 'CCC', 'CCA', 'CCG'], 'His': ['CAT', 'CAC'], 'Gln': ['CAA', 'CAG'], 'Arg': ['CGT', 'CGC', 'CGA', 'CGG'], 'Ile': ['ATT', 'ATC'], 'Met': ['ATA', 'ATG'], 'Thr': ['ACT', 'ACC', 'ACA', 'ACG'], 'Asn': ['AAT', 'AAC'], 'Lys': ['AAA', 'AAG'], 'Val': ['GTT', 'GTC', 'GTA', 'GTG'], 'Ala': ['GCT', 'GCC', 'GCA', 'GCG'], 'Asp': ['GAT', 'GAC'], 'Glu': ['GAA', 'GAG'], 'Gly': ['GGT', 'GGC', 'GGA', 'GGG'], 'End': ['TAA', 'TAG', 'AGA', 'AGG']}


In [228]:
print(genbank[2])

Allochrocebus_lhoesti_NC_023962.1.gb


In [139]:
fields = ['species', 'transl_table', 'aminoacid', "Anticodon", 
                  'Number of codon occurences', '/1000', 'Fraction']
excel_output_dict = {k : list() for k in fields}

In [158]:
for key, value in fraction_dict.items():
    print(key, value)

TTT 0.45
TTC 0.55
TTA 0.16
TTG 0.03
CTT 0.15
CTC 0.16
CTA 0.45
CTG 0.04
TCT 0.19
TCC 0.29
TCA 0.28
TCG 0.03
AGT 0.06
AGC 0.14
TAT 0.4
TAC 0.6
TGT 0.38
TGC 0.62
TGA 0.91
TGG 0.09
CCT 0.22
CCC 0.4
CCA 0.37
CCG 0.02
CAT 0.34
CAC 0.66
CAA 0.95
CAG 0.05
CGT 0.17
CGC 0.32
CGA 0.45
CGG 0.06
ATT 0.46
ATC 0.54
ATA 0.85
ATG 0.15
ACT 0.21
ACC 0.33
ACA 0.44
ACG 0.02
AAT 0.33
AAC 0.67
AAA 0.93
AAG 0.07
GTT 0.23
GTC 0.29
GTA 0.4
GTG 0.08
GCT 0.26
GCC 0.43
GCA 0.3
GCG 0.01
GAT 0.45
GAC 0.55
GAA 0.79
GAG 0.21
GGT 0.22
GGC 0.34
GGA 0.32
GGG 0.12
TAA 0.85
TAG 0.08
AGA 0.0
AGG 0.08


In [168]:
ls = list()

In [169]:
ls.append(None)

In [170]:
ls

[None]

In [243]:
amino_acid = seq3("F")

In [244]:
amino_acid

'Phe'