# Updating the codon table

I want a script that takes an amino acid sequence as input (FASTA format), and creates a codon optimized DNA sequence as output (FASTA). It should be optimized based on an excel file specifying the codon usage. In this notebook, we will go through all the steps needed to create a Yarrowia-specific codon table and save it in an excel file:

1. Read excel input

2. Remove codons with < 5% frequency

3. Recalculate the frequency for each codon (all codons left for each aa should sum 100)

4. Assign codon values based on their frequencies - set upper and lower bounds in two different columns in an excel

5. Create a new excel file with the updated codon table

In [1]:
import pandas as pd #organise and process codon table data
from Bio import SeqIO #reading a FASTA file
from Bio.Seq import Seq #manipulating genetic sequences, translating them...
from Bio.SeqRecord import SeqRecord #manage a sequence and its metadata
import random #random number generator, needed for the last function

In [2]:
# Read the Excel file into a DataFrame
codon_table = pd.read_excel(r'C:\Users\Candela\OneDrive - Danmarks Tekniske Universitet\Codopti\codon_table.xlsx', sheet_name=0)

Visualise the end of the codon table to check that all the amino acid codons possible appear. 

In [11]:
print(codon_table) 

   Amino Acid Codon  Frequency
0           -   TAA       83.0
1           -   TAG       10.0
2           -   TGA        7.0
3           A   GCA        2.2
4           A   GCC       59.1
..        ...   ...        ...
59          V   GTG        8.5
60          V   GTT       33.9
61          W   TGG      100.0
62          Y   TAC       95.1
63          Y   TAT        4.9

[64 rows x 3 columns]


First, find codons with <5% frequency in Yarrowia and remove them. Then, recalculate the frequency of the codons for each amino acid and add it to a new column called "New Frequency"

In [32]:
def updated_table(codon_table, output_excel_path):
    # Filter out codons with a frequency less than 5.0
    codon_table = codon_table[codon_table['Frequency'] >= 5.0].copy()

    # Group the DataFrame by Amino Acid
    grouped = codon_table.groupby('Amino Acid')

    # Calculate the total frequency for each amino acid
    amino_acid_totals = grouped['Frequency'].transform('sum')

    # Calculate the new frequencies for each codon based on the rule of three (if 83% (sum of freq(aa) with Freq > 5%) is now 100%, what is the new individual freq of each aa)
    codon_table['New Frequency'] = (
        codon_table['Frequency'] / amino_acid_totals * 100
    )

    # Write the updated DataFrame to a new Excel file
    codon_table.to_excel(output_excel_path, index=False)


In [33]:
opt_codon_table = r'C:\Users\Candela\OneDrive - Danmarks Tekniske Universitet\Codopti\opt_codon_table.xlsx' 
updated_table(codon_table, opt_codon_table)

Now we can check if the opt_codon_table has less rows and a new frequency column:

In [35]:
print(pd.read_excel(opt_codon_table)) #we need the pd.read_excel because this excel has not been saved as a dataframe yet

   Amino Acid Codon  Frequency  New Frequency
0           -   TAA       83.0      83.000000
1           -   TAG       10.0      10.000000
2           -   TGA        7.0       7.000000
3           A   GCC       59.1      60.927835
4           A   GCT       37.9      39.072165
5           C   TGC       65.6      65.600000
6           C   TGT       34.4      34.400000
7           D   GAC       70.0      70.000000
8           D   GAT       30.0      30.000000
9           E   GAA        6.4       6.400000
10          E   GAG       93.6      93.600000
11          F   TTC       80.7      80.700000
12          F   TTT       19.3      19.300000
13          G   GGA       16.2      16.413374
14          G   GGC       30.6      31.003040
15          G   GGT       51.9      52.583587
16          H   CAC       85.2      85.200000
17          H   CAT       14.8      14.800000
18          I   ATC       65.0      65.789474
19          I   ATT       33.8      34.210526
20          K   AAG       97.3    

Now, we have to add two more columns to the table which correspond to the lower and upper bounds for each amino acid. This is because we want the choice of codon to be random but still determined by its natural frequency. Therefore, we must assign an interval (two values) for each codon. The random number generator, should respect the natural frequency, but assign the codons randomly. This will avoid repetitive sequences and will give us more flexibility, since a new sequence will be generated every time the code is run:

In [51]:
def bounds(codon_table, output_excel_path):
    for amino_acid, group in codon_table.groupby("Amino Acid"):
        # Select codons for the current amino acid:
        amino_acid_codons = group
        # Assign upper and lower bounds based on their frequencies
        value_ranges = []
        lower_bound = 0
        for _, row in amino_acid_codons.iterrows():
            upper_bound = lower_bound + row['New Frequency']
            value_ranges.append((row['Codon'], round(lower_bound, 2), round(upper_bound, 2)))
            lower_bound = upper_bound + 0.01 # Update the lower bound for the next iteration #+0.01?
        for val_range in value_ranges:
            # Update the DataFrame with lower and upper bounds for each codon
            codon_table.loc[(codon_table['Amino Acid'] == amino_acid) & (codon_table['Codon'] == val_range[0]), 'Lower_Bound'] = val_range[1]
            codon_table.loc[(codon_table['Amino Acid'] == amino_acid) & (codon_table['Codon'] == val_range[0]), 'Upper_Bound'] = val_range[2]
    # Save the DataFrame with bounds to a new Excel file
    codon_table.round({'New Frequency': 2, 'Lower_Bound': 2, 'Upper_Bound': 2}).to_excel(output_excel_path, index=False)

In [52]:
bounds_codon_table = r'C:\Users\Candela\OneDrive - Danmarks Tekniske Universitet\Codopti\bounds_codon_table.xlsx' 
bounds(pd.read_excel(opt_codon_table), bounds_codon_table)

Check if the new columns were created:

In [54]:
print(pd.read_excel(bounds_codon_table))

   Amino Acid Codon  Frequency  New Frequency  Lower_Bound  Upper_Bound
0           -   TAA       83.0          83.00         0.00        83.00
1           -   TAG       10.0          10.00        83.01        93.01
2           -   TGA        7.0           7.00        93.02       100.02
3           A   GCC       59.1          60.93         0.00        60.93
4           A   GCT       37.9          39.07        60.94       100.01
5           C   TGC       65.6          65.60         0.00        65.60
6           C   TGT       34.4          34.40        65.61       100.01
7           D   GAC       70.0          70.00         0.00        70.00
8           D   GAT       30.0          30.00        70.01       100.01
9           E   GAA        6.4           6.40         0.00         6.40
10          E   GAG       93.6          93.60         6.41       100.01
11          F   TTC       80.7          80.70         0.00        80.70
12          F   TTT       19.3          19.30        80.71      

*FROM HERE IT SHOULD BE IN THE NEXT NOTEBOOK*

Let's now try to reverse translate the sequence using the upper and lower bounds and a random number generator:

In [56]:
#should be in the next notebook
my_fasta = r'C:\Users\Candela\OneDrive - Danmarks Tekniske Universitet\Codopti\YALI2.txt' 
for seq_record in SeqIO.parse(my_fasta, "fasta"):
    print(seq_record.id)
    print(len(seq_record))
    print(repr(seq_record))
    name = seq_record.id
    AA_seq = seq_record.seq

YALI0_D05621g
592
SeqRecord(seq=Seq('MAHDSELELSDEKVVPSINQEKHSFFQRHLDNHPRMAQYNSQLQRFLKWIEVPT...IIS'), id='YALI0_D05621g', name='YALI0_D05621g', description='YALI0_D05621g  | Yarrowia lipolytica CLIB122 | YALI0D05621p | protein | length=592', dbxrefs=[])


In [55]:
def reverse_translate_amino_acids_fasta(AA_seq, codon_table):
    # Initialize an empty DNA sequence
    dna_sequence = Seq("")

    # Since we have previously defined AA_seq as seq_record.seq, we do not have the header line from the FASTA format
    amino_acid_sequence = AA_seq 

    for amino_acid in amino_acid_sequence:
        # Filter the codon table for the current amino acid
        amino_acid_codons = codon_table[codon_table['Amino Acid'] == amino_acid]

        if not amino_acid_codons.empty:
            # Generate a random value within the specified range
            random_value = random.uniform(0, 100)

            # Map the random value to the corresponding codon using the bounds from the table
            for _, row in amino_acid_codons.iterrows():
                if row['Lower_Bound'] <= random_value <= row['Upper_Bound']:
                    dna_sequence += Seq(row['Codon'])
                    break

    return dna_sequence

In [61]:
# Call the function to reverse translate the amino acid sequence
reverse_translated_dna = reverse_translate_amino_acids_fasta(AA_seq, pd.read_excel(bounds_codon_table))

# Create a SeqRecord object for the DNA sequence
dna_record = SeqRecord(reverse_translated_dna, id="reverse_translated_sequence")

# Print the SeqRecord or save it to a FASTA file
print(dna_record.format("fasta"))

>reverse_translated_sequence <unknown description>
ATGGCCCACGATTCCGAACTCGAGCTCTCCGACGAAAAGGTCGTTCCTTCCATCAACCAG
GAGAAGCACTCCTTCTTCCAGCGACACCTTGACAACCACCCCCGAATGGCTCAGTACAAC
TCTCAGCTTCAGCGATTCCTGAAGTGGATCGAGGTCCCCACCAAGGAGGGCGAGATTAAC
ACCTTCCTCAACAACGAGGACCTGAAGCCCGTCGAGGTTGCTCGACAGACTTGGGGTTGG
AAGAACTTCGTGTCTTTCTGGATCGCTGACTCCTTCAACATTAACACCTGGGAGATCGCT
GCCACCGGCATTCAGCTTGGACTCACCTGGTGGCAGGTTTGGCTTTGCGTCTGGATCGGT
TACTTCTTCTGCGGTGTTTTCGTTGTTCTTTCCGGACGAATTGGTGCTATCTACCACGTT
TCCTTTCCCGTCGCCGGTCGATCTACTTTTGGAATCTTCGGTTCTATCTGGCCCGTCATC
AACCGAGTCGTCATGGCTTGCGTTTGGTACGGAGTTCAGGGCTGGCTGGGTGGCCAGTGC
ATTCAGGTCTGCCTTCTTGCTATCTGGCCCTCCGCCCGACACATGAAGAACGGTATCCCT
GGTTCCGGCACCACTACTTTCGAGTTCCTTTCCTACTTCCTTTTCTGGCTGTTTTCCCTC
CCCTTCATCTACATTCGACCTCACAACCTGCGACACCTGTTCATGGTCAAGGCTGCCATT
GTTCCCGTCGCCGGTATCTCTTTCCTCGTCTGGACCTGCGTCAAGGCCCACGGCATCGGT
CCCATCATGAAGCAGCCTGCCACCGTTCACGGATCTGTTATGGGTTGGGCTTTCATGACC
GCTATCATGAACTCCCTGTCCAACTTCGCCACCATCATTGTCAACGCTCCCGATTTTACC
CGATTCGCTAAGGAGCCCAACGCCATCGTTCTCT

--------------------------------------


In [66]:
def reverse_translate_amino_acids_fasta(AA_seq, codon_table):
    # Define the forbidden sequences
    forbidden_sequences = ["GCTCTTCN", "GCGGCCGC", "GGTCTCNNNNN"]
    
    # Initialize an empty DNA sequence
    dna_sequence = Seq("")
    
    # Since we have previously defined AA_seq as seq_record.seq, we do not have the header line from the FASTA format
    amino_acid_sequence = AA_seq 

    while True:
        # Initialize a list to keep track of added codons
        added_codons = []
        
        for amino_acid in amino_acid_sequence:
            # Filter the codon table for the current amino acid
            amino_acid_codons = codon_table[codon_table['Amino Acid'] == amino_acid]

            if not amino_acid_codons.empty:
                # Generate a random value within the specified range
                random_value = random.uniform(0, 100)

                # Map the random value to the corresponding codon using the bounds from the table
                selected_codon = None
                for _, row in amino_acid_codons.iterrows():
                    if row['Lower_Bound'] <= random_value <= row['Upper_Bound']:
                        selected_codon = row['Codon']
                        break
                    
                # Check if the selected codon contains any forbidden sequences
                contains_forbidden_sequence = any(seq in selected_codon for seq in forbidden_sequences)
                
                if not contains_forbidden_sequence:
                    # Add the selected codon to the DNA sequence
                    dna_sequence += Seq(selected_codon)
                    added_codons.append(selected_codon)
        
        # Check if the generated sequence contains any forbidden sequences
        contains_forbidden_sequence = any(seq in dna_sequence for seq in forbidden_sequences)
        
        if not contains_forbidden_sequence:
            break  # Exit the loop if the sequence is valid
    
    return dna_sequence


In [68]:
# Call the function to reverse translate the amino acid sequence
reverse_translated_dna = reverse_translate_amino_acids_fasta(AA_seq, pd.read_excel(bounds_codon_table))

# Create a SeqRecord object for the DNA sequence
dna_record = SeqRecord(reverse_translated_dna, id="reverse_translated_sequence")

# Print the SeqRecord or save it to a FASTA file
print(dna_record.format("fasta"))

>reverse_translated_sequence <unknown description>
ATGGCCCACGATTCTGAGCTGGAGCTCTCCGATGAGAAGGTCGTCCCCTCCATCAACCAG
GAGAAGCACTCTTTCTTCCAGCGACACCTGGACAACCATCCCCGAATGGCTCAGTACAAC
TCCCAGCTCCAGCGATTTCTTAAGTGGATTGAGGTCCCTACCAAGGAGGGTGAGATCAAC
ACCTTCCTCAACAACGAGGACCTCAAGCCCGTCGAGGTCGCCCGACAGACCTGGGGTTGG
AAGAACTTCGTCTCTTTCTGGATTGCCGACTCCTTCAACATTAACACCTGGGAGATCGCT
GCTACCGGTATCCAGCTCGGCCTGACCTGGTGGCAGGTCTGGCTCTGCGTCTGGATCGGA
TACTTCTTCTGCGGTGTCTTCGTGGTTCTCTCCGGCCGAATCGGTGCCATTTACCACGTC
TCCTTCCCCGTCGCTGGTCGATCCACCTTTGGTATCTTCGGTTCCATCTGGCCCGTCATC
AACCGAGTCGTTATGGCTTGTGTCTGGTACGGTGTCCAGGGCTGGCTCGGTGGACAGTGT
ATCCAGGTTTGTCTTCTGGCTATTTGGCCCTCCGCCCGACACATGAAGAACGGAATCCCT
GGCTCCGGAACCACCACCTTCGAGTTCCTCTCTTACTTCCTGTTCTGGCTGTTCTCTCTG
CCCTTTATCTACATCCGACCCCACAACCTTCGACACCTCTTCATGGTCAAGGCCGCCATC
GTTCCCGTTGCTGGTATCTCCTTCCTCGTCTGGACCTGTGTTAAGGCCCACGGTATTGGT
CCCATTATGAAGCAGCCCGCTACTGTCCACGGATCTGTGATGGGCTGGGCCTTTATGACC
GCTATTATGAACTCTCTGTCTAACTTCGCCACCATTATTGTGAACGCTCCTGACTTCACC
CGATTTGCCAAGGAGCCCAACGCTATTGTCCTCT

--------------------------------

OLD FUNCTIONS:
1st AVOIDS the sequences of the three restriction enzymes but still contains the nd lower bounds

In [63]:
def reverse_translate_amino_acids_fasta(AA_seq, codon_table):
    # Define the forbidden sequences
    forbidden_sequences = ["GCTCTTCN", "GCGGCCGC", "GGTCTCNNNNN"]
    
    # Initialize an empty DNA sequence
    dna_sequence = Seq("")
    
    # Remove the header line from the FASTA format
    amino_acid_sequence = AA_seq
    
    print(amino_acid_sequence)

    while True:
        # Initialize a list to keep track of added codons
        added_codons = []
        
        for amino_acid in amino_acid_sequence:
            # Filter the codon table for the current amino acid
            amino_acid_codons = codon_table[codon_table['Amino Acid'] == amino_acid]

            if not amino_acid_codons.empty:
                # Assign codon value ranges based on their frequencies
                value_ranges = []
                lower_bound = 0
                for _, row in amino_acid_codons.iterrows():
                    upper_bound = lower_bound + row['New Frequency']
                    value_ranges.append((row['Codon'], lower_bound, upper_bound))
                    lower_bound = upper_bound

                while True:
                    # Generate a random value within the specified range
                    random_value = random.uniform(0, 100)

                    # Map the random value to the corresponding codon
                    selected_codon = None
                    for codon, lower, upper in value_ranges:
                        if lower <= random_value <= upper:
                            selected_codon = codon
                            break
                    
                    # Check if the selected codon contains any forbidden sequences
                    contains_forbidden_sequence = any(seq in selected_codon for seq in forbidden_sequences)
                    
                    if not contains_forbidden_sequence:
                        # Add the selected codon to the DNA sequence
                        dna_sequence += Seq(selected_codon)
                        added_codons.append(selected_codon)
                        break
        
        # Check if the generated sequence contains any forbidden sequences
        contains_forbidden_sequence = any(seq in dna_sequence for seq in forbidden_sequences)
        
        if not contains_forbidden_sequence:
            break  # Exit the loop if the sequence is valid
    
    return dna_sequence


--------------------------------

In [16]:
def reverse_translate_amino_acids(amino_acid_sequence, codon_table):
    dna_sequence = Seq("")
    for amino_acid in amino_acid_sequence:
        # Filter the codon table for the current amino acid
        amino_acid_codons = codon_table[codon_table['Amino Acid'] == amino_acid]

        if not amino_acid_codons.empty:
            # Assign codon value ranges based on their frequencies
            value_ranges = []
            lower_bound = 0
            for _, row in amino_acid_codons.iterrows():
                upper_bound = lower_bound + row['New Frequency']
                value_ranges.append((row['Codon'], lower_bound, upper_bound))
                lower_bound = upper_bound #+0.1?

            # Generate a random value within the specified range
            random_value = random.uniform(0, 100)

            # Map the random value to the corresponding codon
            for codon, lower, upper in value_ranges:
                if lower <= random_value <= upper:
                    dna_sequence += Seq(codon)
                    break

    # Ensure that the DNA sequence matches the length of the amino acid sequence
    while len(dna_sequence) < len(amino_acid_sequence) * 3:
        # If it's shorter, add random codons (you can customize this part)
        random_codon = random.choice(codon_table['Codon'])
        dna_sequence += random_codon

    # Trim the sequence to match the desired length
    dna_sequence = dna_sequence[:len(amino_acid_sequence) * 3]

    return dna_sequence

--------------------------------


The table contains three columns:

Amino acid: The amino acid encoded by a given codon (“-“ is for stop codon)

Codon: The codon in question

Frequency: The frequency of a given codon when a specific AA in encoded. 

(If the following is too confusing/complicated then just get started with the other things first, and then get back to this later.)

For example alanine (A), can be encoded by 4 codons (GCA, GCC, GCG, GCT). GCC and GCT occur frequently (59% and 38%), whereas GCA and GCG are very rarely used (2% and 1%). So we can see that Yarrowia has a preference GCC and GCT in regards to alanine.
Biologically speaking this also correlates with the number of tRNA genes Yarrowia has for a given codon.

When doing codon optimization, we don’t want to use the rare codons at all, and we would like to frequent codons to be selected according to their occurrence by using a random number generator.

I would remove the rare codons, and recalculate the frequency for the common codons. You can then proceed in slightly different ways, but you could assign the codons values from 0-100 based on their frequency.

In the case of alanine it might look like this:
After removing the rare codons, the new frequency is:
            GCC = 60.9 %

GCT = 39.1 %

So lets assign GCC the values 0-60.9, and GCT 61.0-100.

We input the sequence AAA, and iterate through it, randomizing a number from 0-100 each time.

We might get the numbers 12, 89, and 35. This would then correspond to the sequence GCC + GCT + GCC.