I want a script that takes an amino acid sequence as input (FASTA format), and creates a codon optimized DNA sequence as output (FASTA). 

It should be optimized based on an excel file specifying the codon usage, the user should be able to specify forbidden sequences (e.g. restriction sites) that the script avoids, and it should keep the sequence within an acceptable GC content (30-80%), both locally (specified window size, e.g. 15 bp) and globally (whole sequence). 

In the end the script should double check that the DNA generated translates to the AA sequence provided as input. 

You will need to use the Pandas package for reading the excel/working with the dataframe.

and you will need Biopython for reading FASTA format and doing some DNA-related operations (translate, gc-content, etc.)

To break it down in to smaller parts, I would recommend to simply start with reading the FASTA and Excel input. 

as a next step you can try building the DNA sequence (maybe even without considering codon preference) and verifying that it translates correctly. 

Then you can continue by considering codon preference.

and lastly avoiding the forbidden sequences


1. Read FASTA and Excel Input:

Read the amino acid sequence from the FASTA file.
Read the codon usage information from the Excel file. You can use libraries like pandas to handle Excel files.

2. Generate DNA Sequence without Codon Optimization:

Create a DNA sequence based on the input amino acid sequence without considering codon preference. This can be done using a simple translation table from amino acids to codons.

3. Verify Translation:

Translate the generated DNA sequence back to an amino acid sequence.
Compare the translated amino acid sequence with the input amino acid sequence to ensure they match.

4. Codon Optimization:

Implement a codon optimization algorithm based on the codon usage information from the Excel file. You can use optimization techniques such as Genetic Algorithms or other heuristics to find the best codon for each amino acid while avoiding forbidden sequences.

5. GC Content Control:

Calculate the GC content of the generated DNA sequence in a sliding window of the specified size.
If the GC content falls outside the acceptable range (20-80%) for any part of the sequence, make necessary adjustments.

6. Avoid Forbidden Sequences:

Check the generated DNA sequence for the presence of forbidden sequences (e.g., restriction sites) and make modifications to avoid them. You may need to implement a search and replace mechanism for this.

7. Iterate and Optimize:

Repeat steps 4-6 until you have a DNA sequence that meets all the criteria.

8. Write Output to FASTA:

Write the optimized DNA sequence to a FASTA file.

# Codon as input

### 1. Codon table

In [17]:
import pandas as pd
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
import random

In [2]:
   # Read the Excel file into a DataFrame
codon_table = pd.read_excel(r'C:\Users\Candela\OneDrive - Danmarks Tekniske Universitet\Codopti\codon_table.xlsx', sheet_name=0)

Visualise the codon table.

In [3]:
print(codon_table.tail()) 

   Amino Acid Codon  Frequency
59          V   GTG        8.5
60          V   GTT       33.9
61          W   TGG      100.0
62          Y   TAC       95.1
63          Y   TAT        4.9


Find codons with <5% frequency in Yarrowia and remove them.

In [4]:
def cod_opt_table(codon_table, output_excel_path):

    # Filter out codons with a frequency less than 5.0
    codon_table = codon_table[codon_table['Frequency'] >= 5.0]

    # Write the updated DataFrame to a new Excel file
    codon_table.to_excel(output_excel_path)

# Example usage:
opt_codon_table = r'C:\Users\Candela\OneDrive - Danmarks Tekniske Universitet\Codopti\opt_codon_table.xlsx'  # Replace with the desired output path

cod_opt_table(codon_table, opt_codon_table)

Recalculate the frequency of the codons for each amino acid.

In [5]:
def calculate_new_frequencies(codon_table):
    # Group the DataFrame by Amino Acid
    grouped = codon_table.groupby('Amino Acid')

    # Calculate the total frequency for each amino acid
    amino_acid_totals = grouped['Frequency'].transform('sum')

    # Calculate the new frequencies for each codon based on the rule of three
    codon_table['New Frequency'] = (
        codon_table['Frequency'] / amino_acid_totals * 100
    )

    return codon_table

In [6]:
# Calculate the new frequencies
updated_opt_codon_table = calculate_new_frequencies(pd.read_excel(opt_codon_table))

# Print the updated DataFrame (with new frequencies)
print(updated_opt_codon_table)

    Unnamed: 0 Amino Acid Codon  Frequency  New Frequency
0            0          -   TAA       83.0      83.000000
1            1          -   TAG       10.0      10.000000
2            2          -   TGA        7.0       7.000000
3            4          A   GCC       59.1      60.927835
4            6          A   GCT       37.9      39.072165
5            7          C   TGC       65.6      65.600000
6            8          C   TGT       34.4      34.400000
7            9          D   GAC       70.0      70.000000
8           10          D   GAT       30.0      30.000000
9           11          E   GAA        6.4       6.400000
10          12          E   GAG       93.6      93.600000
11          13          F   TTC       80.7      80.700000
12          14          F   TTT       19.3      19.300000
13          15          G   GGA       16.2      16.413374
14          16          G   GGC       30.6      31.003040
15          18          G   GGT       51.9      52.583587
16          19

### 2. Sequence


Downloaded a random fasta sequence that we will codon optimise.

In [13]:
my_fasta = r'C:\Users\Candela\OneDrive - Danmarks Tekniske Universitet\Codopti\YALI2.txt' 

Visualise your amino acid fasta sequence:

In [78]:
for seq_record in SeqIO.parse(my_fasta, "fasta"):
    print(seq_record.id)
    print(len(seq_record))
    print(repr(seq_record))
    name = seq_record.id
    AA_seq = seq_record.seq

YALI0_D05621g
592
SeqRecord(seq=Seq('MAHDSELELSDEKVVPSINQEKHSFFQRHLDNHPRMAQYNSQLQRFLKWIEVPT...IIS'), id='YALI0_D05621g', name='YALI0_D05621g', description='YALI0_D05621g  | Yarrowia lipolytica CLIB122 | YALI0D05621p | protein | length=592', dbxrefs=[])


Use the updated_opt_codon_table to reverse translate the amino acid sequence:

In [82]:

def reverse_translate_amino_acids_fasta(AA_seq, codon_table):
    # Initialize an empty DNA sequence
    dna_sequence = Seq("")
    
    # Remove the header line from the FASTA format
    amino_acid_sequence = AA_seq
    
    print(amino_acid_sequence)

    for amino_acid in amino_acid_sequence:
        # Filter the codon table for the current amino acid
        amino_acid_codons = codon_table[codon_table['Amino Acid'] == amino_acid]

        if not amino_acid_codons.empty:
            # Assign codon value ranges based on their frequencies
            value_ranges = []
            lower_bound = 0
            for _, row in amino_acid_codons.iterrows():
                upper_bound = lower_bound + row['New Frequency']
                value_ranges.append((row['Codon'], lower_bound, upper_bound))
                lower_bound = upper_bound

            # Generate a random value within the specified range
            random_value = random.uniform(0, 100)

            # Map the random value to the corresponding codon
            for codon, lower, upper in value_ranges:
                if lower <= random_value <= upper:
                    dna_sequence += Seq(codon)
                    break

    return dna_sequence



# Call the function to reverse translate the amino acid sequence
reverse_translated_dna = reverse_translate_amino_acids_fasta(AA_seq, updated_opt_codon_table)

# Create a SeqRecord object for the DNA sequence
dna_record = SeqRecord(reverse_translated_dna, id="reverse_translated_sequence")

# Print the SeqRecord or save it to a FASTA file
print(dna_record.format("fasta"))


MAHDSELELSDEKVVPSINQEKHSFFQRHLDNHPRMAQYNSQLQRFLKWIEVPTKEGEINTFLNNEDLKPVEVARQTWGWKNFVSFWIADSFNINTWEIAATGIQLGLTWWQVWLCVWIGYFFCGVFVVLSGRIGAIYHVSFPVAGRSTFGIFGSIWPVINRVVMACVWYGVQGWLGGQCIQVCLLAIWPSARHMKNGIPGSGTTTFEFLSYFLFWLFSLPFIYIRPHNLRHLFMVKAAIVPVAGISFLVWTCVKAHGIGPIMKQPATVHGSVMGWAFMTAIMNSLSNFATIIVNAPDFTRFAKEPNAIVLSQLIAVPTAFSLTSFIGIIVSSSATVLYDENIWNPLDVLHKFLEGNKSGSRAGVFFLGFAFAVAQLGTNIAANSLSAGTDMTALLPKYINIRRGGFICAGIALCICPWHLLSSSSNFTTYLSAYATFLSAIAGCSFSDYYLVRKGYIYVGDLYNASKGSTYMYRYGVNWRAFAAYFCGIAINVVGFADAVSDGGVNETARKMYQLNFFLGFLVSAISYYGFNWLSPVVGARETWSEDPNASAMYDEITTDELSQDSQSYDPEEWDRKIANDDPVKTTAIIS
>reverse_translated_sequence <unknown description>
ATGGCCCACGACTCTGAGCTGGAGCTGTCCGACGAGAAGGTTGTTCCTTCTATTAACCAG
GAGAAGCACTCTTTCTTCCAGCGACACCTGGATAACCACCCCCGAATGGCTCAGTACAAC
TCCCAGCTTCAGCGATTCCTCAAGTGGATCGAGGTCCCCACCAAGGAGGGTGAGATTAAC
ACCTTTCTTAACAACGAGGACCTGAAGCCTGTTGAGGTGGCCCGACAGACCTGGGGCTGG
AAGAACTTTGTTTCTTTCTGGATCGCCGACTCTTTCAACATCAACACCTGGGAGATCGCC
GCCACCGGTATCCAGCTGGGCCTCACTTGGTGGCAGGTCTGGCTCTGCGTG

In [80]:
def reverse_translate_amino_acids_fasta(AA_seq, codon_table):
    # Define the forbidden sequences
    forbidden_sequences = ["GCTCTTCN", "GCGGCCGC", "GGTCTCNNNNN"]
    
    # Initialize an empty DNA sequence
    dna_sequence = Seq("")
    
    # Remove the header line from the FASTA format
    amino_acid_sequence = AA_seq
    
    print(amino_acid_sequence)

    while True:
        # Initialize a list to keep track of added codons
        added_codons = []
        
        for amino_acid in amino_acid_sequence:
            # Filter the codon table for the current amino acid
            amino_acid_codons = codon_table[codon_table['Amino Acid'] == amino_acid]

            if not amino_acid_codons.empty:
                # Assign codon value ranges based on their frequencies
                value_ranges = []
                lower_bound = 0
                for _, row in amino_acid_codons.iterrows():
                    upper_bound = lower_bound + row['New Frequency']
                    value_ranges.append((row['Codon'], lower_bound, upper_bound))
                    lower_bound = upper_bound

                while True:
                    # Generate a random value within the specified range
                    random_value = random.uniform(0, 100)

                    # Map the random value to the corresponding codon
                    selected_codon = None
                    for codon, lower, upper in value_ranges:
                        if lower <= random_value <= upper:
                            selected_codon = codon
                            break
                    
                    # Check if the selected codon contains any forbidden sequences
                    contains_forbidden_sequence = any(seq in selected_codon for seq in forbidden_sequences)
                    
                    if not contains_forbidden_sequence:
                        # Add the selected codon to the DNA sequence
                        dna_sequence += Seq(selected_codon)
                        added_codons.append(selected_codon)
                        break
        
        # Check if the generated sequence contains any forbidden sequences
        contains_forbidden_sequence = any(seq in dna_sequence for seq in forbidden_sequences)
        
        if not contains_forbidden_sequence:
            break  # Exit the loop if the sequence is valid
    
    return dna_sequence

# Call the function to reverse translate the amino acid sequence
reverse_translated_dna = reverse_translate_amino_acids_fasta(AA_seq, updated_opt_codon_table)

# Create a SeqRecord object for the DNA sequence
dna_record = SeqRecord(reverse_translated_dna, id="reverse_translated_sequence")

# Print the SeqRecord or save it to a FASTA file
print(dna_record.format("fasta"))

MAHDSELELSDEKVVPSINQEKHSFFQRHLDNHPRMAQYNSQLQRFLKWIEVPTKEGEINTFLNNEDLKPVEVARQTWGWKNFVSFWIADSFNINTWEIAATGIQLGLTWWQVWLCVWIGYFFCGVFVVLSGRIGAIYHVSFPVAGRSTFGIFGSIWPVINRVVMACVWYGVQGWLGGQCIQVCLLAIWPSARHMKNGIPGSGTTTFEFLSYFLFWLFSLPFIYIRPHNLRHLFMVKAAIVPVAGISFLVWTCVKAHGIGPIMKQPATVHGSVMGWAFMTAIMNSLSNFATIIVNAPDFTRFAKEPNAIVLSQLIAVPTAFSLTSFIGIIVSSSATVLYDENIWNPLDVLHKFLEGNKSGSRAGVFFLGFAFAVAQLGTNIAANSLSAGTDMTALLPKYINIRRGGFICAGIALCICPWHLLSSSSNFTTYLSAYATFLSAIAGCSFSDYYLVRKGYIYVGDLYNASKGSTYMYRYGVNWRAFAAYFCGIAINVVGFADAVSDGGVNETARKMYQLNFFLGFLVSAISYYGFNWLSPVVGARETWSEDPNASAMYDEITTDELSQDSQSYDPEEWDRKIANDDPVKTTAIIS
>reverse_translated_sequence <unknown description>
ATGGCCCACGACTCCGAGCTCGAGCTGTCCGACGAAAAGGTGGTCCCCTCTATCAACCAG
GAGAAGCATTCTTTTTTTCAGCGACATCTGGACAACCACCCCCGAATGGCCCAGTACAAC
TCCCAGCTGCAGCGATTCCTCAAGTGGATTGAGGTGCCCACCAAGGAGGGTGAGATCAAC
ACCTTCCTCAACAACGAGGACCTGAAGCCTGTTGAGGTCGCCCGACAGACCTGGGGCTGG
AAGAACTTCGTTTCTTTCTGGATCGCTGATTCTTTCAACATTAACACCTGGGAGATTGCT
GCCACCGGCATTCAGCTTGGTCTGACTTGGTGGCAGGTCTGGCTCTGCGTT

Check that the sequence changes every time you run it:

In [71]:
print(dna_record.seq)

TACCGAGCCATGATCAACGCCTGCATCGACTCCGAACAGGAGAACTGCGAG


In [49]:
print(dna_record.seq)

ATGGCTCACGACTCCGAGCTGGAGCTGTCCGACGAAAAGGTCGTCCCCTCTATCAACCAGGAGAAGCACTCCTTCTTCCAGCGACACCTGGACAACCACCCCCGAATGGCTCAGTACAACTCTCAGCTCCAGCGATTCCTCAAGTGGATTGAGGTCCCTACTAAGGAGGGAGAGATTAACACCTTCCTTAACAACGAGGATCTGAAGCCCGTCGAGGTTGCCCGACAGACCTGGGGTTGGAAGAACTTCGTCTCCTTCTGGATCGCCGACTCCTTCAACATCAACACCTGGGAGATCGCCGCTACCGGTATTCAGCTTGGCCTTACTTGGTGGCAGGTCTGGCTCTGTGTCTGGATCGGTTACTTCTTCTGCGGTGTCTTCGTCGTTCTGTCCGGTCGAATTGGCGCTATCTACCACGTCTCTTTCCCCGTTGCTGGACGATCCACCTTCGGCATCTTCGGCTCTATTTGGCCCGTTATTAACCGAGTTGTTATGGCTTGCGTTTGGTACGGTGTTCAGGGTTGGCTTGGCGGTCAGTGTATTCAGGTCTGCCTCCTGGCTATTTGGCCCTCTGCTCGACATATGAAGAACGGTATCCCCGGTTCTGGAACTACCACTTTCGAGTTCCTTTCTTACTTCCTCTTCTGGCTCTTCTCCCTGCCCTTCATCTACATCCGACCCCACAACCTTCGACACCTCTTCATGGTTAAGGCCGCTATTGTCCCCGTCGCCGGTATCTCCTTCCTGGTCTGGACCTGCGTCAAGGCCCACGGCATTGGTCCCATCATGAAGCAGCCCGCCACTGTGCACGGTTCCGTCATGGGTTGGGCCTTTATGACCGCTATCATGAACTCCCTGTCTAACTTCGCCACTATTATTGTCAACGCCCCCGACTTCACCCGATTCGCCAAGGAGCCCAACGCCATTGTCCTTTCTCAGCTCATCGCCGTCCCTACCGCTTTTTCCCTCACCTCTTTCATTGGTATCATCGTTTCCTCTT

Check for forbidden sequences:


DIct No sapI (7nt), bsaI, notI, no stretches of the same nucleotide max 8, GC content (check 12 nt every new codon added), also check the reverse

In [50]:
from Bio.SeqUtils import GC
GC(dna_record.seq)

55.18018018018018

Translate the sequence back:

In [23]:
dna_record_new = dna_record.translate() 
print(dna_record_new)

ID: <unknown id>
Name: <unknown name>
Description: <unknown description>
Number of features: 0
/molecule_type=protein
Seq('MAHDSELELSDEKVVPSINQEKHSFFQRHLDNHPRMAQYNSQLQRFLKWIEVPT...IIS')


Compare two sequences: 

In [24]:
dna_record_new.seq == AA_seq

True

--------------------------------

In [25]:
def reverse_translate_amino_acids(amino_acid_sequence, codon_table):
    dna_sequence = Seq("")
    for amino_acid in amino_acid_sequence:
        # Filter the codon table for the current amino acid
        amino_acid_codons = codon_table[codon_table['Amino Acid'] == amino_acid]

        if not amino_acid_codons.empty:
            # Assign codon value ranges based on their frequencies
            value_ranges = []
            lower_bound = 0
            for _, row in amino_acid_codons.iterrows():
                upper_bound = lower_bound + row['New Frequency']
                value_ranges.append((row['Codon'], lower_bound, upper_bound))
                lower_bound = upper_bound

            # Generate a random value within the specified range
            random_value = random.uniform(0, 100)

            # Map the random value to the corresponding codon
            for codon, lower, upper in value_ranges:
                if lower <= random_value <= upper:
                    dna_sequence += Seq(codon)
                    break

    # Ensure that the DNA sequence matches the length of the amino acid sequence
    while len(dna_sequence) < len(amino_acid_sequence) * 3:
        # If it's shorter, add random codons (you can customize this part)
        random_codon = random.choice(codon_table['Codon'])
        dna_sequence += random_codon

    # Trim the sequence to match the desired length
    dna_sequence = dna_sequence[:len(amino_acid_sequence) * 3]

    return dna_sequence

--------------------------------

In [26]:
def read_fasta(my_fasta):
    sequence = ""
    
    try:
        with open(my_fasta, 'r') as file:
            lines = file.readlines()
            
            # Assuming the FASTA format has a header starting with '>'
            # followed by one or more lines of sequence data.
            for line in lines:
                if line.startswith(">"):
                    # Skip the header line
                    continue
                sequence += line.strip()
    except FileNotFoundError:
        print(f"File not found: {my_fasta}")
    
    return sequence

In [27]:
my_prot = read_fasta(my_fasta)
print(my_prot)

MAHDSELELSDEKVVPSINQEKHSFFQRHLDNHPRMAQYNSQLQRFLKWIEVPTKEGEINTFLNNEDLKPVEVARQTWGWKNFVSFWIADSFNINTWEIAATGIQLGLTWWQVWLCVWIGYFFCGVFVVLSGRIGAIYHVSFPVAGRSTFGIFGSIWPVINRVVMACVWYGVQGWLGGQCIQVCLLAIWPSARHMKNGIPGSGTTTFEFLSYFLFWLFSLPFIYIRPHNLRHLFMVKAAIVPVAGISFLVWTCVKAHGIGPIMKQPATVHGSVMGWAFMTAIMNSLSNFATIIVNAPDFTRFAKEPNAIVLSQLIAVPTAFSLTSFIGIIVSSSATVLYDENIWNPLDVLHKFLEGNKSGSRAGVFFLGFAFAVAQLGTNIAANSLSAGTDMTALLPKYINIRRGGFICAGIALCICPWHLLSSSSNFTTYLSAYATFLSAIAGCSFSDYYLVRKGYIYVGDLYNASKGSTYMYRYGVNWRAFAAYFCGIAINVVGFADAVSDGGVNETARKMYQLNFFLGFLVSAISYYGFNWLSPVVGARETWSEDPNASAMYDEITTDELSQDSQSYDPEEWDRKIANDDPVKTTAIIS


In [28]:
def translate_aa_to_dna(aa_sequence):
    """
    Translate an amino acid sequence to a DNA sequence without considering codon preference.

    Args:
        aa_sequence (str): The input amino acid sequence.

    Returns:
        str: The corresponding DNA sequence.
    """
    # Define a translation table for amino acids to codons (not taking freq into account)
    translation_table = {
        'A': 'GCT', 'C': 'TGT', 'D': 'GAC', 'E': 'GAG',
        'F': 'TTT', 'G': 'GGC', 'H': 'CAT', 'I': 'ATC',
        'K': 'AAG', 'L': 'CTG', 'M': 'ATG', 'N': 'AAC',
        'P': 'CCC', 'Q': 'CAG', 'R': 'CGT', 'S': 'TCT',
        'T': 'ACC', 'V': 'GTT', 'W': 'TGG', 'Y': 'TAC',
        '*': 'TAA',  # Stop codon
    }

    # Initialize the DNA sequence
    dna_sequence = ""

    # Translate each amino acid to its corresponding codon
    for aa in aa_sequence:
        codon = translation_table.get(aa)
        if codon:
            dna_sequence += codon
        else:
            # Handle unknown amino acids or gaps as needed
            raise ValueError(f"Unknown amino acid: {aa}")

    return dna_sequence


In [29]:
# Example usage:  
translated_dna = translate_aa_to_dna(my_prot)
print("DNA sequence:")
print(translated_dna)

DNA sequence:
ATGGCTCATGACTCTGAGCTGGAGCTGTCTGACGAGAAGGTTGTTCCCTCTATCAACCAGGAGAAGCATTCTTTTTTTCAGCGTCATCTGGACAACCATCCCCGTATGGCTCAGTACAACTCTCAGCTGCAGCGTTTTCTGAAGTGGATCGAGGTTCCCACCAAGGAGGGCGAGATCAACACCTTTCTGAACAACGAGGACCTGAAGCCCGTTGAGGTTGCTCGTCAGACCTGGGGCTGGAAGAACTTTGTTTCTTTTTGGATCGCTGACTCTTTTAACATCAACACCTGGGAGATCGCTGCTACCGGCATCCAGCTGGGCCTGACCTGGTGGCAGGTTTGGCTGTGTGTTTGGATCGGCTACTTTTTTTGTGGCGTTTTTGTTGTTCTGTCTGGCCGTATCGGCGCTATCTACCATGTTTCTTTTCCCGTTGCTGGCCGTTCTACCTTTGGCATCTTTGGCTCTATCTGGCCCGTTATCAACCGTGTTGTTATGGCTTGTGTTTGGTACGGCGTTCAGGGCTGGCTGGGCGGCCAGTGTATCCAGGTTTGTCTGCTGGCTATCTGGCCCTCTGCTCGTCATATGAAGAACGGCATCCCCGGCTCTGGCACCACCACCTTTGAGTTTCTGTCTTACTTTCTGTTTTGGCTGTTTTCTCTGCCCTTTATCTACATCCGTCCCCATAACCTGCGTCATCTGTTTATGGTTAAGGCTGCTATCGTTCCCGTTGCTGGCATCTCTTTTCTGGTTTGGACCTGTGTTAAGGCTCATGGCATCGGCCCCATCATGAAGCAGCCCGCTACCGTTCATGGCTCTGTTATGGGCTGGGCTTTTATGACCGCTATCATGAACTCTCTGTCTAACTTTGCTACCATCATCGTTAACGCTCCCGACTTTACCCGTTTTGCTAAGGAGCCCAACGCTATCGTTCTGTCTCAGCTGATCGCTGTTCCCACCGCTTTTTCTCTGACCTCTTTTATCGGCAT

In [30]:
verify_dna = r'C:\Users\Candela\OneDrive - Danmarks Tekniske Universitet\Codopti\YALI2gene.txt'

In [31]:
reference_dna = read_fasta(verify_dna)
print(reference_dna)

ATGGCCCACGATTCCGAGCTCGAGCTGTCAGACGAAAAAGTCGTGCCTTCTATCAACCAAGAAAAGCACTCCTTTTTCCAACGGCATCTCGACAACCACCCCCGAATGGCCCAATACAACTCGCAACTGCAGCGGTTCCTCAAGTGGATCGAGGTCCCGACCAAGGAAGGAGAGATCAACACATTTCTCAACAACGAGGATCTGAAACCGGTCGAGGTCGCCCGCCAGACCTGGGGATGGAAGAACTTCGTGTCCTTCTGGATCGCAGACTCCTTCAACATCAACACCTGGGAAATTGCCGCCACAGGCATCCAGCTGGGACTCACCTGGTGGCAGGTCTGGCTGTGCGTGTGGATCGGCTACTTCTTCTGCGGAGTCTTTGTGGTCCTGTCTGGACGAATCGGAGCAATCTACCACGTGTCGTTTCCCGTGGCCGGCCGGTCGACATTTGGAATCTTTGGCTCCATCTGGCCCGTTATCAACCGAGTGGTCATGGCCTGCGTGTGGTACGGAGTCCAGGGCTGGCTGGGAGGCCAGTGTATCCAGGTGTGTCTTCTGGCAATCTGGCCTAGTGCTCGACACATGAAAAACGGAATCCCCGGAAGTGGAACCACCACCTTTGAGTTCCTCTCCTACTTTCTGTTCTGGCTCTTCTCCCTGCCCTTCATCTACATCAGACCCCACAACCTGCGGCACCTTTTCATGGTCAAGGCCGCCATCGTGCCTGTCGCAGGCATCTCATTCCTGGTCTGGACCTGTGTCAAGGCCCACGGAATCGGTCCCATCATGAAACAGCCCGCCACCGTTCACGGCTCGGTCATGGGCTGGGCCTTTATGACCGCCATCATGAACTCGCTGTCCAACTTTGCTACCATCATTGTCAACGCCCCGGACTTCACCCGTTTCGCCAAAGAACCTAACGCCATTGTGCTGTCGCAGCTCATTGCCGTGCCCACTGCATTCTCACTGACATCATTCATCGGTATCATTGTGTCATCTT

In [32]:
def compare_dna_sequences(seq1, seq2):
    """
    Compare two DNA sequences and identify mismatches.

    Args:
        seq1 (str): The first DNA sequence.
        seq2 (str): The second DNA sequence to compare.

    Returns:
        list: A list of tuples containing the position and the mismatched nucleotide.
    """
    mismatches = []
    min_len = min(len(seq1), len(seq2))

    for i in range(min_len):
        if seq1[i] != seq2[i]:
            mismatches.append((i, seq1[i], seq2[i]))

    return mismatches

# Ensure you have the 'dna_sequence' variable from the previous script
# Replace 'dna_sequence' with the actual DNA sequence you want to compare

In [33]:
compare_dna_sequences(translated_dna, reference_dna)

[(5, 'T', 'C'),
 (8, 'T', 'C'),
 (11, 'C', 'T'),
 (14, 'T', 'C'),
 (20, 'G', 'C'),
 (29, 'T', 'A'),
 (35, 'G', 'A'),
 (38, 'G', 'A'),
 (41, 'T', 'C'),
 (44, 'T', 'G'),
 (47, 'C', 'T'),
 (59, 'G', 'A'),
 (62, 'G', 'A'),
 (68, 'T', 'C'),
 (71, 'T', 'C'),
 (77, 'T', 'C'),
 (80, 'G', 'A'),
 (83, 'T', 'G'),
 (89, 'G', 'C'),
 (98, 'T', 'C'),
 (104, 'T', 'A'),
 (110, 'T', 'C'),
 (113, 'G', 'A'),
 (122, 'T', 'G'),
 (125, 'G', 'A'),
 (134, 'T', 'G'),
 (137, 'T', 'C'),
 (140, 'G', 'C'),
 (155, 'T', 'C'),
 (158, 'C', 'G'),
 (167, 'G', 'A'),
 (170, 'C', 'A'),
 (182, 'C', 'A'),
 (188, 'G', 'C'),
 (200, 'C', 'T'),
 (206, 'G', 'A'),
 (209, 'C', 'G'),
 (212, 'T', 'C'),
 (218, 'T', 'C'),
 (221, 'T', 'C'),
 (224, 'T', 'C'),
 (236, 'C', 'A'),
 (248, 'T', 'C'),
 (251, 'T', 'G'),
 (254, 'T', 'C'),
 (257, 'T', 'C'),
 (266, 'T', 'A'),
 (272, 'T', 'C'),
 (275, 'T', 'C'),
 (293, 'G', 'A'),
 (296, 'C', 'T'),
 (299, 'T', 'C'),
 (302, 'T', 'C'),
 (305, 'C', 'A'),
 (320, 'C', 'A'),
 (323, 'G', 'C'),
 (338, 'T', 'C

once you get the mismatches, you build again the dna sequence using the specific codon table.

after that you change the low frequency codons for higher frequency ones.

In [34]:
import pandas as pd

def translate_with_codon_table(aa_sequence, codon_preference_file):
    """
    Translate an amino acid sequence to a DNA sequence using a codon preference table from an Excel sheet,
    considering the highest-frequency codon for each amino acid.

    Args:
        aa_sequence (str): The input amino acid sequence.
        codon_preference_file (str): The path to the Excel file containing the codon preference table.
        sheet_name (str): The name of the sheet containing the codon preference data.

    Returns:
        str: The corresponding DNA sequence.
    """
    # Read the codon preference table from the Excel sheet
    codon_preference_df = codon_table
    
    # Create a dictionary to store the highest-frequency codon for each amino acid
    highest_frequency_codon = {}
    
    # Iterate over the DataFrame to find the highest-frequency codon for each amino acid
    for index, row in codon_preference_df.iterrows():
        amino_acid = row['Amino Acid']
        codon = row['Codon']
        frequency = row['Frequency']
        
        if amino_acid not in highest_frequency_codon or frequency > highest_frequency_codon[amino_acid]['Frequency']:
            highest_frequency_codon[amino_acid] = {'Codon': codon, 'Frequency': frequency}

    # Initialize the DNA sequence
    dna_sequence = ""
    
    # Translate each amino acid to its corresponding highest-frequency codon
    for aa in aa_sequence:
        codon_info = highest_frequency_codon.get(aa)
        if codon_info:
            dna_sequence += codon_info['Codon']
        else:
            # Handle unknown amino acids or gaps as needed
            raise ValueError(f"Unknown amino acid: {aa}")

    return dna_sequence


# Example usage:
dna_sequence = translate_with_codon_table(my_prot, codon_table)
print("DNA sequence:")
print(dna_sequence)



DNA sequence:
ATGGCCCACGACTCCGAGCTCGAGCTCTCCGACGAGAAGGTCGTCCCCTCCATCAACCAGGAGAAGCACTCCTTCTTCCAGCGACACCTCGACAACCACCCCCGAATGGCCCAGTACAACTCCCAGCTCCAGCGATTCCTCAAGTGGATCGAGGTCCCCACCAAGGAGGGTGAGATCAACACCTTCCTCAACAACGAGGACCTCAAGCCCGTCGAGGTCGCCCGACAGACCTGGGGTTGGAAGAACTTCGTCTCCTTCTGGATCGCCGACTCCTTCAACATCAACACCTGGGAGATCGCCGCCACCGGTATCCAGCTCGGTCTCACCTGGTGGCAGGTCTGGCTCTGCGTCTGGATCGGTTACTTCTTCTGCGGTGTCTTCGTCGTCCTCTCCGGTCGAATCGGTGCCATCTACCACGTCTCCTTCCCCGTCGCCGGTCGATCCACCTTCGGTATCTTCGGTTCCATCTGGCCCGTCATCAACCGAGTCGTCATGGCCTGCGTCTGGTACGGTGTCCAGGGTTGGCTCGGTGGTCAGTGCATCCAGGTCTGCCTCCTCGCCATCTGGCCCTCCGCCCGACACATGAAGAACGGTATCCCCGGTTCCGGTACCACCACCTTCGAGTTCCTCTCCTACTTCCTCTTCTGGCTCTTCTCCCTCCCCTTCATCTACATCCGACCCCACAACCTCCGACACCTCTTCATGGTCAAGGCCGCCATCGTCCCCGTCGCCGGTATCTCCTTCCTCGTCTGGACCTGCGTCAAGGCCCACGGTATCGGTCCCATCATGAAGCAGCCCGCCACCGTCCACGGTTCCGTCATGGGTTGGGCCTTCATGACCGCCATCATGAACTCCCTCTCCAACTTCGCCACCATCATCGTCAACGCCCCCGACTTCACCCGATTCGCCAAGGAGCCCAACGCCATCGTCCTCTCCCAGCTCATCGCCGTCCCCACCGCCTTCTCCCTCACCTCCTTCATCGGTAT

In [35]:
compare_dna_sequences(dna_sequence, reference_dna)

[(11, 'C', 'T'),
 (26, 'C', 'G'),
 (29, 'C', 'A'),
 (35, 'G', 'A'),
 (38, 'G', 'A'),
 (44, 'C', 'G'),
 (47, 'C', 'T'),
 (50, 'C', 'T'),
 (59, 'G', 'A'),
 (62, 'G', 'A'),
 (74, 'C', 'T'),
 (80, 'G', 'A'),
 (83, 'A', 'G'),
 (86, 'C', 'T'),
 (113, 'G', 'A'),
 (122, 'C', 'G'),
 (125, 'G', 'A'),
 (128, 'C', 'G'),
 (134, 'A', 'G'),
 (158, 'C', 'G'),
 (167, 'G', 'A'),
 (170, 'T', 'A'),
 (182, 'C', 'A'),
 (185, 'C', 'T'),
 (200, 'C', 'T'),
 (203, 'C', 'G'),
 (206, 'G', 'A'),
 (209, 'C', 'G'),
 (224, 'A', 'C'),
 (236, 'T', 'A'),
 (251, 'C', 'G'),
 (266, 'C', 'A'),
 (293, 'G', 'A'),
 (296, 'C', 'T'),
 (305, 'C', 'A'),
 (308, 'T', 'C'),
 (317, 'C', 'G'),
 (320, 'T', 'A'),
 (344, 'C', 'G'),
 (350, 'C', 'G'),
 (359, 'T', 'C'),
 (374, 'T', 'A'),
 (380, 'C', 'T'),
 (383, 'C', 'G'),
 (389, 'C', 'G'),
 (392, 'C', 'T'),
 (395, 'T', 'A'),
 (404, 'T', 'A'),
 (407, 'C', 'A'),
 (419, 'C', 'G'),
 (422, 'C', 'G'),
 (425, 'C', 'T'),
 (431, 'C', 'G'),
 (437, 'T', 'C'),
 (440, 'A', 'G'),
 (443, 'C', 'G'),
 (446,

DUMB you need to verify it by translating it into amino acids again and comparing the sequences:

In [36]:
def translate_dna_to_aa(dna_sequence):
    """
    Translate a DNA sequence to an amino acid sequence.

    Args:
        dna_sequence (str): The input DNA sequence.

    Returns:
        str: The corresponding amino acid sequence.
    """
    # Define a translation table for codons to amino acids
    translation_table = {
        'TTT': 'F', 'TTC': 'F', 'TTA': 'L', 'TTG': 'L',
        'TCT': 'S', 'TCC': 'S', 'TCA': 'S', 'TCG': 'S',
        'TAT': 'Y', 'TAC': 'Y', 'TAA': '*', 'TAG': '*',
        'TGT': 'C', 'TGC': 'C', 'TGA': '*', 'TGG': 'W',
        'CTT': 'L', 'CTC': 'L', 'CTA': 'L', 'CTG': 'L',
        'CCT': 'P', 'CCC': 'P', 'CCA': 'P', 'CCG': 'P',
        'CAT': 'H', 'CAC': 'H', 'CAA': 'Q', 'CAG': 'Q',
        'CGT': 'R', 'CGC': 'R', 'CGA': 'R', 'CGG': 'R',
        'ATT': 'I', 'ATC': 'I', 'ATA': 'I', 'ATG': 'M',
        'ACT': 'T', 'ACC': 'T', 'ACA': 'T', 'ACG': 'T',
        'AAT': 'N', 'AAC': 'N', 'AAA': 'K', 'AAG': 'K',
        'AGT': 'S', 'AGC': 'S', 'AGA': 'R', 'AGG': 'R',
        'GTT': 'V', 'GTC': 'V', 'GTA': 'V', 'GTG': 'V',
        'GCT': 'A', 'GCC': 'A', 'GCA': 'A', 'GCG': 'A',
        'GAT': 'D', 'GAC': 'D', 'GAA': 'E', 'GAG': 'E',
        'GGT': 'G', 'GGC': 'G', 'GGA': 'G', 'GGG': 'G',
    }

    # Initialize the amino acid sequence
    aa_sequence = ""
    
    # Iterate over the DNA sequence in triplets (codons)
    for i in range(0, len(dna_sequence), 3):
        codon = dna_sequence[i:i+3]
        aa = translation_table.get(codon)
        if aa:
            aa_sequence += aa
        else:
            # Handle unknown codons or gaps as needed
            raise ValueError(f"Unknown codon: {codon}")

    return aa_sequence

In [37]:
verify_sequence = translate_dna_to_aa(dna_sequence)
print(verify_sequence) 

MAHDSELELSDEKVVPSINQEKHSFFQRHLDNHPRMAQYNSQLQRFLKWIEVPTKEGEINTFLNNEDLKPVEVARQTWGWKNFVSFWIADSFNINTWEIAATGIQLGLTWWQVWLCVWIGYFFCGVFVVLSGRIGAIYHVSFPVAGRSTFGIFGSIWPVINRVVMACVWYGVQGWLGGQCIQVCLLAIWPSARHMKNGIPGSGTTTFEFLSYFLFWLFSLPFIYIRPHNLRHLFMVKAAIVPVAGISFLVWTCVKAHGIGPIMKQPATVHGSVMGWAFMTAIMNSLSNFATIIVNAPDFTRFAKEPNAIVLSQLIAVPTAFSLTSFIGIIVSSSATVLYDENIWNPLDVLHKFLEGNKSGSRAGVFFLGFAFAVAQLGTNIAANSLSAGTDMTALLPKYINIRRGGFICAGIALCICPWHLLSSSSNFTTYLSAYATFLSAIAGCSFSDYYLVRKGYIYVGDLYNASKGSTYMYRYGVNWRAFAAYFCGIAINVVGFADAVSDGGVNETARKMYQLNFFLGFLVSAISYYGFNWLSPVVGARETWSEDPNASAMYDEITTDELSQDSQSYDPEEWDRKIANDDPVKTTAIIS


now we compare to see if there are mismatches:

In [38]:
def compare_aa_sequences(seq1, seq2):
    """
    Compare two amino acid sequences and identify and indicate mismatches.

    Args:
        seq1 (str): The first amino acid sequence.
        seq2 (str): The second amino acid sequence to compare.

    Returns:
        list: A list of tuples containing the position and the mismatched amino acids.
    """
    mismatches = []
    min_len = min(len(seq1), len(seq2))

    for i in range(min_len):
        if seq1[i] != seq2[i]:
            mismatches.append((i, seq1[i], seq2[i]))

    return mismatches

# Example usage:
input_aa_sequence1 = verify_sequence  
input_aa_sequence2 =  my_prot  
mismatches = compare_aa_sequences(input_aa_sequence1, input_aa_sequence2)

if not mismatches:
    print("The two amino acid sequences are identical.")
else:
    print("Mismatches found:")
    for pos, aa1, aa2 in mismatches:
        print(f"Position {pos + 1}: {aa1} (expected) != {aa2} (actual)")

The two amino acid sequences are identical.


WORKS!

Now only the highest frequency codons appear, and so have 100% frequency.

now let's try with the sequence that Jonathan gave me, although it is not in fasta format

In [39]:
def calculate_gc_content(dna_sequence):
    """
    Calculate the GC content (percentage of G and C bases) of a DNA sequence.

    Args:
        dna_sequence (str): The input DNA sequence.

    Returns:
        float: The GC content as a percentage.
    """
    if len(dna_sequence) == 0:
        return 0.0
    
    gc_count = dna_sequence.count('G') + dna_sequence.count('C')
    total_bases = len(dna_sequence)
    
    gc_content = (gc_count / total_bases) * 100.0
    return gc_content

In [40]:
gc_content = calculate_gc_content(dna_sequence)
print(gc_content)

61.03603603603604


In [41]:
import pandas as pd

def translate_with_codon_table_restricted(aa_sequence, codon_table, forbidden_sequence):
    """
    Translate an amino acid sequence to a DNA sequence using a codon preference table from an Excel sheet,
    considering the highest-frequency codon for each amino acid, and using the second highest if the highest is forbidden.

    Args:
        aa_sequence (str): The input amino acid sequence.
        codon_table (DataFrame): The DataFrame containing the codon preference table.
        forbidden_sequence (str): The sequence that needs to be avoided.

    Returns:
        str: The corresponding DNA sequence.
    """
    # Create a dictionary to store the highest and second highest frequency codons for each amino acid
    codon_info = {}
    
    # Iterate over the DataFrame to find the highest and second highest frequency codons for each amino acid
    for index, row in codon_table.iterrows():
        amino_acid = row['Amino Acid']
        codon = row['Codon']
        frequency = row['Frequency']
        
        if amino_acid not in codon_info:
            codon_info[amino_acid] = {'Highest': codon, 'Highest_Frequency': frequency, 'Second_Highest': None}
        else:
            if frequency > codon_info[amino_acid]['Highest_Frequency']:
                codon_info[amino_acid]['Second_Highest'] = codon_info[amino_acid]['Highest']
                codon_info[amino_acid]['Second_Highest_Frequency'] = codon_info[amino_acid]['Highest_Frequency']
                codon_info[amino_acid]['Highest'] = codon
                codon_info[amino_acid]['Highest_Frequency'] = frequency
            elif codon_info[amino_acid]['Second_Highest'] is None or frequency > codon_info[amino_acid]['Second_Highest_Frequency']:
                codon_info[amino_acid]['Second_Highest'] = codon
                codon_info[amino_acid]['Second_Highest_Frequency'] = frequency

    # Initialize the DNA sequence
    dna_sequence = ""
    
    # Translate each amino acid to its corresponding highest-frequency codon, and if it's forbidden, use the second highest
    for aa in aa_sequence:
        highest_codon = codon_info[aa]['Highest']
        second_highest_codon = codon_info[aa]['Second_Highest'] if codon_info[aa]['Second_Highest'] is not None else codon_info[aa]['Highest']
        
        if aa_sequence.startswith(highest_codon) and not dna_sequence.endswith(forbidden_sequence):
            dna_sequence += highest_codon
        else:
            dna_sequence += second_highest_codon

    return dna_sequence

# Example usage:
forbidden_seq = "AGCTAA"  # Replace with your forbidden sequence
dna_sequence_2 = translate_with_codon_table_restricted(my_prot, codon_table, forbidden_seq)
print("DNA sequence 2:")
print(dna_sequence_2)


DNA sequence 2:
ATGGCTCATGATTCTGAACTGGAACTGTCTGATGAAAAAGTTGTTCCTTCTATTAATCAAGAAAAACATTCTTTTTTTCAACGTCATCTGGATAATCATCCTCGTATGGCTCAATATAATTCTCAACTGCAACGTTTTCTGAAATGGATTGAAGTTCCTACTAAAGAAGGCGAAATTAATACTTTTCTGAATAATGAAGATCTGAAACCTGTTGAAGTTGCTCGTCAAACTTGGGGCTGGAAAAATTTTGTTTCTTTTTGGATTGCTGATTCTTTTAATATTAATACTTGGGAAATTGCTGCTACTGGCATTCAACTGGGCCTGACTTGGTGGCAAGTTTGGCTGTGTGTTTGGATTGGCTATTTTTTTTGTGGCGTTTTTGTTGTTCTGTCTGGCCGTATTGGCGCTATTTATCATGTTTCTTTTCCTGTTGCTGGCCGTTCTACTTTTGGCATTTTTGGCTCTATTTGGCCTGTTATTAATCGTGTTGTTATGGCTTGTGTTTGGTATGGCGTTCAAGGCTGGCTGGGCGGCCAATGTATTCAAGTTTGTCTGCTGGCTATTTGGCCTTCTGCTCGTCATATGAAAAATGGCATTCCTGGCTCTGGCACTACTACTTTTGAATTTCTGTCTTATTTTCTGTTTTGGCTGTTTTCTCTGCCTTTTATTTATATTCGTCCTCATAATCTGCGTCATCTGTTTATGGTTAAAGCTGCTATTGTTCCTGTTGCTGGCATTTCTTTTCTGGTTTGGACTTGTGTTAAAGCTCATGGCATTGGCCCTATTATGAAACAACCTGCTACTGTTCATGGCTCTGTTATGGGCTGGGCTTTTATGACTGCTATTATGAATTCTCTGTCTAATTTTGCTACTATTATTGTTAATGCTCCTGATTTTACTCGTTTTGCTAAAGAACCTAATGCTATTGTTCTGTCTCAACTGATTGCTGTTCCTACTGCTTTTTCTCTGACTTCTTTTATTGGC

In [42]:
gc_content = calculate_gc_content(dna_sequence_2)
print(gc_content)

38.457207207207205


In [43]:
mismatches = compare_aa_sequences(dna_sequence, dna_sequence_2)
print(mismatches)

[(5, 'C', 'T'), (8, 'C', 'T'), (11, 'C', 'T'), (14, 'C', 'T'), (17, 'G', 'A'), (20, 'C', 'G'), (23, 'G', 'A'), (26, 'C', 'G'), (29, 'C', 'T'), (32, 'C', 'T'), (35, 'G', 'A'), (38, 'G', 'A'), (41, 'C', 'T'), (44, 'C', 'T'), (47, 'C', 'T'), (50, 'C', 'T'), (53, 'C', 'T'), (56, 'C', 'T'), (59, 'G', 'A'), (62, 'G', 'A'), (65, 'G', 'A'), (68, 'C', 'T'), (71, 'C', 'T'), (74, 'C', 'T'), (77, 'C', 'T'), (80, 'G', 'A'), (83, 'A', 'T'), (86, 'C', 'T'), (89, 'C', 'G'), (92, 'C', 'T'), (95, 'C', 'T'), (98, 'C', 'T'), (101, 'C', 'T'), (104, 'A', 'T'), (110, 'C', 'T'), (113, 'G', 'A'), (116, 'C', 'T'), (119, 'C', 'T'), (122, 'C', 'T'), (125, 'G', 'A'), (128, 'C', 'G'), (131, 'G', 'A'), (134, 'A', 'T'), (137, 'C', 'T'), (140, 'C', 'G'), (143, 'G', 'A'), (149, 'C', 'T'), (152, 'G', 'A'), (155, 'C', 'T'), (158, 'C', 'T'), (161, 'C', 'T'), (164, 'G', 'A'), (167, 'G', 'A'), (170, 'T', 'C'), (173, 'G', 'A'), (176, 'C', 'T'), (179, 'C', 'T'), (182, 'C', 'T'), (185, 'C', 'T'), (188, 'C', 'G'), (191, 'C', 'T

In [44]:
import pandas as pd
from collections import defaultdict

def calculate_codon_frequencies_as_percentages(dna_sequence, codon_table):
    """
    Calculate the frequency of each codon used in a DNA sequence for each amino acid as percentages relative to the total frequency.

    Args:
        dna_sequence (str): The DNA sequence.
        codon_table (DataFrame): The DataFrame containing the codon preference table.

    Returns:
        dict: A dictionary where keys are amino acids and values are dictionaries containing codon frequencies as percentages.
    """
    # Create a dictionary to store codon frequencies for each amino acid
    codon_frequencies = defaultdict(dict)
    
    # Create a dictionary to store the total frequency for each amino acid
    total_frequencies = defaultdict(int)
    
    # Iterate over the DNA sequence in steps of 3 to extract codons
    for i in range(0, len(dna_sequence), 3):
        codon = dna_sequence[i:i+3]
        
        # Find the corresponding amino acid for the codon in the codon table
        amino_acid = codon_table[codon_table['Codon'] == codon]['Amino Acid'].values[0]
        
        # Update the codon frequencies for the amino acid
        if amino_acid in codon_frequencies:
            if codon in codon_frequencies[amino_acid]:
                codon_frequencies[amino_acid][codon] += 1
            else:
                codon_frequencies[amino_acid][codon] = 1
        else:
            codon_frequencies[amino_acid] = {codon: 1}
        
        # Update the total frequency for the amino acid
        total_frequencies[amino_acid] += 1
    
    # Calculate frequencies as percentages
    codon_frequencies_percentages = {}
    for amino_acid, frequencies in codon_frequencies.items():
        total_frequency = total_frequencies[amino_acid]
        percentages = {codon: (count / total_frequency) * 100 for codon, count in frequencies.items()}
        codon_frequencies_percentages[amino_acid] = percentages
    
    return codon_frequencies_percentages

# Example usage:

codon_frequencies_percentages = calculate_codon_frequencies_as_percentages(dna_sequence_2, codon_table)

# Print codon frequencies as percentages for each amino acid
for amino_acid, percentages in codon_frequencies_percentages.items():
    print(f"Amino Acid: {amino_acid}")
    for codon, percentage in percentages.items():
        print(f"  Codon: {codon}, Frequency Percentage: {percentage:.2f}%")


Amino Acid: M
  Codon: ATG, Frequency Percentage: 100.00%
Amino Acid: A
  Codon: GCT, Frequency Percentage: 100.00%
Amino Acid: H
  Codon: CAT, Frequency Percentage: 100.00%
Amino Acid: D
  Codon: GAT, Frequency Percentage: 100.00%
Amino Acid: S
  Codon: TCT, Frequency Percentage: 100.00%
Amino Acid: E
  Codon: GAA, Frequency Percentage: 100.00%
Amino Acid: L
  Codon: CTG, Frequency Percentage: 100.00%
Amino Acid: K
  Codon: AAA, Frequency Percentage: 100.00%
Amino Acid: V
  Codon: GTT, Frequency Percentage: 100.00%
Amino Acid: P
  Codon: CCT, Frequency Percentage: 100.00%
Amino Acid: I
  Codon: ATT, Frequency Percentage: 100.00%
Amino Acid: N
  Codon: AAT, Frequency Percentage: 100.00%
Amino Acid: Q
  Codon: CAA, Frequency Percentage: 100.00%
Amino Acid: F
  Codon: TTT, Frequency Percentage: 100.00%
Amino Acid: R
  Codon: CGT, Frequency Percentage: 100.00%
Amino Acid: Y
  Codon: TAT, Frequency Percentage: 100.00%
Amino Acid: W
  Codon: TGG, Frequency Percentage: 100.00%
Amino Acid: T



The table contains three columns:

Amino acid: The amino acid encoded by a given codon (“-“ is for stop codon)

Codon: The codon in question

Frequency: The frequency of a given codon when a specific AA in encoded. 

(If the following is too confusing/complicated then just get started with the other things first, and then get back to this later.)

For example alanine (A), can be encoded by 4 codons (GCA, GCC, GCG, GCT). GCC and GCT occur frequently (59% and 38%), whereas GCA and GCG are very rarely used (2% and 1%). So we can see that Yarrowia has a preference GCC and GCT in regards to alanine.
Biologically speaking this also correlates with the number of tRNA genes Yarrowia has for a given codon.

When doing codon optimization, we don’t want to use the rare codons at all, and we would like to frequent codons to be selected according to their occurrence by using a random number generator.

I would remove the rare codons, and recalculate the frequency for the common codons. You can then proceed in slightly different ways, but you could assign the codons values from 0-100 based on their frequency.

In the case of alanine it might look like this:
After removing the rare codons, the new frequency is:
            GCC = 60.9 %

GCT = 39.1 %

So lets assign GCC the values 0-60.9, and GCT 61.0-100.

We input the sequence AAA, and iterate through it, randomizing a number from 0-100 each time.

We might get the numbers 12, 89, and 35. This would then correspond to the sequence GCC + GCT + GCC.

In [45]:
def find_second_highest_codon(codon_table):
    """
    Find the second-highest frequency codon for each amino acid in the codon preference table.

    Args:
        codon_table (DataFrame): The DataFrame containing the codon preference table.

    Returns:
        dict: A dictionary where keys are amino acids and values are the second-highest frequency codons.
    """
    second_highest_codons = {}

    # Group the codon table by amino acid and sort within each group by frequency
    grouped = codon_table.groupby('Amino Acid')
    
    for amino_acid, group in grouped:
        sorted_group = group.sort_values(by='Frequency', ascending=False)
        codons = sorted_group['Codon'].values
        if len(codons) >= 2:
            second_highest_codons[amino_acid] = codons[1]
        else:
            second_highest_codons[amino_acid] = "This amino acid is only encoded by one codon"

    return second_highest_codons

In [46]:
second_highest_codons = find_second_highest_codon(codon_table)
print(second_highest_codons)

{'-': 'TAG', 'A': 'GCT', 'C': 'TGT', 'D': 'GAT', 'E': 'GAA', 'F': 'TTT', 'G': 'GGC', 'H': 'CAT', 'I': 'ATT', 'K': 'AAA', 'L': 'CTG', 'M': 'This amino acid is only encoded by one codon', 'N': 'AAT', 'P': 'CCT', 'Q': 'CAA', 'R': 'CGT', 'S': 'TCT', 'T': 'ACT', 'V': 'GTT', 'W': 'This amino acid is only encoded by one codon', 'Y': 'TAT'}


In [47]:
Codon = "Codon"
print(sheet2[Codon])

codons = sheet2[Codon]
print(codons)

codon_sequence = ''.join(codons)
print(codon_sequence)


NameError: name 'sheet2' is not defined

In [None]:
def substitute_rare_codons(dna_sequence, codon_table):

    # Calculate the threshold frequency for rare codons (e.g., 10%)
    threshold_frequency = 0.10

    # Create a dictionary to map rare codons to "-"
    rare_codon_mapping = {
        codon: "-" if frequency < threshold_frequency else codon
        for codon, frequency in zip(codon_table['Codon'], codon_table['Frequency'])
    }

    # Substitute rare codons in the DNA sequence
    substituted_sequence = ''.join([rare_codon_mapping.get(codon, codon) for codon in dna_sequence])

    return substituted_sequence

In [None]:
rare_codon_sequence = substitute_rare_codons(codon_sequence, codon_table)
print(rare_codon_sequence)

AAAAACAAGAATACAACCACGACTAGAAGCAGGAGTATAATCATGATTCAACACCAGCATCCACCCCCGCCTCGACGCCGGCGTCTACTCCTGCTTGAAGACGAGGATGCAGCCGCGGCTGGAGGCGGGGGTGTAGTCGTGGTTTAATACTAGTATTCATCCTCGTCTTGATGCTGGTGTTTATTCTTGTTT


- use biopython functions
- rare codons <5%
- in rare cases you might need rare codons - if restriction site is present in code - let it proceed but flag - say that it needs to be changed by hand
- delete rare codons from codon_table

- safranal try / acetoin
- assemble and transform parts for heat sensitivity
- recombination decide 

-----------------------------