# Sequence optimisation

The user should be able to specify forbidden sequences (e.g. restriction sites, polynucleotides) that the script avoids, should keep the sequence within an acceptable GC content (30-80%), both locally (specified window size, e.g. 15 bp) and globally (whole sequence). In the end the script should double check that the DNA generated translates to the AA sequence provided as input. The steps are:

1. Download your fasta sequence from FungiDB and visualise it using biopython

2. Create a dictionary of forbidden sequences: SapI (7nt), BsaI, NotI, no stretches of the same nucleotide max 8, GC content. 12nt of both the forward and reverse sequences should be checked every time a new codon is added.

3. Run the sequence through the code - check that a new sequence is generated every time.

4. Translate the sequence and check if it matches the original one.

5. Find a way to collect all optimised sequences in a multifasta (bloc)


In [1]:
import pandas as pd
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
import random
from Bio.SeqUtils import GC

Downloaded a random fasta sequence that we will codon optimise.

In [6]:
my_fasta = r'C:\Users\Candela\OneDrive - Danmarks Tekniske Universitet\Codopti\YALI2.txt' 
codon_table = pd.read_excel(r'C:\Users\Candela\OneDrive - Danmarks Tekniske Universitet\Codopti\bounds_codon_table.xlsx')

        "add_stop (bool, optional): Adds final stop codon (TAA) if True. Defaults to True.
        GC_limit_upper (int, optional): Upper limit for local GC content. Defaults to 80.
        GC_limit_lower (int, optional): Lower limit for local GC content. Defaults to 20.
        window_size_gc (int, optional): Window size for local GC content. Defaults to 12.
        window_size_poly (int, optional): Shortest stretch of polynucleotides not allowed. Defaults to 8.
        lenient_GC (bool, optional): Tries to stay within GC constraints, but may proceed for with high local GC content is unavoidable. Defaults to False.
        quiet (bool, optional): Silences print comments during the script. Defaults to True."

Let's now reverse translate the sequence using the upper and lower bounds and a random number generator:

In [21]:
def codon_opt(AA_seq, codon_table):
    """ Script to convert an AA sequence into a Yarrowia lipolytica codon optimized DNA sequence.
 
    Args:
        AA_seq (str): Amino acid sequence
        codon_table (excel file): Optimised codon table 
        GC_limit_upper (int, optional): Upper limit for local GC content. Defaults to 80.
        GC_limit_lower (int, optional): Lower limit for local GC content. Defaults to 20.
    Returns:
        DNA_seq (str): Codon optimized DNA sequence
    """
    # Initialize an empty DNA sequence
    dna_sequence = Seq("")

    for amino_acid in AA_seq:
        # Filter the codon table for the current amino acid
        amino_acid_codons = codon_table[codon_table['Amino Acid'] == amino_acid]

        if not amino_acid_codons.empty:
            # Generate a random value within the specified range
            random_value = random.uniform(0, 100)

            # Map the random value to the corresponding codon using the bounds from the table
            for _, row in amino_acid_codons.iterrows():
                if row['Lower_Bound'] <= random_value < row['Upper_Bound']: #If the  random number generated is exactly the boundary between codons, the code will select the codon that appears first in the DataFrame.
                    dna_sequence += Seq(row['Codon'])
                    break

    return dna_sequence

In [22]:
for seq_record in SeqIO.parse(my_fasta, "fasta"):
    print(seq_record.id)
    print(len(seq_record))
    print(repr(seq_record))
    # Call the function to reverse translate the amino acid sequence
    reverse_translated_dna = codon_opt(seq_record.seq, codon_table)
    # Create a SeqRecord object for the DNA sequence
    dna_record = SeqRecord(reverse_translated_dna, id=(seq_record.id+"_opti"), description="")
    # Print the SeqRecord or save it to a FASTA file
    print(dna_record.format("fasta"))

YALI0_D05621g
592
SeqRecord(seq=Seq('MAHDSELELSDEKVVPSINQEKHSFFQRHLDNHPRMAQYNSQLQRFLKWIEVPT...IIS'), id='YALI0_D05621g', name='YALI0_D05621g', description='YALI0_D05621g  | Yarrowia lipolytica CLIB122 | YALI0D05621p | protein | length=592', dbxrefs=[])
>YALI0_D05621g_opti
ATGGCCCACGACTCCGAGCTTGAGCTCTCCGACGAGAAGGTCGTGCCTTCTATCAACCAG
GAGAAGCACTCCTTTTTCCAGCGACACCTCGACAACCACCCCCGAATGGCTCAGTACAAC
TCCCAGCTCCAGCGATTCCTCAAGTGGATCGAGGTTCCTACCAAGGAGGGTGAGATCAAC
ACTTTCCTCAACAACGAGGATCTGAAGCCCGTCGAGGTCGCTCGACAGACCTGGGGCTGG
AAGAACTTTGTCTCCTTCTGGATCGCTGACTCCTTTAACATCAACACCTGGGAGATCGCT
GCCACCGGCATCCAGCTCGGCCTCACCTGGTGGCAGGTCTGGCTCTGCGTTTGGATCGGC
TACTTCTTCTGTGGTGTCTTCGTTGTCCTCTCTGGCCGAATCGGTGCCATTTACCATGTC
TCCTTCCCCGTTGCTGGACGATCCACTTTTGGCATCTTCGGCTCCATTTGGCCCGTCATC
AACCGAGTTGTCATGGCCTGTGTCTGGTACGGTGTCCAGGGTTGGCTTGGCGGCCAGTGC
ATCCAGGTCTGCCTCCTTGCCATTTGGCCCTCCGCCCGACACATGAAGAACGGTATCCCC
GGCTCCGGTACTACTACCTTCGAGTTCCTGTCCTACTTCCTGTTTTGGCTTTTCTCCCTG
CCCTTCATCTACATCCGACCCCATAACCTCCGACACCTCTTCATGGTCAAGGCTGC

-----------------------------------

flag GC content:

In [None]:
def reverse_translate_amino_acids_fasta(AA_seq, codon_table):
    # Define the forbidden sequences
    forbidden_sequences = ["GCTCTTCN", "GCGGCCGC", "GGTCTCNNNNN"]
    
    # Initialize an empty DNA sequence
    dna_sequence = Seq("")
    
    # Since we have previously defined AA_seq as seq_record.seq, we do not have the header line from the FASTA format
    amino_acid_sequence = AA_seq 

    while True:
        # Initialize a list to keep track of added codons
        added_codons = []
        
        for amino_acid in amino_acid_sequence:
            # Filter the codon table for the current amino acid
            amino_acid_codons = codon_table[codon_table['Amino Acid'] == amino_acid]

            if not amino_acid_codons.empty:
                # Generate a random value within the specified range
                random_value = random.uniform(0, 100)

                # Map the random value to the corresponding codon using the bounds from the table
                selected_codon = None
                for _, row in amino_acid_codons.iterrows():
                    if row['Lower_Bound'] <= random_value <= row['Upper_Bound']:
                        selected_codon = row['Codon']
                        break
                    
                # Check if the selected codon contains any forbidden sequences
                contains_forbidden_sequence = any(seq in selected_codon for seq in forbidden_sequences)
                
                if not contains_forbidden_sequence:
                    # Add the selected codon to the DNA sequence
                    dna_sequence += Seq(selected_codon)
                    added_codons.append(selected_codon)
        
        # Check if the generated sequence contains any forbidden sequences
        contains_forbidden_sequence = any(seq in dna_sequence for seq in forbidden_sequences)
        
        if not contains_forbidden_sequence:
            break  # Exit the loop if the sequence is valid
    
    return dna_sequence


In [None]:
# Call the function to reverse translate the amino acid sequence
reverse_translated_dna = reverse_translate_amino_acids_fasta(AA_seq, pd.read_excel(bounds_codon_table))

# Create a SeqRecord object for the DNA sequence
dna_record = SeqRecord(reverse_translated_dna, id="reverse_translated_sequence")

# Print the SeqRecord or save it to a FASTA file
print(dna_record.format("fasta"))

Check for forbidden sequences:

GC, then RE, then polynucleotide



First flag, then remove.

In [None]:
if forbidden sequence in dna...

In [None]:
GC(dna_record.seq)