# Sequence optimisation

The user should be able to specify forbidden sequences (e.g. restriction sites, polynucleotides) that the script avoids, should keep the sequence within an acceptable GC content (30-80%), both locally (specified window size, e.g. 15 bp) and globally (whole sequence). In the end the script should double check that the DNA generated translates to the AA sequence provided as input. The steps are:

1. Download a fasta sequence from FungiDB and visualise it using biopython

2. Create a dictionary of forbidden sequences: SapI (7nt), BsaI, NotI, no stretches of the same nucleotide max 8, GC content. 12nt of both the forward and reverse sequences should be checked every time a new codon is added.

3. Run the sequence through the code - check that a new sequence is generated every time.

4. Translate the sequence and check if it matches the original one.



In [1]:
import pandas as pd
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
import random

In rare cases you might need rare codons - if restriction site is present in code - let it proceed but flag - say that it needs to be changed by hand - THE AIM OF THIS NOTEBOOK IS TO AVOID THIS

Downloaded a random fasta sequence that we will codon optimise.

In [2]:
my_fasta = r'C:\Users\Candela\OneDrive - Danmarks Tekniske Universitet\Codopti\YALI2.txt' 

Visualise your amino acid fasta sequence:

In [4]:
for seq_record in SeqIO.parse(my_fasta, "fasta"):
    print(seq_record.id)
    print(len(seq_record))
    print(repr(seq_record))
    name = seq_record.id
    AA_seq = seq_record.seq

YALI0_D05621g
592
SeqRecord(seq=Seq('MAHDSELELSDEKVVPSINQEKHSFFQRHLDNHPRMAQYNSQLQRFLKWIEVPT...IIS'), id='YALI0_D05621g', name='YALI0_D05621g', description='YALI0_D05621g  | Yarrowia lipolytica CLIB122 | YALI0D05621p | protein | length=592', dbxrefs=[])


Use the updated_opt_codon_table to reverse translate the amino acid sequence. First open the excel file with the new codon table including the upper and lower bounds:

In [7]:
   # Read the Excel file into a DataFrame
#updated_codon_table = pd.read_excel(r'C:\Users\Candela\OneDrive - Danmarks Tekniske Universitet\Codopti\PATH', sheet_name=0)

---------------------------------

In [None]:

def reverse_translate_amino_acids_fasta(AA_seq, codon_table):
    # Initialize an empty DNA sequence
    dna_sequence = Seq("")
    
    # Remove the header line from the FASTA format
    amino_acid_sequence = AA_seq
    
    print(amino_acid_sequence)

    for amino_acid in amino_acid_sequence:
        # Filter the codon table for the current amino acid
        amino_acid_codons = codon_table[codon_table['Amino Acid'] == amino_acid]

        if not amino_acid_codons.empty:
            # Assign codon value ranges based on their frequencies
            value_ranges = []
            lower_bound = 0
            for _, row in amino_acid_codons.iterrows():
                upper_bound = lower_bound + row['New Frequency']
                value_ranges.append((row['Codon'], lower_bound, upper_bound))
                lower_bound = upper_bound

            # Generate a random value within the specified range
            random_value = random.uniform(0, 100)

            # Map the random value to the corresponding codon
            for codon, lower, upper in value_ranges:
                if lower <= random_value <= upper:
                    dna_sequence += Seq(codon)
                    break

    return dna_sequence

# Call the function to reverse translate the amino acid sequence
reverse_translated_dna = reverse_translate_amino_acids_fasta(AA_seq, updated_opt_codon_table)

# Create a SeqRecord object for the DNA sequence
dna_record = SeqRecord(reverse_translated_dna, id="reverse_translated_sequence")

# Print the SeqRecord or save it to a FASTA file
print(dna_record.format("fasta"))


Check for forbidden sequences:



In [None]:
from Bio.SeqUtils import GC
GC(dna_record.seq)

Translate the sequence to check if it's the same as the input one:

In [None]:
dna_record_new = dna_record.translate() 
print(dna_record_new)

In [None]:
dna_record_new.seq == AA_seq