# Smart5UTR Sequence Optimization Tutorial

In this tutorial, we will demonstrate how to use a pre-trained autoencoder model and the `optimize_sequences` function to optimize input sequences iteratively. We will first load the necessary modules, the autoencoder model, and the scaler. Then, we will extract a subset of data from our dataset to be used as example sequences for optimization.

In [1]:
## Use this setting when TensorFlow depends on a protobuf version that is not compatible with your currently installed protobuf version
import os
os.environ['PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION'] = 'python'

In [2]:
import sys
sys.path.append("..")
from Smart5UTR.train import load_MTAE
from Smart5UTR.optimization import optimize_sequences
import pandas as pd

## Load the Pre-trained Autoencoder Model

We will load the pre-trained autoencoder model and the corresponding scaler using the `load_MTAE` function.

In [3]:
autoencoder, scaler = load_MTAE(model_path="../models/Smart5UTR/Smart5UTR_egfp_m1pseudo2_Model.h5",scaler_path="../models/egfp_m1pseudo2.scaler")

## Prepare Example Data

Next, we will extract a subset of data from our dataset to be used as example sequences for optimization. We will filter the data from the given dataset based on the desired initial mrl value (init_rl) and then randomly select a specified number of sequences (nbr_sequences) for optimization.

In [4]:
init_rl = 5
nbr_sequences = 5
iterations = 20
coef = 1.25

# Get prototypical sequences
df = pd.read_csv('../data/GSM3130440_egfp_m1pseudo_2.csv')
df.sort_values('total', inplace=True, ascending=False)
df.reset_index(inplace=True, drop=True)
df = df.iloc[:20000]  # select data from the hold-out test dataset

df = df.loc[(df['rl'] < init_rl + 0.5) & (df['rl'] > init_rl - 0.5)]
df = df.loc[~df['utr'].str.contains('TAG|TGA|TAA|ATG')]    ## Avoid including termination codons, optional
df = df.sample(frac=1, random_state=123).reset_index(drop=True).iloc[:nbr_sequences]  # Randomly select sequences

p_seqs = df['utr'].to_list()

In [5]:
p_seqs

['GGAGCCGAGGCGGGCCGATTCACGATCGGTTCGCAAAAACTGTTTGGGTT',
 'AAGAGATCGCTCGTACCTCTGGCTGCTCGGCCCGCCCAAGGGGGAGGTTA',
 'CACAGCCTCCGACCAAATATCGGCCTCCCCGATACACGGTCCGAAACACC',
 'CCCTGGGAGGAGCACCCGTGGTTCCGAGCTTTCGGAGCCCGTTCTCGGCA',
 'TGTTCTGGCGTCCCCACGCCGTTTTCAATTGGGACCGGGCGGGGGGACTC']

## Optimize 5' UTR Sequences

Finally, we will use the `optimize_sequences` function to iteratively optimize the example sequences with the pre-trained autoencoder model and the scaler. The optimized sequences and their output values will be returned as a dictionary.

In [6]:
optimized_seqs = optimize_sequences(autoencoder, scaler, p_seqs, iterations, coef)

%%% generating new UTR seqs by autoencoder ... %%% 
Processing optimization for seq No. 0
Processing optimization for seq No. 1
Processing optimization for seq No. 2
Processing optimization for seq No. 3
Processing optimization for seq No. 4


In [10]:
# Extract the relevant information
original_seqs = p_seqs
experimental_mrl = df['rl'].to_list()
optimized_seqs_list = [seq_info[-1][0].upper() for seq_info in optimized_seqs.values()]
final_predicted_mrl = [seq_info[-1][1] for seq_info in optimized_seqs.values()]
iterations_list = [iterations] * nbr_sequences

# Construct the DataFrame to show the result
results_df = pd.DataFrame({
    'Original Sequence': original_seqs,
    'Experimental MRL': experimental_mrl,
    'Optimized Sequence': optimized_seqs_list,
    'Iterations': iterations_list,
    'Optimized MRL': final_predicted_mrl
})

results_df

Unnamed: 0,Original Sequence,Experimental MRL,Optimized Sequence,Iterations,Optimized MRL
0,GGAGCCGAGGCGGGCCGATTCACGATCGGTTCGCAAAAACTGTTTG...,5.464849,CCCAGCTAAACATACCGACGAACGCGAGCGCGGCAACAAAAGGCAC...,20,7.157289
1,AAGAGATCGCTCGTACCTCTGGCTGCTCGGCCCGCCCAAGGGGGAG...,4.872592,CGAAAATCGACATCACCCCAGGCCGCACTAAGCGAGCAACGGCAAA...,20,7.091981
2,CACAGCCTCCGACCAAATATCGGCCTCCCCGATACACGGTCCGAAA...,5.385016,CCGAACCGACGAACAAACATCGGGTAACGCAACAAACGGCAAATAA...,20,7.293249
3,CCCTGGGAGGAGCACCCGTGGTTCCGAGCTTTCGGAGCCCGTTCTC...,4.764144,CCGAGGAACGCATAACCATTTCCGGGAGAAAACGAAAACCGGCAAC...,20,7.327154
4,TGTTCTGGCGTCCCCACGCCGTTTTCAATTGGGACCGGGCGGGGGG...,4.54163,CCCACATCCACCACTAGGCAGCAACCAAACGCATTCGGGCGGGCAA...,20,7.007205


## Customization and Further Exploration
To customize the optimization process or explore different design requirements, you can try modifying the parameters below or even modify the package's functions (Smart5UTR/optimization.py) if necessary:

- init_rl: The initial ribosome load

- nbr_sequences: Number of sequences to generate

- iterations: Number of optimization iterations

- coef: optimization coefficient

You can also try using novel sequences outside of the tutorial's provided database. Here's an example of generating a random 50nt sequence and using the autoencoder's encoder function to predict the initial MRL value when no experimental MRL value is available:

In [11]:
import random

def make_random_sequences(nbr_sequences, length, constant='', no_uaug=False, no_stop=False):
    # Make randomized sequences, allowing for the inclusion / exclusion of uATGs / stop codons
    seqs = []
    nucs = {0: 'A', 1: 'T', 2: 'C', 3: 'G'}
    i = 0
    while i < nbr_sequences:
        new_seq = ''
        for n in range(length - len(constant)):
            new_seq = new_seq + nucs[random.randint(0, 3)]
        if no_uaug == False or (no_uaug == True and 'ATG' not in new_seq):
            if no_stop == False or (
                no_stop == True and ('TAG' not in new_seq and 'TGA' not in new_seq and 'TAA' not in new_seq)):
                new_seq = new_seq + constant
                seqs.append(new_seq)
                i += 1
    return seqs



In [12]:
seq_len = 50
rand_seqs = make_random_sequences(nbr_sequences, seq_len, no_uaug=True, no_stop=False)

In [13]:
rand_seqs

['CACTTGGTTACTATACACACGAGTAAGGTCAGCGGCATCAGACCGGCTGC',
 'GGTCACCTGGGCTGGAGCTGGTAGCAGCGAGAAACCTAAGCCAGGTCTGA',
 'CTTTTAGAGCCCTCCGCCGCATTCTTGCAGCAAGGGAGACAGGCTAGTGT',
 'TTCGCGAGCGCCCTTTTGTCGCGTGAGCCCTTCACCTTATACCCACCCTG',
 'CGCGGATCAATATTTAAAGCTACAAAGGACTCCTAGATAAGACATTACAC']