# Using the Pre-trained Model to Predict MRL Values for Sequences

This tutorial will show you how to use the pre-trained model and the prediction function to predict the MRL values for a single sequence or multiple sequences in a CSV file.

There are two strategies available for predicting MRL values using the `predict_seq` function: 'trim' and 'sliding_window'. 

1. 'trim': This strategy trims the input sequence to a fixed length (50nt) and directly predicts the MRL value based on the trimmed sequence.

2. 'sliding_window': This strategy uses a sliding window approach to extract subsequences from the input sequence. The MRL value is predicted for each subsequence, and the final predicted MRL value is the average of all subsequence predictions. For sequences of length well beyond 50, we recommend this strategy.

You can choose the strategy according to your specific requirements and the characteristics of your data. For sequences shorter than 50nt, we adopt a zero padding strategy to complete them with 0 on the right.


In [17]:
## Use this setting when TensorFlow depends on a protobuf version that is not compatible with your currently installed protobuf version
import os
os.environ['PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION'] = 'python'


In [18]:
import sys
sys.path.append("..")
from Smart5UTR.train import load_MTAE
from Smart5UTR.dataloader import onehot_singleseq

import pandas as pd
import numpy as np


In [19]:
def predict_seq(seq, model, scaler, strategy='trim'):
    
    valid_bases = set('atcgATCG')
    if not all(base in valid_bases for base in seq):
        raise ValueError("Invalid sequence. The sequence must contain only A, T, C, and G (case-insensitive).")
        
    if strategy == 'trim':
        seq_mtx = onehot_singleseq(seq[:50])
        pred_label = model.predict(seq_mtx)[1].reshape(-1)
        pred_label = scaler.inverse_transform(pred_label)
        return pred_label[0]
    elif strategy == 'sliding_window':
        window_size = 50
        predictions = []
        for i in range(len(seq) - window_size + 1):
            window_seq = seq[i : i + window_size]
            seq_mtx = onehot_singleseq(window_seq)
            pred_label = model.predict(seq_mtx)[1].reshape(-1)
            pred_label = scaler.inverse_transform(pred_label)
            predictions.append(pred_label)
        return np.mean(np.array(predictions), axis=0)[0]
    else:
        raise ValueError("Invalid strategy. Must be 'trim' or 'sliding_window'.")

In [20]:
## Load trained smart5utr model  and scaler
modelpath = "../models/Smart5UTR/Smart5UTR_egfp_m1pseudo2_Model.h5"
scalerpath = "../models/egfp_m1pseudo2.scaler"

model, scaler = load_MTAE(modelpath, scalerpath)


## Predicting MRL Value for a Single Sequence

Here is an example of how to use the `predict_seq` function to predict the MRL value for a single sequence.

In [22]:
seq = "ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT"
strategy = 'sliding_window'  # or 'trim'

pred_label = predict_seq(seq, model, scaler, strategy)

print("predictive MRL = ", pred_label)

predictive MRL =  5.9894896


## Batch predicting for Sequences in a CSV File

To predict MRL values for sequences in a CSV file, you can use the `pandas` library to read the CSV file, pass each sequence to the `predict_seq` function, add the predicted MRL values to a new column, and save the results to a new CSV file. Here is an example:

In [23]:
## load example data 
testdata = pd.read_csv('../data/example_testUTRs.csv')

testdata

Unnamed: 0,name,utr
0,1-INS-50AX2,GACTGAACAATTCAAACATTACAAACATTACTAACAAACCACTAAT...
1,2-INS-50AX2,GACACAAACTGAGAGACAAGAATTCAAGAGACGAACAAATAAAGAA...
2,3-INS-50AX2,GGATAATAACGGAAATAATAGAAGTGATAACTATTAAACTTAATAA...
3,4-INS-50AX2,GGCAAATAATAGAAATAATAATTATTAACACAATTAAACACAACGT...
4,5-INS-A50X2,AGGATTGCGGATATCATTATTATTGGAGAACCTTATTCGGGGGGGGCC
5,6-INS-A50X2,CCGGGAATTTGTTGTAGTGAGTTTTGTTTAAGAGTTTGATAGTAAG...
6,7-INS-A50X2,GCTGATTGTATTGCTCGTCTGTGATAACTCTAATAAACCTAGAAGG...
7,8-INSI-A50X2,TGTCGTGGATAAAGCTGACACCCAATTACTACGATTGTACAACGTA...


In [26]:

testdata['utr']

# Use the prediction function to predict the MRL value for each sequence
testdata['predictive MRL'] = testdata['utr'].apply(lambda x: predict_seq(x, model, scaler, 'trim'))


In [27]:
testdata

Unnamed: 0,name,utr,predictive MRL
0,1-INS-50AX2,GACTGAACAATTCAAACATTACAAACATTACTAACAAACCACTAAT...,7.692575
1,2-INS-50AX2,GACACAAACTGAGAGACAAGAATTCAAGAGACGAACAAATAAAGAA...,7.57422
2,3-INS-50AX2,GGATAATAACGGAAATAATAGAAGTGATAACTATTAAACTTAATAA...,7.536555
3,4-INS-50AX2,GGCAAATAATAGAAATAATAATTATTAACACAATTAAACACAACGT...,7.737344
4,5-INS-A50X2,AGGATTGCGGATATCATTATTATTGGAGAACCTTATTCGGGGGGGGCC,6.582365
5,6-INS-A50X2,CCGGGAATTTGTTGTAGTGAGTTTTGTTTAAGAGTTTGATAGTAAG...,6.918161
6,7-INS-A50X2,GCTGATTGTATTGCTCGTCTGTGATAACTCTAATAAACCTAGAAGG...,6.818514
7,8-INSI-A50X2,TGTCGTGGATAAAGCTGACACCCAATTACTACGATTGTACAACGTA...,5.411742


In [28]:
# Save the results to a new CSV file
testdata.to_csv("../data/example_testUTRs_prediction.csv", index=False)
