# Packages

In [1]:
import pickle
import pandas as pd

# Importing Your Data for Evaluation

The nucleotide sequences need to be converted into DDS values. The notebook `DDS_featuring.ipynb`, located in the `Pre_processing` folder, contains the instructions for performing this conversion. It is essential to ensure that all sequences have the same number of nucleotides so they match the corresponding model.

In this tutorial, we will evaluate a set of sequences. The sequences have been divided based on the number of nucleotides.

The code below imports the nucleotide sequences into a dataframe (`df_fasta1`):

In [2]:
df_fasta1 = pd.read_csv('fasta_sequences/Test_1.fasta', sep='\t', header=None)
df_fasta1

Unnamed: 0,0
0,GTCAAAAGTTTGTT
1,GTCAAAAAAGCCGC
2,GACGCTATAGCGAC
3,GTCGGATGATTGAC
4,GCCTAAAAATTGAC
5,GTCATTAGATTGAC
6,GTCAATAAATTGAC
7,GTCACTAAATTGAC
8,GTCGGTTTTTTGAC
9,GTCATTTAACTGAC


# Importing the File Containing the Desired Features

In this case, we will use the DDS values. We can import the data using the following code:

In [3]:
df_test1 = pd.read_csv('fasta_sequences/dds_valueTest_1.fasta', sep='\t', header=None)
df_test1

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,-1.44,-1.28,-1.45,-1.0,-1.0,-1.0,-1.3,-1.44,-1.0,-1.0,-1.44,-1.44,-1.0,
1,-1.44,-1.28,-1.45,-1.0,-1.0,-1.0,-1.0,-1.0,-1.3,-2.27,-1.84,-2.24,-2.27,
2,-1.3,-1.45,-2.24,-2.27,-1.28,-0.58,-0.88,-0.58,-1.3,-2.27,-2.24,-1.3,-1.45,
3,-1.44,-1.28,-2.24,-1.84,-1.3,-0.88,-1.44,-1.3,-0.88,-1.0,-1.44,-1.3,-1.45,
4,-2.27,-1.84,-1.28,-0.58,-1.0,-1.0,-1.0,-1.0,-0.88,-1.0,-1.44,-1.3,-1.45,
5,-1.44,-1.28,-1.45,-0.88,-1.0,-0.58,-1.3,-1.3,-0.88,-1.0,-1.44,-1.3,-1.45,
6,-1.44,-1.28,-1.45,-1.0,-0.88,-0.58,-1.0,-1.0,-0.88,-1.0,-1.44,-1.3,-1.45,
7,-1.44,-1.28,-1.45,-1.45,-1.28,-0.58,-1.0,-1.0,-0.88,-1.0,-1.44,-1.3,-1.45,
8,-1.44,-1.28,-2.24,-1.84,-1.44,-1.0,-1.0,-1.0,-1.0,-1.0,-1.44,-1.3,-1.45,
9,-1.44,-1.28,-1.45,-0.88,-1.0,-1.0,-0.58,-1.0,-1.45,-1.28,-1.44,-1.3,-1.45,


This code reads the DDS values from the file `dds_valueTest_1.fasta`, which is located in the fasta_sequences directory. The `sep='\t'` argument specifies that the file is tab-separated, and `header=None` indicates that there are no header rows in the file.

In [4]:
df_test1= df_test1.drop(df_test1.columns[13], axis=1) #Deleting column with no values

# Selecting the Appropriate Model

We need to define the model to be used. In this case, we will load the models for 13 base pairs. You can check the model to be used with the shape foundation.

In [5]:
print(df_test1.shape[1])

13


This code prints the number of columns in `df_test1`, which corresponds to the features available for the model. Ensure that this aligns with the expected input for the 13-base pair model.

Here, you need to change the file path to select the model that is suitable for your dataset. Use the following code to load the model:

In [6]:
with open('../Models/TFBS_classification/13_bps/rf13', 'rb') as f:
    rf13 = pickle.load(f)

Make sure to adjust the path `../Models/TFBS_classification/13_bps/rf13` to match the location of your model file. This code will load the Random Forest model designed for 13 base pairs, allowing you to use it for classification tasks.

Here, we will make predictions using the loaded model:

In [7]:
y_pred13 = rf13.predict(df_test1)

This code uses the `predict` method of the Random Forest model (`rf13`) to classify the sequences in `df_test1`. The predictions will be stored in the variable `y_pred13`.

# Exporting the Prediction Results to a CSV File

Finally, we will export the prediction results to a CSV file:

In [8]:
# Por fim, iremos exportar o resultado da predição num arquivo em formato csv
results = {
    "Sequence": [],
    "Classification": []
}

for i in range(len(y_pred13)):
    results["Sequence"].append(df_fasta1[0][i])
    if y_pred13[i] == 0:
        results["Classification"].append("nonTFBS")
    else:
        results["Classification"].append("TFBS")

results = pd.DataFrame(results)
results.to_csv("tfbs_prediction.csv", index=False)

results

Unnamed: 0,Sequence,Classification
0,GTCAAAAGTTTGTT,TFBS
1,GTCAAAAAAGCCGC,nonTFBS
2,GACGCTATAGCGAC,nonTFBS
3,GTCGGATGATTGAC,nonTFBS
4,GCCTAAAAATTGAC,nonTFBS
5,GTCATTAGATTGAC,TFBS
6,GTCAATAAATTGAC,TFBS
7,GTCACTAAATTGAC,TFBS
8,GTCGGTTTTTTGAC,TFBS
9,GTCATTTAACTGAC,TFBS


In this code, we create a dictionary to store the sequence and their corresponding predictions (either "TFBS" or "nonTFBS"). We then convert this dictionary into a Pandas DataFrame and export it as a CSV file named `tfbs_prediction.csv`, without including the index.

This will allow you to easily review and analyze the prediction results.