> In this notebook, you will find the analysis of mutant sequences that underwent machine learning models. This analysis aims to identify the optimal protein sequence for a specific function: achieving high fitness under light conditions and low fitness when deactivated.

> The mutant sequences were generated in the `sequence_generator.ipynb` notebook.

In [1]:
import pandas as pd
import numpy as np 

### 1. Loading the results

> All possible sequences of single mutants were inputted into the `onehot` model through the `train_and_predict.py` script, separately for both the 'light' and 'darkness' models.

In [31]:
# Loading single mutants with predicted fitness under the light model
single_mut_light_fitness = pd.read_csv('output_light_single_mutant.csv')

# Loading single mutants with predicted fitness under the darkness model
single_mut_darkness_fitness = pd.read_csv('output_darkness_single_mutant.csv')

### 2. Fitness analysis

> Let's look at a few statics on our predicted fitnesses.

In [32]:
single_mut_light_fitness['pred'].describe()

count    4275.000000
mean        2.191816
std         0.432109
min        -4.069784
25%         2.191816
50%         2.191816
75%         2.191816
max         4.876305
Name: pred, dtype: float64

In [33]:
single_mut_darkness_fitness['pred'].describe()

count    4275.000000
mean        2.926169
std         0.362640
min        -0.915769
25%         2.926169
50%         2.926169
75%         2.926169
max         7.282917
Name: pred, dtype: float64

### 3.  The Optimal Optoprotein Sequence

In [28]:
# Loading single mutants with predicted fitness under the light model
single_mut_light_fitness = pd.read_csv('data/Darkness/data.csv')

# Loading single mutants with predicted fitness under the darkness model
single_mut_darkness_fitness = pd.read_csv('data/Light/data.csv')

In [29]:
single_mut_darkness_fitness['pred'] = single_mut_darkness_fitness['log_fitness']
single_mut_light_fitness['pred'] = single_mut_light_fitness['log_fitness']

In [36]:
# Sort the light DataFrame to maximize fitness under light
sorted_light_df = single_mut_light_fitness.sort_values(by='pred', ascending=False)

# Sort the darkness DataFrame to minimize fitness under darkness
sorted_darkness_df = single_mut_darkness_fitness.sort_values(by='pred', ascending=True)

# Calculate the combined score (light fitness - darkness fitness)
sorted_light_df['combined_score'] = sorted_light_df['pred'] - sorted_darkness_df['pred']

sorted_combined_df = sorted_light_df.sort_values(by='combined_score', ascending=False)
optimal_sequence = sorted_combined_df.iloc[0]


print("Optimal Sequence:")
print(f"Mutated Position: {optimal_sequence['Mutated_Position']}")
print(f"Original AA: {optimal_sequence['Original_AA']}")
print(f"Mutated AA: {optimal_sequence['Mutated_AA']}")
print(f"Sequence: {optimal_sequence['seq']}")
print(f"Fitness under Light: {optimal_sequence['pred']} (Lightness model)")
print(f"Fitness under Darkness: {sorted_darkness_df.iloc[0]['pred']} (Darkness model)")
print(f"Combined Score: {optimal_sequence['combined_score']}")


Optimal Sequence:
Mutated Position: 117
Original AA: P
Mutated AA: L
Sequence: MLDMGQDRPIDGSGAPGADDTRVEVQPPAQWVLDLIEASPIASVVSDPRLADNPLIAINQAFTDLTGYSEEECVGRNCRFLAGSGTEPWLTDKIRQGVREHKPVLVEILNYKKDGTLFRNAVLVAPIYDDDDELLYFLGSQVEVDDDQPNMGMARRERAAEMLKTLSPRQLEVTTLVASGLRNKEVAARLGLSEKTVKMHRGLVMEKLNLKTSADLVRIAVEAGI
Fitness under Light: 3.722253897677456 (Lightness model)
Fitness under Darkness: -0.9157694121481356 (Darkness model)
Combined Score: 3.3330677580126764
