#  Summary

This notebook utilizes the relaxed structures of NOVO wildtype, provided by Vladimir Slaykovskiy ([here](https://www.kaggle.com/code/vslaykovsky/nesp-thermonet)), by using their calculated energy scores (calculated with the default energy score, i.e., ref15, since this is also the energy score used to relax the structures). The energy scores (with delete mutation scores = quantile score) score 0.471, and after an equal-weight ensemble with the best notebook to date (518, [here](https://www.kaggle.com/code/vslaykovsky/nesp-thermonet-v2)), achieve 0.579.

### New version update: now the notebook runs PyRosetta on Kaggle, both in interactive and commit mode!

# A short guide to rosetta

Rosetta is a 'software suite includes algorithms for computational modeling and analysis of protein structures' ([see here](https://www.rosettacomm)). It can be freely downloaded [here](https://www.rosettacommons.org/software/license-and-download) after a short registration. Moreover, it has an excellent python interface, pyrosetta, and is downloadable at the same link. There are excellent colab-based tutorials for pyrosetta. See [here](https://rosettacommons.github.io/PyRosetta.notebooks/). I recommend going through chapters 1-3 to understand essential utilization and energy scores and then through chapters 4-6 if you want to design/relax structures yourselves. The tutorials are short, easy to understand, and very clear.

# My rosetta code
Robert Hatch discovered that we could run PyRosetta in Kaggle in interactive mode by passing cache_wheel_on_google_drive=False to pyrosettacolabsetup.install_pyrosetta. Installing PyRosetta requires entering a username and password during the installation, so getting it to work also in commit mode took (quite) a bit of tinkering.

## Preparations
Getting the relaxed structures file names and the corresponding mutations from [here](https://www.kaggle.com/datasets/vslaykovsky/thermonet-wildtype-relaxed)

In [1]:
import os
import pandas as pd


relaxed_path = 'data/thermonet_wildtype_relaxed/wildtypeA'
files = os.listdir(relaxed_path)
relaxed_pdbs = [file for file in files if file.endswith(".pdb")]
mutation_to_pdb = {}
mutation_to_pdb['mutation'] = [file.split('_')[1] for file in relaxed_pdbs]
mutation_to_pdb['path'] = [file for file in relaxed_pdbs]
mutation_to_pdb_df = pd.DataFrame(mutation_to_pdb)
mutation_to_pdb_df.head()

Unnamed: 0,mutation,path
0,Q60D,wildtypeA_Q60D_relaxed.pdb
1,T167E,wildtypeA_T167E_relaxed.pdb
2,K18F,wildtypeA_K18F_relaxed.pdb
3,S154H,wildtypeA_S154H_relaxed.pdb
4,E109Y,wildtypeA_E109Y_relaxed.pdb


## Importing and installing pyrosetta
### Basic imports

In [2]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
# !pip install pyrosettacolabsetup
import pyrosettacolabsetup;
import builtins

### Installing PyRosetta
This step requires a username and password; you can get them free [here](https://els2.comotion.uw.edu/product/pyrosetta). Enter yours in username/password instead of my secrets, or create your secrets.

In [3]:
# !pip install --quiet kaggle

In [4]:
# from kaggle_secrets import UserSecretsClient

# global username
# username = "levinthal" #UserSecretsClient().get_secret("levinthal")
# global password
# password = "paradox"# UserSecretsClient().get_secret("paradox")

# '''
# Here I do some tinkering with builtins.input, so that input(), which usually wait
# for user input, would automatically return username and password.
# '''
# global flag
# flag = 0
# def mock_input1(*args, **kwargs):
#     global flag
#     global username
#     global password
#     if flag == 0:
#         flag = 1
#         return username
#     return password
# builtins.input = mock_input1

# pyrosettacolabsetup.install_pyrosetta(prefix='/home/xux/Desktop/Enzyme_stab/Enzyme_pred/pyrosetta', cache_wheel_on_google_drive=False)

In [5]:
# ! pip install './pyrosetta/wheels/pyrosetta-2022.49+release.201d7639f91-cp37-cp37m-linux_x86_64.whl'

## Calculating rosetta scores
This cell would take some time. You can skip it by commenting out this cell and uncommenting the next one, which loads the already calculated scores.

In [None]:
# %%time
# '''
import pyrosetta
pyrosetta.init()
from pyrosetta.teaching import *

scores = []
sfxn = get_score_function(True)
for i in range(len(mutation_to_pdb_df)):
    if not i%100: print(i)
    pose = pyrosetta.pose_from_pdb(os.path.join(relaxed_path, mutation_to_pdb_df.iloc[i]['path']))
    scores.append(sfxn(pose))

mutation_to_pdb_df['scores'] = scores
scores = mutation_to_pdb_df
scores.head()
# '''

Uncomment this cell if you commented out the previous cell to save time.

In [None]:
scores = pd.read_csv('data/rosetta/NOVO_wt_relaxed_scores.csv')
scores.head()


# The rest
From here, it's just regular ranking and submission.

In [None]:
import numpy as np
import pandas as pd
from scipy.stats import rankdata

In [None]:
def find_mut(row):
    mut = row.mutant_seq
    seq = row.sequence
    same = True
    for i,(x,y) in enumerate(zip(seq,mut)):
        if x!=y: 
            same = False
            break
    row['WT'] = seq[i]
    row['position'] = i+1
    if not same:
        if len(mut) < len(seq):
            row['MUT'] = 'X'
        else:
            row['MUT'] = mut[i]
    else: 
        row['position'] = -1
        row['MUT'] = 'X'
    row['mutation_key'] = row['WT']+str(row['position']) + row['MUT']
    return row

In [None]:
novo_test = pd.read_csv("data/test.csv")
novo_test = novo_test.rename({'protein_sequence': 'mutant_seq', 'seq_id': 'source_df_id'}, axis = 1)
novo_test['sequence'] = 'VPVNPEPDATSVENVALKTGSGDSQSDPIKADLEVKGQSALPFDVDCWAILCKGAPNVLQRVNEKTKNSNRDRSGANKGPFKDPQKWGIKALPPKNPSWSAQDFKSPEEYAFASSLQGGTNAILAPVNLASQNSQGGVLNGFYSANKVAQFDPSKPQQTKGTWFQITKFTGAAGPYCKALGSNDKSVCDKNKNIAGDWGFDPAKWAYQYDEKNNKFNYVGK'
novo_test = novo_test.apply(find_mut,axis=1)
novo_test = novo_test.join(scores.set_index('mutation'), on='mutation_key')
novo_test['scores'] = -novo_test['scores']
novo_test.loc[novo_test['scores'].isna(), 'scores'] = novo_test.loc[~novo_test['scores'].isna()].quantile(q=0.25)['scores']
novo_test['scores_rank'] = rankdata(novo_test['scores'])
novo_test.head()

In [None]:
submission_rosetta_scores = novo_test[['source_df_id','scores_rank']]
submission_rosetta_scores = submission_rosetta_scores.rename({'source_df_id': 'seq_id', 'scores_rank': 'tm'}, axis = 1)
submission_rosetta_scores.to_csv('submission_rosetta_scores', index=False)
submission_rosetta_scores.head()

In [None]:
thermonet2_ensemble_same_weight = pd.read_csv('data/thermonet/ensemble_submission.csv')
thermonet2_ensemble_same_weight['tm'] = rankdata(thermonet2_ensemble_same_weight['tm'])
thermonet2_ensemble_same_weight['tm'] = thermonet2_ensemble_same_weight['tm'] + submission_rosetta_scores['tm']
thermonet2_ensemble_same_weight.to_csv('submissions/pyrosetta_thermonet_ensemble', index=False)
thermonet2_ensemble_same_weight.head()