# <u><center>MLe-KCNQ2 VARIANT PREDICTION</center></u>

The MLe-KCNQ2 final model was used to predict the pathogenicity of 293 uncertain significance (1) or conflicting interpretation variants (2) annotated in ClinVar that were not part of the initial dataset. As 3 of these variants were included in both subclasses, only one of them was taken into account. As a result, **290 variants were predicted**.

In [1]:
import utils
import numpy as np
import pandas as pd

Firstly, it was necessary to characterise these 290 variants ("challenge data") in the same way as the KCNQ2 collected variants were characterised. In this case, we already provide the characterised data:

In [2]:
# Challenge set = uncertain + uncertain significance variants
X_ch_full = pd.read_csv("X_challenge.csv", header = "infer")
X_ch_full

Unnamed: 0,Variant,initial_aa,final_aa,topological_domain,functional_domain,d_size,d_hf,d_vol,d_msa,d_charge,d_pol,d_aro,residue_conserv,secondary_str,pLDDT,str_landscape,MTR
0,K4N,K,N,Cytoplasmic,unknown_function,14.07,-0.4,32.6,8.2,pos_to_neu,p_to_p,na_to_na,0.74279,coil,38.76,GAP1,0.703
1,R6P,R,P,Cytoplasmic,unknown_function,59.07,-2.9,29.8,-16.1,pos_to_neu,p_to_np,na_to_na,0.74378,coil,42.67,GAP1,0.707
2,G8C,G,C,Cytoplasmic,unknown_function,-46.09,-2.9,-28.0,46.6,neu_to_neu,p_to_p,na_to_na,0.64081,coil,35.85,GAP1,0.715
3,G8S,G,S,Cytoplasmic,unknown_function,-30.02,0.4,-18.1,13.5,neu_to_neu,p_to_p,na_to_na,0.64081,coil,35.85,GAP1,0.715
4,Y11H,Y,H,Cytoplasmic,unknown_function,26.03,1.9,29.3,2.0,neu_to_pos,np_to_p,a_to_na,0.60965,coil,34.88,GAP1,0.643
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
285,P852L,P,L,Cytoplasmic,unknown_function,-16.04,-5.4,-22.6,56.1,neu_to_neu,np_to_np,na_to_na,0.59805,coil,38.32,GAP4,0.797
286,S855L,S,L,Cytoplasmic,unknown_function,-26.08,-4.6,-45.5,30.4,neu_to_neu,p_to_np,na_to_na,0.51862,coil,33.75,GAP4,0.809
287,G858S,G,S,Cytoplasmic,unknown_function,-30.02,0.4,-18.1,13.5,neu_to_neu,p_to_p,na_to_na,0.63367,coil,34.02,GAP4,0.847
288,F862L,F,L,Cytoplasmic,unknown_function,34.02,-1.0,13.6,-4.6,neu_to_neu,np_to_np,a_to_na,0.62613,coil,34.23,GAP4,0.927


In [3]:
# X_ch will store the data of interest = 16 featuers
X_ch = X_ch_full[["initial_aa", "final_aa", "topological_domain",
       "functional_domain", "d_size", "d_hf", "d_vol", "d_msa", "d_charge",
       "d_pol", "d_aro", "residue_conserv", "secondary_str", "pLDDT",
       "str_landscape", "MTR"]]

#X_ch_names will store the variants name (its nomenclature)
X_ch_names = X_ch_full["Variant"].to_list()

In [4]:
# Convert into numpy array
X_ch = X_ch.to_numpy()
X_ch

array([['K', 'N', 'Cytoplasmic', ..., 38.76, 'GAP1', 0.703],
       ['R', 'P', 'Cytoplasmic', ..., 42.67, 'GAP1', 0.707],
       ['G', 'C', 'Cytoplasmic', ..., 35.85, 'GAP1', 0.715],
       ...,
       ['G', 'S', 'Cytoplasmic', ..., 34.02, 'GAP4', 0.847],
       ['F', 'L', 'Cytoplasmic', ..., 34.23, 'GAP4', 0.927],
       ['A', 'P', 'Cytoplasmic', ..., 38.58, 'GAP4', 0.87]], dtype=object)

The same encoding scheme as for the training set is then applied in `X_ch`, so we also need to load `X_train` data:

In [5]:
# Load X_train and convert it into numpy array
X_train = pd.read_csv("X_train.csv",header="infer")
X_train = X_train.to_numpy()
X_train

array([['R', 'Q', 'hB', ..., 92.83, 'FRAG3', 0.474],
       ['F', 'L', 'S6', ..., 94.57, 'FRAG2', 0.157],
       ['G', 'E', 'Cytoplasmic', ..., 93.63, 'FRAG2', 0.12],
       ...,
       ['L', 'P', 'S6', ..., 95.78, 'FRAG2', 0.162],
       ['A', 'S', 'Cytoplasmic', ..., 35.95, 'GAP4', 0.857],
       ['P', 'L', 'Cytoplasmic', ..., 38.26, 'GAP3', 0.717]], dtype=object)

Then, we can apply the same encoding scheme:

In [6]:
# Apply in X_ch the same encoding schema as in X_train
X_ch_enc, X_ch_df = utils.challenge_encoding(X_train, X_ch)
X_ch_df

Unnamed: 0,d_size,d_hf,d_vol,d_msa,residue_conserv,plddt,MTR,A,C,D,...,coil,helix,membrane_helix,FRAG1,FRAG2,FRAG3,GAP1,GAP2,GAP3,GAP4
0,14.07,-0.4,32.6,8.2,0.74279,38.76,0.703,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,59.07,-2.9,29.8,-16.1,0.74378,42.67,0.707,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,-46.09,-2.9,-28.0,46.6,0.64081,35.85,0.715,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
3,-30.02,0.4,-18.1,13.5,0.64081,35.85,0.715,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
4,26.03,1.9,29.3,2.0,0.60965,34.88,0.643,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
285,-16.04,-5.4,-22.6,56.1,0.59805,38.32,0.797,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
286,-26.08,-4.6,-45.5,30.4,0.51862,33.75,0.809,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
287,-30.02,0.4,-18.1,13.5,0.63367,34.02,0.847,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
288,34.02,-1.0,13.6,-4.6,0.62613,34.23,0.927,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


After that, we can predict `X_ch` using the MLe-KCNQ2 final configuration. To that end, we need to:

In [7]:
# Load some needed data
X_train = np.loadtxt("X_train_preprocessed.csv", delimiter = ",")
X_train_df = pd.read_csv("X_train_df_preprocessed.csv", header = "infer")
X_test = np.loadtxt("X_test_preprocessed.csv", delimiter = ",")
y_train = np.loadtxt("y_train.csv", delimiter = ",")

# MLe-KCNQ2 prediction over X_ch
X_ch_pred, X_ch_proba = utils.MLeKCNQ2_prediction(X_train, X_train_df, y_train, X_test, X_ch_enc)

Finally, a DataFrame is created with the prediction results:

In [8]:
# We create a dataframe (df) using the list of variants designed at the start
predictions =  pd.DataFrame(X_ch_names, columns = ["Variant name"])

# We add to this df some new columns
predictions["Predicted label"] = X_ch_pred # 0.0 = benign ; 1.1 = pathogenic
predictions["Probability of being bening"] = [pair[0] for pair in X_ch_proba] # being 0 = min and 1 = max
predictions["Probability of being pathogenic"] = [pair[1] for pair in X_ch_proba] # being 0 = min and 1 = max
predictions

Unnamed: 0,Variant name,Predicted label,Probability of being bening,Probability of being pathogenic
0,K4N,0.0,0.618902,0.381098
1,R6P,0.0,0.789676,0.210324
2,G8C,0.0,0.851654,0.148346
3,G8S,0.0,0.904191,0.095809
4,Y11H,0.0,0.826731,0.173269
...,...,...,...,...
285,P852L,0.0,0.835520,0.164480
286,S855L,0.0,0.871728,0.128272
287,G858S,0.0,0.878435,0.121565
288,F862L,1.0,0.409877,0.590123


For a processed prediction, see **Data S7**.