## Sliding windows and Classifiers
This is the final procedure of features engineerings. It should be done after the features extraction and after the one-hot-encoding procedure.<br>
The end of the notebook has an example of a some classifers which try to predict single porteins

### Some libraries

In [20]:
#Standard
import sys
import pandas as pd
import numpy as np
import time
import warnings
import matplotlib.pyplot as plt
import pickle
from joblib import dump, load
#Sliding windows
from scipy import signal
#Module
from modules.feature_extraction import *
from modules.feature_preprocessing import *
from modules.models import *
#Classifiers
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
#Metricsc
from sklearn import metrics
print('Done')

Done


### Importing DataFrame 'one-hot-enc'

In [48]:
#Import data
df = pd.read_csv(r'datasets\one-hot-enc.csv')
#Get data ready
LIP = df.LIP
df.drop(['LIP_SCORE', 'LIP'], inplace = True, axis = 1)
#Elimate NaN
df.loc[df.REL_ASA.isna(), 'REL_ASA'] = df.REL_ASA.mean()
df.head()

Unnamed: 0,PDB_ID,CHAIN_ID,RES_ID,REL_ASA,PHI,PSI,NH_O_1_relidx,NH_O_1_energy,O_NH_1_relidx,O_NH_1_energy,...,MC_SC,NO_EDGE_LOC,SC_MC,SC_SC,HBOND,IAC,NO_EDGE_TYPE,PIPISTACK,VDW,CHAIN_LEN
0,1cee,A,1,1.0,360.0,97.6,0.0,0.0,2.0,-0.3,...,0,1,0,0,0,0,1,0,0,179
1,1cee,A,2,0.348485,-91.3,147.8,48.0,-0.1,50.0,-1.7,...,0,0,0,2,2,0,0,0,1,179
2,1cee,A,3,0.387324,-142.6,136.7,-2.0,-0.3,50.0,-0.2,...,0,1,0,0,0,0,1,0,0,179
3,1cee,A,4,0.005917,-94.3,154.4,48.0,-1.9,50.0,-1.6,...,0,0,0,3,2,0,0,0,3,179
4,1cee,A,5,0.346341,-112.5,70.5,-2.0,-0.2,71.0,-1.6,...,0,0,1,0,1,0,0,0,1,179


In [49]:
df.columns

Index(['PDB_ID', 'CHAIN_ID', 'RES_ID', 'REL_ASA', 'PHI', 'PSI',
       'NH_O_1_relidx', 'NH_O_1_energy', 'O_NH_1_relidx', 'O_NH_1_energy',
       'NH_O_2_relidx', 'H_O_2_energy', 'O_NH_2_relidx', 'O_NH_2_energy',
       'INTRA_CONTACTS', 'INTER_CONTACTS', 'ALA', 'ARG', 'ASN', 'ASP', 'CYS',
       'GDP', 'GLN', 'GLU', 'GLY', 'GTP', 'HIS', 'HYP', 'ILE', 'LEU', 'LYS',
       'MET', 'MSE', 'PHE', 'PRO', 'PTR', 'SEP', 'SER', 'THR', 'TPO', 'TRP',
       'TYR', 'VAL', '10_ELIX', 'ALPHA_ELIX', 'BEND', 'ISOLATED_BETA_BRIGE',
       'NO_STRUCT', 'PI_ELIX', 'STRAND', 'TURN', 'ZERO', 'LIG_MC', 'LIG_SC',
       'MC_MC', 'MC_SC', 'NO_EDGE_LOC', 'SC_MC', 'SC_SC', 'HBOND', 'IAC',
       'NO_EDGE_TYPE', 'PIPISTACK', 'VDW', 'CHAIN_LEN'],
      dtype='object')

### Apply sliding windows

Use odd numbers for windows or some extra NaN could be generated

In [22]:
df_slided = sliding_window(data=df, k=5, sd=1)

  sliced = sliced.rolling(window = k, center = True).apply(lambda x: np.dot(x,window)/k)


### Model - Leave-One-Out of Protein
Here we run a cycle that for every protein of the dataset it use that protein as test set and the rest as training. <br>
We print Accuracy, Precision, Recall and Confusion Matrix

#### Build Classifier

In [11]:
clf = RandomForestClassifier(n_estimators=100)

#### Validation phase - LOOCV
Very long procedure for some classifiers. Use verbose to see results for each protein (advised if testing the model for the first time)

In [23]:
results = loo_cv(data = df_slided, clf=clf, 
                 target = LIP,
                 bad_condition=(0.75, 0.75),
                 ign_warnings = True, 
                 get_times = True, verbose = True)

Ieration number: 1
1cee
Accuracy: 0.971
Balanced Accuracy: 0.966
Precision: 0.9
Recall: 0.957
F1 score: 0.928
[[186   5]
 [  2  45]]
______________________________________________________
Ieration number: 2
1dev
Accuracy: 1.0
Balanced Accuracy: 1.0
Precision: 1.0
Recall: 1.0
F1 score: 1.0
[[194   0]
 [  0  41]]
______________________________________________________
Ieration number: 3
1dow
Accuracy: 1.0
Balanced Accuracy: 1.0
Precision: 1.0
Recall: 1.0
F1 score: 1.0
[[206   0]
 [  0  31]]
______________________________________________________
Ieration number: 4
1fqj
Accuracy: 0.972
Balanced Accuracy: 0.985
Precision: 0.714
Recall: 1.0
F1 score: 0.833
[[321  10]
 [  0  25]]
______________________________________________________
Ieration number: 5
1g3j
Accuracy: 1.0
Balanced Accuracy: 1.0
Precision: 1.0
Recall: 1.0
F1 score: 1.0
[[522   0]
 [  0  41]]
______________________________________________________
Ieration number: 6
1hrt
Accuracy: 0.677
Balanced Accuracy: 0.58
Precision: 0.672
Rec

Ieration number: 46
1tce
Accuracy: 0.983
Balanced Accuracy: 0.991
Precision: 0.75
Recall: 1.0
F1 score: 0.857
[[113   2]
 [  0   6]]
______________________________________________________
Ieration number: 47
1r1r
Accuracy: 0.997
Balanced Accuracy: 0.999
Precision: 0.875
Recall: 1.0
F1 score: 0.933
[[734   2]
 [  0  14]]
______________________________________________________
Ieration number: 48
1mxl
Accuracy: 0.991
Balanced Accuracy: 0.995
Precision: 0.923
Recall: 1.0
F1 score: 0.96
[[93  1]
 [ 0 12]]
______________________________________________________
Ieration number: 49
2fym
Accuracy: 1.0
Balanced Accuracy: 1.0
Precision: 1.0
Recall: 1.0
F1 score: 1.0
[[858   0]
 [  0  15]]
______________________________________________________
Ieration number: 50
1iwq
Accuracy: 1.0
Balanced Accuracy: 1.0
Precision: 1.0
Recall: 1.0
F1 score: 1.0
[[139   0]
 [  0  18]]
______________________________________________________
Ieration number: 51
1fv1
Accuracy: 0.95
Balanced Accuracy: 0.9
Precision: 0.9

#### Get some results of the model

In [24]:
print('The avarage balanced accuracy is {}\n'.format(results[1]))
print('The avarage f1 score is {}\n'.format(results[3]))
print('The proteins which gave bad results are:\n{}\n'.format(results[4]))
print('Number of proteins which gave bad results: {}\n'.format(len(results[4])))

The avarage balanced accuracy is 0.9356388888888889

The avarage f1 score is 0.8970277777777778

The proteins which gave bad results are:
[('1hrt', 0.58, 0.788), ('1i7w', 1.0, 0.0), ('1kil', 0.409, 0.581), ('1rf8', 0.5, 0.0), ('1sc5', 0.747, 0.609), ('1sqq', 0.986, 0.719), ('1ymh', 0.645, 0.773), ('2a6q', 0.572, 0.935), ('1ozs', 0.876, 0.727), ('1j2x', 0.503, 0.299)]

Number of proteins which gave bad results: 10



### Example of testing for 1cee

In [25]:
X_train = np.array(df_slided.iloc[238:, 3:])
y_train = np.array(LIP[238:])
X_test = np.array(df_slided.iloc[:238, 3:])
y_test = np.array(LIP[:238])

clf.fit(X_train, y_train)
metrics.balanced_accuracy_score(y_test, clf.predict(X_test))

0.9443578032750362

### Features Selection

In [26]:
if type(clf) == RandomForestClassifier:    
    feat_imp = []
    for i,j in zip(clf.feature_importances_, df_slided.columns[3:]):
        feat_imp.append((i, j))

sorted(feat_imp, reverse=True)

[(0.5614222450479627, 'CHAIN_LEN'),
 (0.05320667734488691, 'O_NH_2_relidx'),
 (0.04563324046450787, 'NH_O_2_relidx'),
 (0.04125705603471625, 'REL_ASA'),
 (0.032581884403664704, 'NH_O_1_relidx'),
 (0.02313349083681923, 'O_NH_1_relidx'),
 (0.020632809765358306, 'NH_O_1_energy'),
 (0.019861685248961123, 'O_NH_1_energy'),
 (0.019240127547776136, 'PHI'),
 (0.01874258683914374, 'O_NH_2_energy'),
 (0.018146950512249845, 'PSI'),
 (0.016375540754529343, 'H_O_2_energy'),
 (0.014663629890030618, 'NO_STRUCT'),
 (0.006451414579456024, 'GLU'),
 (0.006236285605738747, 'LEU'),
 (0.006186174627804969, 'LYS'),
 (0.00607352923278314, 'GLY'),
 (0.0059812160371114165, 'ALPHA_ELIX'),
 (0.005717945351781138, 'STRAND'),
 (0.005641283738779465, 'SER'),
 (0.00527321610837323, 'ASP'),
 (0.0050775535740121246, 'BEND'),
 (0.004782090372172028, 'GLN'),
 (0.00467930172862432, 'TURN'),
 (0.004665456949157267, 'VAL'),
 (0.004629752498896194, 'ARG'),
 (0.004526373550760653, 'ALA'),
 (0.004324371388161699, 'ILE'),
 (0.0

### New DataSet with less features

In [27]:
low_feat = list(df_slided.columns[47:64])
df_new = df_slided.copy()
df_new.drop(low_feat, axis = 1, inplace = True)
df_new.head()

Unnamed: 0,PDB_ID,CHAIN_ID,RES_ID,REL_ASA,PHI,PSI,NH_O_1_relidx,NH_O_1_energy,O_NH_1_relidx,O_NH_1_energy,...,THR,TPO,TRP,TYR,VAL,10_ELIX,ALPHA_ELIX,BEND,ISOLATED_BETA_BRIGE,CHAIN_LEN
0,1cee,A,1,0.309174,37.295511,61.732172,6.825167,-0.059349,15.237319,-0.341909,...,0.148373,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,88.917602
1,1cee,A,2,0.248632,1.699767,65.861247,10.602472,-0.115939,19.014625,-0.449374,...,0.148373,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,88.917602
2,1cee,A,3,0.156897,-44.335322,68.548685,11.191255,-0.308026,24.106508,-0.491738,...,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,88.917602
3,1cee,A,4,0.099614,-54.612304,62.93297,11.713213,-0.49208,27.384748,-0.665566,...,0.121306,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,88.917602
4,1cee,A,5,0.08047,-51.320814,53.228107,13.085949,-0.52402,29.632794,-0.959209,...,0.027067,0.0,0.0,0.0,0.027067,0.0,0.0,0.0,0.0,88.917602


### Re-test the model to check if eliminating some features didn't compromize the accuracy

In [28]:
clf = RandomForestClassifier(n_estimators=100)

results = loo_cv(data = df_new, clf=clf, 
                 target = LIP,
                 bad_condition=(0.75, 0.75),
                 ign_warnings = True, 
                 get_times = True, verbose = True)

Ieration number: 1
1cee
Accuracy: 0.962
Balanced Accuracy: 0.944
Precision: 0.896
Recall: 0.915
F1 score: 0.905
[[186   5]
 [  4  43]]
______________________________________________________
Ieration number: 2
1dev
Accuracy: 1.0
Balanced Accuracy: 1.0
Precision: 1.0
Recall: 1.0
F1 score: 1.0
[[194   0]
 [  0  41]]
______________________________________________________
Ieration number: 3
1dow
Accuracy: 1.0
Balanced Accuracy: 1.0
Precision: 1.0
Recall: 1.0
F1 score: 1.0
[[206   0]
 [  0  31]]
______________________________________________________
Ieration number: 4
1fqj
Accuracy: 0.972
Balanced Accuracy: 0.985
Precision: 0.714
Recall: 1.0
F1 score: 0.833
[[321  10]
 [  0  25]]
______________________________________________________
Ieration number: 5
1g3j
Accuracy: 1.0
Balanced Accuracy: 1.0
Precision: 1.0
Recall: 1.0
F1 score: 1.0
[[522   0]
 [  0  41]]
______________________________________________________
Ieration number: 6
1hrt
Accuracy: 0.692
Balanced Accuracy: 0.583
Precision: 0.672


Ieration number: 46
1tce
Accuracy: 0.975
Balanced Accuracy: 0.987
Precision: 0.667
Recall: 1.0
F1 score: 0.8
[[112   3]
 [  0   6]]
______________________________________________________
Ieration number: 47
1r1r
Accuracy: 0.997
Balanced Accuracy: 0.999
Precision: 0.875
Recall: 1.0
F1 score: 0.933
[[734   2]
 [  0  14]]
______________________________________________________
Ieration number: 48
1mxl
Accuracy: 0.991
Balanced Accuracy: 0.995
Precision: 0.923
Recall: 1.0
F1 score: 0.96
[[93  1]
 [ 0 12]]
______________________________________________________
Ieration number: 49
2fym
Accuracy: 0.998
Balanced Accuracy: 0.933
Precision: 1.0
Recall: 0.867
F1 score: 0.929
[[858   0]
 [  2  13]]
______________________________________________________
Ieration number: 50
1iwq
Accuracy: 1.0
Balanced Accuracy: 1.0
Precision: 1.0
Recall: 1.0
F1 score: 1.0
[[139   0]
 [  0  18]]
______________________________________________________
Ieration number: 51
1fv1
Accuracy: 0.95
Balanced Accuracy: 0.9
Precisi

In [30]:
print('The avarage balanced accuracy is {}\n'.format(results[1]))
print('The avarage f1 score is {}\n'.format(results[3]))
print('The proteins which gave bad results are:\n{}\n'.format(results[4]))
print('Number of proteins which gave bad results: {}\n'.format(len(results[4])))

The avarage balanced accuracy is 0.9329444444444444

The avarage f1 score is 0.8942777777777778

The proteins which gave bad results are:
[('1hrt', 0.583, 0.804), ('1i7w', 1.0, 0.0), ('1kil', 0.44, 0.584), ('1rf8', 0.5, 0.0), ('1sc5', 0.733, 0.58), ('1sqq', 0.986, 0.719), ('1ymh', 0.487, 0.735), ('2a6q', 0.691, 0.963), ('1mv0', 0.769, 0.7), ('1ozs', 0.876, 0.727), ('1j2x', 0.566, 0.337)]

Number of proteins which gave bad results: 11



### Train the model with all the data available and save it

In [33]:
clf = RandomForestClassifier(n_estimators=100)
clf.fit(df_new.iloc[:, 3:], LIP)
dump(clf, 'random_forest.joblib')

['random_forest.joblib']

In [36]:
metrics.balanced_accuracy_score(LIP[:10000], clf.predict(df_new.iloc[:10000, 3:]))

1.0

### Search for best n_estimator and sliding_window_size

Here a short cell to check new parames for future re-trains

In [17]:
df_slided = sliding_window(data=df, k=j, sd=1)
low_feat = list(df_slided.columns[47:64])
df_new = df_slided.copy()
df_new.drop(low_feat, axis = 1, inplace = True)

n_estimator = [80, 100, 120, 140] 
win_size = [3, 5, 7]
for i in n_estimator:
    for j in win_size:
        if (i == 80) and (j == 3):
            continue
        
        clf = RandomForestClassifier(n_estimators=i)
        results = loo_cv(data = df_new, clf=clf, 
                         target = LIP,
                         bad_condition=(0.75, 0.75),
                         ign_warnings = True, 
                         get_times = True, verbose = False)
        
        print('-------------------------------------------------------------------------')
        print('-------------------------------------------------------------------------\n')
        print('MODEL: n_estimator = {}; sliding_size = {}\n'.format(i, j))
        print('The avarage balanced accuracy is {}\n'.format(results[1]))
        print('The avarage f1 score is {}\n'.format(results[3]))
        print('Number of proteins which gave bad results: {}\n'.format(len(results[4])))
        print('-------------------------------------------------------------------------')
        print('-------------------------------------------------------------------------\n')

Time taken: 231.5304365158081

-------------------------------------------------------------------------
-------------------------------------------------------------------------

MODEL: n_estimator = 80; sliding_size = 5

The avarage balanced accuracy is 0.937236111111111

The avarage f1 score is 0.8977777777777778

Number of proteins which gave bad results: 9

-------------------------------------------------------------------------
-------------------------------------------------------------------------



  sliced = sliced.rolling(window = k, center = True).apply(lambda x: np.dot(x,window)/k)


Time taken: 247.37936210632324

-------------------------------------------------------------------------
-------------------------------------------------------------------------

MODEL: n_estimator = 80; sliding_size = 7

The avarage balanced accuracy is 0.9313472222222221

The avarage f1 score is 0.8932777777777778

Number of proteins which gave bad results: 11

-------------------------------------------------------------------------
-------------------------------------------------------------------------



  sliced = sliced.rolling(window = k, center = True).apply(lambda x: np.dot(x,window)/k)


Time taken: 245.64192485809326

-------------------------------------------------------------------------
-------------------------------------------------------------------------

MODEL: n_estimator = 100; sliding_size = 3

The avarage balanced accuracy is 0.929361111111111

The avarage f1 score is 0.8935277777777778

Number of proteins which gave bad results: 9

-------------------------------------------------------------------------
-------------------------------------------------------------------------



  sliced = sliced.rolling(window = k, center = True).apply(lambda x: np.dot(x,window)/k)


Time taken: 288.61536836624146

-------------------------------------------------------------------------
-------------------------------------------------------------------------

MODEL: n_estimator = 100; sliding_size = 5

The avarage balanced accuracy is 0.9304444444444443

The avarage f1 score is 0.8937916666666665

Number of proteins which gave bad results: 9

-------------------------------------------------------------------------
-------------------------------------------------------------------------



  sliced = sliced.rolling(window = k, center = True).apply(lambda x: np.dot(x,window)/k)


Time taken: 310.2429156303406

-------------------------------------------------------------------------
-------------------------------------------------------------------------

MODEL: n_estimator = 100; sliding_size = 7

The avarage balanced accuracy is 0.9308055555555556

The avarage f1 score is 0.8935555555555555

Number of proteins which gave bad results: 10

-------------------------------------------------------------------------
-------------------------------------------------------------------------



  sliced = sliced.rolling(window = k, center = True).apply(lambda x: np.dot(x,window)/k)


Time taken: 275.9439899921417

-------------------------------------------------------------------------
-------------------------------------------------------------------------

MODEL: n_estimator = 120; sliding_size = 3

The avarage balanced accuracy is 0.9305416666666666

The avarage f1 score is 0.8939027777777778

Number of proteins which gave bad results: 10

-------------------------------------------------------------------------
-------------------------------------------------------------------------



  sliced = sliced.rolling(window = k, center = True).apply(lambda x: np.dot(x,window)/k)


Time taken: 317.3324348926544

-------------------------------------------------------------------------
-------------------------------------------------------------------------

MODEL: n_estimator = 120; sliding_size = 5

The avarage balanced accuracy is 0.9369583333333333

The avarage f1 score is 0.8952916666666666

Number of proteins which gave bad results: 10

-------------------------------------------------------------------------
-------------------------------------------------------------------------



  sliced = sliced.rolling(window = k, center = True).apply(lambda x: np.dot(x,window)/k)


Time taken: 360.44796442985535

-------------------------------------------------------------------------
-------------------------------------------------------------------------

MODEL: n_estimator = 120; sliding_size = 7

The avarage balanced accuracy is 0.9354722222222223

The avarage f1 score is 0.897

Number of proteins which gave bad results: 10

-------------------------------------------------------------------------
-------------------------------------------------------------------------



  sliced = sliced.rolling(window = k, center = True).apply(lambda x: np.dot(x,window)/k)


Time taken: 309.3839752674103

-------------------------------------------------------------------------
-------------------------------------------------------------------------

MODEL: n_estimator = 140; sliding_size = 3

The avarage balanced accuracy is 0.9309166666666668

The avarage f1 score is 0.8979722222222222

Number of proteins which gave bad results: 10

-------------------------------------------------------------------------
-------------------------------------------------------------------------



  sliced = sliced.rolling(window = k, center = True).apply(lambda x: np.dot(x,window)/k)


Time taken: 374.53809118270874

-------------------------------------------------------------------------
-------------------------------------------------------------------------

MODEL: n_estimator = 140; sliding_size = 5

The avarage balanced accuracy is 0.9363194444444444

The avarage f1 score is 0.8987500000000002

Number of proteins which gave bad results: 11

-------------------------------------------------------------------------
-------------------------------------------------------------------------



  sliced = sliced.rolling(window = k, center = True).apply(lambda x: np.dot(x,window)/k)


Time taken: 448.02454018592834

-------------------------------------------------------------------------
-------------------------------------------------------------------------

MODEL: n_estimator = 140; sliding_size = 7

The avarage balanced accuracy is 0.9363194444444444

The avarage f1 score is 0.8961111111111111

Number of proteins which gave bad results: 10

-------------------------------------------------------------------------
-------------------------------------------------------------------------

