## Sliding windows with RandomClassifier
This is the final procedure of features engineerings. It should be done after the features extraction and after the one-hot-encoding procedure.<br>
The end of the notebook has an example of a RandomClassifier which try to predict the first protein '1cee'.

### Some libraries

In [1]:
import pandas as pd
import numpy as np
import time
from scipy import signal
from modules.feature_extraction import *
import warnings

### Importing DataFrame 'one-hot-enc'

In [2]:
df = pd.read_csv(r'datasets\one-hot-enc.csv')
LIP = df.LIP
df.drop(['LIP_SCORE', 'LIP'], inplace = True, axis = 1)
df.head()

Unnamed: 0,PDB_ID,CHAIN_ID,RES_ID,REL_ASA,PHI,PSI,NH_O_1_relidx,NH_O_1_energy,O_NH_1_relidx,O_NH_1_energy,...,MC_SC,NO_EDGE_LOC,SC_MC,SC_SC,HBOND,IAC,NO_EDGE_TYPE,PIPISTACK,VDW,CHAIN_LEN
0,1cee,A,1,1.0,360.0,97.6,0.0,0.0,2.0,-0.3,...,0,1,0,0,0,0,1,0,0,179
1,1cee,A,2,0.348485,-91.3,147.8,48.0,-0.1,50.0,-1.7,...,0,0,0,2,2,0,0,0,1,179
2,1cee,A,3,0.387324,-142.6,136.7,-2.0,-0.3,50.0,-0.2,...,0,1,0,0,0,0,1,0,0,179
3,1cee,A,4,0.005917,-94.3,154.4,48.0,-1.9,50.0,-1.6,...,0,0,0,3,2,0,0,0,3,179
4,1cee,A,5,0.346341,-112.5,70.5,-2.0,-0.2,71.0,-1.6,...,0,0,1,0,1,0,0,0,1,179


### Apply sliding windows

Use odd numbers for windows or some extra NaN could be generated

In [8]:
df_slided = sliding_windows(data=df, window=7, std=1, get_time=True, ignore_warnings=True)
df_slided.head()

24.9086856842041


Unnamed: 0,PDB_ID,CHAIN_ID,RES_ID,REL_ASA,PHI,PSI,NH_O_1_relidx,NH_O_1_energy,O_NH_1_relidx,O_NH_1_energy,...,MC_SC,NO_EDGE_LOC,SC_MC,SC_SC,HBOND,IAC,NO_EDGE_TYPE,PIPISTACK,VDW,CHAIN_LEN
0,1cee,A,1,0.189104,29.538351,45.893053,8.393152,-0.188069,11.042499,-0.341764,...,0.0,0.181524,0.0,0.441171,0.352937,0.0,0.181524,0.0,0.352937,64.080718
1,1cee,A,2,0.171384,1.743089,47.714779,8.533527,-0.114541,13.773902,-0.350905,...,0.0,0.174881,0.001587,0.401716,0.364636,0.0,0.174881,0.0,0.260446,64.080718
2,1cee,A,3,0.112079,-31.954699,49.379458,8.146105,-0.22589,17.377634,-0.358541,...,0.0,0.162191,0.019334,0.439584,0.372271,0.0,0.162191,0.0,0.372271,64.080718
3,1cee,A,4,0.07274,-38.616482,45.319353,8.477671,-0.353073,19.677972,-0.480324,...,0.0,0.088234,0.086647,0.486572,0.451283,0.0,0.088234,0.0,0.553886,64.080718
4,1cee,A,5,0.058043,-36.939893,38.438409,9.499458,-0.379061,21.324981,-0.6888,...,0.0,0.019334,0.142857,0.352937,0.513541,0.0,0.019334,0.0,0.494207,64.080718


### NaN values
If we have n NaN values before sliding windows, after the procedure we will have:
  
    (NaN values) proportional to (n*window_size) 
    
This is due by the fact that if a NaN is present in any windows the feature of the center residue is also set to NaN')

In [12]:
NaN_before = len(df[df.REL_ASA.isna()].index)
NaN_after = len(df_slided[df_slided.REL_ASA.isna()].index)
print('Numbers of NaN in REL_ASA before sliding windows: {}'.format(NaN_before))
print('Numbers of NaN in REL_ASA after sliding windows: {}'.format(NaN_after))

Numbers of NaN in REL_ASA before sliding windows: 46
Numbers of NaN in REL_ASA after sliding windows: 279


## CLASSIFIER (Very fast written part. Need to be chanked and adjusted!)
### Random Forest

In [17]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics

  return f(*args, **kwds)


### Set NaN values to the mean

In [22]:
df_slided.REL_ASA[df_slided.REL_ASA.isna()] = df_slided.REL_ASA.mean()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


### Prepare dataset

In [38]:
X = np.array(df_slided.iloc[:, 3:])
y = LIP.values
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

array([0, 0, 0, ..., 1, 0, 0], dtype=int64)

In [42]:
X_train, X_test = X[238:, :], X[0:238, :]
y_train, y_test = y[238:], y[0:238]

In [43]:
#Create a Gaussian Classifier
clf=RandomForestClassifier(n_estimators=100)
clf.fit(X_train,y_train)
y_pred=clf.predict(X_test)

In [44]:
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))
print("Precision:", metrics.precision_score(y_test, y_pred))
print("Recall:", metrics.recall_score(y_test, y_pred))
metrics.confusion_matrix(y_test, y_pred)

Accuracy: 0.9789915966386554
Precision: 0.9375
Recall: 0.9574468085106383


array([[188,   3],
       [  2,  45]], dtype=int64)

### Leave-One-Out of Protein
Here we run a cycle that for every protein of the dataset it use that protein as test set and the rest as training. <br>
We print Accuracy, Precision, Recall and Confusion Matrix

In [85]:
df_clf = df_slided.copy()
df_clf['LIP'] = LIP

for pdb_id in df_clf.PDB_ID.unique():
    #Set train and test dataframe
    df_train = df_clf.copy().iloc[:, 3:]
    df_target = df_clf[df_clf.PDB_ID == pdb_id].iloc[:, 3:]
    df_train.drop(list(df_target.index), inplace = True)
    
    #print(df_train)
    #print(df_target)
    y_train = np.array(df_train.loc[:, 'LIP'])
    df_train.drop(['LIP'], axis = 1, inplace = True)
    X_train = np.array(df_train)
    
    y_test = np.array(df_target.loc[:, 'LIP']) 
    df_target.drop(['LIP'], axis = 1, inplace = True)
    X_test = np.array(df_target)
    
    #Create a Gaussian Classifier
    clf=RandomForestClassifier(n_estimators=100)
    clf.fit(X_train,y_train)
    y_pred=clf.predict(X_test)
    
    print(pdb_id)
    print("Accuracy:", metrics.accuracy_score(y_test, y_pred))
    print("Precision:", metrics.precision_score(y_test, y_pred))
    print("Recall:", metrics.recall_score(y_test, y_pred))
    print(metrics.confusion_matrix(y_test, y_pred))
    print('______________________________________________________')


[0 0 0 ... 1 0 0]
1cee
Accuracy: 0.9705882352941176
Precision: 0.9
Recall: 0.9574468085106383
[[186   5]
 [  2  45]]
______________________________________________________
[0 0 0 ... 1 0 0]
1dev
Accuracy: 0.9957446808510638
Precision: 1.0
Recall: 0.975609756097561
[[194   0]
 [  1  40]]
______________________________________________________
[0 0 0 ... 1 0 0]
1dow
Accuracy: 1.0
Precision: 1.0
Recall: 1.0
[[206   0]
 [  0  31]]
______________________________________________________
[0 0 0 ... 1 0 0]
1fqj
Accuracy: 0.9691011235955056
Precision: 0.6944444444444444
Recall: 1.0
[[320  11]
 [  0  25]]
______________________________________________________
[0 0 0 ... 1 0 0]
1g3j
Accuracy: 1.0
Precision: 1.0
Recall: 1.0
[[522   0]
 [  0  41]]
______________________________________________________
[0 0 0 ... 1 0 0]
1hrt
Accuracy: 0.6615384615384615
Precision: 0.6610169491525424
Recall: 0.9512195121951219
[[ 4 20]
 [ 2 39]]
______________________________________________________
[0 0 0 ... 1 0 0]


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


1j2j
Accuracy: 1.0
Precision: 1.0
Recall: 1.0
[[ 3  0]
 [ 0 38]]
______________________________________________________
[0 0 0 ... 1 0 0]
1jsu
Accuracy: 0.9575856443719413
Precision: 1.0
Recall: 0.6231884057971014
[[544   0]
 [ 26  43]]
______________________________________________________
[0 0 0 ... 1 0 0]
1kil
Accuracy: 0.4420289855072464
Precision: 0.8571428571428571
Recall: 0.4426229508196721
[[ 7  9]
 [68 54]]
______________________________________________________
[0 0 0 ... 1 0 0]
1l8c
Accuracy: 0.9931506849315068
Precision: 0.9803921568627451
Recall: 1.0
[[95  1]
 [ 0 50]]
______________________________________________________
[0 0 0 ... 1 0 0]
1p4q
Accuracy: 0.9607843137254902
Precision: 0.975
Recall: 0.8863636363636364
[[108   1]
 [  5  39]]
______________________________________________________
[0 0 0 ... 1 0 0]
1pq1
Accuracy: 0.9666666666666667
Precision: 0.8064516129032258
Recall: 1.0
[[149   6]
 [  0  25]]
______________________________________________________
[0 0 0 ... 

  'precision', 'predicted', average, warn_for)


1sc5
Accuracy: 0.91
Precision: 0.7096774193548387
Recall: 0.55
[[251   9]
 [ 18  22]]
______________________________________________________
[0 0 0 ... 1 0 0]
1sqq
Accuracy: 0.9730021598272138
Precision: 0.5614035087719298
Recall: 1.0
[[869  25]
 [  0  32]]
______________________________________________________
[0 0 0 ... 1 0 0]
1tba
Accuracy: 0.9757085020242915
Precision: 0.9767441860465116
Recall: 0.8936170212765957
[[199   1]
 [  5  42]]
______________________________________________________
[0 0 0 ... 1 0 0]
1th1
Accuracy: 0.9964912280701754
Precision: 1.0
Recall: 0.9642857142857143
[[514   0]
 [  2  54]]
______________________________________________________
[0 0 0 ... 1 0 0]
1xtg
Accuracy: 1.0
Precision: 1.0
Recall: 1.0
[[427   0]
 [  0  56]]
______________________________________________________
[0 0 0 ... 1 0 0]
1ymh
Accuracy: 0.676923076923077
Precision: 0.7619047619047619
Recall: 0.7441860465116279
[[12 10]
 [11 32]]
______________________________________________________
[0 0

KeyboardInterrupt: 