# sbio_lip_predictor

Predicting LIP tagged aminoacidic residues


## API

*./train.py*

*./predict.py*



## Feature pre-proccesing

*./modules/feature_preprocessing.py*

### 1. Handling Categorical Features

Categorical Features such the name of the residues or the type of secondary structure, have been handled using One-Hot-Encoding procedure (https://en.wikipedia.org/wiki/One-hot). This procedure create one feature for each level of the categorical feature, assigning *1* to the features that represent the categorical values of our instances and *0* to all the others features.

This procedure has been done for the following features: <br>
      
    1. Type of residues
    2. Type of secondary structure
    3. Type of contacts
    
### 2. Sliding Window

In order to get information of the context, a sliding windows by a rolling procedure (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rolling.html) with gaussian filtering has been applied to our instances (residues). <br> Every value has been substituted by a gaussian mean of windows *k* centered in that value. In this way we take in consideration also the values of the *k* closest residues, assigning to them a multiplicative factor which goes to 0 as the distance increases and is *1* for the centered value.

Parameters of sliding windows:

    1. windows size (integer, default = 5): size of the windows (usually an odd number for symmetric purpose). 
    2. std (float, default = 1): standard deviation of the gaussian filter. Higher values mean a minor decrease of the            multiplicative factor as the distance increase.


N.B.
For the first values and the last values of every chain, it has been create a mirroring of their next/previous residues. E.G:
       
     res1, res2, res3 --> res3, res2, res1, res2, res3
     
## Model

*./modules/models*

### 1. Leave One Out Cross Validation Protein Based (LOO-CV Portein Based)

In order to validated our model a *LOO-CV* has been applied. In our case the "one left out" wasn't a single instances but a whole protein. <br>
For protein *p* we trained the model with all the proteins available except the protein *p* and then we used as test that protein *p*.

### 2. Feature Selection

A Random Forest Classifier has been used to extract the best features. Since the type of contacts extracted from the RING server didn't result to have meaningful impact (less than *0.001%*) we excluded them from the input matrix. 

### 3. Models

Various algorithms has been used with various parameters (k-Nearest Neighbours, Support Vector Classifier, Linear Discriminant Analysis, Quadratic Discriminant Analysis, MultiLayerPerceptron, Random Forest, Decision Tree, AdaBoost, Logistic Regression). 

The best performing algorithm have been *Random Forest Classifier* and the *Multi-Layer Perceptron* (MLP) with a balanced accuray of over *0.90* and a f1 score greater than *0.85*.

In the end it has been decided to use a *Random Forest Classifier* because of better performance and less training time required. <br> 
After a grid search the best parameters apperead to be a *sliding windows between 3 and 7*, a standard deviation of the *gaussian filtering around 1*, a *number of estimator around 100* (80-120) and *no limits to trees depth and number of leaves*.

___FINAL MODEL___

    1. Sliding Windows:
       a) Windows Size = 5
       b) Standard Deviation = 1
    2. Classifier: 
       a) Type = Random Forest Classifier:
       b) Parameters = {n_estimator: 100}
  
___CURRENT VALIDATION RESULT___

For every protein has been computed balanced accuracy and f1 score and the final results are the avarage of this two scores of every protein. Random seed not set 

    1. Balanced Accuracy: 0.930 
    2. F1-Score: 0.894