# Model Definition
## Introduction

This notebook is using a simple dataset for epitope prediction used in vaccine development from the Kaggle COVID-19/SARS B-cell Epitope Prediction data which cloned on a Github repository for the sake of this project. This notebook will go through the following steps:
1. Load Training set
2. Test models
3. Score models

## Setup

In [1]:
## Environment libraries
import os, types
import ibm_boto3
from botocore.client import Config
import warnings

## Data procession libraries
import numpy as np
import pandas as pd

## Plot libraries
import matplotlib.pyplot as plt

## Machine learning classifiers
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

## Performance metric libraries
from sklearn.metrics import make_scorer
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.model_selection import cross_validate
from sklearn import metrics

In [2]:
# Handle warnings
warnings.filterwarnings('ignore')  # "error", "ignore", "always", "default", "module", "always" or "once"

## Load Training Data

In [3]:
# The code was removed by Watson Studio for sharing.

Unnamed: 0,peptide_length,chou_fasman,emini,kolaskar_tongaonkar,parker,isoelectric_point,aromaticity,hydrophobicity,stability,kmeans_feature,target
0,0.0,0.476285,0.017313,0.431655,0.619814,0.24855,0.566651,0.564298,0.264627,0.0,1
1,0.0,0.233202,0.004408,0.865707,0.284809,0.295412,0.359258,0.597317,0.148556,0.333333,1
2,0.0,0.314229,0.084398,0.292566,0.733319,0.530951,0.503623,0.880225,0.170325,0.333333,1
3,0.0,0.865613,0.062751,0.235012,0.845722,0.064573,0.245679,0.447703,0.192377,0.333333,1
4,0.0,0.671937,0.046989,0.23741,0.753154,0.37224,0.569787,0.429961,0.123374,0.333333,1


## Model Testing and Results
Eight models were tested using 10 fold cross-validation. This included:
1. Logistic Regression model
2. Linear Support Vector Classifier
3. Decision Tree Classifier
4. Guassian NB
5. K-Neighbor Classifier
6. Quadratic Discriminant Analysis
7. Random Forest Classifier
8. MLP Neural network Classifier.

The MLP classifier was selected base on accuracy measurements

In [4]:
# Create variable and target arrays
X = df.drop(['target'], axis = 1).to_numpy()
y = df['target'].to_numpy()

In [5]:
# Define dictionary with performance metrics
scoring = {'accuracy':make_scorer(accuracy_score), 
           'precision':make_scorer(precision_score),
           'recall':make_scorer(recall_score), 
           'f1_score':make_scorer(f1_score)}

In [6]:
# Instantiate the machine learning classifiers
log_model = LogisticRegression(max_iter=10000)
svc_model = LinearSVC(dual=False)
dtr_model = DecisionTreeClassifier(max_depth=5)
rfc_model = RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1)
gnb_model = GaussianNB()
knc_model = KNeighborsClassifier(3)
qda_model = QuadraticDiscriminantAnalysis()
mlp_model = MLPClassifier(alpha=1, max_iter=1000)

In [9]:
# Define the models evaluation function
def models_evaluation(X, y, folds):
    
    '''
    X : data set features
    y : data set target
    folds : number of cross-validation folds
    
    '''
    
## Perform cross-validation to each machine learning classifier
    log = cross_validate(log_model, X, y, cv=folds, scoring=scoring)
    svc = cross_validate(svc_model, X, y, cv=folds, scoring=scoring)
    dtr = cross_validate(dtr_model, X, y, cv=folds, scoring=scoring)
    rfc = cross_validate(rfc_model, X, y, cv=folds, scoring=scoring)
    gnb = cross_validate(gnb_model, X, y, cv=folds, scoring=scoring)
    knc = cross_validate(knc_model, X, y, cv=folds, scoring=scoring)   
    qda = cross_validate(qda_model, X, y, cv=folds, scoring=scoring)
    mlp = cross_validate(mlp_model, X, y, cv=folds, scoring=scoring)
    
    models_scores_table = pd.DataFrame({'Logistic Regression':[log['test_accuracy'].mean(),
                                                               log['test_precision'].mean(),
                                                               log['test_recall'].mean(),
                                                               log['test_f1_score'].mean()],
                                       
                                      'Support Vector Classifier':[svc['test_accuracy'].mean(),
                                                                   svc['test_precision'].mean(),
                                                                   svc['test_recall'].mean(),
                                                                   svc['test_f1_score'].mean()],
                                       
                                      'Decision Tree':[dtr['test_accuracy'].mean(),
                                                       dtr['test_precision'].mean(),
                                                       dtr['test_recall'].mean(),
                                                       dtr['test_f1_score'].mean()],
                                       
                                      'Random Forest':[rfc['test_accuracy'].mean(),
                                                       rfc['test_precision'].mean(),
                                                       rfc['test_recall'].mean(),
                                                       rfc['test_f1_score'].mean()],
                                       
                                      'Gaussian Naive Bayes':[gnb['test_accuracy'].mean(),
                                                              gnb['test_precision'].mean(),
                                                              gnb['test_recall'].mean(),
                                                              gnb['test_f1_score'].mean()],
                                       
                                       'K Neighbor Classifier':[knc['test_accuracy'].mean(),
                                                              knc['test_precision'].mean(),
                                                              knc['test_recall'].mean(),
                                                              knc['test_f1_score'].mean()],
                                        
                                        'QuaD Analysis':[qda['test_accuracy'].mean(),
                                                              qda['test_precision'].mean(),
                                                              qda['test_recall'].mean(),
                                                              qda['test_f1_score'].mean()],
                                       
                                       'Neural Network':[mlp['test_accuracy'].mean(),
                                                              mlp['test_precision'].mean(),
                                                              mlp['test_recall'].mean(),
                                                              mlp['test_f1_score'].mean()]},
                                      
                                      index=['Accuracy', 'Precision', 'Recall', 'F1 Score'])
    ### Add 'Best Score' column
    models_scores_table['Best Score'] = models_scores_table.idxmax(axis=1)
    
    ### Return models performance metrics scores data frame
    return(models_scores_table)

## Run models_evaluation function
models_evaluation(X, y, 10)

Unnamed: 0,Logistic Regression,Support Vector Classifier,Decision Tree,Random Forest,Gaussian Naive Bayes,K Neighbor Classifier,QuaD Analysis,Neural Network,Best Score
Accuracy,0.72684,0.728719,0.566731,0.711079,0.715904,0.50232,0.706248,0.728852,Neural Network
Precision,0.420078,0.4483,0.2959,0.360442,0.385342,0.236481,0.343946,0.0,Support Vector Classifier
Recall,0.032156,0.019796,0.248027,0.026219,0.071983,0.350777,0.113508,0.0,K Neighbor Classifier
F1 Score,0.057523,0.036401,0.239563,0.045789,0.117175,0.279058,0.15853,0.0,K Neighbor Classifier
