# Model Training
## Introduction

This notebook is using a simple dataset for epitope prediction used in vaccine development from the Kaggle COVID-19/SARS B-cell Epitope Prediction data which cloned on a Github repository for the sake of this project. This notebook will go through the following steps:
1. Load Training set
2. Determine optimal hyperparameters for classifier
3. Train an MLP Classifier model


## Setup

In [1]:
## Environment libraries
import os, types
import ibm_boto3
from botocore.client import Config
import warnings

## Data processing libraries
import numpy as np
import pandas as pd

## Plot libraries
#import matplotlib.pyplot as plt

## Machine learning classifier and tools
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split, RandomizedSearchCV

## Performance metric libraries
from sklearn.metrics import classification_report
from sklearn.model_selection import cross_validate
from sklearn.model_selection import GridSearchCV

In [2]:
# Handle warnings
warnings.filterwarnings('ignore')  # "error", "ignore", "always", "default", "module", "always" or "once"

## Load Training Data

In [3]:
def __iter__(self): return 0

# @hidden_cell
# The following code accesses a file in your IBM Cloud Object Storage. It includes your credentials.
# You might want to remove those credentials before you share the notebook.

if os.environ.get('RUNTIME_ENV_LOCATION_TYPE') == 'external':
    endpoint_d5592f3b715a4ab696c8411b475cdae6 = 'https://s3-api.us-geo.objectstorage.softlayer.net'
else:
    endpoint_d5592f3b715a4ab696c8411b475cdae6 = 'https://s3-api.us-geo.objectstorage.service.networklayer.com'

client_d5592f3b715a4ab696c8411b475cdae6 = ibm_boto3.client(service_name='s3',
    ibm_api_key_id='qDV6Rk1jMFRAdD1sFE2uZh2tpbGIpHsvc66ObIULfDQ8',
    ibm_auth_endpoint="https://iam.cloud.ibm.com/oidc/token",
    config=Config(signature_version='oauth'),
    endpoint_url=endpoint_d5592f3b715a4ab696c8411b475cdae6)

body = client_d5592f3b715a4ab696c8411b475cdae6.get_object(Bucket='covid19epitopeprediction-donotdelete-pr-vjiedfg7ztqsrx',Key='train_data.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

df = pd.read_csv(body)
df.head()

Unnamed: 0,peptide_length,chou_fasman,emini,kolaskar_tongaonkar,parker,isoelectric_point,aromaticity,hydrophobicity,stability,kmeans_feature,target
0,0.0,0.476285,0.017313,0.431655,0.619814,0.24855,0.566651,0.564298,0.264627,0.0,1
1,0.0,0.233202,0.004408,0.865707,0.284809,0.295412,0.359258,0.597317,0.148556,0.333333,1
2,0.0,0.314229,0.084398,0.292566,0.733319,0.530951,0.503623,0.880225,0.170325,0.333333,1
3,0.0,0.865613,0.062751,0.235012,0.845722,0.064573,0.245679,0.447703,0.192377,0.333333,1
4,0.0,0.671937,0.046989,0.23741,0.753154,0.37224,0.569787,0.429961,0.123374,0.333333,1


## Model Optimization and Evaluation
After several iterative steps, the model was shown to perform slightly better using LBFGS gradient descent with 50-100-50 layers.

In [4]:
# Create variable and target training and testing arrays
X = df.drop(['target'], axis = 1).to_numpy()
y = df['target'].to_numpy()

In [5]:
X_train,X_test,y_train,y_test = train_test_split(X,y,stratify=y,test_size=0.3,random_state=0)

In [6]:
# Instantiate the machine learning classifier
mlp = MLPClassifier(max_iter=100)

In [7]:
# Select hyperparameter space search
parameter_space = {
    'hidden_layer_sizes': [(50,50,50), (50,100,50), (100,)],
    'activation': ['tanh', 'relu'],
    'solver': ['sgd', 'lbfgs'],
    'alpha': [0.0001, 0.05],
    'learning_rate': ['constant','adaptive'],
}

In [8]:
# Perform hyperparameter search
clf = GridSearchCV(mlp, parameter_space, n_jobs=-1, cv=3)
clf.fit(X_train, y_train)

GridSearchCV(cv=3, estimator=MLPClassifier(max_iter=100), n_jobs=-1,
             param_grid={'activation': ['tanh', 'relu'],
                         'alpha': [0.0001, 0.05],
                         'hidden_layer_sizes': [(50, 50, 50), (50, 100, 50),
                                                (100,)],
                         'learning_rate': ['constant', 'adaptive'],
                         'solver': ['sgd', 'lbfgs']})

In [9]:
# Show dbst paramete set
print('Best parameters found:\n', clf.best_params_)

## Show all results
#means = clf.cv_results_['mean_test_score']
#stds = clf.cv_results_['std_test_score']
#for mean, std, params in zip(means, stds, clf.cv_results_['params']):
#    print("%0.3f (+/-%0.03f) for %r" % (mean, std * 2, params))

Best parameters found:
 {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (50, 100, 50), 'learning_rate': 'constant', 'solver': 'lbfgs'}


In [14]:
# Instantiate the machine learning classifiers
mlp = MLPClassifier(activation = 'relu', hidden_layer_sizes=(50, 100, 50), max_iter=1000, alpha=1e-4, solver='lbfgs', random_state=42, learning_rate = 'adaptive', learning_rate_init=.1)

In [15]:
# Train model
mlp.fit(X_train,y_train)
print("Training set score: %f" % mlp.score(X_train, y_train))

Training set score: 0.840809
