# Model selection for predictive modeling tool (Early screening for oral cancer)

This notebook explains the steps followed to select the optimal model for classifying oral cancer lesions based on colour images. A mass-screening tool was developed (in MATLAB/Python) based on this work. For more details, visit IIT Roorkee Masters thesis repository **(Belvin Thomas , "Identification and classification of oral cancer lesions in color images using SVM and ANN", 2013)**

The model selection with optimal parameters is an important step in the development of a predictive modelling tool which can efficiently handle the bias-variance trade-off. It ensures that the final model is capable of effectively handling the issues of underfitting and overfitting. **An ensemble of the selected models and associated parameters is suggested for optimum generalisation.** This will ensure unbiased prediction while dealing with in an unseen image in a real-world mass-screening scenario.

## This file contains :

**1) Loading the cleaned data:** It contains texture features obtained from a repository of cancerous and non-cancerous images. Suitable features are selected from a set of texture features based on Gray level co-occurrance and Grey level run length. 

        For more details about the data and feature selection mechanism, visit the thesis cited above.

**2) Splitting of data:** Data os split into training-validation-test dataset at 60-20-20 ratio.

**3)Fitting a base model, cross validation and hyperparameter tuning based on following machine learning algorthms:**

         - Logistic Regression
         - Support Vector Machines (SVM)
         - Multi-Layer Perceptron (MLP)
         - Random forest classifier
         - Gradient Boosting classifier
         
Machine learning algorithm implementations from *scikit-learn library* is used to train the models. Hyperparameters are tuned using GridsearchCV

For the full dataset and more test data contact me belvinthomas@gmail.com
         

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

oc = pd.read_csv('OC_data_cleaned.csv')
oc.head()

Unnamed: 0,Labels,F1,F2,F3,F4,F5,F6,F7,F8,F9,F10,F11,F12,F13,F14,F15,F16,F17,F18
0,1,0.315018,0.10506,0.075914,0.064558,0.767077,0.770861,0.769463,0.088371,0.087806,0.089052,0.088658,0.56653,0.059893,0.277701,0.500227,0.393685,0.016851,0.065507
1,1,0.597311,0.456494,0.403967,0.367882,0.399237,0.373282,0.365941,0.336841,0.312095,0.290886,0.264992,0.511623,0.128474,0.175892,0.339991,0.433176,0.064335,0.052067
2,1,0.489999,0.256581,0.220746,0.210259,0.487443,0.454825,0.444068,0.258086,0.259474,0.264092,0.25897,0.419175,0.097771,0.161377,0.362233,0.275433,0.043811,0.049042
3,1,0.666515,0.74435,0.584654,0.490782,0.0,0.044284,0.106833,0.423656,0.400398,0.388102,0.368839,0.685056,0.030992,0.411515,0.632773,0.476661,0.013883,0.022316
4,1,0.686092,0.527778,0.48335,0.463071,0.261286,0.253401,0.27384,0.465242,0.443937,0.423321,0.395139,0.605882,0.058807,0.289854,0.499319,0.353755,0.014066,0.081336


**Train-Validation-Test Data split (0.6-0.2-0.2)**

In [2]:
features = oc.drop('Labels', axis=1)
labels = oc['Labels']

X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.4, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_test, y_test, test_size=0.5, random_state=42)

In [3]:
for dataset in [y_train, y_val, y_test]:
    print(round(len(dataset) / len(labels), 2))

0.6
0.2
0.2


In [4]:
X_train.to_csv('OCtrain_features.csv', index=False)
X_val.to_csv('OCval_features.csv', index=False)
X_test.to_csv('OCtest_features.csv', index=False)

y_train.to_csv('OCtrain_labels.csv', index=False)
y_val.to_csv('OCval_labels.csv', index=False)
y_test.to_csv('OCtest_labels.csv', index=False) 

**Logistic Regression - Cross validation and hyperparameter (C) tuning**

In [5]:
import joblib
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings('ignore', category=DeprecationWarning)

tr_features = pd.read_csv('OCtrain_features.csv')
tr_labels = pd.read_csv('OCtrain_labels.csv')

In [6]:
def print_results(results):
    print('BEST PARAMS: {}\n'.format(results.best_params_))

    means = results.cv_results_['mean_test_score']
    stds = results.cv_results_['std_test_score']
    for mean, std, params in zip(means, stds, results.cv_results_['params']):
        print('{} (+/-{}) for {}'.format(round(mean, 3), round(std * 2, 3), params))

In [7]:
lr = LogisticRegression()
parameters = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100,152]
}

cv = GridSearchCV(lr, parameters, cv=5)
cv.fit(tr_features, tr_labels.values.ravel())

print_results(cv)

BEST PARAMS: {'C': 152}

0.524 (+/-0.021) for {'C': 0.001}
0.906 (+/-0.052) for {'C': 0.01}
0.931 (+/-0.037) for {'C': 0.1}
0.938 (+/-0.035) for {'C': 1}
0.955 (+/-0.042) for {'C': 10}
0.958 (+/-0.035) for {'C': 100}
0.958 (+/-0.052) for {'C': 152}


In [8]:
cv.best_estimator_

LogisticRegression(C=152, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [9]:
joblib.dump(cv.best_estimator_, 'OCmodel_LR.pkl')

['OCmodel_LR.pkl']

**SVM - Model fitting, Cross validation and hyperparameter tuning**

In [10]:
from sklearn.svm import SVC
svc = SVC()
parameters = {
    'kernel': ['linear', 'rbf','poly','sigmoid'],
    'C': [0.1, 1, 10,100]
}

cv = GridSearchCV(svc, parameters, cv=5)
cv.fit(tr_features, tr_labels.values.ravel())

print_results(cv)

BEST PARAMS: {'C': 10, 'kernel': 'linear'}

0.944 (+/-0.06) for {'C': 0.1, 'kernel': 'linear'}
0.937 (+/-0.064) for {'C': 0.1, 'kernel': 'rbf'}
0.944 (+/-0.046) for {'C': 0.1, 'kernel': 'poly'}
0.528 (+/-0.026) for {'C': 0.1, 'kernel': 'sigmoid'}
0.955 (+/-0.028) for {'C': 1, 'kernel': 'linear'}
0.941 (+/-0.047) for {'C': 1, 'kernel': 'rbf'}
0.962 (+/-0.04) for {'C': 1, 'kernel': 'poly'}
0.253 (+/-0.118) for {'C': 1, 'kernel': 'sigmoid'}
0.972 (+/-0.047) for {'C': 10, 'kernel': 'linear'}
0.958 (+/-0.06) for {'C': 10, 'kernel': 'rbf'}
0.965 (+/-0.062) for {'C': 10, 'kernel': 'poly'}
0.233 (+/-0.108) for {'C': 10, 'kernel': 'sigmoid'}
0.965 (+/-0.049) for {'C': 100, 'kernel': 'linear'}
0.948 (+/-0.073) for {'C': 100, 'kernel': 'rbf'}
0.952 (+/-0.07) for {'C': 100, 'kernel': 'poly'}
0.226 (+/-0.104) for {'C': 100, 'kernel': 'sigmoid'}


In [11]:
cv.best_estimator_

SVC(C=10, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='linear',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

In [12]:
joblib.dump(cv.best_estimator_, 'OCmodel_SVM.pkl')

['OCmodel_SVM.pkl']

**MLP - Model fitting, Cross validation and hyperparameter tuning**

In [13]:
from sklearn.neural_network import MLPClassifier
from sklearn import preprocessing
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings('ignore', category=DeprecationWarning)
mlp = MLPClassifier()
scaler = preprocessing.StandardScaler().fit(tr_features)
tr_features_scaled=scaler.transform(tr_features)

parameters = {
    'hidden_layer_sizes': [(3,), (50,), (18,)],
    'activation': ['relu', 'tanh', 'logistic'],
    'learning_rate': ['constant', 'invscaling', 'adaptive'],
    'solver':['lbfgs', 'sgd', 'adam'],
    'early_stopping' : [True]
}

cv = GridSearchCV(mlp, parameters, cv=5)
cv.fit(tr_features_scaled, tr_labels.values.ravel())

print_results(cv)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
ABNORMAL_TERMINATION_IN_LNSRCH.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


BEST PARAMS: {'activation': 'logistic', 'early_stopping': True, 'hidden_layer_sizes': (3,), 'learning_rate': 'invscaling', 'solver': 'lbfgs'}

0.938 (+/-0.047) for {'activation': 'relu', 'early_stopping': True, 'hidden_layer_sizes': (3,), 'learning_rate': 'constant', 'solver': 'lbfgs'}
0.742 (+/-0.292) for {'activation': 'relu', 'early_stopping': True, 'hidden_layer_sizes': (3,), 'learning_rate': 'constant', 'solver': 'sgd'}
0.572 (+/-0.329) for {'activation': 'relu', 'early_stopping': True, 'hidden_layer_sizes': (3,), 'learning_rate': 'constant', 'solver': 'adam'}
0.948 (+/-0.062) for {'activation': 'relu', 'early_stopping': True, 'hidden_layer_sizes': (3,), 'learning_rate': 'invscaling', 'solver': 'lbfgs'}
0.484 (+/-0.351) for {'activation': 'relu', 'early_stopping': True, 'hidden_layer_sizes': (3,), 'learning_rate': 'invscaling', 'solver': 'sgd'}
0.493 (+/-0.04) for {'activation': 'relu', 'early_stopping': True, 'hidden_layer_sizes': (3,), 'learning_rate': 'invscaling', 'solver': 'a

In [14]:
cv.best_estimator_

MLPClassifier(activation='logistic', alpha=0.0001, batch_size='auto',
              beta_1=0.9, beta_2=0.999, early_stopping=True, epsilon=1e-08,
              hidden_layer_sizes=(3,), learning_rate='invscaling',
              learning_rate_init=0.001, max_fun=15000, max_iter=200,
              momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True,
              power_t=0.5, random_state=None, shuffle=True, solver='lbfgs',
              tol=0.0001, validation_fraction=0.1, verbose=False,
              warm_start=False)

In [15]:
joblib.dump(cv.best_estimator_, 'OCmodel_MLP.pkl')

['OCmodel_MLP.pkl']

**Random Forest - Model fitting, Cross validation and hyperparameter tuning**

In [16]:
from sklearn.ensemble import RandomForestClassifier
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings('ignore', category=DeprecationWarning)
rf = RandomForestClassifier()
parameters = {
    'n_estimators': [5, 50, 250],
    'max_depth': [2, 4, 8, 16, 32, None]
}

cv = GridSearchCV(rf, parameters, cv=5)
cv.fit(tr_features, tr_labels.values.ravel())

print_results(cv)

BEST PARAMS: {'max_depth': None, 'n_estimators': 50}

0.951 (+/-0.014) for {'max_depth': 2, 'n_estimators': 5}
0.934 (+/-0.055) for {'max_depth': 2, 'n_estimators': 50}
0.934 (+/-0.055) for {'max_depth': 2, 'n_estimators': 250}
0.931 (+/-0.049) for {'max_depth': 4, 'n_estimators': 5}
0.941 (+/-0.052) for {'max_depth': 4, 'n_estimators': 50}
0.941 (+/-0.052) for {'max_depth': 4, 'n_estimators': 250}
0.944 (+/-0.041) for {'max_depth': 8, 'n_estimators': 5}
0.938 (+/-0.056) for {'max_depth': 8, 'n_estimators': 50}
0.938 (+/-0.056) for {'max_depth': 8, 'n_estimators': 250}
0.934 (+/-0.04) for {'max_depth': 16, 'n_estimators': 5}
0.931 (+/-0.049) for {'max_depth': 16, 'n_estimators': 50}
0.944 (+/-0.034) for {'max_depth': 16, 'n_estimators': 250}
0.934 (+/-0.071) for {'max_depth': 32, 'n_estimators': 5}
0.938 (+/-0.047) for {'max_depth': 32, 'n_estimators': 50}
0.944 (+/-0.034) for {'max_depth': 32, 'n_estimators': 250}
0.944 (+/-0.035) for {'max_depth': None, 'n_estimators': 5}
0.951 (+/-0

In [17]:
cv.best_estimator_

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=50,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [18]:
joblib.dump(cv.best_estimator_, 'OCmodel_RandomForest.pkl')

['OCmodel_RandomForest.pkl']

**Gradient Boost - Model fitting, Cross validation and hyperparameter tuning**

In [19]:
from sklearn.ensemble import GradientBoostingClassifier
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings('ignore', category=DeprecationWarning)
gb = GradientBoostingClassifier()
parameters = {
    'n_estimators': [5, 50, 250, 500],
    'max_depth': [1, 3, 5, 7, 9],
    'learning_rate': [0.01, 0.1, 1, 10, 100]
}

cv = GridSearchCV(gb, parameters, cv=5)
cv.fit(tr_features, tr_labels.values.ravel())

print_results(cv)


BEST PARAMS: {'learning_rate': 1, 'max_depth': 1, 'n_estimators': 50}

0.796 (+/-0.284) for {'learning_rate': 0.01, 'max_depth': 1, 'n_estimators': 5}
0.899 (+/-0.05) for {'learning_rate': 0.01, 'max_depth': 1, 'n_estimators': 50}
0.945 (+/-0.059) for {'learning_rate': 0.01, 'max_depth': 1, 'n_estimators': 250}
0.945 (+/-0.059) for {'learning_rate': 0.01, 'max_depth': 1, 'n_estimators': 500}
0.927 (+/-0.074) for {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 5}
0.931 (+/-0.062) for {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 50}
0.938 (+/-0.051) for {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 250}
0.938 (+/-0.051) for {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 500}
0.924 (+/-0.064) for {'learning_rate': 0.01, 'max_depth': 5, 'n_estimators': 5}
0.934 (+/-0.059) for {'learning_rate': 0.01, 'max_depth': 5, 'n_estimators': 50}
0.934 (+/-0.059) for {'learning_rate': 0.01, 'max_depth': 5, 'n_estimators': 250}
0.934 (+/-0.059) for {'learning_rate'

In [20]:
cv.best_estimator_

GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
                           learning_rate=1, loss='deviance', max_depth=1,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=50,
                           n_iter_no_change=None, presort='deprecated',
                           random_state=None, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)

In [21]:
joblib.dump(cv.best_estimator_, 'OCmodel_GradientBoost.pkl')

['OCmodel_GradientBoost.pkl']

### Model evaluation (applying saved models on the validation set)

In [22]:
from sklearn.metrics import accuracy_score, precision_score, recall_score
from time import time

val_features = pd.read_csv('OCval_features.csv')
val_labels = pd.read_csv('OCval_labels.csv')

te_features = pd.read_csv('OCtest_features.csv')
te_labels = pd.read_csv('OCtest_labels.csv')

In [23]:
from sklearn import preprocessing
scaler = preprocessing.StandardScaler().fit(tr_features)
val_features_scaled=scaler.transform(val_features)
te_features_scaled=scaler.transform(te_features)

In [24]:
models = {}

for mdl in ['LR', 'SVM', 'RandomForest', 'GradientBoost']:
    models[mdl] = joblib.load('OCmodel_{}.pkl'.format(mdl))
    
MLPmodel = joblib.load('OCmodel_MLP.pkl'.format(mdl))

In [25]:
models

{'LR': LogisticRegression(C=152, class_weight=None, dual=False, fit_intercept=True,
                    intercept_scaling=1, l1_ratio=None, max_iter=100,
                    multi_class='auto', n_jobs=None, penalty='l2',
                    random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                    warm_start=False),
 'SVM': SVC(C=10, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
     decision_function_shape='ovr', degree=3, gamma='scale', kernel='linear',
     max_iter=-1, probability=False, random_state=None, shrinking=True,
     tol=0.001, verbose=False),
 'RandomForest': RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                        criterion='gini', max_depth=None, max_features='auto',
                        max_leaf_nodes=None, max_samples=None,
                        min_impurity_decrease=0.0, min_impurity_split=None,
                        min_samples_leaf=1, min_samples_split=2,
                        m

In [26]:
def evaluate_model(name, model, features, labels):
    start = time()
    pred = model.predict(features)
    end = time()
    accuracy = round(accuracy_score(labels, pred), 3)
    precision = round(precision_score(labels, pred), 3)
    recall = round(recall_score(labels, pred), 3)
    print('{} -- Accuracy: {} / Precision: {} / Recall: {} / Latency: {}ms'.format(name,
                                                                                   accuracy,
                                                                                   precision,
                                                                                   recall,
                                                                                   round((end - start)*1000, 1)))

In [27]:
for name, mdl in models.items():
    evaluate_model(name, mdl, val_features, val_labels)

LR -- Accuracy: 0.969 / Precision: 1.0 / Recall: 0.944 / Latency: 0.0ms
SVM -- Accuracy: 0.958 / Precision: 1.0 / Recall: 0.926 / Latency: 1.1ms
RandomForest -- Accuracy: 0.958 / Precision: 1.0 / Recall: 0.926 / Latency: 3.1ms
GradientBoost -- Accuracy: 0.969 / Precision: 1.0 / Recall: 0.944 / Latency: 0.0ms


In [28]:
evaluate_model('MLP', MLPmodel, val_features_scaled, val_labels)

MLP -- Accuracy: 0.958 / Precision: 1.0 / Recall: 0.926 / Latency: 0.0ms


### Final Model Selection (applying saved models on the test set)

In [29]:
evaluate_model('Random Forest', models['RandomForest'], te_features, te_labels)

Random Forest -- Accuracy: 0.969 / Precision: 0.941 / Recall: 1.0 / Latency: 19.4ms


In [30]:
evaluate_model('SVM', models['SVM'], te_features, te_labels)

SVM -- Accuracy: 0.948 / Precision: 0.939 / Recall: 0.958 / Latency: 0.0ms


In [31]:
evaluate_model('LR', models['LR'], te_features, te_labels)

LR -- Accuracy: 0.948 / Precision: 0.939 / Recall: 0.958 / Latency: 0.0ms


In [32]:
evaluate_model('MLP', MLPmodel, te_features_scaled, te_labels)

MLP -- Accuracy: 0.927 / Precision: 0.936 / Recall: 0.917 / Latency: 0.0ms


In [33]:
evaluate_model('GradientBoost ', models['GradientBoost'], te_features, te_labels)

GradientBoost  -- Accuracy: 0.948 / Precision: 0.922 / Recall: 0.979 / Latency: 0.0ms
