## Diabetic Retinopathy Debrecen Dataset - SVM/Neural Network/Ensemble

The goal for this analysis is to classify patients as either positive or negative for havving diabetic retinopathy. The classifiers for this analysis will be Support Vector Machine (SVM), Neural Network, and Ensemble classifiers. The dataset contains 1151 instances and 20 attributes (categorical and continuous) and can be found [here](http://archive.ics.uci.edu/ml/datasets/Diabetic+Retinopathy+Debrecen+Data+Set).

### Set up Environment 

In [1]:
# import libraries 
import pickle
import warnings
import numpy as np
import pandas as pd
import sklearn as sk

from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score 

%matplotlib inline
warnings.simplefilter("ignore")

### Read in Data 

In [2]:
# create list w column names 
col_names = []
for i in range(20):
    if i == 0:
        col_names.append('quality')
    if i == 1:
        col_names.append('prescreen')
    if i >= 2 and i <= 7:
        col_names.append('ma' + str(i))
    if i >= 8 and i <= 15:
        col_names.append('exudate' + str(i))
    if i == 16:
        col_names.append('eu_dist')
    if i == 17:
        col_names.append('diameter')
    if i == 18:
        col_names.append('amfm_class')
    if i == 19:
        col_names.append('label')

# read in data, add column names 
data = pd.read_csv("messidor_features.txt", names = col_names)

# preview data 
print(data.info())

data.sample(10)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1151 entries, 0 to 1150
Data columns (total 20 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   quality     1151 non-null   int64  
 1   prescreen   1151 non-null   int64  
 2   ma2         1151 non-null   int64  
 3   ma3         1151 non-null   int64  
 4   ma4         1151 non-null   int64  
 5   ma5         1151 non-null   int64  
 6   ma6         1151 non-null   int64  
 7   ma7         1151 non-null   int64  
 8   exudate8    1151 non-null   float64
 9   exudate9    1151 non-null   float64
 10  exudate10   1151 non-null   float64
 11  exudate11   1151 non-null   float64
 12  exudate12   1151 non-null   float64
 13  exudate13   1151 non-null   float64
 14  exudate14   1151 non-null   float64
 15  exudate15   1151 non-null   float64
 16  eu_dist     1151 non-null   float64
 17  diameter    1151 non-null   float64
 18  amfm_class  1151 non-null   int64  
 19  label       1151 non-null  

Unnamed: 0,quality,prescreen,ma2,ma3,ma4,ma5,ma6,ma7,exudate8,exudate9,exudate10,exudate11,exudate12,exudate13,exudate14,exudate15,eu_dist,diameter,amfm_class,label
857,1,1,24,23,22,16,14,9,47.3263,26.655198,12.243846,7.371068,5.069121,2.92386,1.354087,0.274686,0.488392,0.106393,0,1
665,1,1,12,12,9,8,6,2,114.678102,59.917325,25.885087,1.193682,0.015745,0.005904,0.0,0.0,0.551971,0.085614,1,0
1148,1,0,49,48,48,45,43,33,30.461898,13.96698,1.763305,0.137858,0.011221,0.0,0.0,0.0,0.560632,0.129843,0,0
502,1,1,66,64,63,62,56,42,39.164755,9.526313,3.345879,0.245682,0.004607,0.001536,0.0,0.0,0.481664,0.079847,0,1
801,1,1,8,8,8,7,7,5,56.855339,21.532559,4.982154,0.335777,0.003893,0.003893,0.003893,0.003893,0.561032,0.136257,1,0
174,1,0,13,13,13,12,11,6,31.004416,14.292234,0.994879,0.047469,0.0,0.0,0.0,0.0,0.549666,0.133508,0,0
1096,1,1,64,63,63,60,56,50,22.276818,10.053161,0.868831,0.056869,0.0,0.0,0.0,0.0,0.565797,0.127955,0,0
133,1,1,63,63,61,59,55,48,13.198181,5.947414,1.164862,0.10156,0.020004,0.006155,0.003078,0.0,0.507123,0.090788,0,1
806,1,1,38,36,32,19,13,8,159.901453,86.301094,26.082431,2.430988,0.072704,0.0,0.0,0.0,0.531522,0.144385,1,1
579,1,1,19,19,19,18,16,14,66.900142,18.928429,7.771131,1.621016,0.155301,0.018271,0.0,0.0,0.494315,0.112669,1,0


### Data Preprocessing 

Now that the data has been read it, the features and class labels need to be split. 

In [3]:
# separate features and labels 
labels = data['label']
features = data.drop(['label'], axis = 1) 

# check shape
print(labels.shape)
print(features.shape)

(1151,)
(1151, 19)


### Support Vector Machines (SVM)

For Support Vector Machines (SVM), scaling the data is critical for the algorithm to work. To scale only the training data inside the cross-validation loop, a Pipeline needs to be used to pass into the cross-validation. 

In [4]:
# create scaler 
scaler = StandardScaler() 

# create support vector classification 
svc = SVC() 

# create pipeline 
pipe = Pipeline(steps=[('scaler', scaler), ('svc', svc)]) 

# get cv scores
cv_scores = cross_val_score(pipe, features, labels, cv=5)

# print scores 
print('accuracy:', cv_scores.mean()) 

accuracy: 0.7011368341803125


Now, lets see if the mode can improved by tuning the parameters, specifically the kernel:

In [5]:
# set parameters to tune
params = {'svc__kernel': ['linear', 'rbf', 'poly', 'sigmoid']}

# create gridsearchcv + fit data 
grid_search = GridSearchCV(pipe, params, cv=5) 
grid_search.fit(features, labels) 

# print results 
print('best parameters:', grid_search.best_params_)

best parameters: {'svc__kernel': 'linear'}


According to the grid search, the best parameter is a linear kernel. Now, lets pass the grid search into another cross validation loop to evaluate the accuracy of the model. 

In [6]:
# get cv scores
cv_scores = cross_val_score(grid_search, features, labels, cv=5)

# print results 
print('accuracy:', cv_scores.mean())

accuracy: 0.7228646715603239


By tuning the kernel parameter, the accuracy of the model slightly improved. Now, lets try finding the best 'C' paramter, which represents the cost for a misclassifciation. 

In [7]:
# set parameters 
params = {
    'svc__kernel': ['linear', 'rbf', 'poly', 'sigmoid'], 
    'svc__C': list(range(50, 101, 10))
    }

# create gridsearchCV + fit data 
grid_search = GridSearchCV(pipe, params, cv=5) 
grid_search.fit(features, labels) 

# get cv scores 
cv_scores = cross_val_score(grid_search, features, labels, cv=5)

# print results 
print('accuracy:', cv_scores.mean())

accuracy: 0.7454357236965933


Now that the 'C' parameter has been tuned, the accuracy has increased. Lets try using a neural network to classify our data: 

### Neural Networks (NN) 

Neural Networks use a multi-layer perceptron (MLP) supervised learning algorithm. Much like the previous algorithm, MLPs are sensitive to features scaling which requires standardized data as input. 

In [8]:
# create NN 
mlp = MLPClassifier() 

# create pipeline
pipe = Pipeline(steps=[('scaler', scaler), ('mlp', mlp)]) 

# set parameters 
params = {
    'mlp__hidden_layer_sizes': [(10, ), (20, ), (30, ), (40, ), (50, ), (60, )], 
    'mlp__activation': ['logistic', 'tanh', 'relu']
    }

# create gridsearchCV + fit data
grid_search = GridSearchCV(pipe, params, cv=5) 
grid_search.fit(features, labels) 

# get cv scores
cv_scores = cross_val_score(grid_search, features, labels, cv=5) 

# print accuracy 
print('accuracy:', cv_scores.mean()) 

accuracy: 0.7315189158667419


Unfortunately, using a neural network as a classifier did not increase the accuracy of the model. 

### Ensemble Classifiers 

Ensemble classifiers combine the predictions of multiple base estimators to increase improve the accuracy of the predictions. There are several types of ensembles, but we will only be exploring Random Forests and AdaBoost.

#### Random Forests

Random forests are an kind of ensemble classifier where many estimators are built independently in parallel. This ensemble works by manipulating the input features of the model.

In [9]:
# create random forest classifier 
rfc = RandomForestClassifier() 

# set params 
params = { 
    'max_depth': list(range(35, 56)), 
    'min_samples_leaf': [8, 10, 12], 
    'max_features': ['sqrt', 'log2']
    } 

# create gridsearchCV 
grid_search = GridSearchCV(rfc, params, cv=5) 
grid_search.fit(features, labels) 

# get cv scores 
cv_scores = cross_val_score(grid_search, features, labels, cv=5) 

# print accuracy 
print('accuracy:', cv_scores.mean())

accuracy: 0.6767588932806324


#### AdaBoost 

The other type of ensemble classifier, AdaBoost, creates an ensemble classifier called boosting where each of the classifiers are trained one-by-one where sampling on the training set depends on the performance of previous models, as opposed to Random Forests. 

In [None]:
# create ada boost classifier 
ada = AdaBoostClassifier() 

# set params 
params = { 'n_estimators': list(range(50, 251, 25)) }

# create gridsearchCV + fit data 
grid_search = GridSearchCV(ada, params, cv=5)
grid_search.fit(features, labels) 

# get cv scores 
cv_scores = cross_val_score(grid_search, features, labels, cv=5)

# print accuracy 
print('accuracy:', cv_scores.mean())

accuracy: 0.6941501976284586


Although the AdaBoost did not perform the best overall, it did perform better than the random forest classifier. However, for the final model, lets use the Support Vector Machine classifier since it performed the best overall. 

### Final Model 

In [None]:
# create support vector classification 
svc = SVC() 

# create pipeline 
pipe = Pipeline(steps=[('scaler', scaler), ('svc', svc)]) 

# set parameters 
params = {
    'svc__kernel': ['linear', 'rbf', 'poly', 'sigmoid'], 
    'svc__C': list(range(50, 101, 10))
    }

# create final model 
final_model = GridSearchCV(pipe, params, cv=5) 
final_model.fit(features, labels) 

# print best params 
print('best parameters:', final_model.best_params_) 

# dump final model 
filename = 'finalized_model.sav'
pickle.dump(final_model, open(filename, 'wb'))

best parameters: {'svc__C': 70, 'svc__kernel': 'linear'}


For the final model, the best parameters are shown above. Now, lets use the final model to classify a new record: 

In [None]:
# new record to classify
record = [ 0.05905386, 0.2982129, 0.68613149, 0.75078865, 0.87119216, 0.88615694,
  0.93600623, 0.98369184, -0.47426472, -0.57642756, -0.53115361, -0.42789774,
 -0.21907738, -0.20090532, -0.21496782, -0.2080998, 0.06692373, -2.81681183,
 -0.7117194 ]

# load the model 
loaded_model = pickle.load(open(filename, 'rb'))

if (loaded_model.predict([record]) == 1): 
    print('Positive for Diabetic Retinopathy')
else: 
    print('Negative for Diabetic Retinopathy')

Positive for Diabetic Retinopathy
