Binary classification problem using Pima Indians Diabetes Data

Data Dictionary: (Features)
0. Number of times pregnant.
1. Plasma glucose concentration a 2 hours in an oral glucose tolerance test.
3. Triceps skinfold thickness (mm).
4. 2-Hour serum insulin (mu U/ml).
5. Body mass index (weight in kg/(height in m)^2).
6. Diabetes pedigree function.
7. Age(years).
8. Class variable (0 or 1).

In [35]:
from __future__ import division
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import pylab as pl
import numpy as np
%matplotlib inline

### Pima Indians Diabetes Dataset

In [53]:
# Load CSV using Pandas from URL
import pandas
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = pandas.read_csv(url, names=names)
print(data.shape)


(768, 9)


In [40]:
# We can see plenty of 0s indicating possible missing values

data.head(20)


Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
5,5,116,74,0,0,25.6,0.201,30,0
6,3,78,50,32,88,31.0,0.248,26,1
7,10,115,0,0,0,35.3,0.134,29,0
8,2,197,70,45,543,30.5,0.158,53,1
9,8,125,96,0,0,0.0,0.232,54,1


In [17]:
data.describe()


Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [41]:
# Count for total number of zero values

print((data[['preg','plas','pres','skin','test','mass','pedi','age']] == 0).sum())


preg    111
plas      5
pres     35
skin    227
test    374
mass     11
pedi      0
age       0
dtype: int64


### Mark missing values

In [42]:
# Mark zero values as missing or NaN
data[['preg','plas','pres','skin','test','mass','pedi','age']] = data[['preg','plas','pres','skin','test','mass','pedi','age']].replace(0, np.NaN)
# Count the number of NaN values in each column
print(data.isnull().sum())


preg     111
plas       5
pres      35
skin     227
test     374
mass      11
pedi       0
age        0
class      0
dtype: int64


In [43]:
# Let's take a look at the NaNs

data[['preg','plas','pres','skin','test','mass','pedi','age']] = data[['preg','plas','pres','skin','test','mass','pedi','age']].replace(0, np.NaN)
# Count the number of NaN values in each column
print(data.head(20))


    preg   plas  pres  skin   test  mass   pedi  age  class
0    6.0  148.0  72.0  35.0    NaN  33.6  0.627   50      1
1    1.0   85.0  66.0  29.0    NaN  26.6  0.351   31      0
2    8.0  183.0  64.0   NaN    NaN  23.3  0.672   32      1
3    1.0   89.0  66.0  23.0   94.0  28.1  0.167   21      0
4    NaN  137.0  40.0  35.0  168.0  43.1  2.288   33      1
5    5.0  116.0  74.0   NaN    NaN  25.6  0.201   30      0
6    3.0   78.0  50.0  32.0   88.0  31.0  0.248   26      1
7   10.0  115.0   NaN   NaN    NaN  35.3  0.134   29      0
8    2.0  197.0  70.0  45.0  543.0  30.5  0.158   53      1
9    8.0  125.0  96.0   NaN    NaN   NaN  0.232   54      1
10   4.0  110.0  92.0   NaN    NaN  37.6  0.191   30      0
11  10.0  168.0  74.0   NaN    NaN  38.0  0.537   34      1
12  10.0  139.0  80.0   NaN    NaN  27.1  1.441   57      0
13   1.0  189.0  60.0  23.0  846.0  30.1  0.398   59      1
14   5.0  166.0  72.0  19.0  175.0  25.8  0.587   51      1
15   7.0  100.0   NaN   NaN    NaN  30.0

Having missing values can cause errors for some Machine Learning algorithms.

In this next section below, we will try to evaluate a Linear Discriminant Analysis (LDA) algorithm on the dataset with missing values.

In [44]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score


Evaluate an LDA model on the dataset using k-fold cross validation

model = LinearDiscriminantAnalysis()
kfold = KFold(n_splits=3, random_state=7)
result = cross_val_score
(model, X, y, cv=kfold, scoring='accuracy')
print(result.mean())

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

It doesn't work!

### Remove rows with missing values

In [45]:
# Try removing rows with missing values using the dropna () in Pandas
# Mark zero values as missing or NaN
data[['preg','plas','pres','skin','test','mass','pedi','age']] = data[['preg','plas','pres','skin','test','mass','pedi','age']].replace(0, np.NaN)
# Split dataset into inputs (features) and outputs (targets)
data.dropna(inplace=True)
# summarize the number of rows and columns in the dataset
print(data.shape)


(336, 9)


In [47]:
# Try LDA algorithm again after removing missing values
# Split dataset into inputs and outputs
values = data.values
X = values[:,0:8]
y = values[:,8]
# Evaluate an LDA model on the dataset using k-fold cross validation
model = LinearDiscriminantAnalysis()
kfold = KFold(n_splits=3, random_state=7)
result = cross_val_score(model, X, y, cv=kfold, scoring='accuracy')
print(result.mean())


0.779761904762


This generates an accuracy of ~78% - not bad.  However, removing rows with missing values can be too limiting on some predictive modeling problems, an alternative is to impute missing values.

### Imputation for missing values



Methods for imputing missing values:
    1) Use a constant value that has meaning within the domain, such as 0, distinct from all other values.
    2) A value from another randomly selected record.
    3)  A mean, median, or mode value fro the column.
    4)  A value estimated by another predictive model.
    

In [48]:
# Using fillna () with the mean for each column
# Fill missing values with mean column values
data.fillna(data.mean(), inplace=True)
# Count the number of NaN values in each column
print(data.isnull().sum())


preg     0
plas     0
pres     0
skin     0
test     0
mass     0
pedi     0
age      0
class    0
dtype: int64


In [50]:
#Example below uses the Imputer class to replace missing values with the mean of each column,
#then prints the number of NaN values in the transformed matrix
    

In [56]:
from sklearn.preprocessing import Imputer

# Load CSV using Pandas from URL
import pandas
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = pandas.read_csv(url, names=names)


# Mark zero values as missing or NaN
data[['preg','plas','pres','skin','test','mass','pedi','age']] = data[['preg','plas','pres','skin','test','mass','pedi','age']].replace(0, np.NaN)


# Fill missing values with mean column values
values = data.values
imputer = Imputer()
transformed_values = imputer.fit_transform(values)
# Count the number of NaN values in each column
print(np.isnan(transformed_values).sum())


0


In [57]:
from sklearn.preprocessing import Imputer
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score


In [58]:
# Example below shows the LDA algorithm trained in the Imputer transformed dataset
# Split dataset into inputs and outputs
values = data.values
X = values[:,0:8]
y = values[:,8]
# Fill missing values with mean column values
imputer = Imputer()
transformed_X = imputer.fit_transform(X)
# Evaluate an LDA model on the dataset using k-fold cross validation
model = LinearDiscriminantAnalysis()
kfold = KFold(n_splits=3, random_state=7)
result = cross_val_score(model, transformed_X, y, cv=kfold, scoring='accuracy')
print(result.mean())


0.765625


The accuracy of LDA on transformed dataset gives an accuracy score of ~77%.

Algorithms that support missing values include KNNs, classification and regression trees.