In [1]:
import pandas as pd
from collections import Counter
from sklearn.model_selection import GridSearchCV as gscv

#as the gridsearch CV is returning scores even lower than those used without it, we are going to drop that 
#step for the public output. 



This is a simple exercise involving building a predictive classification algorithm to predict diabetes in a set of Pima Indians using features such as blood pressure, skin thickness, glucose levels, etc. (details mentioned below)

This was a dataset of Pima Indians of Central Arizona sourced from the UCI Machine Learning Repository. As of 2014, the majority of the population lives in the federally recognized Gila River Indian Community (GRIC). In historic times a large number of Akimel O'Odham migrated north to occupy the banks of the Salt River, where they formed the Salt River Pima-Maricopa Indian Community (SRPMIC). Both tribes are confederations of two distinct ethnicities, which include the Maricopa.

Today the GRIC is a sovereign tribe residing on more than 550,000 acres (2,200 km2) of land in central Arizona. The community is divided into seven districts (similar to states) with a council representing individual subgovernments. It is self-governed by an elected Governor (currently Gregory Mendoza), Lieutenant Governor (currently Stephen Roe-Lewis) and 18-member Tribal Council. The council is elected by district with the number of electees determined by district population. There are more than 19,000 enrolled members overall.

The dataset consists of 9 columns of data:

1. Number of times pregnant
2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test
3. Diastolic blood pressure (mm Hg)
4. Triceps skin fold thickness (mm)
5. 2-Hour serum insulin (mu U/ml)
6. Body mass index (weight in kg/(height in m)^2)
7. Diabetes pedigree function
8. Age (years)
9. Class variable of diabetic outcome(0 or 1) 



In [2]:
#loading in the csv as a dataframe

df = pd.read_csv('diabetes.csv')

In [None]:
#data source: UCI Machine Learning Repository
#https://archive.ics.uci.edu/ml/datasets/pima+indians+diabetes

In [3]:
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [4]:
from scipy.stats import pearsonr as pr

#As we can see, there are no strong correlations between any individual value and a diabetic outcome, using pearson's r 
#as a measurement of correlation

cor = pr(df['Glucose'], df['Outcome'])
cor2 = pr(df['Pregnancies'], df['Outcome'])
cor3 = pr(df['BloodPressure'], df['Outcome'])
cor4 = pr(df['SkinThickness'], df['Outcome'])
cor5 = pr(df['Insulin'], df['Outcome'])
cor6 = pr(df['BMI'], df['Outcome'])
cor7 = pr(df['DiabetesPedigreeFunction'], df['Outcome'])
cor8 = pr(df['Age'], df['Outcome'])

print 'Correlations: '
print ""
print cor[0]
print cor2[0]
print cor3[0]
print cor4[0]
print cor5[0]
print cor6[0]
print cor7[0]
print cor8[0]

Correlations: 

0.466581398307
0.221898153034
0.0650683595503
0.0747522319183
0.130547954884
0.292694662644
0.173844065653
0.238355983027


As seen from the figures above, there are no statistically significant correlations between any individual feature and a diabetic outcome. 

However, when we combine all the data features and use them together to predict the outcome, the results are a lot more promising. 

In [5]:
#loading in the big guns
#all classifiers are we are attempting to build a classifier that takes all features as input and returns the best 
#F1 score

from sklearn.svm import SVC 
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis as QDA
from sklearn.ensemble import RandomForestClassifier as RFC
from sklearn.naive_bayes import GaussianNB as NB
from sklearn.neighbors import KNeighborsClassifier as KNC
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression as reg

#the train-test splitter

from sklearn.model_selection import train_test_split


In [6]:
#train_test splitter
f = df[['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 
                          'BMI', 'DiabetesPedigreeFunction', 'Age']].as_matrix()

l = df['Outcome'].as_matrix()

f_train, f_test, l_train, l_test = train_test_split(f, l)

In [7]:
#here is the saved copy of the train-test split that yields the highest results. 
import pickle
"""
with open('f_train.pkl', 'wb') as f:
    pickle.dump(f_train, f)
    
with open('f_test.pkl', 'wb') as f:
    pickle.dump(f_test, f)
    
with open('l_train.pkl', 'wb') as f:
    pickle.dump(l_train, f)
    
with open('l_test.pkl', 'wb') as f:
    pickle.dump(l_test, f)
"""

with open('f_train.pkl', 'rb') as f:
    f_train = pickle.load(f)
    
with open('f_test.pkl', 'rb') as f:
    f_test = pickle.load(f)
    
with open('l_train.pkl', 'rb') as f:
    l_train = pickle.load(f)
    
with open('l_test.pkl', 'rb') as f:
    l_test = pickle.load(f)

In [8]:
#function to choose between small set of classifiers. 

def try_clf(f_train, f_test, l_train, l_test):
    svc_clf = SVC(kernel='linear')
    qda_clf = QDA()
    rfc_clf = RFC()
    nb_clf = NB()
    knc_clf = KNC()
    reg_clf = reg()
    
    svc_clf.fit(f_train, l_train)
    svc_pred_lst = svc_clf.predict(f_test)
    print 'SVC results'
    print ""
    print classification_report(l_test, svc_pred_lst, target_names = ["diabetic", "non-diabetic"])
    print ""
    print ""
    
    qda_clf.fit(f_train, l_train)
    qda_pred_lst = qda_clf.predict(f_test)
    print 'QDA results'
    print ""
    print classification_report(l_test, qda_pred_lst, target_names = ["diabetic", "non-diabetic"])
    print ""
    print ""
    
    rfc_clf.fit(f_train, l_train)
    rfc_pred_lst = rfc_clf.predict(f_test)
    print 'RFC results'
    print ""
    print classification_report(l_test, rfc_pred_lst, target_names = ["diabetic", "non-diabetic"])
    print ""
    print ""
    
    nb_clf.fit(f_train, l_train)
    nb_pred_lst = nb_clf.predict(f_test)
    print 'Gaussian Naive Bayes results'
    print ""
    print classification_report(l_test, nb_pred_lst, target_names = ["diabetic", "non-diabetic"])
    print ""
    print ""
    
    knc_clf.fit(f_train, l_train)
    knc_pred_lst = knc_clf.predict(f_test)
    print 'KNC results'
    print ""
    print classification_report(l_test, knc_pred_lst, target_names = ["diabetic", "non-diabetic"])
    print ""
    print ""
    
    reg_clf.fit(f_train, l_train)
    reg_pred_lst = reg_clf.predict(f_test)
    print 'Logistic Regression results'
    print ""
    print classification_report(l_test, reg_pred_lst, target_names = ["diabetic", "non-diabetic"])
    print ""
    print ""

In [9]:
try_clf(f_train, f_test, l_train, l_test)

SVC results

              precision    recall  f1-score   support

    diabetic       0.82      0.87      0.85       123
non-diabetic       0.74      0.67      0.70        69

 avg / total       0.79      0.80      0.79       192



QDA results

              precision    recall  f1-score   support

    diabetic       0.82      0.84      0.83       123
non-diabetic       0.70      0.68      0.69        69

 avg / total       0.78      0.78      0.78       192



RFC results

              precision    recall  f1-score   support

    diabetic       0.79      0.87      0.83       123
non-diabetic       0.71      0.58      0.64        69

 avg / total       0.76      0.77      0.76       192



Gaussian Naive Bayes results

              precision    recall  f1-score   support

    diabetic       0.83      0.81      0.82       123
non-diabetic       0.68      0.70      0.69        69

 avg / total       0.77      0.77      0.77       192



KNC results

              precision    recall 