# Results

Using Voting CLassifiers = 0.841% accuracy

# The Dataset

This notebook was made in reference to the following 2 notebooks.


2015 codebook: https://www.cdc.gov/brfss/annual_data/2015/pdf/codebook15_llcp.pdf

Data cleaning reference: https://www.kaggle.com/alexteboul/diabetes-health-indicators-dataset-notebook

The purpose of this analysis is to clean BRFSS data into a useable format for machine learning. The dataset is a collection of answers from-over-the phone interviews conducted by the CDC(Center for Disease Control and Prevention) in the US in 2015. The dataset has originally 330 columns but we will only select a subset of these after research in the field has identified these as **important risk factors**.

The following list are such important risk factors and are not ordered in anyway:


1. blood pressure (high)
2. cholesterol (high)
3. smoking
4. diabetes
5. obesity
6. age
7. sex
8. race
9. diet
10. exercise
11. alcohol consumption
12. BMI
13. Household Income
14. Marital Status
15. Sleep
16. Time since last checkup
17. Education
18. Health care coverage
19. Mental Health

In order to pick out the above listed risk factors, I will consult the 2015 codebook to get a better understanding of the feature names.

In [None]:
# Import libraries
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns

plt.rcParams['font.size'] = 14
plt.rcParams['figure.figsize'] = (15, 7)

In [None]:
# Load in the datasets
df = pd.read_csv('/content/drive/MyDrive/Datasets/Diabetes 2015/2015.csv')

In [None]:
df.shape

(441456, 330)

In [None]:
# Select specific columns 
df = df[['DIABETE3',
        '_RFHYPE5',  
        'TOLDHI2', '_CHOLCHK', 
        '_BMI5', 
        'SMOKE100', 
        'CVDSTRK3', '_MICHD', 
        '_TOTINDA', 
        '_FRTLT1', '_VEGLT1', 
        '_RFDRHV5', 
        'HLTHPLN1', 'MEDCOST', 
        'GENHLTH', 'MENTHLTH', 'PHYSHLTH', 'DIFFWALK', 
        'SEX', '_AGEG5YR', 'EDUCA', 'INCOME2']]

In [None]:
df

Unnamed: 0,DIABETE3,_RFHYPE5,TOLDHI2,_CHOLCHK,_BMI5,SMOKE100,CVDSTRK3,_MICHD,_TOTINDA,_FRTLT1,_VEGLT1,_RFDRHV5,HLTHPLN1,MEDCOST,GENHLTH,MENTHLTH,PHYSHLTH,DIFFWALK,SEX,_AGEG5YR,EDUCA,INCOME2
0,3.0,2.0,1.0,1.0,4018.0,1.0,2.0,2.0,2.0,2.0,1.0,1.0,1.0,2.0,5.0,18.0,15.0,1.0,2.0,9.0,4.0,3.0
1,3.0,1.0,2.0,2.0,2509.0,1.0,2.0,2.0,1.0,2.0,2.0,1.0,2.0,1.0,3.0,88.0,88.0,2.0,2.0,7.0,6.0,1.0
2,3.0,1.0,1.0,1.0,2204.0,,1.0,,9.0,9.0,9.0,9.0,1.0,2.0,4.0,88.0,15.0,,2.0,11.0,4.0,99.0
3,3.0,2.0,1.0,1.0,2819.0,2.0,2.0,2.0,2.0,1.0,2.0,1.0,1.0,1.0,5.0,30.0,30.0,1.0,2.0,9.0,4.0,8.0
4,3.0,1.0,2.0,1.0,2437.0,2.0,2.0,2.0,2.0,9.0,1.0,1.0,1.0,2.0,5.0,88.0,20.0,2.0,2.0,9.0,5.0,77.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
441451,1.0,2.0,1.0,1.0,1842.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0,1.0,2.0,4.0,88.0,88.0,1.0,2.0,11.0,2.0,4.0
441452,3.0,1.0,2.0,1.0,2834.0,2.0,2.0,2.0,1.0,1.0,2.0,1.0,1.0,2.0,1.0,88.0,88.0,2.0,2.0,2.0,5.0,2.0
441453,3.0,2.0,1.0,1.0,4110.0,1.0,2.0,2.0,9.0,9.0,9.0,1.0,1.0,2.0,4.0,20.0,88.0,2.0,2.0,11.0,4.0,5.0
441454,3.0,2.0,2.0,1.0,2315.0,2.0,2.0,2.0,2.0,1.0,1.0,1.0,1.0,2.0,3.0,88.0,88.0,2.0,1.0,7.0,5.0,1.0


# Clean the dataset

In [None]:
# Drop Missing Values
df = df.dropna()
df.shape

(343606, 22)

## Modify and clean the dataset to be more suitable to ML algorithms

In [None]:
# DIABETE3
# going to make this ordinal. 0 is for no diabetes or only during pregnancy, 1 is for pre-diabetes or borderline diabetes, 2 is for yes diabetes
# Remove all 7 (dont knows)
# Remove all 9 (refused)
df['DIABETE3'] = df['DIABETE3'].replace({2:0, 3:0, 1:2, 4:1})
df = df[df.DIABETE3 != 7]
df = df[df.DIABETE3 != 9]
df.DIABETE3.unique()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


array([0., 2., 1.])

In [None]:
#1 _RFHYPE5
#Change 1 to 0 so it represetnts No high blood pressure and 2 to 1 so it represents high blood pressure
df['_RFHYPE5'] = df['_RFHYPE5'].replace({1:0, 2:1})
df = df[df._RFHYPE5 != 9]
df._RFHYPE5.unique()

array([1., 0.])

In [None]:
#2 TOLDHI2
# Change 2 to 0 because it is No
# Remove all 7 (dont knows)
# Remove all 9 (refused)
df['TOLDHI2'] = df['TOLDHI2'].replace({2:0})
df = df[df.TOLDHI2 != 7]
df = df[df.TOLDHI2 != 9]
df.TOLDHI2.unique()

array([1., 0.])

In [None]:
#3 _CHOLCHK
# Change 3 to 0 and 2 to 0 for Not checked cholesterol in past 5 years
# Remove 9
df['_CHOLCHK'] = df['_CHOLCHK'].replace({3:0,2:0})
df = df[df._CHOLCHK != 9]
df._CHOLCHK.unique()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


array([1., 0.])

In [None]:
#4 _BMI5 (no changes, just note that these are BMI * 100. So for example a BMI of 4018 is really 40.18)
df['_BMI5'] = df['_BMI5'].div(100).round(0)
df._BMI5.unique()

array([40., 25., 28., 24., 27., 30., 26., 23., 34., 33., 21., 22., 31.,
       38., 20., 19., 32., 46., 41., 37., 36., 29., 35., 18., 54., 45.,
       39., 47., 43., 55., 49., 42., 17., 16., 48., 44., 50., 59., 15.,
       52., 53., 57., 51., 14., 58., 63., 61., 56., 60., 74., 62., 64.,
       13., 66., 73., 65., 68., 85., 71., 84., 67., 70., 82., 79., 92.,
       72., 88., 96., 81., 12., 77., 95., 75., 91., 69., 76., 87., 89.,
       83., 98., 86., 80., 90., 78., 97.])

In [None]:
#5 SMOKE100
# Change 2 to 0 because it is No
# Remove all 7 (dont knows)
# Remove all 9 (refused)
df['SMOKE100'] = df['SMOKE100'].replace({2:0})
df = df[df.SMOKE100 != 7]
df = df[df.SMOKE100 != 9]
df.SMOKE100.unique()

array([1., 0.])

In [None]:
#6 CVDSTRK3
# Change 2 to 0 because it is No
# Remove all 7 (dont knows)
# Remove all 9 (refused)
df['CVDSTRK3'] = df['CVDSTRK3'].replace({2:0})
df = df[df.CVDSTRK3 != 7]
df = df[df.CVDSTRK3 != 9]
df.CVDSTRK3.unique()

array([0., 1.])

In [None]:
#7 _MICHD
#Change 2 to 0 because this means did not have MI or CHD
df['_MICHD'] = df['_MICHD'].replace({2: 0})
df._MICHD.unique()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


array([0., 1.])

In [None]:
#8 _TOTINDA
# 1 for physical activity
# change 2 to 0 for no physical activity
# Remove all 9 (don't know/refused)
df['_TOTINDA'] = df['_TOTINDA'].replace({2:0})
df = df[df._TOTINDA != 9]
df._TOTINDA.unique()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


array([0., 1.])

In [None]:
#9 _FRTLT1
# Change 2 to 0. this means no fruit consumed per day. 1 will mean consumed 1 or more pieces of fruit per day 
# remove all dont knows and missing 9
df['_FRTLT1'] = df['_FRTLT1'].replace({2:0})
df = df[df._FRTLT1 != 9]
df._FRTLT1.unique()

array([0., 1.])

In [None]:
#10 _VEGLT1
# Change 2 to 0. this means no vegetables consumed per day. 1 will mean consumed 1 or more pieces of vegetable per day 
# remove all dont knows and missing 9
df['_VEGLT1'] = df['_VEGLT1'].replace({2:0})
df = df[df._VEGLT1 != 9]
df._VEGLT1.unique()

array([1., 0.])

In [None]:
#11 _RFDRHV5
# Change 1 to 0 (1 was no for heavy drinking). change all 2 to 1 (2 was yes for heavy drinking)
# remove all dont knows and missing 9
df['_RFDRHV5'] = df['_RFDRHV5'].replace({1:0, 2:1})
df = df[df._RFDRHV5 != 9]
df._RFDRHV5.unique()

array([0., 1.])

In [None]:
df._RFDRHV5.value_counts()

0.0    282758
1.0     15879
Name: _RFDRHV5, dtype: int64

In [None]:
#12 HLTHPLN1
# 1 is yes, change 2 to 0 because it is No health care access
# remove 7 and 9 for don't know or refused
df['HLTHPLN1'] = df['HLTHPLN1'].replace({2:0})
df = df[df.HLTHPLN1 != 7]
df = df[df.HLTHPLN1 != 9]
df.HLTHPLN1.unique()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


array([1., 0.])

In [None]:
#13 MEDCOST
# Change 2 to 0 for no, 1 is already yes
# remove 7 for don/t know and 9 for refused
df['MEDCOST'] = df['MEDCOST'].replace({2:0})
df = df[df.MEDCOST != 7]
df = df[df.MEDCOST != 9]
df.MEDCOST.unique()

array([0., 1.])

In [None]:
#14 GENHLTH
# This is an ordinal variable that I want to keep (1 is Excellent -> 5 is Poor)
# Remove 7 and 9 for don't know and refused
df = df[df.GENHLTH != 7]
df = df[df.GENHLTH != 9]
df.GENHLTH.unique()

array([5., 3., 2., 4., 1.])

In [None]:
#15 MENTHLTH
# already in days so keep that, scale will be 0-30
# change 88 to 0 because it means none (no bad mental health days)
# remove 77 and 99 for don't know not sure and refused
df['MENTHLTH'] = df['MENTHLTH'].replace({88:0})
df = df[df.MENTHLTH != 77]
df = df[df.MENTHLTH != 99]
df.MENTHLTH.unique()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


array([18.,  0., 30.,  3.,  5., 15., 10.,  6., 20.,  2., 25.,  1., 29.,
        4.,  7.,  8., 21., 14., 26.,  9., 16., 28., 11., 12., 24., 17.,
       13., 23., 27., 19., 22.])

In [None]:
#16 PHYSHLTH
# already in days so keep that, scale will be 0-30
# change 88 to 0 because it means none (no bad mental health days)
# remove 77 and 99 for don't know not sure and refused
df['PHYSHLTH'] = df['PHYSHLTH'].replace({88:0})
df = df[df.PHYSHLTH != 77]
df = df[df.PHYSHLTH != 99]
df.PHYSHLTH.unique()

array([15.,  0., 30.,  2., 14., 28.,  7., 20.,  3., 10.,  1.,  5., 17.,
        4., 19.,  6., 21., 12.,  8., 25., 27., 22., 29., 24.,  9., 16.,
       18., 23., 13., 26., 11.])

In [None]:
#17 DIFFWALK
# change 2 to 0 for no. 1 is already yes
# remove 7 and 9 for don't know not sure and refused
df['DIFFWALK'] = df['DIFFWALK'].replace({2:0})
df = df[df.DIFFWALK != 7]
df = df[df.DIFFWALK != 9]
df.DIFFWALK.unique()

array([1., 0.])

In [None]:
#18 SEX
# in other words - is respondent male (somewhat arbitrarily chose this change because men are at higher risk for heart disease)
# change 2 to 0 (female as 0). Male is 1
df['SEX'] = df['SEX'].replace({2:0})
df.SEX.unique()

array([0., 1.])

In [None]:
#19 _AGEG5YR
# already ordinal. 1 is 18-24 all the way up to 13 wis 80 and older. 5 year increments.
# remove 14 because it is don't know or missing
df = df[df._AGEG5YR != 14]
df._AGEG5YR.unique()

array([ 9.,  7., 11., 10., 13.,  8.,  4.,  6.,  2., 12.,  5.,  1.,  3.])

In [None]:
#20 EDUCA
# This is already an ordinal variable with 1 being never attended school or kindergarten only up to 6 being college 4 years or more
# Scale here is 1-6
# Remove 9 for refused:
df = df[df.EDUCA != 9]
df.EDUCA.unique()

array([4., 6., 3., 5., 2., 1.])

In [None]:
#21 INCOME2
# Variable is already ordinal with 1 being less than $10,000 all the way up to 8 being $75,000 or more
# Remove 77 and 99 for don't know and refused
df = df[df.INCOME2 != 77]
df = df[df.INCOME2 != 99]
df.INCOME2.unique()

array([3., 1., 8., 6., 4., 7., 2., 5.])

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 253680 entries, 0 to 441455
Data columns (total 22 columns):
 #   Column    Non-Null Count   Dtype  
---  ------    --------------   -----  
 0   DIABETE3  253680 non-null  float64
 1   _RFHYPE5  253680 non-null  float64
 2   TOLDHI2   253680 non-null  float64
 3   _CHOLCHK  253680 non-null  float64
 4   _BMI5     253680 non-null  float64
 5   SMOKE100  253680 non-null  float64
 6   CVDSTRK3  253680 non-null  float64
 7   _MICHD    253680 non-null  float64
 8   _TOTINDA  253680 non-null  float64
 9   _FRTLT1   253680 non-null  float64
 10  _VEGLT1   253680 non-null  float64
 11  _RFDRHV5  253680 non-null  float64
 12  HLTHPLN1  253680 non-null  float64
 13  MEDCOST   253680 non-null  float64
 14  GENHLTH   253680 non-null  float64
 15  MENTHLTH  253680 non-null  float64
 16  PHYSHLTH  253680 non-null  float64
 17  DIFFWALK  253680 non-null  float64
 18  SEX       253680 non-null  float64
 19  _AGEG5YR  253680 non-null  float64
 20  EDUC

# Data Vizualization

In [None]:
# Check whether categorical of numerical features
df.nunique().sort_values()

_VEGLT1      2
_RFHYPE5     2
TOLDHI2      2
_CHOLCHK     2
SEX          2
SMOKE100     2
CVDSTRK3     2
_MICHD       2
_TOTINDA     2
_FRTLT1      2
DIFFWALK     2
_RFDRHV5     2
HLTHPLN1     2
MEDCOST      2
DIABETE3     3
GENHLTH      5
EDUCA        6
INCOME2      8
_AGEG5YR    13
MENTHLTH    31
PHYSHLTH    31
_BMI5       84
dtype: int64

# Machine Learning

## Data Processing for ML

In [None]:
# Get the dummy variables 
_RFHYPE5 = pd.get_dummies(df._RFHYPE5, drop_first=True, prefix='_RFHYPE5')
TOLDHI2 = pd.get_dummies(df.TOLDHI2, drop_first=True, prefix='TOLDHI2')
_CHOLCHK = pd.get_dummies(df._CHOLCHK, drop_first=True, prefix='_CHOLCHK')
SMOKE100 = pd.get_dummies(df.SMOKE100, drop_first=True, prefix='SMOKE100')

CVDSTRK3 = pd.get_dummies(df.CVDSTRK3, drop_first=True, prefix='CVDSTRK3')
_MICHD = pd.get_dummies(df._MICHD, drop_first=True, prefix='_MICHD')
_TOTINDA = pd.get_dummies(df._TOTINDA, drop_first=True, prefix='_TOTINDA')
_FRTLT1 = pd.get_dummies(df._FRTLT1, drop_first=True, prefix='_FRTLT1')

_VEGLT1 = pd.get_dummies(df._VEGLT1, drop_first=True, prefix='_VEGLT1')
_RFDRHV5 = pd.get_dummies(df._RFDRHV5, drop_first=True, prefix='_RFDRHV5')
HLTHPLN1 = pd.get_dummies(df.HLTHPLN1, drop_first=True, prefix='HLTHPLN1')
MEDCOST = pd.get_dummies(df.MEDCOST, drop_first=True, prefix='MEDCOST')

GENHLTH = pd.get_dummies(df.GENHLTH, drop_first=True, prefix='GENHLTH')
DIFFWALK = pd.get_dummies(df.DIFFWALK, drop_first=True, prefix='DIFFWALK')
SEX = pd.get_dummies(df.SEX, drop_first=True, prefix='SEX')
EDUCA = pd.get_dummies(df.EDUCA, drop_first=True, prefix='EDUCA')
INCOME2 = pd.get_dummies(df.INCOME2, drop_first=True, prefix='INCOME2')
_AGEG5YR = pd.get_dummies(df._AGEG5YR, drop_first=True, prefix='_AGEG5YR')

# Drop un-encoded features
df.drop(['_RFHYPE5','TOLDHI2','_CHOLCHK','SMOKE100', 'CVDSTRK3', '_MICHD', '_TOTINDA', '_FRTLT1',
         '_VEGLT1', '_RFDRHV5', 'HLTHPLN1', 'MEDCOST', 'GENHLTH', 'DIFFWALK', 'SEX', 'EDUCA', 'INCOME2', '_AGEG5YR'], 
         axis = 1, inplace = True)

# Add the results to the original df
df = pd.concat([_RFHYPE5,TOLDHI2,_CHOLCHK,SMOKE100, CVDSTRK3, _MICHD, _TOTINDA, _FRTLT1,
         _VEGLT1, _RFDRHV5, HLTHPLN1, MEDCOST, GENHLTH, DIFFWALK, SEX, EDUCA, INCOME2, _AGEG5YR,
          df], axis=1)

df.head()

Unnamed: 0,_RFHYPE5_1.0,TOLDHI2_1.0,_CHOLCHK_1.0,SMOKE100_1.0,CVDSTRK3_1.0,_MICHD_1.0,_TOTINDA_1.0,_FRTLT1_1.0,_VEGLT1_1.0,_RFDRHV5_1.0,HLTHPLN1_1.0,MEDCOST_1.0,GENHLTH_2.0,GENHLTH_3.0,GENHLTH_4.0,GENHLTH_5.0,DIFFWALK_1.0,SEX_1.0,EDUCA_2.0,EDUCA_3.0,EDUCA_4.0,EDUCA_5.0,EDUCA_6.0,INCOME2_2.0,INCOME2_3.0,INCOME2_4.0,INCOME2_5.0,INCOME2_6.0,INCOME2_7.0,INCOME2_8.0,_AGEG5YR_2.0,_AGEG5YR_3.0,_AGEG5YR_4.0,_AGEG5YR_5.0,_AGEG5YR_6.0,_AGEG5YR_7.0,_AGEG5YR_8.0,_AGEG5YR_9.0,_AGEG5YR_10.0,_AGEG5YR_11.0,_AGEG5YR_12.0,_AGEG5YR_13.0,DIABETE3,_BMI5,MENTHLTH,PHYSHLTH
0,1,1,1,1,0,0,0,0,1,0,1,0,0,0,0,1,1,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0.0,40.0,18.0,15.0
1,0,0,0,1,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0.0,25.0,0.0,0.0
3,1,1,1,0,0,0,0,1,0,0,1,1,0,0,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0.0,28.0,30.0,30.0
5,1,0,1,0,0,0,1,1,1,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0.0,27.0,0.0,0.0
6,1,1,1,0,0,0,1,1,1,0,1,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0.0,24.0,3.0,0.0


In [None]:
from sklearn.preprocessing import MinMaxScaler

# Instantiate a MinMaxScaler object
scaler = MinMaxScaler()

# Create a list of continuous features
con_feat = ['_BMI5', 'MENTHLTH', 'PHYSHLTH']

# Fit on dataset
df[con_feat] = scaler.fit_transform(df[con_feat])
df.head()

Unnamed: 0,_RFHYPE5_1.0,TOLDHI2_1.0,_CHOLCHK_1.0,SMOKE100_1.0,CVDSTRK3_1.0,_MICHD_1.0,_TOTINDA_1.0,_FRTLT1_1.0,_VEGLT1_1.0,_RFDRHV5_1.0,HLTHPLN1_1.0,MEDCOST_1.0,GENHLTH_2.0,GENHLTH_3.0,GENHLTH_4.0,GENHLTH_5.0,DIFFWALK_1.0,SEX_1.0,EDUCA_2.0,EDUCA_3.0,EDUCA_4.0,EDUCA_5.0,EDUCA_6.0,INCOME2_2.0,INCOME2_3.0,INCOME2_4.0,INCOME2_5.0,INCOME2_6.0,INCOME2_7.0,INCOME2_8.0,_AGEG5YR_2.0,_AGEG5YR_3.0,_AGEG5YR_4.0,_AGEG5YR_5.0,_AGEG5YR_6.0,_AGEG5YR_7.0,_AGEG5YR_8.0,_AGEG5YR_9.0,_AGEG5YR_10.0,_AGEG5YR_11.0,_AGEG5YR_12.0,_AGEG5YR_13.0,DIABETE3,_BMI5,MENTHLTH,PHYSHLTH
0,1,1,1,1,0,0,0,0,1,0,1,0,0,0,0,1,1,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0.0,0.325581,0.6,0.5
1,0,0,0,1,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0.0,0.151163,0.0,0.0
3,1,1,1,0,0,0,0,1,0,0,1,1,0,0,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0.0,0.186047,1.0,1.0
5,1,0,1,0,0,0,1,1,1,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0.0,0.174419,0.0,0.0
6,1,1,1,0,0,0,1,1,1,0,1,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0.0,0.139535,0.1,0.0


In [None]:
df.describe()

Unnamed: 0,_RFHYPE5_1.0,TOLDHI2_1.0,_CHOLCHK_1.0,SMOKE100_1.0,CVDSTRK3_1.0,_MICHD_1.0,_TOTINDA_1.0,_FRTLT1_1.0,_VEGLT1_1.0,_RFDRHV5_1.0,HLTHPLN1_1.0,MEDCOST_1.0,GENHLTH_2.0,GENHLTH_3.0,GENHLTH_4.0,GENHLTH_5.0,DIFFWALK_1.0,SEX_1.0,EDUCA_2.0,EDUCA_3.0,EDUCA_4.0,EDUCA_5.0,EDUCA_6.0,INCOME2_2.0,INCOME2_3.0,INCOME2_4.0,INCOME2_5.0,INCOME2_6.0,INCOME2_7.0,INCOME2_8.0,_AGEG5YR_2.0,_AGEG5YR_3.0,_AGEG5YR_4.0,_AGEG5YR_5.0,_AGEG5YR_6.0,_AGEG5YR_7.0,_AGEG5YR_8.0,_AGEG5YR_9.0,_AGEG5YR_10.0,_AGEG5YR_11.0,_AGEG5YR_12.0,_AGEG5YR_13.0,DIABETE3,_BMI5,MENTHLTH,PHYSHLTH
count,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0
mean,0.429001,0.424121,0.96267,0.443169,0.040571,0.094186,0.756544,0.634256,0.81142,0.056197,0.951053,0.084177,0.351167,0.298195,0.124448,0.047623,0.168224,0.440342,0.015937,0.037362,0.247359,0.275583,0.423072,0.046448,0.063048,0.079372,0.10203,0.143764,0.170368,0.356295,0.029951,0.043847,0.05449,0.06369,0.078126,0.103729,0.121539,0.131047,0.126908,0.092766,0.062993,0.068444,0.296921,0.190493,0.106159,0.141403
std,0.494934,0.49421,0.189571,0.496761,0.197294,0.292087,0.429169,0.481639,0.391175,0.230302,0.215759,0.277654,0.477336,0.457466,0.330093,0.212968,0.374066,0.496429,0.125234,0.189648,0.431478,0.446809,0.494048,0.210454,0.24305,0.270318,0.302689,0.350851,0.375957,0.478905,0.170453,0.204754,0.226982,0.244201,0.26837,0.304909,0.326753,0.337452,0.33287,0.290105,0.24295,0.252508,0.69816,0.076845,0.247095,0.290598
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.139535,0.0,0.0
50%,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.174419,0.0,0.0
75%,1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.22093,0.066667,0.1
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0


## Split train test

In [None]:
from sklearn.model_selection import train_test_split

y = df.pop('DIABETE3')
X = df

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

## Use Voting Classifer

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn import tree
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

In [None]:
# Build and train a Gaussian Naive Bayes
gnb = GaussianNB()
cv = cross_val_score(gnb,X_train,y_train,cv=5)
print(cv)
print(cv.mean())

[0.70674321 0.70770406 0.71588361 0.71238513 0.70555337]
0.7096538744653256


In [None]:
# Build and train a Logistic Regression
lr = LogisticRegression(max_iter = 2000)
cv = cross_val_score(lr,X_train,y_train,cv=5)
print(cv)
print(cv.mean())

[0.84808692 0.84606667 0.84981153 0.84835793 0.84685129]
0.8478348658566908


In [None]:
# Build and train a Decision Tree
dt = tree.DecisionTreeClassifier(random_state = 1)
cv = cross_val_score(dt,X_train,y_train,cv=5)
print(cv)
print(cv.mean())

[0.77067678 0.76966666 0.76954347 0.7692971  0.76675372]
0.7691875471370091


In [None]:
# Build and train a KNN
knn = KNeighborsClassifier()
cv = cross_val_score(knn,X_train,y_train,cv=5)
print(cv)
print(cv.mean())

[0.82926409 0.82756412 0.83261475 0.82916554 0.8271903 ]
0.82915975850749


In [None]:
# Build and train a Random Forest Classifier
rf = RandomForestClassifier(random_state = 1)
cv = cross_val_score(rf,X_train,y_train,cv=5)
print(cv)
print(cv.mean())

[0.84143487 0.84005519 0.84219863 0.84237109 0.84002661]
0.84121727574766


In [None]:
# Build and train an SVC (temporarily removed as it takes too long to train)
# svc = SVC(probability = True)
# cv = cross_val_score(svc,X_train,y_train,cv=5)
# print(cv)
# print(cv.mean())

In [None]:
from xgboost import XGBClassifier
xgb = XGBClassifier(random_state =1)
cv = cross_val_score(xgb,X_train,y_train,cv=5)
print(cv)
print(cv.mean())

[0.84939269 0.84683042 0.85087093 0.85059992 0.84800926]
0.8491406440624989


In [None]:
# Build a Voting Classifier
from sklearn.ensemble import VotingClassifier
# voting_clf = VotingClassifier(estimators = [('lr',lr),('knn',knn),('rf',rf),('gnb',gnb),('svc',svc),('xgb',xgb)], voting = 'soft') 
voting_clf = VotingClassifier(estimators = [('lr',lr),('knn',knn),('rf',rf),('gnb',gnb),('xgb',xgb)], voting = 'soft') 

In [None]:
cv = cross_val_score(voting_clf,X_train,y_train,cv=5)
print(cv)
print(cv.mean())

[0.84113922 0.83680307 0.84362758 0.84256818 0.83820341]
0.8404682953677591


In [None]:
# Train the Voting Classifier
voting_clf.fit(X_train,y_train)

VotingClassifier(estimators=[('lr', LogisticRegression(max_iter=2000)),
                             ('knn', KNeighborsClassifier()),
                             ('rf', RandomForestClassifier(random_state=1)),
                             ('gnb', GaussianNB()),
                             ('xgb', XGBClassifier(random_state=1))],
                 voting='soft')

In [None]:
# Get the accuracy score of the classification
print(voting_clf.score(X_test, y_test))

0.8413946704509618
