# HW6 Comprehensive Supervised Learning
Contributors: Zongyu Wu, Tony Wilson, Chandler Smith

Summary: The overall purpose of this assignment is to tie together a variety of supervised learning techniques in order to appropriately analyze several questions related to the Behavioral Risk Factor Surveillance System. More specifically, the goal is to use supervised ML to identify patterns of comorbidity among the survey respondents. This is a comparative exercise focusing on pre (2019) and post (2021) covid health. 

Notes from discussion:

Approach: One approach for understanding behavioral factors behind comorbidity is to apply classification algorithms to the BRFSS data. Such algorithms can be trained to predict whether an individual has multiple chronic conditions based on their responses to the survey questions. By analyzing the features that are most important for predicting comorbidity, one can identify risk factors and inform the development of targeted prevention and intervention strategies.

Response related:
Depression = ADDEPEV3
 Ever told Asthma = _CASTHM1
 COPD = (CHCCOPD2 for 2019 and CHCCOPD3 for 2021)
 Cancer = (combine CHCSCNCR and CHCOCNCR)
 Ever told Heart Condition = combination of CVDCRHD4, CVDINFR4 and CVDSTRK3
 Diabetes = DIABETE4
 

Key demographic features:
Age: _AGE_G
 Marital: MARITAL
 Sex: _SEX
 Income: INCOME2
 Education: _EDUCAG

- Do NOT refer to correlation
- BRFSS data can be used to understand comorbidity, which refers to the presence of multiple chronic conditions in an individual. 
- RFSS uses a complex sampling and weighting scheme to measure prevalence of many health conditions, behavioral and lifestyle related risk factors and emerging health issues in states. 
- Do not use the weight variable as we are conducting estimations. 


- 2019 Codebook: https://www.cdc.gov/brfss/annual_data/2019/pdf/codebook19_llcp-v2-508.HTML
- 2021 Codebook: https://www.cdc.gov/brfss/annual_data/2021/pdf/codebook21_llcp-v2-508.pdf

## Clean, Standardize, and Merge data


In [1]:
import pandas as pd
import numpy as np
import os
from datetime import datetime

import sklearn 

# Libraries related to outlier detection
from sklearn.neighbors import LocalOutlierFactor
from sklearn.ensemble import IsolationForest
from sklearn.covariance import EllipticEnvelope

from sklearn import datasets
import random
import seaborn as sns
import matplotlib.pyplot as plt
import missingno as msno
from scipy.stats import pearsonr

import warnings
warnings.filterwarnings('ignore') 
sns.set(rc={'figure.figsize':(11,8)})

pd.options.display.float_format = '{:.2f}'.format

## <font color= darkgreen> Basic EDA

### <font color= lightblue> Step 1: Read the data and merge into Pandas Data Frame

In [2]:
###############################################################################################################################################################################
'''                                                                       Data Frame Setup                                                                                  '''
###############################################################################################################################################################################
## The first column is index: skipping that column to end read csv to Panda Data Frame
df_21 = pd.read_csv("res/brfss21-1.csv")
df_21.drop(columns="Unnamed: 0", inplace=True)
df_21.head()



Unnamed: 0,_STATE,FMONTH,IDATE,IMONTH,IDAY,IYEAR,DISPCODE,SEQNO,_PSU,CTELENM1,...,_FRTRES1,_VEGRES1,_FRUTSU1,_VEGESU1,_FRTLT1A,_VEGLT1A,_FRT16A,_VEG23A,_FRUITE1,_VEGETE1
0,1.0,1.0,b'01192021',b'01',b'19',b'2021',1100.0,b'2021000001',2021000001.0,1.0,...,1.0,1.0,100.0,214.0,1.0,1.0,1.0,1.0,0.0,0.0
1,1.0,1.0,b'01212021',b'01',b'21',b'2021',1100.0,b'2021000002',2021000002.0,1.0,...,1.0,1.0,100.0,128.0,1.0,1.0,1.0,1.0,0.0,0.0
2,1.0,1.0,b'01212021',b'01',b'21',b'2021',1100.0,b'2021000003',2021000003.0,1.0,...,1.0,1.0,100.0,71.0,1.0,2.0,1.0,1.0,0.0,0.0
3,1.0,1.0,b'01172021',b'01',b'17',b'2021',1100.0,b'2021000004',2021000004.0,1.0,...,1.0,1.0,114.0,165.0,1.0,1.0,1.0,1.0,0.0,0.0
4,1.0,1.0,b'01152021',b'01',b'15',b'2021',1100.0,b'2021000005',2021000005.0,1.0,...,1.0,1.0,100.0,258.0,1.0,1.0,1.0,1.0,0.0,0.0


In [3]:
df_19 = pd.read_csv("res/brfss19-1.csv")
df_19.drop(columns="Unnamed: 0", inplace= True)
df_19.head()


Unnamed: 0,_STATE,FMONTH,IDATE,IMONTH,IDAY,IYEAR,DISPCODE,SEQNO,_PSU,CTELENM1,...,_VEGESU1,_FRTLT1A,_VEGLT1A,_FRT16A,_VEG23A,_FRUITE1,_VEGETE1,_FLSHOT7,_PNEUMO3,_AIDTST4
0,1,1,1182019,1,18,2019,1100,2019000001,2019000001,1.0,...,114.0,1,1,1,1,0,0,2.0,1.0,2.0
1,1,1,1132019,1,13,2019,1100,2019000002,2019000002,1.0,...,121.0,1,1,1,1,0,0,1.0,1.0,2.0
2,1,1,1182019,1,18,2019,1100,2019000003,2019000003,1.0,...,164.0,1,1,1,1,0,0,1.0,2.0,2.0
3,1,1,1182019,1,18,2019,1200,2019000004,2019000004,1.0,...,,9,9,1,1,1,1,9.0,9.0,
4,1,1,1042019,1,4,2019,1100,2019000005,2019000005,1.0,...,178.0,1,1,1,1,0,0,2.0,1.0,2.0


### <font color= lightblue> Step 2: Summary of Stats

In [4]:
###############################################################################################################################################################################
'''                                                                    Data Frame Description                                                                               '''
###############################################################################################################################################################################
df_19.describe()

Unnamed: 0,_STATE,FMONTH,IDATE,IMONTH,IDAY,IYEAR,DISPCODE,SEQNO,_PSU,CTELENM1,...,_VEGESU1,_FRTLT1A,_VEGLT1A,_FRT16A,_VEG23A,_FRUITE1,_VEGETE1,_FLSHOT7,_PNEUMO3,_AIDTST4
count,418268.0,418268.0,418268.0,418268.0,418268.0,418268.0,418268.0,418268.0,418268.0,149941.0,...,364838.0,418268.0,418268.0,418268.0,418268.0,418268.0,418268.0,159112.0,159112.0,377977.0
mean,29.62,6.54,6727352.51,6.58,14.53,2019.04,1117.44,2019004884.52,2019004884.52,1.0,...,204.14,2.19,2.19,1.0,1.0,0.11,0.13,2.23,2.37,1.97
std,16.15,3.34,3304672.99,3.31,8.49,0.21,37.95,3653.32,3653.32,0.0,...,267.9,2.4,2.63,0.04,0.05,0.32,0.35,2.47,2.73,1.56
min,1.0,1.0,1012020.0,1.0,1.0,2019.0,1100.0,2019000001.0,2019000001.0,1.0,...,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0
25%,18.0,4.0,4082019.0,4.0,7.0,2019.0,1100.0,2019002011.0,2019002011.0,1.0,...,114.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0
50%,27.0,7.0,7012019.0,7.0,14.0,2019.0,1100.0,2019004137.0,2019004137.0,1.0,...,165.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,2.0
75%,42.0,9.0,9302019.0,9.0,22.0,2019.0,1100.0,2019006895.0,2019006895.0,1.0,...,229.0,2.0,2.0,1.0,1.0,0.0,0.0,2.0,2.0,2.0
max,72.0,12.0,12312019.0,12.0,31.0,2020.0,1200.0,2019017419.0,2019017419.0,1.0,...,13204.0,9.0,9.0,1.0,1.0,2.0,2.0,9.0,9.0,9.0


In [5]:
###############################################################################################################################################################################
'''                                                                     Data Frame Shape                                                                                    '''
###############################################################################################################################################################################
df_19.shape

(418268, 250)

In [6]:
###############################################################################################################################################################################
'''                                                                    Data Frame Description                                                                               '''
###############################################################################################################################################################################
df_21.describe()

Unnamed: 0,_STATE,FMONTH,DISPCODE,_PSU,CTELENM1,PVTRESD1,COLGHOUS,STATERE1,LADULT1,COLGSEX,...,_FRTRES1,_VEGRES1,_FRUTSU1,_VEGESU1,_FRTLT1A,_VEGLT1A,_FRT16A,_VEG23A,_FRUITE1,_VEGETE1
count,438693.0,438693.0,438693.0,438693.0,117786.0,117786.0,30.0,117786.0,117786.0,30.0,...,438693.0,438693.0,387606.0,378566.0,438693.0,438693.0,438693.0,438693.0,438693.0,438693.0
mean,30.74,6.41,1118.19,2021006064.89,1.0,1.0,1.0,1.0,1.01,1.63,...,0.88,0.86,178.34,271.54,2.27,2.26,0.99,0.99,0.13,0.15
std,15.33,3.42,38.58,6383.75,0.0,0.02,0.0,0.0,0.08,0.49,...,0.32,0.34,691.29,1036.23,2.49,2.71,0.07,0.09,0.35,0.38
min,1.0,1.0,1100.0,2021000001.0,1.0,1.0,1.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0
25%,20.0,3.0,1100.0,2021002091.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,57.0,114.0,1.0,1.0,1.0,1.0,0.0,0.0
50%,31.0,6.0,1100.0,2021004338.0,1.0,1.0,1.0,1.0,1.0,2.0,...,1.0,1.0,100.0,167.0,1.0,1.0,1.0,1.0,0.0,0.0
75%,41.0,9.0,1100.0,2021007674.0,1.0,1.0,1.0,1.0,1.0,2.0,...,1.0,1.0,200.0,229.0,2.0,2.0,1.0,1.0,0.0,0.0
max,78.0,12.0,1200.0,2021039095.0,2.0,2.0,1.0,1.0,2.0,2.0,...,1.0,1.0,19800.0,39600.0,9.0,9.0,1.0,1.0,2.0,2.0


In [7]:
###############################################################################################################################################################################
'''                                                                     Data Frame Shape                                                                                    '''
###############################################################################################################################################################################
df_21.shape

(438693, 250)

### <font color= lightblue> Step 3: Choose Feature Space

<font color=white>We will first need to choose our feature space.  The five demographics provided will be the first on the list:
<font color= red>
* Age:  _AGE_G
* Marital: MARITAL
* Sex: _SEX
* Income: INCOME2
* Education: _EDUCAG

<font color= white>
Next we want to choose some features which we think will have some bearing on the response columns. We also want to pair the 250 features down to 20 (as per professor). Another demographic could be their race and city vs rural living as well as access to health care (insurance) and routine checkups:
<br><br>
<font color= red>

* Race: _RACE
* Urban / Rural: _METSTAT
* Health Care Access (Insurance): PRIMINSR &ensp;&ensp;&ensp;&ensp;<font color=lightblue>-- Health Insurance data not included in the data provided<font color= red>
* Health Care Access (Routine checkup): CHECKUP1

<font color= white>
Also, we should also track physical activity, High blood pressure, High Cholesterol, smoking habits, drinking habits, BMI, kidney disease
<br><br>
<font color= red>

* Physical exersice: _TOTINDA
* HBP: _RFHYPE6 &ensp;&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;<font color=lightblue>-- Blood Pressure metrics not included in the data provided<font color= white><br>

Instead we will use Currently taking blood pressure medication and Be very carefull with our clean up <br><font color= red>

* High Cholesterol: _RFCHOL3 &ensp;&ensp;&ensp;&ensp;<font color=lightblue>-- Cholesterol metrics not included in the data provided<font color= red>
* Four level smoker status: _SMOKER3
* Number of drinks per week: _DRNKWK1&ensp;&ensp;&ensp;&ensp;<font color=lightblue> Continuous Data<font color= red>
* Total Fruit per Day: _FRUTSU1&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;<font color=lightblue> Continuous Data<font color= red>
* Total Vegetables per Day: _VEGESU1&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;<font color=lightblue> Continuous Data<font color= red>
* Total French Fry per Day: FRNCHDA_&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;<font color=lightblue> Continuous Data<font color= red>
* Body Mass Index: _BMI5&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;<font color=lightblue> Continuous Data<font color= red>
* Kidney Disease: CHCKDNY2

<font color= white>
The last demographic that Tony would like to include specifically is whether the participant is a veteran:
br><br>
<font color= red>

* Veteran Status: VETERAN3


In [8]:
###############################################################################################################################################################################
'''                                                               Choosing Response and Features                                                                            '''
###############################################################################################################################################################################
df_19_trim = df_19.loc[:, ['ADDEPEV3', '_CASTHM1', 'CHCCOPD2', 'CHCSCNCR', 'CHCOCNCR', 'CVDCRHD4', 'CVDINFR4', 'CVDSTRK3',
                            'DIABETE4', '_AGE_G', 'MARITAL', '_SEX', 'INCOME2', '_EDUCAG', '_RACE', '_METSTAT', 'CHECKUP1',
                            'BPMEDS', '_TOTINDA', '_SMOKER3', '_DRNKWK1', '_FRUTSU1', '_VEGESU1','FRNCHDA_', '_BMI5', 
                            'CHCKDNY2', 'VETERAN3']]
df_19_trim.shape

(418268, 27)

In [9]:
###############################################################################################################################################################################
'''                                                               Choosing Response and Features                                                                            '''
###############################################################################################################################################################################
df_21_trim = df_21.loc[:, ['ADDEPEV3', '_CASTHM1', 'CHCCOPD3', 'CHCSCNCR', 'CHCOCNCR', 'CVDCRHD4', 'CVDINFR4', 'CVDSTRK3',
                            'DIABETE4', '_AGE_G', 'MARITAL', '_SEX', 'INCOME3', '_EDUCAG', '_RACE', '_METSTAT', 'CHECKUP1',
                            'BPMEDS', '_TOTINDA', '_SMOKER3', '_DRNKWK1', '_FRUTSU1', '_VEGESU1','FRNCHDA_', '_BMI5', 
                            'CHCKDNY2', 'VETERAN3']]
df_21_trim.shape

(438693, 27)

### <font color= lightblue> Step 4: Missing Data Analysis

##### BPMEDS
From looking at the data itself, most of the missing values from BPMEDS is because the previous question answered something other than yes to every choice other than yes.  Therefore we can infer that these were a no and that 7 and 9 should be dropped.
* 2021-
- missing: 266,560 
<br>vs<br>
- Previous other than yes: 266,560

there is only 0.26% which was I don't know or Refused

* 2019-
- missing: 248,634 
<br>vs<br>
- Previous other than yes: 248,634

there is only 0.21% which was I don't know or Refused

In [10]:
###############################################################################################################################################################################
'''                                                                   Analyze, Clean, and Impute                                                                            '''
###############################################################################################################################################################################
## Print Original Shape for comparisson
print('Original 19 shape:\t', df_21_trim.shape)
print('Original 21 shape:\t', df_21_trim.shape)

## Interpret and clean df
# output how many NaN there is per column
print('Missing values by field (Pre-Interpolate):')

# Visualize the finished Data Frame
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    print(df_19_trim.isnull().sum(), '\n')
    print(df_21_trim.isnull().sum(), '\n')

## Replace missing values in BPMEDS with No answer and drop 7 and 9 response as NAN
df_19_trim = df_19_trim[df_19_trim['BPMEDS'] != 7] 
df_19_trim = df_19_trim[df_19_trim['BPMEDS'] != 9]
df_19_trim['BPMEDS'].fillna(2, inplace= True) 

# How are we doing now
print(df_19_trim.isnull().sum(), '\n')
print(df_19_trim.shape)

# Drop the remaining NaN
df_19_trim.dropna(inplace= True)


## Replace missing values in BPMEDS with No answer and drop 7 and 9 response as NAN
df_21_trim = df_21_trim[df_21_trim['BPMEDS'] != 7] 
df_21_trim = df_21_trim[df_21_trim['BPMEDS'] != 9]
df_21_trim['BPMEDS'].fillna(2, inplace= True) 

# How are we doing now
print(df_21_trim.isnull().sum(), '\n')
print(df_21_trim.shape)

# Drop the remaining NaN
df_21_trim.dropna(inplace= True)

# Output final shape
print('Final 19 shape:\t', df_19_trim.shape)
print('Final 21 shape:\t', df_21_trim.shape)

Original 19 shape:	 (438693, 27)
Original 21 shape:	 (438693, 27)
Missing values by field (Pre-Interpolate):
ADDEPEV3        10
_CASTHM1         0
CHCCOPD2         8
CHCSCNCR         8
CHCOCNCR         9
CVDCRHD4         8
CVDINFR4        10
CVDSTRK3        11
DIABETE4         9
_AGE_G           0
MARITAL         49
_SEX             0
INCOME2       6881
_EDUCAG          0
_RACE            3
_METSTAT      8458
CHECKUP1        10
BPMEDS      248634
_TOTINDA         0
_SMOKER3         0
_DRNKWK1         0
_FRUTSU1     44600
_VEGESU1     53430
FRNCHDA_     38866
_BMI5        36203
CHCKDNY2        11
VETERAN3      1374
dtype: int64 

ADDEPEV3         3
_CASTHM1         0
CHCCOPD3         3
CHCSCNCR         2
CHCOCNCR         3
CVDCRHD4         2
CVDINFR4         2
CVDSTRK3         2
DIABETE4         3
_AGE_G           0
MARITAL          5
_SEX             0
INCOME3       8847
_EDUCAG          0
_RACE            0
_METSTAT      7054
CHECKUP1         2
BPMEDS      266560
_TOTINDA         0
_S

## Create categorical columns

- Column 1: Binary - indicates whether an individual has haf any of the chronic conditions. 
- Column 2: Multiclass - create a milticlass column with values that inidicate the total number of chronic conditions.

### <font color= lightblue> Step 5: Binary and Multiclass Response Created

<font color= yellow>First we create the binaries for 2019

In [11]:
###############################################################################################################################################################################
'''                                                                   Convert Response Column for 2019                                                                      '''
###############################################################################################################################################################################
## Binary Comorbidity Response Features from our respective responses
# Create the Depression comorbidity variable
df_19_trim['Depression'] = df_19_trim.apply(lambda row: 1 if row[0] == 1 else (0 if row[0] == 2 else np.NAN), axis=1)
print(df_19_trim[['Depression', 'ADDEPEV3']].value_counts())
print(df_19_trim['Depression'].isnull().sum())
print(df_19_trim['ADDEPEV3'].value_counts(), '\n')

# Create the Asthma comorbidity variable
df_19_trim['Asthma'] = df_19_trim.apply(lambda row: 1 if row[1] == 1 else (0 if row[1] == 2 else np.NAN), axis=1)
print(df_19_trim[['Asthma', '_CASTHM1']].value_counts())
print(df_19_trim['Asthma'].isnull().sum())
print(df_19_trim['_CASTHM1'].value_counts(), '\n')

# Create the COPD comorbidity variable
df_19_trim['COPD'] = df_19_trim.apply(lambda row: 1 if row[2] == 1 else (0 if row[2] == 2 else np.NAN), axis=1)
print(df_19_trim[['COPD', 'CHCCOPD2']].value_counts())
print(df_19_trim['COPD'].isnull().sum())
print(df_19_trim['CHCCOPD2'].value_counts(), '\n')

# Create the Cancer comorbidity variable
df_19_trim['Cancer'] = df_19_trim.apply(lambda row: 1 if row[3] == 1 else (1 if row[4] == 1 else (0 if row[3] == 2 else ( 0 if row[4] == 2 else np.NAN))), axis=1)
print(df_19_trim[['Cancer', 'CHCSCNCR', 'CHCOCNCR']].value_counts())
print(df_19_trim['Cancer'].isnull().sum())
print(df_19_trim['CHCSCNCR'].value_counts(), '\n')
print(df_19_trim['CHCOCNCR'].value_counts(), '\n')
print(df_19_trim['Cancer'].value_counts(), '\n')

# Create the Heart Condition comorbidity variable
df_19_trim['Heart'] = df_19_trim.apply(lambda row: 1 if row[5] == 1 else (1 if row[6] == 1 else (1 if row[7] == 1 else (0 if row[5] == 2 else (0 if row[6] == 2 else (0 if row[7] == 2 else np.NAN))))), axis=1)
print(df_19_trim[['Heart', 'CVDCRHD4', 'CVDINFR4', 'CVDSTRK3']].value_counts())
print(df_19_trim['Heart'].isnull().sum())
print(df_19_trim['CVDCRHD4'].value_counts(), '\n')
print(df_19_trim['CVDINFR4'].value_counts(), '\n')
print(df_19_trim['CVDSTRK3'].value_counts(), '\n')
print(df_19_trim['Heart'].value_counts(), '\n')

# Create the Depression comorbidity variable
df_19_trim['Diabetes'] = df_19_trim.apply(lambda row: 1 if row[8] == 1 else (0 if row[8] == 2 else (0 if row[8] == 3 else (0 if row[8] == 4 else np.NAN))), axis=1)
print(df_19_trim[['Diabetes', 'DIABETE4']].value_counts())
print(df_19_trim['Diabetes'].isnull().sum())
print(df_19_trim['DIABETE4'].value_counts(), '\n')

print(df_19_trim.head())

Depression  ADDEPEV3
0.00        2.00        264512
1.00        1.00         64620
dtype: int64
1373
2.00    264512
1.00     64620
7.00      1188
9.00       185
Name: ADDEPEV3, dtype: int64 

Asthma  _CASTHM1
1.00    1           297376
0.00    2            30814
dtype: int64
2315
1    297376
2     30814
9      2315
Name: _CASTHM1, dtype: int64 

COPD  CHCCOPD2
0.00  2.00        300875
1.00  1.00         28179
dtype: int64
1451
2.00    300875
1.00     28179
7.00      1398
9.00        53
Name: CHCCOPD2, dtype: int64 

Cancer  CHCSCNCR  CHCOCNCR
0.00    2.00      2.00        267482
1.00    1.00      2.00         26841
        2.00      1.00         26565
        1.00      1.00          8057
0.00    7.00      2.00           660
        2.00      7.00           509
1.00    7.00      1.00           143
        1.00      7.00            89
0.00    2.00      9.00            58
        9.00      2.00            13
1.00    1.00      9.00            10
        9.00      1.00             1
dtype: 

<font color= yellow>Then we create the binaries for 2021

In [12]:
###############################################################################################################################################################################
'''                                                                   Convert Response Column for 2021                                                                      '''
###############################################################################################################################################################################
## Binary Comorbidity Response Features from our respective responses
# Create the Depression comorbidity variable
df_21_trim['Depression'] = df_21_trim.apply(lambda row: 1 if row[0] == 1 else (0 if row[0] == 2 else np.NAN), axis=1)
print(df_21_trim[['Depression', 'ADDEPEV3']].value_counts())
print(df_21_trim['Depression'].isnull().sum())
print(df_21_trim['ADDEPEV3'].value_counts(), '\n')

# Create the Asthma comorbidity variable
df_21_trim['Asthma'] = df_21_trim.apply(lambda row: 1 if row[1] == 1 else (0 if row[1] == 2 else np.NAN), axis=1)
print(df_21_trim[['Asthma', '_CASTHM1']].value_counts())
print(df_21_trim['Asthma'].isnull().sum())
print(df_21_trim['_CASTHM1'].value_counts(), '\n')

# Create the COPD comorbidity variable
df_21_trim['COPD'] = df_21_trim.apply(lambda row: 1 if row[2] == 1 else (0 if row[2] == 2 else np.NAN), axis=1)
print(df_21_trim[['COPD', 'CHCCOPD3']].value_counts())
print(df_21_trim['COPD'].isnull().sum())
print(df_21_trim['CHCCOPD3'].value_counts(), '\n')

# Create the Cancer comorbidity variable
df_21_trim['Cancer'] = df_21_trim.apply(lambda row: 1 if row[3] == 1 else (1 if row[4] == 1 else (0 if row[3] == 2 else ( 0 if row[4] == 2 else np.NAN))), axis=1)
print(df_21_trim[['Cancer', 'CHCSCNCR', 'CHCOCNCR']].value_counts())
print(df_21_trim['Cancer'].isnull().sum())
print(df_21_trim['CHCSCNCR'].value_counts(), '\n')
print(df_21_trim['CHCOCNCR'].value_counts(), '\n')
print(df_21_trim['Cancer'].value_counts(), '\n')

# Create the Heart Condition comorbidity variable
df_21_trim['Heart'] = df_21_trim.apply(lambda row: 1 if row[5] == 1 else (1 if row[6] == 1 else (1 if row[7] == 1 else (0 if row[5] == 2 else (0 if row[6] == 2 else (0 if row[7] == 2 else np.NAN))))), axis=1)
print(df_21_trim[['Heart', 'CVDCRHD4', 'CVDINFR4', 'CVDSTRK3']].value_counts())
print(df_21_trim['Heart'].isnull().sum())
print(df_21_trim['CVDCRHD4'].value_counts(), '\n')
print(df_21_trim['CVDINFR4'].value_counts(), '\n')
print(df_21_trim['CVDSTRK3'].value_counts(), '\n')
print(df_21_trim['Heart'].value_counts(), '\n')

# Create the Depression comorbidity variable
df_21_trim['Diabetes'] = df_21_trim.apply(lambda row: 1 if row[8] == 1 else (0 if row[8] == 2 else (0 if row[8] == 3 else (0 if row[8] == 4 else np.NAN))), axis=1)
print(df_21_trim[['Diabetes', 'DIABETE4']].value_counts())
print(df_21_trim['Diabetes'].isnull().sum())
print(df_21_trim['DIABETE4'].value_counts(), '\n')

print(df_21_trim.head())

Depression  ADDEPEV3
0.00        2.00        268786
1.00        1.00         68765
dtype: int64
1447
2.00    268786
1.00     68765
7.00      1183
9.00       264
Name: ADDEPEV3, dtype: int64 

Asthma  _CASTHM1
1.00    1.00        303552
0.00    2.00         33089
dtype: int64
2357
1.00    303552
2.00     33089
9.00      2357
Name: _CASTHM1, dtype: int64 

COPD  CHCCOPD3
0.00  2.00        311016
1.00  1.00         26759
dtype: int64
1223
2.00    311016
1.00     26759
7.00      1157
9.00        66
Name: CHCCOPD3, dtype: int64 

Cancer  CHCSCNCR  CHCOCNCR
0.00    2.00      2.00        278474
1.00    2.00      1.00         25632
        1.00      2.00         25540
                  1.00          7869
0.00    7.00      2.00           588
        2.00      7.00           468
1.00    7.00      1.00           137
        1.00      7.00            88
0.00    2.00      9.00            56
        9.00      2.00            17
1.00    1.00      9.00            14
        9.00      1.00             

<font color= yellow>Drop the coloumns no longer needed and any missing values introduced

In [13]:
###############################################################################################################################################################################
'''                                                                   Re-Clean Data fropm both years                                                                        '''
###############################################################################################################################################################################
## First we need to drop the columns we no longer need and then the new NaNs for 2019
print('df_19_trim Original Shape_\t', df_19_trim.shape)
df_19_trim.drop(columns=['ADDEPEV3', '_CASTHM1', 'CHCCOPD2', 'CHCSCNCR', 'CHCOCNCR', 'CVDCRHD4', 'CVDINFR4', 'CVDSTRK3', 'DIABETE4'], inplace=True)
df_19_trim.dropna(inplace= True)
print('df_19_trim New Shape_\t', df_19_trim.shape, '\n')

## First we need to drop the columns we no longer need and then the new NaNs for 2021
print('df_21_trim Original Shape_\t', df_21_trim.shape)
df_21_trim.drop(columns=['ADDEPEV3', '_CASTHM1', 'CHCCOPD3', 'CHCSCNCR', 'CHCOCNCR', 'CVDCRHD4', 'CVDINFR4', 'CVDSTRK3', 'DIABETE4'], inplace=True)
df_21_trim.dropna(inplace= True)
print('df_19_trim New Shape_\t', df_21_trim.shape)

df_19_trim Original Shape_	 (330505, 33)
df_19_trim New Shape_	 (325245, 24) 

df_21_trim Original Shape_	 (338998, 33)
df_19_trim New Shape_	 (333899, 24)


<font color= yellow>Now we can move on to our last Comorbidity Variable for each year

In [14]:
###############################################################################################################################################################################
'''                                                                   Convert Response Column for 2019/2021                                                                 '''
###############################################################################################################################################################################
## Multiclass Response from the Binary Comorbidities
# First for 2019
df_19_trim['Comorbidity'] = df_19_trim.apply(lambda row: 1 if row[18] == 1 else (1 if row[19] == 1 else (1 if row[20] == 1 else (1 if row[21] == 1 else (1 if row[22] == 1 else (1 if row[23] == 1 else 0))))), axis=1)

# Then for 2021
df_21_trim['Comorbidity'] = df_21_trim.apply(lambda row: 1 if row[18] == 1 else (1 if row[19] == 1 else (1 if row[20] == 1 else (1 if row[21] == 1 else (1 if row[22] == 1 else (1 if row[23] == 1 else 0))))), axis=1)

### <font color= lightblue> Step 6: Make sure there are no NaNs left over

In [15]:
###############################################################################################################################################################################
'''                                                            Check for missing values again for 2019/2021                                                                 '''
###############################################################################################################################################################################
# output how many NaN there is per column
print('Missing values by field (Pre-Interpolate):')

# Visualize the finished Data Frame
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    print(df_19_trim.isnull().sum(), '\n')
    print(df_21_trim.isnull().sum(), '\n')

# We can see that there is one column left in each that have a different name so let us fix this now
df_21_trim = df_21_trim.rename(columns={'INCOME3': 'INCOME2'})

# Recheck by visualizing the finished Data Frame
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    print(df_19_trim.isnull().sum(), '\n')
    print(df_21_trim.isnull().sum(), '\n')

# Final shape of DF's
print('Final Shapes:\n', df_19_trim.shape, '\n', df_21_trim.shape)

Missing values by field (Pre-Interpolate):
_AGE_G         0
MARITAL        0
_SEX           0
INCOME2        0
_EDUCAG        0
_RACE          0
_METSTAT       0
CHECKUP1       0
BPMEDS         0
_TOTINDA       0
_SMOKER3       0
_DRNKWK1       0
_FRUTSU1       0
_VEGESU1       0
FRNCHDA_       0
_BMI5          0
CHCKDNY2       0
VETERAN3       0
Depression     0
Asthma         0
COPD           0
Cancer         0
Heart          0
Diabetes       0
Comorbidity    0
dtype: int64 

_AGE_G         0
MARITAL        0
_SEX           0
INCOME3        0
_EDUCAG        0
_RACE          0
_METSTAT       0
CHECKUP1       0
BPMEDS         0
_TOTINDA       0
_SMOKER3       0
_DRNKWK1       0
_FRUTSU1       0
_VEGESU1       0
FRNCHDA_       0
_BMI5          0
CHCKDNY2       0
VETERAN3       0
Depression     0
Asthma         0
COPD           0
Cancer         0
Heart          0
Diabetes       0
Comorbidity    0
dtype: int64 

_AGE_G         0
MARITAL        0
_SEX           0
INCOME2        0
_EDUCAG  

### <font color= lightblue> Step 6: Merge Data Frames

In [16]:
df = pd.concat([df_19_trim, df_21_trim])
df.head()

Unnamed: 0,_AGE_G,MARITAL,_SEX,INCOME2,_EDUCAG,_RACE,_METSTAT,CHECKUP1,BPMEDS,_TOTINDA,...,_BMI5,CHCKDNY2,VETERAN3,Depression,Asthma,COPD,Cancer,Heart,Diabetes,Comorbidity
0,6.0,2.0,2.0,3.0,1.0,2.0,1.0,1.0,1.0,2.0,...,2817.0,2.0,2.0,0.0,1.0,0.0,0.0,0.0,0.0,1
1,6.0,1.0,2.0,5.0,3.0,1.0,1.0,1.0,2.0,1.0,...,1854.0,2.0,2.0,0.0,1.0,0.0,0.0,0.0,0.0,1
2,6.0,3.0,2.0,7.0,4.0,2.0,1.0,1.0,1.0,1.0,...,3162.0,2.0,2.0,0.0,1.0,0.0,0.0,0.0,1.0,1
4,6.0,1.0,2.0,99.0,3.0,1.0,2.0,1.0,2.0,2.0,...,2148.0,2.0,2.0,0.0,1.0,0.0,0.0,0.0,0.0,1
6,6.0,2.0,1.0,7.0,4.0,1.0,2.0,1.0,2.0,1.0,...,3298.0,2.0,2.0,0.0,0.0,1.0,0.0,0.0,1.0,1


In [17]:
df.describe()

Unnamed: 0,_AGE_G,MARITAL,_SEX,INCOME2,_EDUCAG,_RACE,_METSTAT,CHECKUP1,BPMEDS,_TOTINDA,...,_BMI5,CHCKDNY2,VETERAN3,Depression,Asthma,COPD,Cancer,Heart,Diabetes,Comorbidity
count,659144.0,659144.0,659144.0,659144.0,659144.0,659144.0,659144.0,659144.0,659144.0,659144.0,...,659144.0,659144.0,659144.0,659144.0,659144.0,659144.0,659144.0,659144.0,659144.0,659144.0
mean,4.42,2.31,1.53,18.67,3.06,1.97,1.31,1.43,1.67,1.25,...,2847.1,1.97,1.88,0.2,0.9,0.08,0.18,0.11,0.13,0.97
std,1.59,1.72,0.5,29.69,0.97,2.21,0.46,1.04,0.47,0.51,...,647.48,0.33,0.41,0.4,0.29,0.27,0.38,0.32,0.34,0.18
min,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1200.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,3.0,1.0,1.0,5.0,2.0,1.0,1.0,1.0,1.0,1.0,...,2412.0,2.0,2.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
50%,5.0,1.0,2.0,7.0,3.0,1.0,1.0,1.0,2.0,1.0,...,2741.0,2.0,2.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
75%,6.0,3.0,2.0,9.0,4.0,1.0,2.0,1.0,2.0,1.0,...,3162.0,2.0,2.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
max,6.0,9.0,2.0,99.0,9.0,9.0,2.0,9.0,2.0,9.0,...,9933.0,9.0,9.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [18]:
df.shape

(659144, 25)

<font color= white>The final Data Frames have the following Response Columns<font color= red>

* Depression
* Asthma
* COPD
* Cancer
* Heart
* Diabetes
* Comorbidity

<font color=white>And the following Feature Space<font color= red>

* _AGE_G
* MARITAL
* _SEX
* INCOME2
* _EDUCAG
* _RACE
* _METSTAT
* CHECKUP1
* BPMEDS
* _TOTINDA
* _SMOKER3
* _DRNKWK1
* _FRUTSU1
* _VEGESU1
* FRNCHDA_
* _BMI5
* CHCKDNY2
* VETERAN3

<font color=white>The shapes of the different Dat Frames are as follows:<font color= red>

1. 2019: <br>
325245, 25<br><br>
2. 2021: <br>
333899, 25<br><br>
3. Merged: <br>
659144, 25<br><br>

## Analysis 1
Using both years, run exploratory data analysis using crosstabs, visuals, and basic frequency distributions to understand how chronic conditions are distribituted across geography and demography. You may also use the newly created comorbidity variables for this analysis. 

- Summarize salient features of the healthiest and least healthy states in the country.
- Discuss if you noticed any associations of the risk factors such as Age, Sex, Income, Education, Marital Status etc. with the level of comorbidities, while comparing the two years of data.

## Analysis 2
1. Using only 2021, use several classification algorithms to classify comorbility variables 1 and 2:
- Logistic Regression
- KNN
- RF
- Gradient Boosting
- XGBoost
- Catboost

2. Write a short report describing the performance metrics. 
3. In the end, choose one model each for the classification of both categorical variables

### Import Packages Used

In [69]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import RandomOverSampler
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
import xgboost as xgb
from catboost import CatBoostClassifier

### Train Test Split

In [71]:
# Get feature space and response
binary = ['Depression', 'Asthma', 'COPD', 'Cancer', 'Heart', 'Diabetes']
multi = ['Comorbidity']

# Feature space
X = df_21_trim.copy()
X.drop(binary, inplace=True, axis=1)
X.drop(multi, inplace=True, axis=1)
print('Feature space size: ', X.shape, '\n')

# Standarize
std = ['_DRNKWK1', '_FRUTSU1', '_VEGESU1', 'FRNCHDA_', '_BMI5']
scalar = StandardScaler()
X[std] = scalar.fit_transform(X[std])

# Response
y_binary = df_21_trim[binary].copy()
y_multi = df_21_trim[multi].copy()
print('Response size:')
print('Binary: ', y_binary.shape)
print('Multi: ', y_multi.shape, '\n')

# Over sample
ros = RandomOverSampler(random_state=0)

# Train Test Split
# Since we have many rows, 10% test is enough.
# Responses are highly imbalanced, so we over sample for each response.
X_train_binary, X_test_binary, y_train_binary, y_test_binary = {}, {}, {}, {}
for response in binary:
    # Over sample for each response.
    ros = RandomOverSampler(random_state=0)
    X_resampled_binary, y_resampled_binary = ros.fit_resample(X, y_binary[response])
    X_train, X_test, y_train, y_test = train_test_split(X_resampled_binary, y_resampled_binary, test_size=0.1, random_state=42)
    X_train_binary[response] = X_train.reset_index(drop=True)
    X_test_binary[response] = X_test.reset_index(drop=True)
    y_train_binary[response] = y_train.reset_index(drop=True)
    y_test_binary[response] = y_test.reset_index(drop=True)
    print('Train set for ', response, X_train.shape)

ros = RandomOverSampler(random_state=0)
X_resampled_multi, y_resampled_multi = ros.fit_resample(X, y_multi)
X_train_multi, X_test_multi, y_train_multi, y_test_multi = train_test_split(X_resampled_multi, y_resampled_multi, test_size=0.1, random_state=42)
X_train_multi.reset_index(inplace=True, drop=True)
X_test_multi.reset_index(inplace=True, drop=True)
y_train_multi.reset_index(inplace=True, drop=True)
y_test_multi.reset_index(inplace=True, drop=True)
print('Train set for multi case', X_train_multi.shape)


Feature space size:  (333899, 18) 

Response size:
Binary:  (333899, 6)
Multi:  (333899, 1) 

Train set for  Depression (479453, 18)
Train set for  Asthma (542296, 18)
Train set for  COPD (554112, 18)
Train set for  Cancer (495964, 18)
Train set for  Heart (536263, 18)
Train set for  Diabetes (522509, 18)
Train set for multi case (581113, 18)


### Logistic Regresison

#### Binary

In [73]:
# Configure estimator
lgr = LogisticRegression(random_state=42,
                         class_weight='balanced')
parameters = {
    'C': [1e-3, 1e-2, 1e-1, 1],
}

# Grid search for 5 reponses
for response in binary:
    clf = GridSearchCV(lgr, param_grid=parameters)
    clf.fit(X_train_binary[response], y_train_binary[response])
    y_predict = clf.predict(X_test_binary[response])
    print('For the response ', response, ', classification report on test set is:')
    print(classification_report(y_test_binary[response], y_predict))
    print('Best parameters are: ', clf.best_params_)
    print('Best score on train set is: ', clf.best_score_)
    print(pd.crosstab(y_predict, y_test_binary[response], rownames=['y_predict'], colnames=['y_test']))
    print()


For the response  Depression , classification report on test set is:
              precision    recall  f1-score   support

         0.0       0.63      0.65      0.64     26722
         1.0       0.64      0.62      0.63     26551

    accuracy                           0.64     53273
   macro avg       0.64      0.64      0.64     53273
weighted avg       0.64      0.64      0.64     53273

Best parameters are:  {'C': 1}
Best score on train set is:  0.6354637501815812
y_test      0.00   1.00
y_predict              
0.00       17454  10095
1.00        9268  16456

For the response  Asthma , classification report on test set is:
              precision    recall  f1-score   support

         0.0       0.61      0.59      0.60     30197
         1.0       0.60      0.62      0.61     30059

    accuracy                           0.61     60256
   macro avg       0.61      0.61      0.61     60256
weighted avg       0.61      0.61      0.61     60256

Best parameters are:  {'C': 0.001}
B

#### Multi

In [74]:
# Configure estimator
lgr = LogisticRegression(random_state=42,
                         class_weight='balanced')
parameters = {
    'C': [1e-3, 1e-2, 1e-1, 1],
}

# Grid search for multi response
clf = GridSearchCV(lgr, param_grid=parameters)
clf.fit(X_train_multi, y_train_multi)
y_predict = clf.predict(X_test_multi)
print('For the multi response, classification report on test set is:')
print(classification_report(y_test_multi, y_predict))
print('Best parameters are: ', clf.best_params_)
print('Best score on train set is: ', clf.best_score_)
print(pd.crosstab(y_predict, y_test_multi.Comorbidity, rownames=['y_predict'], colnames=['y_test']))

For the multi response, classification report on test set is:
              precision    recall  f1-score   support

           0       0.61      0.61      0.61     32363
           1       0.61      0.61      0.61     32206

    accuracy                           0.61     64569
   macro avg       0.61      0.61      0.61     64569
weighted avg       0.61      0.61      0.61     64569

Best parameters are:  {'C': 1}
Best score on train set is:  0.6092343494889769
y_test         0      1
y_predict              
0          19673  12628
1          12690  19578


### KNN

#### Binary

In [75]:
# Configure estimator
knn = KNeighborsClassifier(n_neighbors=5, leaf_size=10)

# Train and test for 5 reponses
for response in binary:
    knn.fit(X_train_binary[response], y_train_binary[response])
    y_predict = clf.predict(X_test_binary[response])
    print('For the response ', response, ', classification report on test set is:')
    print(classification_report(y_test_binary[response], y_predict))
    print(pd.crosstab(y_predict, y_test_binary[response], rownames=['y_predict'], colnames=['y_test']))


For the response  Depression , classification report on test set is:
              precision    recall  f1-score   support

         0.0       0.45      0.38      0.41     26722
         1.0       0.46      0.52      0.49     26551

    accuracy                           0.45     53273
   macro avg       0.45      0.45      0.45     53273
weighted avg       0.45      0.45      0.45     53273

y_test      0.00   1.00
y_predict              
0          10253  12646
1          16469  13905
For the response  Asthma , classification report on test set is:
              precision    recall  f1-score   support

         0.0       0.55      0.46      0.50     30197
         1.0       0.53      0.61      0.57     30059

    accuracy                           0.54     60256
   macro avg       0.54      0.54      0.54     60256
weighted avg       0.54      0.54      0.54     60256

y_test      0.00   1.00
y_predict              
0          14035  11667
1          16162  18392
For the response  CO

#### Multi

In [76]:
# Configure estimator
knn = KNeighborsClassifier(n_neighbors=5, leaf_size=10)


# Train and test for multi response
knn.fit(X_train_multi, y_train_multi)
y_predict = knn.predict(X_test_multi)
print('For the multi response, classification report on test set is:')
print(classification_report(y_test_multi, y_predict))
print(pd.crosstab(y_predict, y_test_multi.Comorbidity, rownames=['y_predict'], colnames=['y_test']))

For the multi response, classification report on test set is:
              precision    recall  f1-score   support

           0       0.91      1.00      0.95     32363
           1       1.00      0.90      0.95     32206

    accuracy                           0.95     64569
   macro avg       0.95      0.95      0.95     64569
weighted avg       0.95      0.95      0.95     64569

y_test         0      1
y_predict              
0          32363   3316
1              0  28890


### Random Forest

#### Binary

In [77]:
# Configure estimator
rf = RandomForestClassifier(n_jobs=5)


# Train and test for 5 reponses
for response in binary:
    rf.fit(X_train_binary[response], y_train_binary[response])
    y_predict = rf.predict(X_test_binary[response])
    print('For the response ', response, ', classification report on test set is:')
    print(classification_report(y_test_binary[response], y_predict))
    print(pd.crosstab(y_predict, y_test_binary[response], rownames=['y_predict'], colnames=['y_test']))


For the response  Depression , classification report on test set is:
              precision    recall  f1-score   support

         0.0       0.98      0.93      0.96     26722
         1.0       0.94      0.98      0.96     26551

    accuracy                           0.96     53273
   macro avg       0.96      0.96      0.96     53273
weighted avg       0.96      0.96      0.96     53273

y_test      0.00   1.00
y_predict              
0.00       24958    460
1.00        1764  26091
For the response  Asthma , classification report on test set is:
              precision    recall  f1-score   support

         0.0       0.99      1.00      1.00     30197
         1.0       1.00      0.99      1.00     30059

    accuracy                           1.00     60256
   macro avg       1.00      1.00      1.00     60256
weighted avg       1.00      1.00      1.00     60256

y_test      0.00   1.00
y_predict              
0.00       30188    166
1.00           9  29893
For the response  CO

#### Multi

In [78]:
# Configure estimator
rf = RandomForestClassifier(n_jobs=5)

# Train and test for multi response
rf.fit(X_train_multi, y_train_multi)
y_predict = rf.predict(X_test_multi)
print('For the multi response, classification report on test set is:')
print(classification_report(y_test_multi, y_predict))
print(pd.crosstab(y_predict, y_test_multi.Comorbidity, rownames=['y_predict'], colnames=['y_test']))

For the multi response, classification report on test set is:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     32363
           1       1.00      1.00      1.00     32206

    accuracy                           1.00     64569
   macro avg       1.00      1.00      1.00     64569
weighted avg       1.00      1.00      1.00     64569

y_test         0      1
y_predict              
0          32363      4
1              0  32202


### Gradient Boost

#### Binary

In [79]:
# Configure estimator
gbc = GradientBoostingClassifier()

# Train and test for 5 reponses
for response in binary:
    gbc.fit(X_train_binary[response], y_train_binary[response])
    y_predict = gbc.predict(X_test_binary[response])
    print('For the response ', response, ', classification report on test set is:')
    print(classification_report(y_test_binary[response], y_predict))
    print(pd.crosstab(y_predict, y_test_binary[response], rownames=['y_predict'], colnames=['y_test']))
    print()


For the response  Depression , classification report on test set is:
              precision    recall  f1-score   support

         0.0       0.65      0.69      0.67     26722
         1.0       0.67      0.63      0.65     26551

    accuracy                           0.66     53273
   macro avg       0.66      0.66      0.66     53273
weighted avg       0.66      0.66      0.66     53273

y_test      0.00   1.00
y_predict              
0.00       18408   9757
1.00        8314  16794

For the response  Asthma , classification report on test set is:
              precision    recall  f1-score   support

         0.0       0.64      0.57      0.60     30197
         1.0       0.61      0.67      0.64     30059

    accuracy                           0.62     60256
   macro avg       0.62      0.62      0.62     60256
weighted avg       0.62      0.62      0.62     60256

y_test      0.00   1.00
y_predict              
0.00       17222   9853
1.00       12975  20206

For the response  

#### Multi

In [80]:
# Configure estimator
gbc = GradientBoostingClassifier()

# Train and test for multi response
gbc.fit(X_train_multi, y_train_multi)
y_predict = gbc.predict(X_test_multi)
print('For the multi response, classification report on test set is:')
print(classification_report(y_test_multi, y_predict))
print(pd.crosstab(y_predict, y_test_multi.Comorbidity, rownames=['y_predict'], colnames=['y_test']))

For the multi response, classification report on test set is:
              precision    recall  f1-score   support

           0       0.62      0.68      0.64     32363
           1       0.64      0.58      0.61     32206

    accuracy                           0.63     64569
   macro avg       0.63      0.63      0.63     64569
weighted avg       0.63      0.63      0.63     64569

y_test         0      1
y_predict              
0          21880  13636
1          10483  18570


### XGBoost

#### Binary

In [81]:
# Configure estimator
xbc = xgb.XGBClassifier(n_estimators=100,
						max_depth=10,
						eta=0.01,
						min_child_weight=5,
						random_state=100)

# Train and test for 5 reponses
for response in binary:
    xbc.fit(X_train_binary[response], y_train_binary[response])
    y_predict = xbc.predict(X_test_binary[response])
    print('For the response ', response, ', classification report on test set is:')
    print(classification_report(y_test_binary[response], y_predict))
    print(pd.crosstab(y_predict, y_test_binary[response], rownames=['y_predict'], colnames=['y_test']))
    print()

For the response  Depression , classification report on test set is:
              precision    recall  f1-score   support

         0.0       0.67      0.70      0.68     26722
         1.0       0.68      0.65      0.67     26551

    accuracy                           0.68     53273
   macro avg       0.68      0.68      0.68     53273
weighted avg       0.68      0.68      0.68     53273

y_test      0.00   1.00
y_predict              
0          18685   9185
1           8037  17366

For the response  Asthma , classification report on test set is:
              precision    recall  f1-score   support

         0.0       0.68      0.61      0.64     30197
         1.0       0.64      0.71      0.67     30059

    accuracy                           0.66     60256
   macro avg       0.66      0.66      0.66     60256
weighted avg       0.66      0.66      0.66     60256

y_test      0.00   1.00
y_predict              
0          18443   8850
1          11754  21209

For the response  

#### Multi

In [82]:
# Configure estimator
xbg = xgb.XGBClassifier(n_estimators=100,
						max_depth=10,
						eta=0.01,
						min_child_weight=5,
						random_state=100)

# Train and test for multi response
xbg.fit(X_train_multi, y_train_multi)
y_predict = xbg.predict(X_test_multi)
print('For the multi response, classification report on test set is:')
print(classification_report(y_test_multi, y_predict))
print(pd.crosstab(y_predict, y_test_multi.Comorbidity, rownames=['y_predict'], colnames=['y_test']))

For the multi response, classification report on test set is:
              precision    recall  f1-score   support

           0       0.67      0.79      0.73     32363
           1       0.75      0.61      0.67     32206

    accuracy                           0.70     64569
   macro avg       0.71      0.70      0.70     64569
weighted avg       0.71      0.70      0.70     64569

y_test         0      1
y_predict              
0          25720  12630
1           6643  19576


### Catboost

#### Binary

In [83]:
# Configure estimator
cbc = CatBoostClassifier(iterations=2,
                           depth=2,
                           learning_rate=1,
                           loss_function='Logloss',
                           verbose=True)

# Train and test for 5 reponses
for response in binary:
    cbc.fit(X_train_binary[response], y_train_binary[response])
    y_predict = cbc.predict(X_test_binary[response])
    print('For the response ', response, ', classification report on test set is:')
    print(classification_report(y_test_binary[response], y_predict))
    print(pd.crosstab(y_predict, y_test_binary[response], rownames=['y_predict'], colnames=['y_test']))
    print()

0:	learn: 0.6678843	total: 16.8ms	remaining: 16.8ms
1:	learn: 0.6509854	total: 29.1ms	remaining: 0us
For the response  Depression , classification report on test set is:
              precision    recall  f1-score   support

         0.0       0.60      0.70      0.65     26722
         1.0       0.64      0.54      0.59     26551

    accuracy                           0.62     53273
   macro avg       0.62      0.62      0.62     53273
weighted avg       0.62      0.62      0.62     53273

y_test      0.00   1.00
y_predict              
0.00       18704  12242
1.00        8018  14309

0:	learn: 0.6696567	total: 19.5ms	remaining: 19.5ms
1:	learn: 0.6641108	total: 36.6ms	remaining: 0us
For the response  Asthma , classification report on test set is:
              precision    recall  f1-score   support

         0.0       0.60      0.59      0.59     30197
         1.0       0.59      0.60      0.60     30059

    accuracy                           0.60     60256
   macro avg       0.6

#### Multi

In [84]:
# Configure estimator
cbc = CatBoostClassifier(iterations=2,
                           depth=2,
                           learning_rate=1,
                           loss_function='Logloss',
                           verbose=True)

# Train and test for multi response
cbc.fit(X_train_multi, y_train_multi)
y_predict = cbc.predict(X_test_multi)
print('For the multi response, classification report on test set is:')
print(classification_report(y_test_multi, y_predict))
print(pd.crosstab(y_predict, y_test_multi.Comorbidity, rownames=['y_predict'], colnames=['y_test']))

0:	learn: 0.6711616	total: 18.8ms	remaining: 18.8ms
1:	learn: 0.6655812	total: 31ms	remaining: 0us
For the multi response, classification report on test set is:
              precision    recall  f1-score   support

           0       0.58      0.70      0.64     32363
           1       0.63      0.50      0.55     32206

    accuracy                           0.60     64569
   macro avg       0.61      0.60      0.60     64569
weighted avg       0.60      0.60      0.60     64569

y_test         0      1
y_predict              
0          22763  16174
1           9600  16032


### Analysis 2 Report

At first, I was trying to use a single train set and test set for all those chronic conditions and all models. But when I finished running those models, the performance on each condition is really poor that the model tend to predict most samples into one class. The multiclass case is even worse that actually all test cases are classified as 0 for all models.

The reason is that the classes are highly imbalanced for each chronic condition since only a small population has those conditions. Moreover, the population who have each condition is different from each other. So I over sampled for each condition, and over sampled again for the multiclass case. The resulting performance is then improved dramatically.

There are 5 features with a large range, `'_DRNKWK1', '_FRUTSU1', '_VEGESU1', 'FRNCHDA_', '_BMI5'`, that may affect the performance. So standardization is applied to those features. All other features have small range with 10, all they are categorical, so the rest are kept unchanged. The resulting performance is then slightly improved.

The test accuracy for Logistic Regression ranges from 0.61 to 0.73 for binary cases, and is 0.61 for the multiclass case. KNN has poorer performance for the binary cases ranging from 0.45 to 0.67. But in multiclass situation, it archives 0.95 accuracy which is a huge improvement. Random Forest has accuracy from 0.96 to 1.00 on binary cases and archives 1.00 accuracy on the multiclass situation. Scores of Gradient Boost is from 0.62 to 0.75 for binary and 0.63 for the multiclass case. XGBoost has accuracy for binary cases from 0.66 to 0.78 and 0.70 for multiclass case. CatBoost has accuracy for binary cases from 0.60 to 0.72 and 0.60 for multiclass case.

Among those models random forest would be chosen as the model for classification because of its impressive performance. KNN could also be used for classify the multiclass category as it has a relative high score.