# Assignment - Decision Tree Classification

In this assignment, we will focus on healthcare. This data set is made available by the Center for Clinical and Translational Research, Virginia Commonwealth University. It contains data about 10 years of clinical care at 130 US Hospitals. Each row represents a single patient. The columns include the characteristics of deidentified diabetes patients. This is a binary classification task: predict whether a diabetes patient is readmitted to the hospital within 30 days of their discharge (1=Yes, 0=No). This is an important performance metric for hospitals as they try to minimize these types of readmissions.

## Description of Variables

The description of variables are provided in "Healthcare - Data Dictionary.docx"

## Goal

Use the **healthcare.csv** data set and build a model to predict **readmitted**. Build (at least) **two decision tree** models.

## Submission:

Please save and submit this Jupyter notebook file. The correctness of the code matters for your grade. **Readability and organization of your code is also important.** You may lose points for submitting unreadable/undecipherable code. Therefore, use markdown cells to create sections, and use comments where necessary.



# Read and Prepare the Data
## Also, perform feature engineering: create one new variable from existing ones

In [1]:
# Common imports
import numpy as np
import pandas as pd

np.random.seed(42)


In [2]:
#We will predict the "price_gte_150" value in the data set:

hlthce = pd.read_csv("healthcare.csv")
hlthce.head()

Unnamed: 0,race,gender,age,admission_type,discharge_disposition,admission_source,time_in_hospital,payer_code,medical_specialty,num_lab_procedures,...,diag_1,diag_2,diag_3,number_diagnoses,max_glu_serum,A1Cresult,insulin,change,diabetesMed,readmitted
0,Other,Female,70-80,2,3,1,14,,InternalMedicine,32,...,486.0,404,428,9,,,No,No,No,1
1,Caucasian,Female,80-90,1,3,5,4,MC,,44,...,38.0,438,599,9,,,Steady,Ch,Yes,0
2,AfricanAmerican,Male,50-60,5,1,1,6,HM,,29,...,296.0,585,428,9,,,Up,Ch,Yes,1
3,Caucasian,Female,50-60,1,1,6,3,HM,InternalMedicine,47,...,250.02,401,493,4,,>8,No,Ch,Yes,0
4,AfricanAmerican,Female,40-50,3,1,1,4,UN,Hematology,92,...,486.0,287,595,7,,>7,No,No,No,0


In [3]:
from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(hlthce, test_size=0.3)

In [4]:
train_set.isna().sum()

race                      114
gender                      0
age                         0
admission_type              0
discharge_disposition       0
admission_source            0
time_in_hospital            0
payer_code               2492
medical_specialty        3043
num_lab_procedures          0
num_procedures              0
num_medications             0
number_outpatient           0
number_emergency            0
number_inpatient            0
diag_1                      3
diag_2                     12
diag_3                     62
number_diagnoses            0
max_glu_serum               0
A1Cresult                   0
insulin                     0
change                      0
diabetesMed                 0
readmitted                  0
dtype: int64

In [5]:
test_set.isna().sum()

race                       60
gender                      0
age                         0
admission_type              0
discharge_disposition       0
admission_source            0
time_in_hospital            0
payer_code               1032
medical_specialty        1306
num_lab_procedures          0
num_procedures              0
num_medications             0
number_outpatient           0
number_emergency            0
number_inpatient            0
diag_1                      0
diag_2                      8
diag_3                     36
number_diagnoses            0
max_glu_serum               0
A1Cresult                   0
insulin                     0
change                      0
diabetesMed                 0
readmitted                  0
dtype: int64

In [6]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

from sklearn.preprocessing import FunctionTransformer


In [7]:
# We can't use the following columns in this tutorial, because they are not for binary classification tasks
#NOTE FROM GLORIA - dropping these to work on seeing if they do anything and experince dropping items. 

train = train_set.drop(['race', 'payer_code', 'medical_specialty', 'diag_1', 'diag_2', 'diag_3'], axis=1)
test = test_set.drop(['race', 'payer_code', 'medical_specialty', 'diag_1', 'diag_2', 'diag_3'], axis=1)

In [8]:
train.shape

(6066, 19)

In [9]:
test.shape

(2600, 19)

In [10]:
train.head()

Unnamed: 0,gender,age,admission_type,discharge_disposition,admission_source,time_in_hospital,num_lab_procedures,num_procedures,num_medications,number_outpatient,number_emergency,number_inpatient,number_diagnoses,max_glu_serum,A1Cresult,insulin,change,diabetesMed,readmitted
4166,Male,80-90,1,3,7,4,39,1,19,5,0,0,9,,>7,Steady,Ch,Yes,1
5546,Female,80-90,1,1,7,8,73,0,16,0,0,0,9,,,No,No,Yes,0
2957,Male,60-70,1,1,7,2,41,2,19,0,0,2,8,,,No,No,No,1
6329,Male,40-50,1,1,7,4,54,1,20,0,0,0,5,,>7,Steady,No,Yes,0
565,Female,70-80,3,1,1,2,15,2,8,0,0,1,9,,,Steady,No,Yes,0


In [11]:
test.head()

Unnamed: 0,gender,age,admission_type,discharge_disposition,admission_source,time_in_hospital,num_lab_procedures,num_procedures,num_medications,number_outpatient,number_emergency,number_inpatient,number_diagnoses,max_glu_serum,A1Cresult,insulin,change,diabetesMed,readmitted
3936,Male,50-60,1,1,7,6,56,0,14,0,0,0,9,,,No,No,Yes,0
6333,Male,70-80,1,1,7,5,65,0,10,0,0,0,9,,Norm,No,No,Yes,0
5639,Female,70-80,1,3,7,3,70,0,18,0,0,0,9,,,Down,Ch,Yes,1
2036,Female,80-90,1,1,7,5,44,0,20,0,0,1,9,,,Steady,No,Yes,1
841,Male,60-70,1,1,7,3,53,3,7,0,0,0,7,,>7,No,No,Yes,0


In [12]:
train.describe()

Unnamed: 0,admission_type,discharge_disposition,admission_source,time_in_hospital,num_lab_procedures,num_procedures,num_medications,number_outpatient,number_emergency,number_inpatient,number_diagnoses,readmitted
count,6066.0,6066.0,6066.0,6066.0,6066.0,6066.0,6066.0,6066.0,6066.0,6066.0,6066.0,6066.0
mean,1.995384,4.179525,5.802176,4.577481,43.734257,1.338938,16.384603,0.375701,0.277613,0.847511,7.564293,0.468678
std,1.431933,5.857465,4.039676,3.051092,19.423605,1.699481,8.084403,1.28025,1.432545,1.601625,1.842855,0.499059
min,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
25%,1.0,1.0,1.0,2.0,33.0,0.0,11.0,0.0,0.0,0.0,6.0,0.0
50%,1.0,1.0,7.0,4.0,45.0,1.0,15.0,0.0,0.0,0.0,9.0,0.0
75%,3.0,6.0,7.0,6.0,57.0,2.0,20.0,0.0,0.0,1.0,9.0,1.0
max,8.0,28.0,20.0,14.0,109.0,6.0,67.0,40.0,63.0,19.0,15.0,1.0


In [32]:
train_y = train['readmitted']
test_y = test['readmitted']

train_inputs = train.drop(['readmitted'], axis=1)
test_inputs = test.drop(['readmitted'], axis=1)

In [33]:
train_y.describe()

count    6066.000000
mean        0.468678
std         0.499059
min         0.000000
25%         0.000000
50%         0.000000
75%         1.000000
max         1.000000
Name: readmitted, dtype: float64

In [34]:
test_y.describe()

count    2600.000000
mean        0.463462
std         0.498759
min         0.000000
25%         0.000000
50%         0.000000
75%         1.000000
max         1.000000
Name: readmitted, dtype: float64

In [35]:
def new_col(df):
    #Create a copy so that we don't overwrite the existing dataframe
    df1 = df.copy()
    
    df1['per_person_amt'] = np.where(df1['num_procedures'] > 0, 1, 0)
    
    return df1[['per_person_amt']]
    # You can use this to check whether the calculation is made correctly:
    #return df1

In [36]:
#Let's test the new function:

# Send train set to the function we created
new_col(train_set)

Unnamed: 0,per_person_amt
4166,1
5546,0
2957,1
6329,1
565,1
...,...
5734,1
5191,0
5390,0
860,1


In [37]:
train_inputs.dtypes

gender                   object
age                      object
admission_type            int64
discharge_disposition     int64
admission_source          int64
time_in_hospital          int64
num_lab_procedures        int64
num_procedures            int64
num_medications           int64
number_outpatient         int64
number_emergency          int64
number_inpatient          int64
number_diagnoses          int64
max_glu_serum            object
A1Cresult                object
insulin                  object
change                   object
diabetesMed              object
dtype: object

In [38]:
# Identify the numerical columns
numeric_columns = train_inputs.select_dtypes(include=[np.number]).columns.to_list()

# Identify the categorical columns
categorical_columns = train_inputs.select_dtypes('object').columns.to_list()

In [None]:
categorical_columns = ['A1Cresult','admission_source', 'admission_type', 'age', 'change', 
                       'diabetesMed', 'discharge_disposition', 'gender', 'insulin', 'max_glu_serum']

# binary_columns = ['per_person_amt']

numeric_columns = ['num_lab_procedures', 'num_medications', 'num_procedures', 'number_diagnoses', 
                   'number_emergency', 'number_inpatient','number_outpatient', 'time_in_hospital']

In [39]:
numeric_columns

['admission_type',
 'discharge_disposition',
 'admission_source',
 'time_in_hospital',
 'num_lab_procedures',
 'num_procedures',
 'num_medications',
 'number_outpatient',
 'number_emergency',
 'number_inpatient',
 'number_diagnoses']

In [21]:
categorical_columns

['gender',
 'age',
 'max_glu_serum',
 'A1Cresult',
 'insulin',
 'change',
 'diabetesMed']

In [23]:
#binary_columns

In [40]:
feat_eng_columns = ['per_person_amt']

In [41]:
numeric_transformer = Pipeline(steps=[
                ('imputer', SimpleImputer(strategy='median')),
                ('scaler', StandardScaler())])

In [42]:
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='unknown')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

In [43]:
binary_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent'))])

In [44]:
my_new_column = Pipeline(steps=[('my_new_column', FunctionTransformer(new_col))])

In [45]:
preprocessor = ColumnTransformer([
        ('num', numeric_transformer, numeric_columns),
        ('cat', categorical_transformer, categorical_columns),
        # ('binary', binary_transformer, binary_columns),
        ('trans', my_new_column, feat_eng_columns)],
        remainder='passthrough')

#passtrough is an optional step. You don't have to use it.

In [46]:
#Fit and transform the train data
train_x = preprocessor.fit_transform(train_inputs)

train_x

ValueError: A given column is not a column of the dataframe

In [None]:
train_x.shape

# Baseline:

# Decision Tree Model 1:

# Decision Tree Model 2:

# Discussion

Briefly answer the following questions: (2 points) 
1) Which model performs the best (and why)?<br>
2) What is the baseline?<br>
3) Does the best model perform better than the baseline (and why)?<br>
4) Does the best model exhibit any overfitting; what did you do about it?