# Assignment - Decision Tree Classification

In this assignment, we will focus on healthcare. This data set is made available by the Center for Clinical and Translational Research, Virginia Commonwealth University. It contains data about 10 years of clinical care at 130 US Hospitals. Each row represents a single patient. The columns include the characteristics of deidentified diabetes patients. This is a binary classification task: predict whether a diabetes patient is readmitted to the hospital within 30 days of their discharge (1=Yes, 0=No). This is an important performance metric for hospitals as they try to minimize these types of readmissions.

## Description of Variables

The description of variables are provided in "Healthcare - Data Dictionary.docx"

## Goal

Use the **healthcare.csv** data set and build a model to predict **readmitted**. Build (at least) **two decision tree** models.

## Submission:

Please save and submit this Jupyter notebook file. The correctness of the code matters for your grade. **Readability and organization of your code is also important.** You may lose points for submitting unreadable/undecipherable code. Therefore, use markdown cells to create sections, and use comments where necessary.



# Read and Prepare the Data
## Also, perform feature engineering: create one new variable from existing ones

In [None]:
# Common imports
import numpy as np
import pandas as pd

np.random.seed(42)


In [None]:
#We will predict the "price_gte_150" value in the data set:

hlthce = pd.read_csv("healthcare.csv")
hlthce.head()

In [None]:
from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(hlthce, test_size=0.3)

In [None]:
train_set.isna().sum()

In [None]:
test_set.isna().sum()

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

from sklearn.preprocessing import FunctionTransformer


In [None]:
# We can't use the following columns in this tutorial, because they are not for binary classification tasks
#NOTE FROM GLORIA - dropping these to work on seeing if they do anything and experince dropping items. 

train = train_set.drop(['race', 'payer_code', 'medical_specialty', 'diag_1', 'diag_2', 'diag_3'], axis=1)
test = test_set.drop(['race', 'payer_code', 'medical_specialty', 'diag_1', 'diag_2', 'diag_3'], axis=1)

In [None]:
train.shape

In [None]:
test.shape

In [None]:
train.head()

In [None]:
test.head()

In [None]:
train.describe()

In [None]:
train_1 = train['readmitted']
test_2 = test['readmitted']

train_inputs = train.drop(['readmitted'], axis=1)
test_inputs = test.drop(['readmitted'], axis=1)

In [None]:
train_1.describe()

In [None]:
test_2.describe()

In [None]:
def new_col(df):
    #Create a copy so that we don't overwrite the existing dataframe
    df1 = df.copy()
    
    df1['per_person_amt'] = np.where(df1['num_procedures'] > 0, 1, 0)
    
    return df1[['per_person_amt']]
    # You can use this to check whether the calculation is made correctly:
    #return df1

In [None]:
#Let's test the new function:

# Send train set to the function we created
new_col(train_set)

In [None]:
train_inputs.dtypes

In [None]:
categorical_columns = ['A1Cresult','admission_source', 'admission_type', 'age', 'change', 
                       'diabetesMed', 'discharge_disposition', 'gender', 'insulin', 'max_glu_serum']

# binary_columns = ['per_person_amt']

numeric_columns = ['num_lab_procedures', 'num_medications', 'num_procedures', 'number_diagnoses', 
                   'number_emergency', 'number_inpatient','number_outpatient', 'time_in_hospital']

In [None]:
numeric_columns

In [None]:
categorical_columns

In [None]:
binary_columns

In [None]:
feat_eng_columns = ['per_person_amt']

In [None]:
numeric_transformer = Pipeline(steps=[
                ('imputer', SimpleImputer(strategy='median')),
                ('scaler', StandardScaler())])

In [None]:
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='unknown')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

In [None]:
binary_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent'))])

In [None]:
my_new_column = Pipeline(steps=[('my_new_column', FunctionTransformer(new_col))])

In [None]:
preprocessor = ColumnTransformer([
        ('num', numeric_transformer, numeric_columns),
        ('cat', categorical_transformer, categorical_columns),
        ('binary', binary_transformer, binary_columns),
        ('trans', my_new_column, feat_eng_columns)],
        remainder='passthrough')

#passtrough is an optional step. You don't have to use it.

In [None]:
#Fit and transform the train data
train_x = preprocessor.fit_transform(train_inputs)

train_x

In [None]:
train_x.shape

# Baseline:

# Decision Tree Model 1:

# Decision Tree Model 2:

# Discussion

Briefly answer the following questions: (2 points) 
1) Which model performs the best (and why)?<br>
2) What is the baseline?<br>
3) Does the best model perform better than the baseline (and why)?<br>
4) Does the best model exhibit any overfitting; what did you do about it?