# ODSC Predictive Modeling Workshop

11/4/2017

## Prerequisites 

Download Data: https://goo.gl/AoR7xn

Anaconda: https://www.anaconda.com/download/

## Load Packages

In [None]:
%matplotlib inline 
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
import pandas as pd
import numpy as np
import seaborn

## Data Import


The dataset is extracted from a research study that can be found online. <br>
https://www.hindawi.com/journals/bmri/2014/781670/sup/

Data dictionary<br>
https://www.hindawi.com/journals/bmri/2014/781670/tab1/

The business problem defined is<br>
"**Can we predict which patient will be readmitted to the hospital within the first 30 days from discharge**?"

Here is an article on the background of hospital readmission. In short, one big problem in healthcare industry today is the rising cost of patient readmission.
https://www.speechmed.com/cost-hospital-readmission/


The data set we are using today has been sampled down for demo purpose.

First, let's read the Excel table into pandas dataframe

## Exploratory Data Analysis

The first thing you always want to do is to look at the raw data file. No summary stats can be more intuitive than actually looking at the raw data.

In [None]:
df[:3].transpose()

There is a combination of categorical, numeric and text features in the data set. There are missing values in some columns, such as weight. For this workshop, we will exclude Text features from the analysis. 

Next, let's get some summary stats for both numeric features. 

In [None]:
df.describe()

In [None]:
# density plot
df.number_inpatient.plot.density()

In [None]:
# histogram
df.number_inpatient.plot.hist()

Look for a few things: 
- Anomaly values, e.g. Outliers, values that does not make business sense, etc.
- Extremely skewed features, big difference between mean and median
- Features with few unique values, e.g. Min and 75% values are the same, etc

Take a look at frequency tables for categorical features

In [None]:
# frequency table
df.weight.value_counts(sort=False, dropna=False)

Frequency table shows that weight column has a large percentage of missing values. This feature may not add much value to the model, but we will keep it in the data for now. 

In [None]:
# bar chart
pd.crosstab(df.weight, columns='N').plot.bar()

Generally, you need to continue to review each feature to flag any potential problem. Let's assume that we have done that and move on.....

## Data Preparation

#### Missing value imputation

Take a look at missing value distribution in the data set

In [None]:
df.isnull().sum()

There are many ways to impute missing values depending on variable type and algorithms used to train a model. 
- constant 
- median imputation 
- forward fill 
- backward fill 
- frequency based 
- model based 
- missing value indicator

In [None]:
# impute missing with median 
df = df.fillna(df.median())

In [None]:
# encode categorical missing into a 'MISSING' category
df = df.fillna(value='MISSING')

In [None]:
# check imputation
df.isnull().sum()

### Split data into feature columns and target column

**We will exclude text features for this workshop