__DD:__ To get ml_metrics installed I had to use Anaconda command prompt and run __pip install ml_metrics__.  I am under the impression that we should avoid using pip in an Anaconda environment but I had no choice.

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from ml_metrics import rmse
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston

In [2]:
boston = load_boston()

# Convert the matrix to pandas
bos = pd.DataFrame(boston.data)
bos.columns = boston.feature_names
bos['MEDV'] = boston.target
#bos.head()
#bos.describe()

__Step 1:__
Use sklearn.datasets to get the Boston Housing dataset.  Fit a linear regressor to the data as a baseline.  There is no need to do Cross-Validation.  We will simply be exploring the change in results.

In [3]:
train_set = bos.sample(frac=0.7, random_state=100)
test_set = bos[~bos.isin(train_set)].dropna()
# Get the training and testing row indices for later use
train_index = train_set.index.values.astype(int)
test_index = test_set.index.values.astype(int)

# Converting the training and testing datasets back to matrix-formats
X_train = train_set.iloc[:, :-1].values # returns the data; excluding the target
Y_train = train_set.iloc[:, -1].values # returns the target-only
X_test = test_set.iloc[:, :-1].values # ""
Y_test = test_set.iloc[:, -1].values # ""

# Fit a linear regression to the training data
reg = LinearRegression(normalize=True).fit(X_train, Y_train)

#Predict
Y_pred = reg.predict(X_test)

#Get measures
orig_mae = mean_absolute_error(Y_test,Y_pred)
orig_mse = mean_squared_error(Y_test,Y_pred)
orig_rmse_val = rmse(Y_test,Y_pred)
orig_r2 = r2_score(Y_test,Y_pred)

In [4]:
res_frame = pd.DataFrame({'data':'original',
                   'imputation':'none',
                   'mae': orig_mae, 
                   'mse': orig_mse, 
                   'rmse':orig_rmse_val, 
                   'R2':orig_r2,
                   'mae_diff':np.nan,
                   'mse_diff':np.nan,
                   'rmse_diff':np.nan,
                   'R2_diff':np.nan}, index=[0])

__Question 1:__ What is the loss and what are the goodness of fit parameters?  This will be our baseline for comparison.


In [5]:
res_frame

Unnamed: 0,data,imputation,mae,mse,rmse,R2,mae_diff,mse_diff,rmse_diff,R2_diff
0,original,none,3.604571,24.098505,4.909023,0.70494,,,,


__Step 2:__ (repeat for each percentage value below)
Select 1%, 5% 10%, 20%, 33%, and 50% of your data in a single column [hold that column selection constant throughout all iterations] (Completely at random), replace the original value with a NaN (i.e., “not a number” – ex., np.nan) and then perform an imputation for the missing values.   

__Question 2:__ In each case [1%, 5%, 10%, 20%, 33%, 50%] perform a fit with the imputed data and compare the loss and goodness of fit to your baseline.  [Note: you should have (6) models to compare against your baseline at this point.]


In [6]:
in_sample1 = bos.sample(frac=0.01, random_state=99)
in_sample5 = bos.sample(frac=0.05, random_state=99)
in_sample10 = bos.sample(frac=0.1, random_state=99)
in_sample20 = bos.sample(frac=0.2, random_state=99)
in_sample33 = bos.sample(frac=0.33, random_state=99)
in_sample50 = bos.sample(frac=0.50, random_state=99)

out_sample1 = bos[~bos.isin(in_sample1)].dropna()
out_sample5 = bos[~bos.isin(in_sample5)].dropna()
out_sample10 = bos[~bos.isin(in_sample10)].dropna()
out_sample20 = bos[~bos.isin(in_sample20)].dropna()
out_sample33 = bos[~bos.isin(in_sample33)].dropna()
out_sample50 = bos[~bos.isin(in_sample50)].dropna()

In [7]:
print(out_sample1.shape[0] + in_sample1.shape[0])
print(out_sample5.shape[0] + in_sample5.shape[0])
print(out_sample10.shape[0] + in_sample10.shape[0])
print(out_sample20.shape[0] + in_sample20.shape[0])
print(out_sample33.shape[0] + in_sample33.shape[0])
print(out_sample50.shape[0] + in_sample50.shape[0])
print(bos.shape[0])

506
506
506
506
506
506
506


In [8]:
in_sample1['DIS'] = np.nan
in_sample5['DIS'] = np.nan
in_sample10['DIS'] = np.nan
in_sample20['DIS'] = np.nan
in_sample33['DIS'] = np.nan
in_sample50['DIS'] = np.nan

**Choose an imputation method**

In [9]:
in_sample1['DIS'] = in_sample1['DIS'].fillna(out_sample1['DIS'].mean())
in_sample5['DIS'] = in_sample5['DIS'].fillna(out_sample5['DIS'].mean())
in_sample10['DIS'] = in_sample10['DIS'].fillna(out_sample10['DIS'].mean())
in_sample20['DIS'] = in_sample20['DIS'].fillna(out_sample20['DIS'].mean())
in_sample33['DIS'] = in_sample33['DIS'].fillna(out_sample33['DIS'].mean())
in_sample50['DIS'] = in_sample50['DIS'].fillna(out_sample50['DIS'].mean())

**Rejoin the imputed and original datasets**

In [10]:
imputed_data1 = pd.concat([in_sample1, out_sample1])
imputed_data1 = imputed_data1.sort_index()

imputed_data5 = pd.concat([in_sample5, out_sample5])
imputed_data5 = imputed_data5.sort_index()

imputed_data10 = pd.concat([in_sample10, out_sample10])
imputed_data10 = imputed_data10.sort_index()

imputed_data20 = pd.concat([in_sample20, out_sample20])
imputed_data20 = imputed_data20.sort_index()

imputed_data33 = pd.concat([in_sample33, out_sample33])
imputed_data33 = imputed_data33.sort_index()

imputed_data50 = pd.concat([in_sample50, out_sample50])
imputed_data50 = imputed_data50.sort_index()

**Use the same training and testing indices to fit the model**

In [11]:
train_set1 = imputed_data1.iloc[train_index]
test_set1 = imputed_data1.iloc[test_index]

train_set5 = imputed_data5.iloc[train_index]
test_set5 = imputed_data5.iloc[test_index]

train_set10 = imputed_data10.iloc[train_index]
test_set10 = imputed_data10.iloc[test_index]

train_set20 = imputed_data20.iloc[train_index]
test_set20 = imputed_data20.iloc[test_index]

train_set33 = imputed_data33.iloc[train_index]
test_set33 = imputed_data33.iloc[test_index]

train_set50 = imputed_data50.iloc[train_index]
test_set50 = imputed_data50.iloc[test_index]

In [12]:
X_train1 = train_set1.iloc[:, :-1].values
Y_train1 = train_set1.iloc[:, -1].values
X_test1 = test_set1.iloc[:, :-1].values
Y_test1 = test_set1.iloc[:, -1].values

X_train5 = train_set5.iloc[:, :-1].values
Y_train5 = train_set5.iloc[:, -1].values
X_test5 = test_set5.iloc[:, :-1].values
Y_test5 = test_set5.iloc[:, -1].values

X_train10 = train_set10.iloc[:, :-1].values
Y_train10 = train_set10.iloc[:, -1].values
X_test10 = test_set10.iloc[:, :-1].values
Y_test10 = test_set10.iloc[:, -1].values

X_train20 = train_set20.iloc[:, :-1].values
Y_train20 = train_set20.iloc[:, -1].values
X_test20 = test_set20.iloc[:, :-1].values
Y_test20 = test_set20.iloc[:, -1].values

X_train33 = train_set33.iloc[:, :-1].values
Y_train33 = train_set33.iloc[:, -1].values
X_test33 = test_set33.iloc[:, :-1].values
Y_test33 = test_set33.iloc[:, -1].values

X_train50 = train_set50.iloc[:, :-1].values
Y_train50 = train_set50.iloc[:, -1].values
X_test50 = test_set50.iloc[:, :-1].values
Y_test50 = test_set50.iloc[:, -1].values

**Fit a new model to the imputed dataset**

In [13]:
reg1 = LinearRegression().fit(X_train1, Y_train1)
reg5 = LinearRegression().fit(X_train5, Y_train5)
reg10 = LinearRegression().fit(X_train10, Y_train10)
reg20 = LinearRegression().fit(X_train20, Y_train20)
reg33 = LinearRegression().fit(X_train33, Y_train33)
reg50 = LinearRegression().fit(X_train50, Y_train50)

In [14]:
Y_pred1 = reg1.predict(X_test1)
Y_pred5 = reg5.predict(X_test5)
Y_pred10 = reg10.predict(X_test10)
Y_pred20 = reg20.predict(X_test20)
Y_pred33 = reg33.predict(X_test33)
Y_pred50 = reg50.predict(X_test50)

mae1 = mean_absolute_error(Y_test1,Y_pred1)
mae5 = mean_absolute_error(Y_test5,Y_pred5)
mae10 = mean_absolute_error(Y_test10,Y_pred10)
mae20 = mean_absolute_error(Y_test20,Y_pred20)
mae33 = mean_absolute_error(Y_test33,Y_pred33)
mae50 = mean_absolute_error(Y_test50,Y_pred50)

mse1 = mean_squared_error(Y_test1,Y_pred1)
mse5 = mean_squared_error(Y_test5,Y_pred5)
mse10 = mean_squared_error(Y_test10,Y_pred10)
mse20 = mean_squared_error(Y_test20,Y_pred20)
mse33 = mean_squared_error(Y_test33,Y_pred33)
mse50 = mean_squared_error(Y_test50,Y_pred50)

rmse_val1 = rmse(Y_test1,Y_pred1)
rmse_val5 = rmse(Y_test5,Y_pred5)
rmse_val10 = rmse(Y_test10,Y_pred10)
rmse_val20 = rmse(Y_test20,Y_pred20)
rmse_val33 = rmse(Y_test33,Y_pred33)
rmse_val50 = rmse(Y_test50,Y_pred50)

r21 = r2_score(Y_test1,Y_pred1)
r25 = r2_score(Y_test5,Y_pred5)
r210 = r2_score(Y_test10,Y_pred10)
r220 = r2_score(Y_test20,Y_pred20)
r233 = r2_score(Y_test33,Y_pred33)
r250 = r2_score(Y_test50,Y_pred50)


# DD: Left off here

In [None]:
temp_frame = pd.DataFrame({'data':'1% imputed',
                   'imputation':'Mean',
                   'mae': mae, 
                   'mse': mse, 
                   'rmse':rmse_val,
                   'R2':r2,
                   'mae_diff':mae-orig_mae,
                   'mse_diff':mse-orig_mse,
                   'rmse_diff':rmse_val-orig_rmse_val,
                   'R2_diff':r2-orig_r2
                   }, index=[0])

In [None]:
res_frame = pd.concat([res_frame, temp_frame])
res_frame

__Step 3:__ Take two columns and create data “Missing at Random” when controlled for a third variable (i.e., if Variable Z is > 30, then Variables X, Y are randomly missing).  Use your preferred imputation method to fill in 10%, 20% and 30% of your missing data.

__Question 3:__ In each case [10%, 20%, 30%] perform a fit with the imputed data and compare the loss and goodness of fit to your baseline.  [Note: you should have (9) models to compare against your baseline at this point.]


__Step 4:__  Create a “Missing Not at Random” pattern in which 25% of the data is missing for a single column.

__Question 4:__ Perform a fit with the imputed data [25%] and compare the loss and goodness of fit to your baseline.  [Note: you should have (10) models to compare against your baseline at this point.]


__Step 5:__ Describe your imputation approach and summarize your findings.  What impact did the missing data have on your baseline model’s performance? 