__DD:__ To get ml_metrics installed I had to use Anaconda command prompt and run __pip install ml_metrics__.  I am under the impression that we should avoid using pip in an Anaconda environment but I had no choice.

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from ml_metrics import rmse
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston

In [2]:
boston = load_boston()

# Convert the matrix to pandas
bos = pd.DataFrame(boston.data)
bos.columns = boston.feature_names
bos['MEDV'] = boston.target
#bos.head()
#bos.describe()

In [3]:
bos.describe()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
count,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0
mean,3.613524,11.363636,11.136779,0.06917,0.554695,6.284634,68.574901,3.795043,9.549407,408.237154,18.455534,356.674032,12.653063,22.532806
std,8.601545,23.322453,6.860353,0.253994,0.115878,0.702617,28.148861,2.10571,8.707259,168.537116,2.164946,91.294864,7.141062,9.197104
min,0.00632,0.0,0.46,0.0,0.385,3.561,2.9,1.1296,1.0,187.0,12.6,0.32,1.73,5.0
25%,0.082045,0.0,5.19,0.0,0.449,5.8855,45.025,2.100175,4.0,279.0,17.4,375.3775,6.95,17.025
50%,0.25651,0.0,9.69,0.0,0.538,6.2085,77.5,3.20745,5.0,330.0,19.05,391.44,11.36,21.2
75%,3.677083,12.5,18.1,0.0,0.624,6.6235,94.075,5.188425,24.0,666.0,20.2,396.225,16.955,25.0
max,88.9762,100.0,27.74,1.0,0.871,8.78,100.0,12.1265,24.0,711.0,22.0,396.9,37.97,50.0


__Step 1:__
Use sklearn.datasets to get the Boston Housing dataset.  Fit a linear regressor to the data as a baseline.  There is no need to do Cross-Validation.  We will simply be exploring the change in results.

In [4]:
train_set = bos.sample(frac=0.7, random_state=100)
test_set = bos[~bos.isin(train_set)].dropna()
# Get the training and testing row indices for later use
train_index = train_set.index.values.astype(int)
test_index = test_set.index.values.astype(int)

# Converting the training and testing datasets back to matrix-formats
X_train = train_set.iloc[:, :-1].values # returns the data; excluding the target
Y_train = train_set.iloc[:, -1].values # returns the target-only
X_test = test_set.iloc[:, :-1].values # ""
Y_test = test_set.iloc[:, -1].values # ""

# Fit a linear regression to the training data
reg = LinearRegression(normalize=True).fit(X_train, Y_train)

#Predict
Y_pred = reg.predict(X_test)

#Get measures
orig_mae = mean_absolute_error(Y_test,Y_pred)
orig_mse = mean_squared_error(Y_test,Y_pred)
orig_rmse_val = rmse(Y_test,Y_pred)
orig_r2 = r2_score(Y_test,Y_pred)

In [5]:
res_frame = pd.DataFrame({'data':'original',
                   'imputation':'none',
                   'mae': orig_mae, 
                   'mse': orig_mse, 
                   'rmse':orig_rmse_val, 
                   'R2':orig_r2,
                   'mae_diff':np.nan,
                   'mse_diff':np.nan,
                   'rmse_diff':np.nan,
                   'R2_diff':np.nan}, index=[0])

__Question 1:__ What is the loss and what are the goodness of fit parameters?  This will be our baseline for comparison.


In [6]:
res_frame

Unnamed: 0,data,imputation,mae,mse,rmse,R2,mae_diff,mse_diff,rmse_diff,R2_diff
0,original,none,3.604571,24.098505,4.909023,0.70494,,,,


__Step 2:__ (repeat for each percentage value below)
Select 1%, 5% 10%, 20%, 33%, and 50% of your data in a single column [hold that column selection constant throughout all iterations] (Completely at random), replace the original value with a NaN (i.e., “not a number” – ex., np.nan) and then perform an imputation for the missing values.   

__Question 2:__ In each case [1%, 5%, 10%, 20%, 33%, 50%] perform a fit with the imputed data and compare the loss and goodness of fit to your baseline.  [Note: you should have (6) models to compare against your baseline at this point.]


In [7]:
def definedsample(data, fraction, featurename, resultframe):
    in_sample = data.sample(frac=fraction, random_state=99)
    out_sample = data[~data.isin(in_sample)].dropna()
    in_sample[featurename] = np.nan
    in_sample[featurename] = in_sample[featurename].fillna(out_sample[featurename].mean())
    sampleddata = pd.concat([in_sample, out_sample])
    sampleddata = sampleddata.sort_index()
    train_set = sampleddata.iloc[train_index]
    test_set = sampleddata.iloc[test_index]
    X_train = train_set.iloc[:, :-1].values
    Y_train = train_set.iloc[:, -1].values
    X_test = test_set.iloc[:, :-1].values
    Y_test = test_set.iloc[:, -1].values
    
    reg = LinearRegression().fit(X_train, Y_train)
    Y_pred = reg.predict(X_test)
    
    mae = mean_absolute_error(Y_test,Y_pred)
    mse = mean_squared_error(Y_test,Y_pred)
    rmse_val = rmse(Y_test,Y_pred)
    r2 = r2_score(Y_test,Y_pred)
    
    temp_frame = pd.DataFrame({'data': str(fraction)+'% imputed',
                   'imputation':'Mean',
                   'mae': mae, 
                   'mse': mse, 
                   'rmse':rmse_val,
                   'R2':r2,
                   'mae_diff':mae-orig_mae,
                   'mse_diff':mse-orig_mse,
                   'rmse_diff':rmse_val-orig_rmse_val,
                   'R2_diff':r2-orig_r2
                   }, index=[0])
    resultframe = pd.concat([resultframe, temp_frame])
    return resultframe

In [8]:
res_frame = definedsample(bos,.01,'DIS',res_frame)
res_frame = definedsample(bos,.05,'DIS',res_frame)
res_frame = definedsample(bos,.10,'DIS',res_frame)
res_frame = definedsample(bos,.20,'DIS',res_frame)
res_frame = definedsample(bos,.5,'DIS',res_frame)

In [9]:
res_frame

Unnamed: 0,data,imputation,mae,mse,rmse,R2,mae_diff,mse_diff,rmse_diff,R2_diff
0,original,none,3.604571,24.098505,4.909023,0.70494,,,,
0,0.01% imputed,Mean,3.638791,24.316646,4.931191,0.702269,0.034219,0.218141,0.022168,-0.002671
0,0.05% imputed,Mean,3.553488,23.97754,4.896687,0.706421,-0.051083,-0.120964,-0.012336,0.001481
0,0.1% imputed,Mean,3.547616,24.027511,4.901786,0.705809,-0.056955,-0.070994,-0.007236,0.000869
0,0.2% imputed,Mean,3.64101,25.303311,5.03024,0.690188,0.036439,1.204806,0.121217,-0.014752
0,0.5% imputed,Mean,3.624329,25.662844,5.065851,0.685786,0.019757,1.564339,0.156828,-0.019154


In [10]:
# sampledefinitions = [.01, .05]
# for sampledefinition in sampledefinitions:
#     definedsample(bos,sampledefinitions,'DIS',res_frame)

__Step 3:__ Take two columns and create data “Missing at Random” when controlled for a third variable (i.e., if Variable Z is > 30, then Variables X, Y are randomly missing).  Use your preferred imputation method to fill in 10%, 20% and 30% of your missing data.

In [11]:
#If INDUS > 9.6 then CHAS,NOX may be missing
# in_sample1['DIS'] = in_sample1['DIS'].fillna(out_sample1['DIS'].mean())
# imputed_data1 = pd.concat([in_sample1, out_sample1])
# imputed_data1 = imputed_data1.sort_index()

__Question 3:__ In each case [10%, 20%, 30%] perform a fit with the imputed data and compare the loss and goodness of fit to your baseline.  [Note: you should have (9) models to compare against your baseline at this point.]

__Step 4:__  Create a “Missing Not at Random” pattern in which 25% of the data is missing for a single column.

__Question 4:__ Perform a fit with the imputed data [25%] and compare the loss and goodness of fit to your baseline.  [Note: you should have (10) models to compare against your baseline at this point.]


__Step 5:__ Describe your imputation approach and summarize your findings.  What impact did the missing data have on your baseline model’s performance? 