# Evaluation of imputation approaches

Four imputation approaches are tested:
1.	Remove entries with missing total_bedrooms values (baseline)
2.	Impute with average of existing total_bedrooms values
3.	Impute with linear regression prediction from total_rooms values
4.	Each entry’s missing total_bedrooms value is replaced by average of total_bedrooms

In [1]:
import pandas
import numpy as np
import sklearn.linear_model as lm
from sklearn.model_selection import KFold
import DataPrepUtil
import Impute

## Evaluation function (using linear regression)

In [2]:
def lr_evaluate(housing):
    y = housing.median_house_value.values.reshape(-1,1)
    X = housing.drop(columns=['median_house_value'], inplace=False).values
    
    Model = lm.LinearRegression()
    kf = KFold(n_splits=5, shuffle=True)
    
    total_train_err = 0
    total_validation_err = 0
    for train_index, validation_index in kf.split(X):
        X_train, X_validation = X[train_index], X[validation_index]
        y_train, y_validation = y[train_index], y[validation_index]
        Model.fit(X_train, y_train)
        total_train_err += Model.score(X_train, y_train)
        total_validation_err += Model.score(X_validation, y_validation)
        
    avg_train_err = total_train_err / 5
    avg_validation_err = total_validation_err / 5
    print("Average training r2 score: " + str(avg_train_err))
    print("Average validation r2 score: " + str(avg_validation_err))
    print("Average of training and validation r2 score: " + str((avg_train_err + avg_validation_err) / 2))

## Reading data

In [3]:
housing1 = pandas.read_csv('./housing.csv')
DataPrepUtil.transform_ocean_proximity(housing1)
housing2 = pandas.read_csv('./housing.csv')
DataPrepUtil.transform_ocean_proximity(housing2)
housing3 = pandas.read_csv('./housing.csv')
DataPrepUtil.transform_ocean_proximity(housing3)
housing4 = pandas.read_csv('./housing.csv')
DataPrepUtil.transform_ocean_proximity(housing4)

## Evaluation of approach 1

In [4]:
Impute.remove_incomplete_entries(housing1)
lr_evaluate(housing1)

Average training r2 score: 0.6466055946564022
Average validation r2 score: 0.6449707540538443
Average of training and validation r2 score: 0.6457881743551233


## Evaluation of approach 2

In [5]:
Impute.fill_average(housing2)
lr_evaluate(housing2)

Average training r2 score: 0.6456505766826136
Average validation r2 score: 0.6432810292574983
Average of training and validation r2 score: 0.644465802970056


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


## Evaluation of approach 3

In [6]:
Impute.fill_lr_prediction_from_other_column(housing3, 'total_rooms')
lr_evaluate(housing3)

Average training r2 score: 0.6464231032809773
Average validation r2 score: 0.645326433409665
Average of training and validation r2 score: 0.6458747683453212


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


## Evaluation of approach 4

In [7]:
Impute.fill_nn_prediction(housing4, 4)
lr_evaluate(housing4)

Average training r2 score: 0.6464353151622129
Average validation r2 score: 0.6444294546996068
Average of training and validation r2 score: 0.6454323849309098


  return umr_sum(a, axis, dtype, out, keepdims, initial)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)
