## Dealing with Missing Values
[Resources](https://www.kaggle.com/code/alexisbcook/missing-values)

Goal: Discuss and test on 3 appraoches to handle missing values in the dataset.

## Import Libraries

In [7]:
import pandas as pd
from sklearn.model_selection import train_test_split

## Load the Dataset
Dataset used: [Melbourne Housing Snapshot](https://www.kaggle.com/datasets/dansbecker/melbourne-housing-snapshot) from Kaggle

In [6]:
os.listdir("./archive")

path = "./archive/melb_data.csv"

In [19]:
data = pd.read_csv(path) 
data.head()

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,3067.0,...,1.0,1.0,202.0,,,Yarra,-37.7996,144.9984,Northern Metropolitan,4019.0
1,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,...,1.0,0.0,156.0,79.0,1900.0,Yarra,-37.8079,144.9934,Northern Metropolitan,4019.0
2,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067.0,...,2.0,0.0,134.0,150.0,1900.0,Yarra,-37.8093,144.9944,Northern Metropolitan,4019.0
3,Abbotsford,40 Federation La,3,h,850000.0,PI,Biggin,4/03/2017,2.5,3067.0,...,2.0,1.0,94.0,,,Yarra,-37.7969,144.9969,Northern Metropolitan,4019.0
4,Abbotsford,55a Park St,4,h,1600000.0,VB,Nelson,4/06/2016,2.5,3067.0,...,1.0,2.0,120.0,142.0,2014.0,Yarra,-37.8072,144.9941,Northern Metropolitan,4019.0


## Features (X) and Labels (y)

In [20]:
numerical_features = data.select_dtypes(include="number")

X = numerical_features.drop(["Price"], axis=1)
y = data["Price"]

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size = 0.2, random_state=0)

In [21]:
print(f"X_train shape: {X_train.shape}")
print(f"X_valid shape: {X_valid.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_valid shape: {y_valid.shape}")

X_train shape: (10864, 12)
X_valid shape: (2716, 12)
y_train shape: (10864,)
y_valid shape: (2716,)


## Train Model and MAE

In [23]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

def score_dataset(X_train, X_valid, y_train, y_valid):
    model = RandomForestRegressor(n_estimators = 10, random_state=42)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_valid)
    return mean_absolute_error(y_valid, y_pred)

## Approach 1: Drop Columns w/ Missing Values

In [24]:
cols_with_missing = [col for col in X_train.columns if X_train[col].isnull().any()]

dropped_X_train = X_train.drop(cols_with_missing, axis=1)
dropped_X_valid = X_valid.drop(cols_with_missing, axis=1)

print("MAE from Approach 1 (Drop columns with missing values):")
print(score_dataset(dropped_X_train, dropped_X_valid, y_train, y_valid))

MAE from Approach 1 (Drop columns with missing values):
188390.99375306827


## Approach 2: Use SimpleImputer to Replace Missing Values w/ Mean Value along each column

In [29]:
from sklearn.impute import SimpleImputer

# create objest of Imputer
imputer = SimpleImputer()

# imputation
imputed_X_train = pd.DataFrame(imputer.fit_transform(X_train))
imputed_X_valid = pd.DataFrame(imputer.fit_transform(X_valid))

imputed_X_train.columns = X_train.columns
imputed_X_valid.columns = X_valid.columns

# mae score
print("MAE from Approach 2 (Imputation):")
print(score_dataset(imputed_X_train, imputed_X_valid, y_train, y_valid))


MAE from Approach 2 (Imputation):
177927.81902307313


## Approach 3: Imputation + Keep Track of Which Values were Imputed

In [30]:
X_train_plus = X_train.copy()
X_valid_plus = X_valid.copy()

# maje new columns, indicating what'll be imputed
for col in cols_with_missing:
    X_train_plus[col+"_was_missing"] = X_train_plus[col].isnull()
    X_valid_plus[col+"_was_missing"] = X_valid_plus[col].isnull()
    
my_imputer = SimpleImputer()
imputed_X_train_plus = pd.DataFrame(my_imputer.fit_transform(X_train_plus))
imputed_X_valid_plus = pd.DataFrame(my_imputer.transform(X_valid_plus))

# Imputation removed column names; put them back
imputed_X_train_plus.columns = X_train_plus.columns
imputed_X_valid_plus.columns = X_valid_plus.columns

print("MAE from Approach 3 (An Extension to Imputation):")
print(score_dataset(imputed_X_train_plus, imputed_X_valid_plus, y_train, y_valid))

MAE from Approach 3 (An Extension to Imputation):
176926.4156725577


## Why did imputation perform better than dropping the columns?

Dropping columns remove a lot of useful information

In [31]:
# Shape of training data (num_rows, num_columns)
print(X_train.shape)

# Number of missing values in each column of training data
missing_val_count_by_column = (X_train.isnull().sum())
print(missing_val_count_by_column[missing_val_count_by_column > 0])

(10864, 12)
Car               49
BuildingArea    5156
YearBuilt       4307
dtype: int64
