Since ML models throw an error when working with datasets with missing values, we need to adopt strategies to deal with the missing data.

- Simple Option: Drop the column with missing values. Not recommended as it can lead to loss of valuable data.
- Imputation: Fill the missing value with some number. For example, the mean value. Usually performs better compared to the former approach.
- Extension to Imputation: Use imputation but add an additional column to indicate if the value was missing initially. Benefit is not guaranteed.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

data = pd.read_csv("melb_data.csv")
y = data.Price

# DataFrame.drop([list, of, labels], axis=(1 for column axis, 0 for row axis))
predictors = data.drop(['Price'], axis=1)

# .select_dtypes() selects columns based on their data types and returns a new DataFrame with only those columns
# DataFrame.select_dtypes(exclude=[list, of, datatypes, in, single, quotes])
X = predictors.select_dtypes(exclude=['object'])

# Both test and train sizes are specified for clarity and readability
X_train, X_val, Y_train, Y_val = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0) # test_size = 0.25 by default

In [2]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error as mae

# Model comparison function on the basis of mae
def score(X_train, X_val, y_train, y_val):
    model = RandomForestRegressor(n_estimators=10, random_state=0)
    model.fit(X_train, y_train)
    predictions = model.predict(X_val)
    return mae(y_val, predictions)

First, we will test the model using the first approach to deal with missing values: Dropping columns with missing values.

In [3]:
cols_with_missing = [col for col in X_train.columns if X_train[col].isnull().any()]

reduced_X_train = X_train.drop(cols_with_missing, axis=1)
reduced_X_valid = X_val.drop(cols_with_missing, axis=1)

print(score(reduced_X_train, reduced_X_valid, Y_train, Y_val))

183550.22137772635


Next, we will test the model using Imputation using the `SimpleImputer` to replace missing values with the mean value of each column.

In [4]:
from sklearn.impute import SimpleImputer

my_imputer = SimpleImputer()
imputed_X_train = pd.DataFrame(my_imputer.fit_transform(X_train))
imputed_X_val = pd.DataFrame(my_imputer.transform(X_val))

print(score(imputed_X_train, imputed_X_val, Y_train, Y_val))

178166.46269899711


It is necessary to understand why two different methods `fit_transform()` and `transform()` were used on the training and validation set respectively.

- `fit_transform()` does two things:
    - Learns which values to use for filling the missing values, whether it be the mean, median, mode etc.
    - Fills these values in and returns.

- `transform()` fills in the missing values based ont he exact metrics calculated in training using the `fit_transform()`. It does not relearn.

This is the reason why `fit_transform()` is only used on the training set. We want to train the model only using the training data and the values derived from it. If we were to use `fit_transform()` on the validation set, it would separate the metrics and lead to better but unrealistic performance as the model will be able to consider future values. 