<a href="https://colab.research.google.com/github/cirilwakounig/MachineLearning/blob/main/2_Dealing_with_Missing_Values.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Dealing with Missing Values

This script is showing how to effectively deal with missing values. More information about methods and strategies can be found here: https://www.kaggle.com/alexisbcook/missing-values. This script is based on the course 'Intermediate Machine Learning' provided by Kaggle. 

In [None]:
# Import the required Libraries
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.impute import SimpleImputer

##### 1. Import the required Data

---



In [None]:
# Import the Data Set
file_path_train = '/content/drive/MyDrive/Colab Notebooks/Kaggle Course/Intermediate Machine Learning/train.csv'
file_path_test = '/content/drive/MyDrive/Colab Notebooks/Kaggle Course/Intermediate Machine Learning/test.csv'

# Read the data
X_full = pd.read_csv(file_path_train, index_col = 'Id')
X_test_full = pd.read_csv(file_path_test, index_col = 'Id')

# Assign the dependent variable - Remove missing target values
X_full.dropna(axis = 0, subset = ['SalePrice'], inplace = True)   # Inplace = True overrides existing data frame
y = X_full.SalePrice

# Separate features from predictors
X_full.drop(['SalePrice'], axis = 1, inplace = True)

# Assign features - To keep things simple, we'll use only numerical predictors
X = X_full.select_dtypes(exclude=['object'])
X_test = X_test_full.select_dtypes(exclude=['object'])

# Split the data in train and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, train_size = 0.8, test_size = 0.2, random_state = 0)

#### 2. Set up Regressor used to score Data processing Approaches

---



In [None]:
# We are using a Random Forest Regressor to score the performance

def score_mae(X_train, X_val, y_train, y_val):
  # Develop Model and make Predictions
  model = RandomForestRegressor(n_estimators = 100, random_state = 0)
  model.fit(X_train,y_train)
  preds = model.predict(X_val)

  error = mean_absolute_error(y_val, preds)
  return (error)

#### 3. Test Data Processing Approaches

---



In [None]:
# Variable Containing the Score of each Approach
method_score = []

##### 3.1 Drop Columns with Missing Values

This approach drops any column, that contains a missing value.


In [None]:
# Detect columns with missing values
drop_cols = [col for col in X_train.columns if X_train[col].isnull().any()]   # any() is the keyword to drop any column containing a missing value. 

# Remove columns with missing values
reduced_X_train = X_train.drop(drop_cols, axis = 1)
reduced_X_val = X_val.drop(drop_cols, axis = 1)

# Count the number of missing rows in each column
print(X_train.shape)
numMissingValues = X_train.isnull().sum()
print(numMissingValues[numMissingValues>0])

(1168, 36)
LotFrontage    212
MasVnrArea       6
GarageYrBlt     58
dtype: int64


In [None]:
# Predict using the reduced feature set
method_score.append(score_mae(reduced_X_train, reduced_X_val, y_train, y_val))
print(method_score)

[17837.82570776256]


##### 3.2 Impute Missing Values based on Similarities

This approach uses sklearn's impute function to approximate missing values. 

In [None]:
# Define the imputer
my_imputer = SimpleImputer()

# Impute using the train features as a fitting set for the imputer
imputed_X_train = pd.DataFrame(my_imputer.fit_transform(X_train))   # Transform np array to data frame
imputed_X_val = pd.DataFrame(my_imputer.transform(X_val))

# Imputation returns np array and thus removes columns. Reassign column names here
imputed_X_train.columns = X_train.columns
imputed_X_val.columns = X_val.columns

In [None]:
# Predict using imputed dataset
method_score.append(score_mae(imputed_X_train, imputed_X_val, y_train, y_val))
print(method_score)

[17837.82570776256, 18062.894611872147]


##### 3.3 An Extension to Imputation

Here, the imputation approach from 3.2 is extended by keeping track of which values have been imputed. 

In [None]:
# Make a copy to avoid changing data when imputing 
X_train_plus = X_train.copy()
X_val_plus = X_val.copy()

# Make new columns indicating what will be imputed. df[col].isnull() returns true/false in each row of column col in dataframe df
for col in drop_cols:
  X_train_plus[col + '_was_missing'] = X_train_plus[col].isnull()
  X_val_plus[col + '_was_missing'] = X_val_plus[col].isnull()

In [None]:
# Define the imputer
my_imputer = SimpleImputer()

# Impute using the train features as a fitting set for the imputer
imputed_X_train_plus = pd.DataFrame(my_imputer.fit_transform(X_train_plus))   # Transform np array to data frame
imputed_X_val_plus = pd.DataFrame(my_imputer.transform(X_val_plus))

# Imputation returns np array and thus removes columns. Reassign column names here
imputed_X_train_plus.columns = X_train_plus.columns
imputed_X_val_plus.columns = X_val_plus.columns

In [None]:
method_score.append(score_mae(imputed_X_train_plus, imputed_X_val_plus, y_train, y_val))
print(method_score)

[17837.82570776256, 18062.894611872147, 18148.417180365297]


##### 3.4 Conclusion

As the share of missing values is very low, dropping whole columns is unwise, as it would remove valuable information that could be used to better fit the model. Dropping whole columns only makes sense, if a significant amount of entries are missing within a column. In the case here, dropping columns proved to be the best approach, as it always must be considered, if imputing values would make sense. 