<a href="https://colab.research.google.com/github/ashwanthlonely/Learn/blob/main/Missing_Values.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
df= pd.read_csv('/content/drive/MyDrive/dataset/melb_data.csv')

In [None]:
y=df.Price

In [None]:
melb_predictors=df.drop(['Price'], axis=1)

In [None]:
X=melb_predictors.select_dtypes(exclude=['object'])

In [None]:
X_train, X_valid, y_train, y_valid= train_test_split(X,y, test_size=0.1, random_state=42)

Define Function to Measure Quality of Each Approach¶
We define a function score_dataset() to compare different approaches to dealing with missing values. This function reports the mean absolute error (MAE) from a random forest model.



In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

def score_dataset(X_train, X_valid, y_train, y_valid):
  model=RandomForestRegressor(n_estimators=10, random_state=0)
  model.fit(X_train, y_train)
  preds=model.predict(X_valid)
  return mean_absolute_error(y_valid, preds)

Score from Approach 1 (Drop Columns with Missing Values)¶
Since we are working with both training and validation sets, we are careful to drop the same columns in both DataFrames.



In [None]:
cols_with_missing= [col for col in X_train.columns if X_train[col].isnull().any()]
reduced_X_train=X_train.drop(cols_with_missing, axis=1)
reduced_X_valid=X_valid.drop(cols_with_missing, axis=1)
print("MAE from approach 1 (drop columns with missing values):")
print(score_dataset(reduced_X_train, reduced_X_valid, y_train, y_valid))

MAE from approach 1 (drop columns with missing values):
190529.05409916543


Score from Approach 2 (Imputation)¶
Next, we use SimpleImputer to replace missing values with the mean value along each column.

Although it's simple, filling in the mean value generally performs quite well (but this varies by dataset). While statisticians have experimented with more complex ways to determine imputed values (such as regression imputation, for instance), the complex strategies typically give no additional benefit once you plug the results into sophisticated machine learning models.



In [None]:
from sklearn.impute import SimpleImputer

my_imputer=SimpleImputer()
imputed_X_train=pd.DataFrame(my_imputer.fit_transform(X_train))
imputed_X_valid=pd.DataFrame(my_imputer.transform(X_valid))

imputed_X_train.columns=X_train.columns
imputed_X_valid.columns=X_valid.columns

print("MAE from Aproach 2 (Imputation):")
print(score_dataset(imputed_X_train, imputed_X_valid, y_train, y_valid))

MAE from Aproach 2 (Imputation):
178339.54558173785


We see that Approach 2 has lower MAE than Approach 1, so Approach 2 performed better on this dataset.

Score from Approach 3 (An Extension to Imputation)
Next, we impute the missing values, while also keeping track of which values were imputed.



In [None]:
X_train_plus=X_train.copy()
X_valid_plus=X_valid.copy()

for col in cols_with_missing:
  X_train_plus[col + '_was_missing']=X_train_plus[col].isnull()
  X_valid_plus[col + '_was_missing']=X_valid_plus[col].isnull()

#imputation

My_imputer=SimpleImputer()
imputed_X_train_plus=pd.DataFrame(my_imputer.fit_transform(X_train_plus))
imputed_X_valid_plus=pd.DataFrame(my_imputer.transform(X_valid_plus))

imputed_X_train_plus.columns=X_train_plus.columns
imputed_X_valid_plus.columns=X_valid_plus.columns

print("MAE from Approach 3 (An Extension to Imputation):")
print(score_dataset(imputed_X_train_plus, imputed_X_valid_plus, y_train, y_valid))

MAE from Approach 3 (An Extension to Imputation):
177621.39934778033


In [None]:
print(X_train.shape)
missing_val_count_by_column=(X_train.isnull().sum())
print(missing_val_count_by_column[missing_val_count_by_column > 0])

(12222, 12)
Car               58
BuildingArea    5803
YearBuilt       4847
dtype: int64


Conclusion¶
As is common, imputing missing values (in Approach 2 and Approach 3) yielded better results, relative to when we simply dropped columns with missing values (in Approach 1).

