# Five important ways for imputing Missing values

You can impute missing values using machine learning models. This process is known as data imputation and is commonly used in fata prepocessing to handle missing or incomplete data. There are several methods and models you can use,depending on the nature of your data and the missing values:

1. **Simple Imputation Techniques**:
- Mean/Median/Mode Imputation : Replacing missing values with the mean or median of the column, suitable for numerical data.

- Mode Imputation : Replacing missing values with the mode (most frequent value) of that variable in the dataset.Useful for categorical data.

2. **K-Nearest Neighbors (KNN)**: This algorithm can be used to impute missing values based on the similerity of rows.

3. **Regression Imputation**: Use a regression model to predict the missing values based on other variables in your dataset.

4. **Decision trees and Random Forests**: These can handle missing values inherity. They can also be used to predict missing values based on the pattern learned from other data

# Simple Imputating Techniques
## Mean/Median Imputation

### Mean Imputation 

In [22]:
# import libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# load the dataset Titanic
# df = sns.load_dataset('titanic')
# df.head()

In [23]:
# check the number od missing values in each column
# df.isnull().sum().sort_values(ascending=False)

In [24]:
# impute missing values with mean
# df['age'] = df['age'].fillna(df['age'].mean())

# check the number o missing values in each column
# df.isnull().sum().sort_values(ascending=False)

# Median Imputation

In [25]:
# impute massing values with median
# df['age'] = df['age'].fillna(df['age'].median())
# df.isnull().sum().sort_values(ascending=False)


# Mode imputation

In [26]:
# imputing missing values by mode
# df['embark_town'] = df['embark_town'].fillna(df['embark_town'].mode()[0])
# df['embarked'] = df['embarked'].fillna(df['embarked'].mode()[0])

# df.isnull().sum().sort_values(ascending=False)




# K-Nearest Neighbors Alogorithm (KNN)

In [27]:
# load the dataset
# data = sns.load_dataset('titanic')

# check the number of messing values in each column
# data.isnull().sum().sort_values(ascending=False)

In [28]:
# imputing missing values with KNN imputer
# from sklearn.impute import KNNImputer

# call the KNN class with number of mneighbors = 4
# imputer = KNNImputer(n_neighbors=4)

# impute missing values with KNN imputer
# data['age'] = imputer.fit_transform(data[['age']])


# check the number of missing values in each column

# data.isnull().sum().sort_values(ascending=False)

# Regression Imputation

In [29]:
# load the dataset
df = sns.load_dataset('titanic')

# check the missing values in each column
df.isnull().sum().sort_values(ascending=False)

deck           688
age            177
embarked         2
embark_town      2
survived         0
pclass           0
sex              0
sibsp            0
parch            0
fare             0
class            0
who              0
adult_male       0
alive            0
alone            0
dtype: int64

In [30]:
# impute missing values with regression imputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# call the IterativeImputer class with max_iter = 10
imputer = IterativeImputer(max_iter=10)

# impute missing values with regression imputet
df['age'] = imputer.fit_transform(df[['age']])

# check the number of missing values in each column
df.isnull().sum().sort_values(ascending=False)

deck           688
embarked         2
embark_town      2
survived         0
pclass           0
sex              0
age              0
sibsp            0
parch            0
fare             0
class            0
who              0
adult_male       0
alive            0
alone            0
dtype: int64

# Random Forests for Imputing Missing Values

In [31]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error,r2_score,mean_absolute_error,mean_absolute_percentage_error
from sklearn.impute import SimpleImputer

# load the data
df = sns.load_dataset('titanic')

# check the number of missing values in each column
df.isnull().sum().sort_values(ascending=False)

deck           688
age            177
embarked         2
embark_town      2
survived         0
pclass           0
sex              0
sibsp            0
parch            0
fare             0
class            0
who              0
adult_male       0
alive            0
alone            0
dtype: int64

In [32]:
# remove deck column
df.drop('deck',axis=1,inplace=True)

# check the number of missing values in each column
df.isnull().sum().sort_values(ascending=False)

age            177
embarked         2
embark_town      2
survived         0
pclass           0
sex              0
sibsp            0
parch            0
fare             0
class            0
who              0
adult_male       0
alive            0
alone            0
dtype: int64

In [33]:
# encode the data using labelEncoder 
from sklearn.preprocessing import LabelEncoder

# column to encode
columns_to_encode = ['sex','embarked','who','class','embark_town','alive']

#dictonery to store labelEncoder for each column
label_encoders = {}

# loop to apply LabelEncoder to each column
for col in columns_to_encode:
    # create a new LabelEncoder for the column
    le = LabelEncoder()

    # fit and transform the data then inverse transform it
    df[col] = le.fit_transform(df[col])

    # store the encoder in the dictonery
    label_encoders[col] = le

# check the first few rows of the dataframe
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,embark_town,alive,alone
0,0,3,1,22.0,1,0,7.25,2,2,1,True,2,0,False
1,1,1,0,38.0,1,0,71.2833,0,0,2,False,0,1,False
2,1,3,0,26.0,0,0,7.925,2,2,2,False,2,1,True
3,1,1,0,35.0,1,0,53.1,2,0,2,False,2,1,False
4,0,3,1,35.0,0,0,8.05,2,2,1,True,2,0,True


In [34]:
# split the data into two parts: one with missing values one Without missing values
df_with_missing = df[df['age'].isna()]

# dropna remove all rows with missing values
df_without_missing = df.dropna()

In [35]:
print('The shape of the original dataset',df.shape)
print('The shape of the dataset with missing values removed is',df_without_missing.shape)
print('The shape of the dataset with missing values is',df_with_missing.shape)

The shape of the original dataset (891, 14)
The shape of the dataset with missing values removed is (714, 14)
The shape of the dataset with missing values is (177, 14)


In [36]:
df_with_missing.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,embark_town,alive,alone
5,0,3,1,,0,0,8.4583,1,2,1,True,1,0,True
17,1,2,1,,0,0,13.0,2,1,1,True,2,1,True
19,1,3,0,,0,0,7.225,0,2,2,False,0,1,True
26,0,3,1,,0,0,7.225,0,2,1,True,0,0,True
28,1,3,0,,0,0,7.8792,1,2,2,False,1,1,True


In [37]:
df_without_missing.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,embark_town,alive,alone
0,0,3,1,22.0,1,0,7.25,2,2,1,True,2,0,False
1,1,1,0,38.0,1,0,71.2833,0,0,2,False,0,1,False
2,1,3,0,26.0,0,0,7.925,2,2,2,False,2,1,True
3,1,1,0,35.0,1,0,53.1,2,0,2,False,2,1,False
4,0,3,1,35.0,0,0,8.05,2,2,1,True,2,0,True


In [38]:
# print the columns
df.columns

Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare',
       'embarked', 'class', 'who', 'adult_male', 'embark_town', 'alive',
       'alone'],
      dtype='object')

In [39]:
# regression  Imputation

# split the data into X and y and we will only take the columns with no missing values
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_absolute_percentage_error
X = df_without_missing.drop(['age'],axis=1)
y = df_without_missing['age']

# split the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

# Random Forest  Imputation
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# evaluate the model
y_pred = rf_model.predict(X_test)
print("Random Forest RMSE: ", np.sqrt(mean_squared_error(y_test, y_pred)))
print("Random Forest R2: ", r2_score(y_test, y_pred))
print("Random Forest MAE: ", mean_absolute_error(y_test, y_pred))
print("Random Forest MAPE: ", mean_absolute_percentage_error(y_test, y_pred))

Random Forest RMSE:  11.081260589808045
Random Forest R2:  0.33769388288226154
Random Forest MAE:  8.666661815622195
Random Forest MAPE:  0.40839466096086574


In [40]:
# check the number of missing values in each column
df_with_missing.isnull().sum().sort_values(ascending=False)

age            177
survived         0
pclass           0
sex              0
sibsp            0
parch            0
fare             0
embarked         0
class            0
who              0
adult_male       0
embark_town      0
alive            0
alone            0
dtype: int64

In [41]:
# predict missing values
y_pred = rf_model.predict(df_with_missing.drop(['age'],axis=1))
y_pred

array([32.97658333, 35.64221825, 18.347     , 35.57148611, 20.65142857,
       26.7619855 , 36.648     , 18.63142857, 21.80633333, 33.55618169,
       31.06587652, 35.90741667, 18.63142857, 24.824     , 31.03      ,
       39.405     , 25.849     , 26.7619855 , 31.06587652, 19.41142857,
       31.06587652, 31.06587652, 26.7619855 , 26.27095821, 29.23514286,
       31.06587652, 48.25650595, 27.94      , 31.87071429, 31.99628481,
       30.015     , 20.85816667, 33.755     , 60.19168831, 26.00185714,
       26.24316667, 28.91733333, 49.31      , 28.55277778, 48.25650595,
       18.63142857, 20.85816667, 33.78929167, 26.7619855 , 26.63      ,
       32.01066667, 28.22883333, 28.55277778, 31.99628481, 29.72904762,
       48.25650595, 27.67733333, 56.26333333, 18.63142857, 34.65645944,
       60.44168831, 39.405     , 35.7725    , 18.63142857, 24.78266667,
       34.305     , 31.06587652, 31.602     , 20.85816667, 25.296     ,
       36.97133333, 26.7619855 , 24.85777778, 55.52      , 35.57

In [42]:
# replace the missing values with the predicted values
df_with_missing['age'] = y_pred

# check the missing values     
df_with_missing.isnull().sum().sort_values(ascending=False)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_with_missing['age'] = y_pred


survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
embarked       0
class          0
who            0
adult_male     0
embark_town    0
alive          0
alone          0
dtype: int64

In [43]:
# concatenate the two dataframes
df_complete = pd.concat([df_with_missing,df_without_missing],axis=0)

# print the shape of the complete dataframe
print('The shape of the complete dataframe is :',df_complete.shape)

# check the first 5 rows of the complete dataframe
df_complete.head()

The shape of the complete dataframe is : (891, 14)


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,embark_town,alive,alone
5,0,3,1,32.976583,0,0,8.4583,1,2,1,True,1,0,True
17,1,2,1,35.642218,0,0,13.0,2,1,1,True,2,1,True
19,1,3,0,18.347,0,0,7.225,0,2,2,False,0,1,True
26,0,3,1,35.571486,0,0,7.225,0,2,1,True,0,0,True
28,1,3,0,20.651429,0,0,7.8792,1,2,2,False,1,1,True


In [44]:
for col in columns_to_encode:
    # retrive the corresponding LabelEncoder for each columns
    le = label_encoders[col]

    # Inverse transform the data
    df_complete[col] = le.inverse_transform(df[col])

# check the first 5 rows of the complete dataframe
df_complete.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,embark_town,alive,alone
5,0,3,male,32.976583,0,0,8.4583,S,Third,man,True,Southampton,no,True
17,1,2,female,35.642218,0,0,13.0,C,First,woman,True,Cherbourg,yes,True
19,1,3,female,18.347,0,0,7.225,S,Third,woman,False,Southampton,yes,True
26,0,3,female,35.571486,0,0,7.225,S,First,woman,True,Southampton,yes,True
28,1,3,male,20.651429,0,0,7.8792,S,Third,man,False,Southampton,no,True


# Multiple Imputation by Chained Equation

In [45]:
# import libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer


# load the dataset

df = sns.load_dataset('titanic')
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True
