<a href="https://colab.research.google.com/github/falahamro/Imputing_Missing_Data/blob/main/Performing_mean_or_median_imputation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Mean or median Imputation

Mean / median imputation consists in replacing all occurrences of missing values (NA) in a variable by the mean (if the variable has a Gaussian distribution) or median (if the variable has a skewed distribution).

In this project, I will replace missing values by the median or the mean using pandas and Scikit-learn, both open source Python libraries.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer

In [2]:
# Let's load our dataset 

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
data = pd.read_csv("/content/drive/MyDrive/Feature Engineering/Chapter02/creditApprovalUCI.csv")
data.head()

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15,A16
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202.0,0,1
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43.0,560,1
2,a,24.5,,u,g,q,h,,,,0,f,g,280.0,824,1
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100.0,3,1
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120.0,0,1


In mean and median imputation, the mean or median values should be calculated using the variables in the train set; therefore, let's separate the data into train and test sets and their respective targets:

In [7]:
X_train, X_test, y_train, y_test = train_test_split(
    data.drop('A16', axis= 1), data['A16'], test_size=0.3,
    random_state=0)

In [8]:
# we use the panda's shape to check the size of the returned datasets
X_train.shape, X_test.shape

((483, 15), (207, 15))

In [9]:
# checking percentage of missing values in the train set
X_train.isnull().mean()

A1     0.008282
A2     0.022774
A3     0.140787
A4     0.008282
A5     0.008282
A6     0.008282
A7     0.008282
A8     0.140787
A9     0.140787
A10    0.140787
A11    0.000000
A12    0.000000
A13    0.000000
A14    0.014493
A15    0.000000
dtype: float64

The percentage of missing values for the A2, A3, A8, A11, and A15 variables should be 0

In [10]:
# let's replace the missing values with the median in five numerical variables using pandas:

for var in ['A2', 'A3', 'A8', 'A11', 'A15']:
  value = X_train[var].median()
  X_train[var] = X_train[var].fillna(value)
  X_test[var] = X_test[var].fillna(value)

we calculate the median using the train set and then use this value to replace the missing data in the train and test sets.

The pandas' fillna() returns a new dataset with imputed values by default. We can set the inplace argument to True to replace missing data in the original dataframe: X_train[var].fillna(inplace=True).

**Impute missing values by the median using scikit-learn**

In [13]:
X_train, X_test, y_train, y_test = train_test_split(
         data[['A2', 'A3', 'A8', 'A11', 'A15']], data['A16'],
         test_size=0.3, random_state=0)

create a median imputation transformer using SimpleImputer() from scikit-learn:

In [14]:
imputer = SimpleImputer(strategy='median')

In [17]:
# then we fit the SimpleImputer() to the train set so that it learns the median values of the variables 

imputer.fit(X_train)

SimpleImputer()

In [18]:
# inspect the learned median values: 
imputer.statistics_

array([ 31.89019068,   4.84148193,   2.36901205,   2.51759834,
       966.25258799])

In [19]:
#let's replace the missing values with medians: 

X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)

In [20]:
pd.DataFrame(X_train, columns= ['A2', 'A3', 'A8', 'A11', 'A15'])

Unnamed: 0,A2,A3,A8,A11,A15
0,46.08,3.000,2.375,8.0,4159.0
1,15.92,2.875,0.085,0.0,0.0
2,36.33,2.125,0.085,1.0,1187.0
3,22.17,0.585,0.000,0.0,0.0
4,57.83,7.040,14.000,6.0,1332.0
...,...,...,...,...,...
478,36.75,4.710,0.000,0.0,0.0
479,41.75,0.960,2.500,0.0,600.0
480,19.58,0.665,1.665,0.0,5.0
481,22.83,2.290,2.290,7.0,2384.0
