## Missing Value Handling in Machine Learning

### Method 4- Impute missing values with sklearn Impute Module---SimpleImputer

Youtube Explanation : https://youtu.be/ZGqedVjagK8

#### Univariate vs. Multivariate Imputation

**Univariate Imputation** : Which imputes values in the i-th feature dimension using only non-missing values in that feature dimension (e.g. impute.SimpleImputer).

**Multivariate Imputation** algorithms use the entire set of available feature dimensions to estimate the missing values (e.g. impute.IterativeImputer).

In [1]:
import pandas as pd
import numpy as np
import sklearn
from sklearn.impute import SimpleImputer #Import SimpleImputer class from sklearn.impute

In [2]:
from platform import python_version
print("python",python_version())
print('\n'.join(f'{m.__name__} {m.__version__}' for m in globals().values() if getattr(m, '__version__', None)))

python 3.7.9
pandas 1.2.1
numpy 1.19.2
sklearn 0.23.2


### The SimpleImputer 
class provides basic strategies for imputing missing values. Missing values can be imputed with a provided constant value, or using the statistics (mean, median or most frequent) of each column in which the missing values are located.

In [3]:
# Import Dataset
df_saless = pd.read_excel("https://raw.githubusercontent.com/atulpatelDS/Data_Files/master/Feature_Engineering/Missing_Value/Saless.xlsx")

In [4]:
df_saless

Unnamed: 0,Date,Store_Type,City_Type,Day_Temp,No_of_Customers,Sales,Product_Quality
0,2020-10-01,1,1,30.0,100.0,3112.0,A
1,2020-10-02,2,1,32.0,115.0,3682.0,A
2,2020-10-03,3,3,31.0,,2774.0,A
3,2020-10-04,1,2,29.0,105.0,3182.0,
4,2020-10-05,1,2,33.0,104.0,1368.0,B
5,2020-10-07,2,2,,,,B
6,2020-11-24,2,3,26.0,90.0,4232.0,C
7,2020-11-25,3,3,,96.0,,
8,2020-11-26,2,2,27.0,100.0,2356.0,B
9,2020-11-28,3,1,,,,A


In [5]:
df_saless["Day_Temp"].mean()

28.11111111111111

In [6]:
df_saless["No_of_Customers"].mean()

99.44444444444444

In [7]:
imputer=SimpleImputer(missing_values=np.nan,strategy='mean')
# Impute NaN value in columns "Day_Temp" with mean value of respected column.
imputer=imputer.fit(df_saless.iloc[:,3:4])
df_saless.iloc[:,3:4]=imputer.transform(df_saless.iloc[:,3:4])

In [8]:
# we can directly use the fi_transform inplace of fit and then transform
imputer=SimpleImputer(missing_values=np.nan,strategy='mean')
# Impute NaN value in columns "Day_Temp" with mean value of respected column.
df_saless.iloc[:,3:4]=imputer.fit_transform(df_saless.iloc[:,3:4])

In [9]:
df_saless

Unnamed: 0,Date,Store_Type,City_Type,Day_Temp,No_of_Customers,Sales,Product_Quality
0,2020-10-01,1,1,30.0,100.0,3112.0,A
1,2020-10-02,2,1,32.0,115.0,3682.0,A
2,2020-10-03,3,3,31.0,,2774.0,A
3,2020-10-04,1,2,29.0,105.0,3182.0,
4,2020-10-05,1,2,33.0,104.0,1368.0,B
5,2020-10-07,2,2,28.111111,,,B
6,2020-11-24,2,3,26.0,90.0,4232.0,C
7,2020-11-25,3,3,28.111111,96.0,,
8,2020-11-26,2,2,27.0,100.0,2356.0,B
9,2020-11-28,3,1,28.111111,,,A


In [10]:
## Impute NaN value in columns "Day_Temp" and "No_of_Customers" with mean value of respected column.
imputer=SimpleImputer(missing_values=np.nan,strategy='mean')
imputer=imputer.fit(df_saless.iloc[:,3:5])
df_saless.iloc[:,3:5]=imputer.transform(df_saless.iloc[:,3:5])

In [11]:
df_saless

Unnamed: 0,Date,Store_Type,City_Type,Day_Temp,No_of_Customers,Sales,Product_Quality
0,2020-10-01,1,1,30.0,100.0,3112.0,A
1,2020-10-02,2,1,32.0,115.0,3682.0,A
2,2020-10-03,3,3,31.0,99.444444,2774.0,A
3,2020-10-04,1,2,29.0,105.0,3182.0,
4,2020-10-05,1,2,33.0,104.0,1368.0,B
5,2020-10-07,2,2,28.111111,99.444444,,B
6,2020-11-24,2,3,26.0,90.0,4232.0,C
7,2020-11-25,3,3,28.111111,96.0,,
8,2020-11-26,2,2,27.0,100.0,2356.0,B
9,2020-11-28,3,1,28.111111,99.444444,,A


**The SimpleImputer class also supports categorical data** represented as string values or pandas categoricals when using the 'most_frequent' or 'constant' strategy:

In [12]:
import pandas as pd
df = pd.DataFrame([["a", "x"],
                    [np.nan, "y"],
                    ["a", np.nan],
                    ["b", "y"]], dtype="category")

In [13]:
df

Unnamed: 0,0,1
0,a,x
1,,y
2,a,
3,b,y


In [14]:
imp = SimpleImputer(strategy="most_frequent")
print(imp.fit_transform(df))

[['a' 'x']
 ['a' 'y']
 ['a' 'y']
 ['b' 'y']]


In [15]:
# Import Dataset
df_saless = pd.read_excel("https://raw.githubusercontent.com/atulpatelDS/Data_Files/master/Feature_Engineering/Missing_Value/Saless.xlsx")

In [16]:
df_saless

Unnamed: 0,Date,Store_Type,City_Type,Day_Temp,No_of_Customers,Sales,Product_Quality
0,2020-10-01,1,1,30.0,100.0,3112.0,A
1,2020-10-02,2,1,32.0,115.0,3682.0,A
2,2020-10-03,3,3,31.0,,2774.0,A
3,2020-10-04,1,2,29.0,105.0,3182.0,
4,2020-10-05,1,2,33.0,104.0,1368.0,B
5,2020-10-07,2,2,,,,B
6,2020-11-24,2,3,26.0,90.0,4232.0,C
7,2020-11-25,3,3,,96.0,,
8,2020-11-26,2,2,27.0,100.0,2356.0,B
9,2020-11-28,3,1,,,,A


In [17]:
imp_cat = SimpleImputer(missing_values=np.nan, strategy="most_frequent")

In [18]:
#imp_cat.fit_transform(df_saless.iloc[:,-1:])
df_saless.iloc[:,-1:] = imp_cat.fit_transform(df_saless.iloc[:,-1:])

In [19]:
df_saless

Unnamed: 0,Date,Store_Type,City_Type,Day_Temp,No_of_Customers,Sales,Product_Quality
0,2020-10-01,1,1,30.0,100.0,3112.0,A
1,2020-10-02,2,1,32.0,115.0,3682.0,A
2,2020-10-03,3,3,31.0,,2774.0,A
3,2020-10-04,1,2,29.0,105.0,3182.0,A
4,2020-10-05,1,2,33.0,104.0,1368.0,B
5,2020-10-07,2,2,,,,B
6,2020-11-24,2,3,26.0,90.0,4232.0,C
7,2020-11-25,3,3,,96.0,,A
8,2020-11-26,2,2,27.0,100.0,2356.0,B
9,2020-11-28,3,1,,,,A


In [20]:
# Import Dataset
df_saless = pd.read_excel("https://raw.githubusercontent.com/atulpatelDS/Data_Files/master/Feature_Engineering/Missing_Value/Saless.xlsx")

In [21]:
df_saless

Unnamed: 0,Date,Store_Type,City_Type,Day_Temp,No_of_Customers,Sales,Product_Quality
0,2020-10-01,1,1,30.0,100.0,3112.0,A
1,2020-10-02,2,1,32.0,115.0,3682.0,A
2,2020-10-03,3,3,31.0,,2774.0,A
3,2020-10-04,1,2,29.0,105.0,3182.0,
4,2020-10-05,1,2,33.0,104.0,1368.0,B
5,2020-10-07,2,2,,,,B
6,2020-11-24,2,3,26.0,90.0,4232.0,C
7,2020-11-25,3,3,,96.0,,
8,2020-11-26,2,2,27.0,100.0,2356.0,B
9,2020-11-28,3,1,,,,A


In [22]:
# If we want to replace the specific value inplace of NaN then we use strategy with fill_value 
imp_cat = SimpleImputer(missing_values=np.nan, strategy="constant",fill_value='G')

In [23]:
#imp_cat.fit_transform(df_saless.iloc[:,-1:])
df_saless.iloc[:,-1:] = imp_cat.fit_transform(df_saless.iloc[:,-1:])

In [24]:
df_saless

Unnamed: 0,Date,Store_Type,City_Type,Day_Temp,No_of_Customers,Sales,Product_Quality
0,2020-10-01,1,1,30.0,100.0,3112.0,A
1,2020-10-02,2,1,32.0,115.0,3682.0,A
2,2020-10-03,3,3,31.0,,2774.0,A
3,2020-10-04,1,2,29.0,105.0,3182.0,G
4,2020-10-05,1,2,33.0,104.0,1368.0,B
5,2020-10-07,2,2,,,,B
6,2020-11-24,2,3,26.0,90.0,4232.0,C
7,2020-11-25,3,3,,96.0,,G
8,2020-11-26,2,2,27.0,100.0,2356.0,B
9,2020-11-28,3,1,,,,A


In [25]:
# Import the dataset
df_titanic = pd.read_csv("https://raw.githubusercontent.com/atulpatelDS/Data_Files/master/Titanic/titanic_train.csv")

In [26]:
df_titanic.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [27]:
df_titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [28]:
imputer = SimpleImputer(missing_values=np.nan,strategy='mean')
df_titanic["Age"] =  imputer.fit_transform(df_titanic[["Age"]])

In [29]:
df_titanic.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [30]:
# impute categorical column "Embarked" using strategy='most_frequent'
imputer = SimpleImputer(missing_values=np.nan,strategy='most_frequent')
df_titanic["Embarked"] =  imputer.fit_transform(df_titanic[["Embarked"]])

In [31]:
df_titanic.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         0
dtype: int64