<h1 style='color: #C9C9C9'>Machine Learning with Python<img style="float: right; margin-top: 0;" width="240" src="../../Images/cf-logo.png" /></h1> 
<p style='color: #C9C9C9'>&copy; Coding Fury 2022 - all rights reserved</p>

<hr style='color: #C9C9C9' />

# Imputers in SciKit Learn

In the last notebook, we learned how to deal with missing data. We also looked at the idea of imputing data - entering a value that seems reasonable. 

SciKit Learn comes with a SimpleImputer library, which makes things easier. 

It's best practice to split your data before imputing values - to avoid "data leakage". 

In [None]:
import numpy as np
import pandas as pd

In [None]:
auto_df = pd.read_csv('../../Data/automobiles.csv')
auto_df 

In [None]:
# drop the symbolling and normalised losses columns
auto_df = auto_df.drop(['symboling', 'normalised_losses'], axis=1)
# drop the rows without a price
auto_df = auto_df.dropna(subset=['price']) 

In [None]:
from sklearn.impute import SimpleImputer

Imputing is done in 3 stages: 
1. Create an instance of the imputer
2. fit the data to the imputer
3. transform the data

In [None]:
imp_mode = SimpleImputer(strategy='most_frequent', missing_values=np.NaN)   # "most_frequent" is same as the "mode"
imp_mode.fit(auto_df[['num_of_doors']])  
auto_df['num_of_doors'] = imp_mode.transform(auto_df[['num_of_doors']])


However, we can fit and transform the data in one line of code as shown below

In [None]:
imp_mode = SimpleImputer(strategy='most_frequent', missing_values=np.NaN)   # "most_frequent" is same as the "mode"
auto_df['num_of_doors'] = imp_mode.fit_transform(auto_df[['num_of_doors']])

In [None]:
imp_mean = SimpleImputer(strategy='mean', missing_values=np.NaN)  
auto_df[['bore', 'stroke']] = imp_mean.fit_transform(auto_df[['bore','stroke']])

In [None]:
imp_median = SimpleImputer(strategy='median', missing_values=np.NaN)  
auto_df[['horsepower','peak_rpm']] = imp_mean.fit_transform(auto_df[['horsepower','peak_rpm']])

In [None]:
auto_df