### Feature engineering - Imputers
scikit learn SimplerImputer on the Titanic dataset 

***
#### Environment
`conda activate sklearn-env`

***
#### Goals
- Replace continuous missing values with mean value of all the other elements from corresponsing columns 
- Replace discrete or categorical missing values with most frequent value from the same column. 

***
#### References
https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html


#### Basic python imports

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import random
import seaborn as sns

# Make numpy printouts easier to read.
np.set_printoptions(precision=3, suppress=True)

#### Dataset load using sklearn API from https://www.openml.org site

https://www.openml.org/d/40945

In [None]:
from sklearn.datasets import fetch_openml

# Load data from https://www.openml.org/d/40945
raw_dataset = fetch_openml("titanic", version=1, as_frame=True).frame
dataset = raw_dataset.copy()
dataset.head(10)

In [None]:
dataset.drop(['boat', 'body', 'home.dest'],  axis=1, inplace=True)
dataset

### Verify missing data in dataset

Notice:
- missing values in `age` , `fare`, `cabin` and `embarked` fields.
- `age` and `fare` fields - continuous
- `cabin` and `embarked` fields - categorical

In [None]:
dataset.isna().sum()

In [None]:
sns.set(rc={'figure.figsize':(15,15)})
sns.heatmap(dataset.isnull(), cbar = False).set_title("Missing values heatmap")

In [None]:
from sklearn.impute import SimpleImputer

imputerAge = SimpleImputer(strategy='median').fit(dataset[['age', 'fare']])

In [None]:
dataset[['age', 'fare']] = imputerAge.transform(dataset[['age', 'fare']])

In [None]:
dataset.isna().sum()

In [None]:
sns.set(rc={'figure.figsize':(15,15)})
sns.heatmap(dataset.isnull(), cbar = False).set_title("Missing values heatmap")

In [None]:
from sklearn.impute import SimpleImputer

imputer_embarked = SimpleImputer(missing_values=np.NaN, 
                                 strategy='most_frequent').fit(dataset[['embarked']])

In [None]:
dataset[['embarked']] = imputer_embarked.transform(dataset[['embarked']])

In [None]:
from sklearn.impute import SimpleImputer

dataset[['cabin']] = SimpleImputer(missing_values=None, 
                                   strategy='constant', 
                                   fill_value='unknown_cabin').fit_transform(dataset[['cabin']])

In [None]:
dataset.isna().sum()

Notice replaced values in dataset

In [None]:
dataset.sample(20)