### Feature Engineering Techniques: Imputation

This is the second notebook in a series of hands-on methods of [feature engineering techniques](https://heartbeat.fritz.ai/hands-on-with-feature-engineering-techniques-variables-types-b2120e534680)

In [1]:
import warnings
warnings.filterwarnings("ignore")

In [2]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer

In [3]:
grades = pd.read_csv("data/class-grades.csv", na_values="NA")
grades.head(2)

Unnamed: 0,Prefix,Assignment,Tutorial,Midterm,TakeHome,Final
0,5,57.14,34.09,64.38,51.48,52.5
1,8,95.05,105.49,67.5,99.07,68.33


### Use Mean/Median Imputation for missing values

In [4]:
grades.isna().sum()

Prefix        0
Assignment    0
Tutorial      0
Midterm       0
TakeHome      1
Final         3
dtype: int64

In [5]:
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(grades)

SimpleImputer()

In [6]:
train = imputer.transform(grades)

### Aribtrary Value Attribution for Missing Values

In [7]:
#create the imputer, with fill value 999 as the arbitraty value
imputer = SimpleImputer(missing_values=np.nan, strategy='constant', fill_value=999)

In [8]:
# fit the imputer to the train data
imputer.fit(grades)

# apply the transformation to the train and test
train = imputer.transform(grades)

### End of Tail Imputation

In [11]:
from feature_engine.imputation import EndTailImputer

In [16]:
# create the imputer
imputer = EndTailImputer(imputation_method='gaussian', tail='right')

# fit the imputer to the train set
imputer.fit(grades)

# transform the data
train = imputer.transform(grades)
train.head(2)

Unnamed: 0,Prefix,Assignment,Tutorial,Midterm,TakeHome,Final
0,5,57.14,34.09,64.38,51.48,52.5
1,8,95.05,105.49,67.5,99.07,68.33


### Frequent Category Imputation

In [17]:
# create the imputer, with most frequent as strategy to fill missing value.
imputer = SimpleImputer(missing_values=np.nan, strategy='most_frequent')

# fit the imputer to the train set
imputer.fit(grades)

SimpleImputer(strategy='most_frequent')

In [19]:
# transform the data
train = imputer.transform(grades)

### Complete Case Analysis Imputation

In [20]:
#drop the missing values
train= grades.dropna(inplace=False)

In [22]:
train.shape

(95, 6)

In [23]:
grades.shape

(99, 6)

### Random Sample Imputation

In [25]:
from feature_engine.imputation import RandomSampleImputer

# create a random sampler imputer
imputer = RandomSampleImputer(random_state=42)

# fit with data
imputer.fit(grades)

# transform the data 
train = imputer.transform(grades)
train.shape())