# Imputation

The problems in this notebook correspond to the concepts covered in `Lectures/Cleaning/Imputation`.

In [1]:
import pandas as pd
import numpy as np

##### 1. Load `iris` with NAs

In these problems you will be working with a slightly different version of the iris data set. Load this version below, then use `.info` to identify the column(s) with missing values.

In [2]:
iris = pd.read_csv("../../Data/iris_w_nas.csv")

##### Sample Solution

In [3]:
iris.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   135 non-null    float64
 4   iris_class    150 non-null    int64  
dtypes: float64(4), int64(1)
memory usage: 6.0 KB


##### 2. Train test split

Make a train test split for these data. Remember to stratify by `iris_class`.

##### Sample Solution

In [4]:
from sklearn.model_selection import train_test_split

In [5]:
iris_train, iris_test = train_test_split(iris.copy(),
                                            shuffle=True,
                                            random_state=233,
                                            stratify = iris['iris_class'])

##### 4. `SimpleImputer`

Use `sklearn`'s `SimpleImputer` object to impute the missing values of the training set with the median of that column.

##### Sample Solution

In [6]:
from sklearn.impute import SimpleImputer

In [7]:
impute = SimpleImputer(strategy = 'median')

impute.fit(iris_train[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']])

impute.transform(iris_train[['sepal_length', 
                             'sepal_width', 
                             'petal_length', 
                             'petal_width']])[iris_train.petal_width.isna()]

array([[6.7, 3.3, 5.7, 1.3],
       [6.5, 3. , 5.8, 1.3],
       [5.2, 3.4, 1.4, 1.3],
       [5.1, 3.8, 1.5, 1.3],
       [6.1, 2.8, 4. , 1.3],
       [5.4, 3.9, 1.3, 1.3],
       [5.1, 3.5, 1.4, 1.3],
       [5.8, 2.7, 5.1, 1.3],
       [5.1, 2.5, 3. , 1.3],
       [6.1, 2.6, 5.6, 1.3]])

##### 5. Imputing test data

Impute the missing values in the test set.

##### Sample Solution

In [8]:
impute.transform(iris_test[['sepal_length', 
                             'sepal_width', 
                             'petal_length', 
                             'petal_width']])[iris_test.petal_width.isna()]

array([[4.4, 3. , 1.3, 1.3],
       [5.8, 2.7, 3.9, 1.3],
       [6.5, 2.8, 4.6, 1.3],
       [6.3, 2.8, 5.1, 1.3],
       [7. , 3.2, 4.7, 1.3]])

##### 6. MLR imputation

Build a multiple linear regression model using `sepal_length`, `sepal_width` and `petal_width` to impute the missing values from the training and test sets

##### Sample Solution

In [9]:
from sklearn.linear_model import LinearRegression

In [10]:
reg = LinearRegression()

iris_train_no_nas = iris_train.loc[~iris_train.petal_width.isna()].copy()


reg.fit(iris_train_no_nas[['sepal_length', 
                           'sepal_width', 
                           'petal_length']],
           iris_train_no_nas['petal_width'])

iris_train['petal_width_imputed'] = iris_train['petal_width'].copy()
iris_train.loc[iris_train.petal_width.isna(), 
               'petal_width_imputed'] = reg.predict(iris_train.loc[iris_train.petal_width.isna(),
                                                                   ['sepal_length', 'sepal_width', 'petal_length']])

iris_test['petal_width_imputed'] = iris_test['petal_width'].copy()
iris_test.loc[iris_test.petal_width.isna(), 
               'petal_width_imputed'] = reg.predict(iris_test.loc[iris_test.petal_width.isna(),
                                                                   ['sepal_length', 'sepal_width', 'petal_length']])



In [11]:
iris_train.loc[iris_train.petal_width.isna()]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,iris_class,petal_width_imputed
124,6.7,3.3,5.7,,2,2.087729
104,6.5,3.0,5.8,,2,2.119041
28,5.2,3.4,1.4,,0,0.137195
19,5.1,3.8,1.5,,0,0.295013
71,6.1,2.8,4.0,,1,1.204596
16,5.4,3.9,1.3,,0,0.147931
0,5.1,3.5,1.4,,0,0.178882
101,5.8,2.7,5.1,,2,1.829203
98,5.1,2.5,3.0,,1,0.817573
134,6.1,2.6,5.6,,2,2.011481


In [12]:
iris_test.loc[iris_test.petal_width.isna()]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,iris_class,petal_width_imputed
38,4.4,3.0,1.3,,0,0.165346
82,5.8,2.7,3.9,,1,1.192503
54,6.5,2.8,4.6,,1,1.440293
133,6.3,2.8,5.1,,2,1.746911
50,7.0,3.2,4.7,,1,1.474131


##### 7. `IterativeImputer`

`sklearn` currently has another imputater object that more easily implements imputation similar to what you were asked to do in 6. in development.

Because this object is not yet stable, we will not demonstrate it, but you can read more about `IterativeImputer` here, <a href="https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html#sklearn.impute.IterativeImputer">https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html#sklearn.impute.IterativeImputer</a>. Learn it on your own, or pay attention for later versions of `sklearn` where this object has become stable.

##### 8. `KNNImputer`

Another model based approach to imputing is to use $k$-nearest neighbors regressor to impute the missing values. This <b>can</b> be implemented directly using `sklearn`'s `KNNImputer` object, <a href="https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html#sklearn.impute.KNNImputer">https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html#sklearn.impute.KNNImputer</a>.

Use `KNNImputer` to impute the missing `petal_width` values in the training and test sets. Imagine you are building a model to classify the iris class of each observation. Should you include `iris_class` as input into this imputation?

##### Sample Solution

In [13]:
from sklearn.impute import KNNImputer

In [14]:
impute = KNNImputer()


impute.fit(iris_train[['sepal_length', 
                           'sepal_width', 
                           'petal_length',
                           'petal_width']])

impute.transform(iris_train[['sepal_length', 
                           'sepal_width', 
                           'petal_length',
                           'petal_width']])[iris_train.petal_width.isna()]

array([[6.7 , 3.3 , 5.7 , 2.22],
       [6.5 , 3.  , 5.8 , 2.  ],
       [5.2 , 3.4 , 1.4 , 0.26],
       [5.1 , 3.8 , 1.5 , 0.18],
       [6.1 , 2.8 , 4.  , 1.22],
       [5.4 , 3.9 , 1.3 , 0.22],
       [5.1 , 3.5 , 1.4 , 0.24],
       [5.8 , 2.7 , 5.1 , 1.98],
       [5.1 , 2.5 , 3.  , 1.08],
       [6.1 , 2.6 , 5.6 , 1.84]])

In [15]:
impute.transform(iris_test[['sepal_length', 
                           'sepal_width', 
                           'petal_length',
                           'petal_width']])[iris_test.petal_width.isna()]

array([[4.4 , 3.  , 1.3 , 0.16],
       [5.8 , 2.7 , 3.9 , 1.18],
       [6.5 , 2.8 , 4.6 , 1.46],
       [6.3 , 2.8 , 5.1 , 1.82],
       [7.  , 3.2 , 4.7 , 1.5 ]])

--------------------------

This notebook was written for the Erd&#337;s Institute C&#337;de Data Science Boot Camp by Matthew Osborne, Ph. D., 2022.

Any potential redistributors must seek and receive permission from Matthew Tyler Osborne, Ph.D. prior to redistribution. Redistribution of the material contained in this repository is conditional on acknowledgement of Matthew Tyler Osborne, Ph.D.'s original authorship and sponsorship of the Erdős Institute as subject to the license (see License.md)