# The Experiment

With our datasets now cleaned of all NaN values, we're going to load them and remove data! Let's get started!

## Importing Libraries.

In [1]:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Preparation

We're going to load up the `iris_cleaned` dataset and designate it as our "control" group.

In [2]:
iris_ctrl = pd.read_csv('datasets/iris/iris_cleaned')
iris_ctrl = iris_ctrl.drop('Unnamed: 0', axis=1)
iris_ctrl.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


Next we'll make a copy to act as our "experimental" group.

In [3]:
iris_exp = iris_ctrl.copy()
iris_exp.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


Now to randomly replace 10% of data with NaN values.

In [4]:
# defining feature columns
features = ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 
            'petal width (cm)']

# Inserting NaN values into Experiment Group

for col in iris_exp[features]:
    # Every cell has a 10% chance of being selected
    # Rows may be selected more than once
    iris_exp.loc[iris_exp.sample(frac=0.1, replace=True).index, col] = np.nan

iris_exp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   sepal length (cm)  137 non-null    float64
 1   sepal width (cm)   135 non-null    float64
 2   petal length (cm)  135 non-null    float64
 3   petal width (cm)   135 non-null    float64
 4   target             150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


In [5]:
# obtaining indices of rows with NaN values
nan_rows = iris_exp[iris_exp.isna().any(axis=1)]
nan_rows

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
1,4.9,,,0.2,setosa
3,4.6,,1.5,0.2,setosa
5,5.4,,1.7,0.4,setosa
7,5.0,3.4,,0.2,setosa
15,5.7,4.4,,,setosa
18,5.7,,1.7,0.3,setosa
25,5.0,,1.6,0.2,setosa
28,5.2,3.4,1.4,,setosa
29,4.7,3.2,1.6,,setosa
33,5.5,4.2,1.4,,setosa


## Create an Answer Key

Now that we've replaced 10% of the data with `NaN` values, we'll **use the index of each affected row** to **subset an answer key** from the **control group** to measure our results against. 

In [6]:
# Creating list of indices 
null_idx = list(nan_rows.index)

# Creating Answer Key to compare future results against
answer_key = iris_ctrl.iloc[null_idx]
answer_key

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
1,4.9,3.0,1.4,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
5,5.4,3.9,1.7,0.4,setosa
7,5.0,3.4,1.5,0.2,setosa
15,5.7,4.4,1.5,0.4,setosa
18,5.7,3.8,1.7,0.3,setosa
25,5.0,3.0,1.6,0.2,setosa
28,5.2,3.4,1.4,0.2,setosa
29,4.7,3.2,1.6,0.2,setosa
33,5.5,4.2,1.4,0.2,setosa


## One-Hot-Encoding

KNN Imputation works by finding the mean value of *k* nearest neighbors.  Since it calculates the *mean*, non-numerical characters are useless to us. To fix this, we're going to One Hot Encode our Target variable.  

Important Note: Normally, I would Label Encode the Target Variable since we don't want multiple target variables (e.g. $y_{1}$, $y_{2}$, $y_{3}$ = w$x_{1}$ + w$x_{2}$ + w$x_{3}$ ... + b).  However, KNN Imputation requires scaling our data, and keeping our data between 0 and 1 will keep things simple and easier to revert back.  Therefore, we will make binary dummy columns.  

In [7]:
# Creating Dummy Columns
target_vars = iris_exp['target']
target_dummies = pd.get_dummies(target_vars, drop_first=False)

# Adding dummies to dataframe
iris_exp2 = pd.concat([iris_exp, target_dummies], axis=1)
iris_exp2 = iris_exp2.drop('target', axis=1)
iris_exp2.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),setosa,versicolor,virginica
0,5.1,3.5,1.4,0.2,1,0,0
1,4.9,,,0.2,1,0,0
2,4.7,3.2,1.3,0.2,1,0,0
3,4.6,,1.5,0.2,1,0,0
4,5.0,3.6,1.4,0.2,1,0,0


In [8]:
iris_exp2.describe()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),setosa,versicolor,virginica
count,137.0,135.0,135.0,135.0,150.0,150.0,150.0
mean,5.781022,3.068889,3.781481,1.200741,0.333333,0.333333,0.333333
std,0.816154,0.429763,1.782761,0.749776,0.472984,0.472984,0.472984
min,4.3,2.0,1.0,0.1,0.0,0.0,0.0
25%,5.1,2.8,1.6,0.3,0.0,0.0,0.0
50%,5.7,3.0,4.4,1.3,0.0,0.0,0.0
75%,6.4,3.4,5.1,1.8,1.0,1.0,1.0
max,7.9,4.4,6.9,2.5,1.0,1.0,1.0


## Scaling
As mentioned earlier, we need to scale our data when we use KNN Imputation.  This is a necessary step since KNN Imputation uses the **distance between points** to determine the nearest neighbors and so larger numbers can create a bias.

Normally, on a dataset like this, scaling wouldn't be necessary since the data are on the same scale, but using the MinMax scaler won't hurt it.

In [9]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

scaled_iris = pd.DataFrame(scaler.fit_transform(iris_exp2), 
                           columns=iris_exp2.columns)
scaled_iris.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),setosa,versicolor,virginica
0,0.222222,0.625,0.067797,0.041667,1.0,0.0,0.0
1,0.166667,,,0.041667,1.0,0.0,0.0
2,0.111111,0.5,0.050847,0.041667,1.0,0.0,0.0
3,0.083333,,0.084746,0.041667,1.0,0.0,0.0
4,0.194444,0.666667,0.067797,0.041667,1.0,0.0,0.0


# KNN Imputation

Now that the data is preprocessed, we'll begin our experiment!  We'll start by imputing the means of the 5 nearest neighbors.

In [10]:
from sklearn.impute import KNNImputer

impute = KNNImputer(n_neighbors = 5)

# Applying to dataframe
knn_iris = pd.DataFrame(impute.fit_transform(scaled_iris), 
                           columns=scaled_iris.columns)

knn_iris

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),setosa,versicolor,virginica
0,0.222222,0.625000,0.067797,0.041667,1.0,0.0,0.0
1,0.166667,0.558333,0.088136,0.041667,1.0,0.0,0.0
2,0.111111,0.500000,0.050847,0.041667,1.0,0.0,0.0
3,0.083333,0.533333,0.084746,0.041667,1.0,0.0,0.0
4,0.194444,0.666667,0.067797,0.041667,1.0,0.0,0.0
...,...,...,...,...,...,...,...
145,0.666667,0.416667,0.711864,0.916667,0.0,0.0,1.0
146,0.555556,0.208333,0.677966,0.750000,0.0,0.0,1.0
147,0.611111,0.416667,0.711864,0.791667,0.0,0.0,1.0
148,0.527778,0.583333,0.745763,0.916667,0.0,0.0,1.0


In [11]:
# Inverting Scaling
inverse_knn_iris = pd.DataFrame(scaler.inverse_transform(knn_iris), 
                           columns=knn_iris.columns)
inverse_knn_iris.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),setosa,versicolor,virginica
0,5.1,3.5,1.4,0.2,1.0,0.0,0.0
1,4.9,3.34,1.52,0.2,1.0,0.0,0.0
2,4.7,3.2,1.3,0.2,1.0,0.0,0.0
3,4.6,3.28,1.5,0.2,1.0,0.0,0.0
4,5.0,3.6,1.4,0.2,1.0,0.0,0.0


## Collecting Results

In order to evaluate the results, we're going to need to index the CELLS that contain NaNs.  Let's make a function that will automate the process of subsetting the NaN values of different columns, compiling them into a list, retrieving the answers, and calculate the RMSE of the KNN estimations

AttributeError: 'dict_keys' object has no attribute 'value'