# The Experiment

With our datasets now cleaned of all NaN values, we're going to load them and remove data! Let's get started!

## Importing Libraries.

In [1]:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Preparation

We're going to load up the `iris_cleaned` dataset and designate it as our "control" group.

In [2]:
iris_ctrl = pd.read_csv('datasets/iris/iris_cleaned')
iris_ctrl = iris_ctrl.drop('Unnamed: 0', axis=1)
iris_ctrl.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


Next we'll make a copy to act as our "experimental" group.

In [3]:
iris_exp = iris_ctrl.copy()
iris_exp.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


Now to randomly replace 10% of data with NaN values.

In [4]:
# defining feature columns
features = ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 
            'petal width (cm)']

# Inserting NaN values into Experiment Group

for col in iris_exp[features]:
    # Every cell has a 10% chance of being selected
    # Rows may be selected more than once
    iris_exp.loc[iris_exp.sample(frac=0.1, replace=True).index, col] = np.nan

iris_exp.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,,setosa
1,4.9,3.0,,0.2,setosa
2,4.7,3.2,,,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [5]:
# obtaining indices of rows with NaN values
nan_rows = iris_exp[iris_exp.isna().any(axis=1)]
nan_rows

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,,setosa
1,4.9,3.0,,0.2,setosa
2,4.7,3.2,,,setosa
5,,3.9,1.7,0.4,setosa
6,4.6,3.4,1.4,,setosa
13,4.3,3.0,,0.1,setosa
16,5.4,3.9,,0.4,setosa
17,5.1,3.5,1.4,,setosa
20,,3.4,1.7,0.2,setosa
29,4.7,,1.6,,setosa


## Create an Answer Key

Now that we've replaced 10% of the data with `NaN` values, we'll **use the index of each affected row** to **subset an answer key** from the **control group** to measure our results against. 

In [6]:
# Creating list of indices 
null_idx = list(nan_rows.index)

# Creating Answer Key to compare future results against
answer_key = iris_ctrl.iloc[null_idx]
answer_key

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
5,5.4,3.9,1.7,0.4,setosa
6,4.6,3.4,1.4,0.3,setosa
13,4.3,3.0,1.1,0.1,setosa
16,5.4,3.9,1.3,0.4,setosa
17,5.1,3.5,1.4,0.3,setosa
20,5.4,3.4,1.7,0.2,setosa
29,4.7,3.2,1.6,0.2,setosa


## One-Hot-Encoding

KNN Imputation works by finding the mean value of *k* nearest neighbors.  Since it calculates the *mean*, non-numerical characters are useless to us. To fix this, we're going to One Hot Encode our Target variable.  

Important Note: Normally, I would Label Encode the Target Variable since we don't want multiple target variables (e.g. $y_{1}$, $y_{2}$, $y_{3}$ = w$x_{1}$ + w$x_{2}$ + w$x_{3}$ ... + b).  However, KNN Imputation requires scaling our data, and keeping our data between 0 and 1 will keep things simple and easier to revert back.  Therefore, we will make binary dummy columns.  

In [9]:
# Creating Dummy Columns
target_vars = iris_exp['target']
target_dummies = pd.get_dummies(target_vars, drop_first=False)

# Adding dummies to dataframe
iris_exp2 = pd.concat([iris_exp, target_dummies], axis=1)
iris_exp2 = iris_exp2.drop('target', axis=1)
iris_exp2.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),setosa,versicolor,virginica
0,5.1,3.5,1.4,,1,0,0
1,4.9,3.0,,0.2,1,0,0
2,4.7,3.2,,,1,0,0
3,4.6,3.1,1.5,0.2,1,0,0
4,5.0,3.6,1.4,0.2,1,0,0


In [10]:
iris_exp2.describe()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),setosa,versicolor,virginica
count,137.0,135.0,136.0,136.0,150.0,150.0,150.0
mean,5.811679,3.055556,3.724265,1.222059,0.333333,0.333333,0.333333
std,0.801384,0.447158,1.750788,0.756277,0.472984,0.472984,0.472984
min,4.3,2.0,1.0,0.1,0.0,0.0,0.0
25%,5.1,2.8,1.6,0.3,0.0,0.0,0.0
50%,5.8,3.0,4.2,1.3,0.0,0.0,0.0
75%,6.4,3.35,5.1,1.8,1.0,1.0,1.0
max,7.7,4.4,6.7,2.5,1.0,1.0,1.0


## Scaling
As mentioned earlier, we need to scale our data when we use KNN Imputation.  This is a necessary step since KNN Imputation uses the **distance between points** to determine the nearest neighbors and so larger numbers can create a bias.

Normally, on a dataset like this, scaling wouldn't be necessary since the data are on the same scale, but using the MinMax scaler won't hurt it.

In [11]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

scaled_iris = pd.DataFrame(scaler.fit_transform(iris_exp2), 
                           columns=iris_exp2.columns)
scaled_iris.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),setosa,versicolor,virginica
0,0.235294,0.625,0.070175,,1.0,0.0,0.0
1,0.176471,0.416667,,0.041667,1.0,0.0,0.0
2,0.117647,0.5,,,1.0,0.0,0.0
3,0.088235,0.458333,0.087719,0.041667,1.0,0.0,0.0
4,0.205882,0.666667,0.070175,0.041667,1.0,0.0,0.0


# KNN Imputation

Now that the data is preprocessed, we'll begin our experiment!  We'll start by imputing the means of the 5 nearest neighbors.

In [12]:
from sklearn.impute import KNNImputer

impute = KNNImputer(n_neighbors = 5)

# Applying to dataframe
knn_iris = pd.DataFrame(impute.fit_transform(scaled_iris), 
                           columns=scaled_iris.columns)

knn_iris

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),setosa,versicolor,virginica
0,0.235294,0.625000,0.070175,0.075000,1.0,0.0,0.0
1,0.176471,0.416667,0.087719,0.041667,1.0,0.0,0.0
2,0.117647,0.500000,0.091228,0.033333,1.0,0.0,0.0
3,0.088235,0.458333,0.087719,0.041667,1.0,0.0,0.0
4,0.205882,0.666667,0.070175,0.041667,1.0,0.0,0.0
...,...,...,...,...,...,...,...
145,0.705882,0.458333,0.736842,0.916667,0.0,0.0,1.0
146,0.588235,0.208333,0.701754,0.716667,0.0,0.0,1.0
147,0.647059,0.416667,0.736842,0.791667,0.0,0.0,1.0
148,0.558824,0.583333,0.771930,0.916667,0.0,0.0,1.0


In [13]:
# Inverting Scaling
inverse_knn_iris = pd.DataFrame(scaler.inverse_transform(knn_iris), 
                           columns=knn_iris.columns)
inverse_knn_iris.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),setosa,versicolor,virginica
0,5.1,3.5,1.4,0.28,1.0,0.0,0.0
1,4.9,3.0,1.5,0.2,1.0,0.0,0.0
2,4.7,3.2,1.52,0.18,1.0,0.0,0.0
3,4.6,3.1,1.5,0.2,1.0,0.0,0.0
4,5.0,3.6,1.4,0.2,1.0,0.0,0.0
