# The Experiment

With our datasets now cleaned of all NaN values, we're going to load them and remove data! Let's get started!

## Importing Libraries.

In [1]:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Preparation

We're going to load up the `iris_cleaned` dataset and designate it as our "control" group.

In [2]:
iris_ctrl = pd.read_csv('datasets/iris/iris_cleaned')
iris_ctrl = iris_ctrl.drop('Unnamed: 0', axis=1)
iris_ctrl.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   sepal length (cm)  150 non-null    float64
 1   sepal width (cm)   150 non-null    float64
 2   petal length (cm)  150 non-null    float64
 3   petal width (cm)   150 non-null    float64
 4   target             150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


Changing column names for ease of use.

In [3]:
name_change = {'sepal length (cm)':'sepal_length', 
               'sepal width (cm)':'sepal_width', 
               'petal length (cm)':'petal_length',
              'petal width (cm)':'petal_width'}
iris_ctrl = iris_ctrl.rename(name_change, axis=1)
iris_ctrl.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   target        150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


Next we'll make a copy to act as our "experimental" group.

In [4]:
iris_exp = iris_ctrl.copy()
iris_exp.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,target
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


Now to randomly replace 10% of data with NaN values.

In [5]:
# defining feature columns
features = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']

# Inserting NaN values into Experiment Group

for col in iris_exp[features]:
    # Every cell has a 10% chance of being selected
    # Rows may be selected more than once
    iris_exp.loc[iris_exp.sample(frac=0.1, replace=True).index, col] = np.nan

iris_exp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  136 non-null    float64
 1   sepal_width   137 non-null    float64
 2   petal_length  136 non-null    float64
 3   petal_width   136 non-null    float64
 4   target        150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


In [6]:
# obtaining indices of rows with NaN values
nan_cols = iris_exp['sepal_length']
nan_cols = nan_cols[nan_cols.isna()]
nan_cols

19    NaN
33    NaN
50    NaN
53    NaN
54    NaN
59    NaN
60    NaN
71    NaN
73    NaN
103   NaN
113   NaN
114   NaN
122   NaN
130   NaN
Name: sepal_length, dtype: float64

## Create an Answer Key

Now that we've replaced 10% of the data with `NaN` values, we'll **use the index of each affected row** to **subset an answer key** from the **control group** to measure our results against. 

In [7]:
# Creating list of indices 
null_idx = list(nan_cols.index)

# Creating Answer Key to compare future results against
answer_key = iris_ctrl.sepal_length.iloc[null_idx]
answer_key

19     5.1
33     5.5
50     7.0
53     5.5
54     6.5
59     5.2
60     5.0
71     6.1
73     6.1
103    6.3
113    5.7
114    5.8
122    7.7
130    7.4
Name: sepal_length, dtype: float64

## Label Encoding

KNN Imputation works by finding the mean value of *k* nearest neighbors.  Since it calculates the *mean*, non-numerical characters are useless to us. To fix this, we're going to Label Encode our Target variable.  


In [22]:
def cat_codes(df, columns):
    """
    Input: Data frame and list of columns
    Output: Columns converted to categories and assigned cat_codes
    """
    for i in columns:
        df[i] = df[i].astype('category')
        df[i] = df[i].cat.codes

In [24]:
label_encoding = ['target']
cat_codes(iris_exp, label_encoding)
iris_exp.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,,0.2,0


## Scaling
As mentioned earlier, we need to scale our data when we use KNN Imputation.  This is a necessary step since KNN Imputation uses the **distance between points** to determine the nearest neighbors and so larger numbers can create a bias.

Normally, on a dataset like this, scaling wouldn't be necessary since the data are on the same scale, but using the MinMax scaler won't hurt it.

In [25]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

scaled_iris = pd.DataFrame(scaler.fit_transform(iris_exp), 
                           columns=iris_exp.columns)
scaled_iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,target
0,0.222222,0.625,0.067797,0.041667,0.0
1,0.166667,0.416667,0.067797,0.041667,0.0
2,0.111111,0.5,0.050847,0.041667,0.0
3,0.083333,0.458333,0.084746,0.041667,0.0
4,0.194444,0.666667,,0.041667,0.0


# KNN Imputation

Now that the data is preprocessed, we'll begin our experiment!  We'll start by imputing the means of the 5 nearest neighbors.

In [26]:
from sklearn.impute import KNNImputer

impute = KNNImputer(n_neighbors = 5)

# Applying to dataframe
knn_iris = pd.DataFrame(impute.fit_transform(scaled_iris), 
                           columns=scaled_iris.columns)

knn_iris

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,target
0,0.222222,0.625000,0.067797,0.041667,0.0
1,0.166667,0.416667,0.067797,0.041667,0.0
2,0.111111,0.500000,0.050847,0.041667,0.0
3,0.083333,0.458333,0.084746,0.041667,0.0
4,0.194444,0.666667,0.067797,0.041667,0.0
...,...,...,...,...,...
145,0.666667,0.416667,0.711864,0.916667,1.0
146,0.555556,0.208333,0.677966,0.750000,1.0
147,0.611111,0.416667,0.711864,0.791667,1.0
148,0.527778,0.583333,0.789831,0.916667,1.0


In [27]:
# Inverting Scaling
inverse_knn_iris = pd.DataFrame(scaler.inverse_transform(knn_iris), 
                           columns=knn_iris.columns)
inverse_knn_iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,target
0,5.1,3.5,1.4,0.2,0.0
1,4.9,3.0,1.4,0.2,0.0
2,4.7,3.2,1.3,0.2,0.0
3,4.6,3.1,1.5,0.2,0.0
4,5.0,3.6,1.4,0.2,0.0


## Collecting Results

In order to evaluate the results, we're going to need to index the CELLS that contain NaNs.  Let's make a function that will automate the process of subsetting the NaN values of different columns, compiling them into a list, retrieving the answers, and calculate the RMSE of the KNN estimations.  

Honestly, it will be easier if we just make this function run the entire process.

In [21]:
def impute_analysis(df, column=None):
    # Create lists to store actual and imputed values for each columns
    features = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']

    # create dictionaries for actuals, predictions, and indexes
    actuals = {}
    preds = {}
    index = {}
    for i in range(len(features)):
        combo_names= features[i]
        # Distribute names to separate lists
        actuals[combo_names] = []
        preds[combo_names] = []
        indexes[combo_names] = []

    # Create copy of Dataframe for manipulation
    df_preds = df.copy()
    
    # Preprocess experimental data 
    
    # Add NaNs 
    features = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']        
    for col in df_preds[features]:
        df_preds.loc[df_preds.sample(frac=0.1, replace=True).index, col] = np.nan

    # label encode target variable
    label_encoding = ['target']
    cat_codes(df_preds, label_encoding)
    
    # Create Answer Keys
   
    
    # Get indexes of NaN values for each column
    for col in df_preds[features]:
        nan_cols = df_preds[col]
        nan_cols = nan_cols[nan_cols.isna()]

        # Creating list of indices 
        null_idx = list(nan_cols.index)
        index[col] = null_idx
        # Creating Answer Key to compare future results against
        actuals[col] = df[col].iloc[null_idx]
        

        

impute_analysis(inverse_knn_iris, column=None)

{'sepal_length': 19     5.10
 33     4.92
 50     6.26
 53     5.74
 54     6.22
 59     5.66
 60     5.60
 71     5.66
 73     6.34
 103    6.56
 113    6.12
 114    6.62
 122    7.20
 130    6.86
 Name: sepal_length, dtype: float64,
 'sepal_width': 27     3.70
 34     3.44
 37     3.26
 40     3.54
 41     3.50
 47     3.42
 62     2.74
 79     2.40
 89     2.68
 92     2.72
 96     2.72
 97     2.64
 100    3.08
 Name: sepal_width, dtype: float64,
 'petal_length': 4      1.40
 18     1.44
 21     1.46
 26     1.48
 34     1.48
 38     1.34
 44     1.48
 51     4.56
 61     4.58
 62     4.30
 63     4.40
 68     4.36
 90     4.06
 148    5.66
 Name: petal_length, dtype: float64,
 'petal_width': 25     0.16
 40     0.24
 50     1.48
 54     1.28
 58     1.42
 75     1.42
 80     1.12
 89     1.30
 91     1.34
 98     1.12
 100    2.20
 119    1.82
 128    1.84
 129    1.88
 Name: petal_width, dtype: float64}