# The Experiment

With our datasets now cleaned of all NaN values, we're going to load them and remove data! Let's get started!

## Importing Libraries.

In [1]:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
from functions import cat_codes

## Preprocess

We're going to load up the `iris_cleaned` dataset and designate it as our "control" group.

In [2]:
iris_ctrl = pd.read_csv('datasets/iris/iris_cleaned')
iris_ctrl = iris_ctrl.drop('Unnamed: 0', axis=1)
iris_ctrl.describe()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


Changing column names for ease of use.

In [3]:
name_change = {'sepal length (cm)':'sepal_length', 
               'sepal width (cm)':'sepal_width', 
               'petal length (cm)':'petal_length',
              'petal width (cm)':'petal_width'}
iris_ctrl = iris_ctrl.rename(name_change, axis=1)
iris_ctrl.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   target        150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


Next we'll make a copy to act as our "experimental" group.

In [4]:
iris_exp = iris_ctrl.copy()
iris_exp.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,target
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


### Label Encoding

KNN Imputation works by finding the mean value of *k* nearest neighbors.  Since it calculates the *mean*, non-numerical characters are useless to us. To fix this, we're going to Label Encode our Target variable.  


In [5]:
def cat_codes(df, columns):
    """
    Input: Data frame and list of columns
    Output: Columns converted to categories and assigned cat_codes
    """
    for i in columns:
        df[i] = df[i].astype('category')
        df[i] = df[i].cat.codes

In [6]:
label_encoding = ['target']
cat_codes(iris_exp, label_encoding)
iris_exp.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


### Scaling
As mentioned earlier, we need to scale our data when we use KNN Imputation.  This is a necessary step since KNN Imputation uses the **distance between points** to determine the nearest neighbors and so larger numbers can create a bias.

Normally, on a dataset like this, scaling wouldn't be necessary since the data are on the same scale, but using the MinMax scaler won't hurt it.

In [7]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

scaled_iris = pd.DataFrame(scaler.fit_transform(iris_exp), 
                           columns=iris_exp.columns)
scaled_iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,target
0,0.222222,0.625,0.067797,0.041667,0.0
1,0.166667,0.416667,0.067797,0.041667,0.0
2,0.111111,0.5,0.050847,0.041667,0.0
3,0.083333,0.458333,0.084746,0.041667,0.0
4,0.194444,0.666667,0.067797,0.041667,0.0


In [8]:
# Store copy of preprocessed dataframe for future iterations
iris_exp_copy = scaled_iris.copy()

## Remove Data
Now to randomly replace 10% of data with NaN values.

In [9]:
# Set seed for reproducibility
np.random.seed(42)

# defining feature columns
features = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']

# Inserting NaN values into Experiment Group
for col in scaled_iris[features]:
    # Every cell has a 10% chance of being selected
    # Rows may be selected more than once
    scaled_iris.loc[scaled_iris.sample(frac=0.1, replace=True).index,
                                                                col] = np.nan
    
scaled_iris.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  136 non-null    float64
 1   sepal_width   135 non-null    float64
 2   petal_length  135 non-null    float64
 3   petal_width   136 non-null    float64
 4   target        150 non-null    float64
dtypes: float64(5)
memory usage: 6.0 KB


In [10]:
# Total number of feature values 
num_vals = len(scaled_iris.index) * 4
print(f'The dataset (without target) has a total of {num_vals} values')

# Calculate number of NaNs
num_nan = scaled_iris.isna().sum().sum()
print(f'There are {num_nan} NaN values')

# Percent of missing values
percent_nan = (num_nan / num_vals) * 100
print(f'{round(percent_nan, 2)}% of the dataset is missing')

# Calculate number of rows with missing values

# obtaining indices of rows with NaN values
nan_cols = scaled_iris[features]
nan_cols = nan_cols[nan_cols.isna().any(axis=1)]
nan_rows = len(nan_cols.index)
print(f'There are {nan_rows} rows with missing values')

# Percentage of entries with missing data
total_missing = (nan_rows / 150) * 100
print(f'{round(total_missing, 2)}% of the rows contain missing values')


The dataset (without target) has a total of 600 values
There are 58 NaN values
9.67% of the dataset is missing
There are 46 rows with missing values
30.67% of the rows contain missing values


## Create an Answer Key

We've replaced roughly 10% of the data with `NaN` values, and 30% of the rows have been affected. Now we'll **subset an answer key** from the **control group** to measure our results against by **using the index of each affected row**. 

In [11]:
# Creating list of indices 
null_idx = list(nan_cols.index)

# Creating Answer Key to compare future results against
answer_key = iris_ctrl.iloc[null_idx]
answer_key

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,target
1,4.9,3.0,1.4,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
7,5.0,3.4,1.5,0.2,setosa
8,4.4,2.9,1.4,0.2,setosa
13,4.3,3.0,1.1,0.1,setosa
14,5.8,4.0,1.2,0.2,setosa
17,5.1,3.5,1.4,0.3,setosa
20,5.4,3.4,1.7,0.2,setosa
21,5.1,3.7,1.5,0.4,setosa
34,4.9,3.1,1.5,0.2,setosa


# KNN Imputation

Now that the data is preprocessed, we'll begin our experiment!  We'll start by imputing the means of the 5 nearest neighbors.

In [12]:
from sklearn.impute import KNNImputer

impute = KNNImputer(n_neighbors = 5)

# Applying to dataframe
knn_iris = pd.DataFrame(impute.fit_transform(scaled_iris), 
                           columns=scaled_iris.columns)

knn_iris

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,target
0,0.222222,0.625000,0.067797,0.041667,0.0
1,0.166667,0.458333,0.067797,0.033333,0.0
2,0.111111,0.500000,0.050847,0.041667,0.0
3,0.083333,0.458333,0.084746,0.041667,0.0
4,0.194444,0.666667,0.067797,0.041667,0.0
...,...,...,...,...,...
145,0.666667,0.416667,0.711864,0.916667,1.0
146,0.555556,0.208333,0.677966,0.750000,1.0
147,0.611111,0.416667,0.711864,0.791667,1.0
148,0.527778,0.583333,0.745763,0.916667,1.0


In [13]:
# Inverting Scaling
inverse_knn_iris = pd.DataFrame(scaler.inverse_transform(knn_iris), 
                           columns=knn_iris.columns)
inverse_knn_iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,target
0,5.1,3.5,1.4,0.2,0.0
1,4.9,3.1,1.4,0.18,0.0
2,4.7,3.2,1.3,0.2,0.0
3,4.6,3.1,1.5,0.2,0.0
4,5.0,3.6,1.4,0.2,0.0


## Collecting Results

In order to evaluate the results, we're going to need to index the CELLS that contain NaNs.  Let's make a function that will automate the process of subsetting the NaN values of different columns, compiling them into a list, retrieving the answers, and calculate the RMSE of the KNN estimations.  

Honestly, it will be easier if we just make this function run the entire process.

In [14]:
# Subsetting data to match that of our answer key
test_iris = inverse_knn_iris.iloc[null_idx]
test_iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,target
1,4.9,3.1,1.4,0.18,0.0
3,4.6,3.1,1.5,0.2,0.0
7,5.0,3.4,1.5,0.26,0.0
8,4.4,2.9,1.34,0.2,0.0
13,4.3,3.0,1.34,0.1,0.0


In [15]:
# Resetting indexes of test_iris and answer_key for iteration
test_iris = test_iris.reset_index()
test_iris.drop(['index', 'target'], axis=1, inplace=True)
answer_key = answer_key.reset_index()
answer_key.drop(['index', 'target'], axis=1, inplace=True)

# Calculate results
results = pd.DataFrame((round((answer_key - test_iris), 3)))

results.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
0,0.0,-0.1,0.0,0.02
1,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,-0.06
3,0.0,0.0,0.06,0.0
4,0.0,-0.0,-0.24,0.0


Now that we have a dataframe with the rows that had data imputed, we need to find out the accuracy. 

To do this we first need to find out if there happen to be any perfect imputations (e.g. `actual values - predictions == 0`).  We'll sum up the total number of imperfect imputations and subtract them from the total number of values that were imputed (58, as calculated earlier).  

In [16]:
# Imputes where y - y_hat != 0
imperfect_imputes = 0

for col in results.columns:
    for i in range(len(results)):
        if results[col][i] != 0.00 or results[col][i] != -0.00:
            imperfect_imputes += 1
# We know from earlier that there were 58 total NaN values 
total_imputes = 58 

# Imputes where y - y_hat == 0
perfect_imputes = total_imputes - imperfect_imputes

print(f'Total Values Imputed: {total_imputes}')
print(f'Imperfect Imputations: {imperfect_imputes}')
print(f'Perfect Imputations: {perfect_imputes}')

Total Values Imputed: 58
Imperfect Imputations: 57
Perfect Imputations: 1


So we did end up with 1 perfect estimation! That's pretty impressive! 

Now we'll find out the RMSE.  To do this we'll create a loop that will collect all of the errors from the imputed values and square them.  From there, the values will be added together and then we'll take the square-root of the resulting sum. 

In [17]:
squared_terms = []
for col in results.columns:
    for i in range(len(results)):
        if results[col][i] != 0.00 or results[col][i] != -0.00:
            error = results[col][i]
            squared_error = error**2
            squared_terms.append(squared_error)
n = 58
sum_sqr_err = sum(squared_terms)
mse = sum_sqr_err/n
rmse = np.sqrt(mse)

print(f'RMSE for KNN Imputation on Iris dataset is {rmse}')

RMSE for KNN Imputation on Iris dataset is 0.4271315714277568


The RMSE when 10% of the data is missing is about 0.427.  Even though only 10% of the total data was missing, we must remember that 30% of the rows were still affected which is relatively close to some real world datasets.  How much will this change if we increase the amount of missing data?

# 20% Missing Data
We'll start by making a copy of the preprocessed dataframe (`iris_exp_copy`) we stored earlier. This way we'll always have a clean copy of the preprocessed dataset. Generalizing the code above, we can automate this process by use of a function stored in the file `functions.py`.

In [18]:
from functions import knn_continuous

knn_continuous(df=iris_exp, target="target", frac_nan=0.2, n_neighbors=5, 
               seed=7)

The dataset (without target) has a total of 600 values
There are 109 NaN values
18.17% of the dataset is missing
There are 78 rows with missing values
52.0% of the rows contain missing values


-----------------------------------------------------------------


Total Values Imputed: 109
Imperfect Imputations: 104
Perfect Imputations: 5


-----------------------------------------------------------------


RMSE for KNN Imputation on dataset is 0.34823275986884333
