# The Experiment

With our datasets now cleaned of all NaN values, we're going to load them and remove data! Let's get started!

## Importing Libraries.

In [1]:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Preparation

We're going to load up the `iris_cleaned` dataset and designate it as our "control" group.

In [2]:
iris_ctrl = pd.read_csv('datasets/iris/iris_cleaned')
iris_ctrl = iris_ctrl.drop('Unnamed: 0', axis=1)
iris_ctrl.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


Next we'll make a copy to act as our "experimental" group.

In [3]:
iris_exp = iris_ctrl.copy()
iris_exp.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


Now to randomly replace 10% of data with NaN values.

In [4]:
# defining feature columns
features = ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 
            'petal width (cm)']

# Inserting NaN values into Experiment Group

for col in iris_exp[features]:
    # Every cell has a 10% chance of being selected
    # Rows may be selected more than once
    iris_exp.loc[iris_exp.sample(frac=0.1, replace=True).index, col] = np.nan

iris_exp.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,setosa
1,,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,,setosa
3,4.6,3.1,1.5,,setosa
4,5.0,3.6,1.4,0.2,setosa


In [15]:
# obtaining indices of rows with NaN values
nan_rows = iris_exp[iris_exp.isna().any(axis=1)]
nan_rows

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
1,,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,,setosa
3,4.6,3.1,1.5,,setosa
7,,3.4,1.5,0.2,setosa
14,,4.0,1.2,0.2,setosa
16,,3.9,,0.4,setosa
17,5.1,3.5,1.4,,setosa
20,,3.4,1.7,0.2,setosa
23,5.1,3.3,1.7,,setosa
25,5.0,,1.6,0.2,setosa


## Create an Answer Key

Now that we've replaced 10% of the data with `NaN` values, we'll **use the index of each affected row** to **subset an answer key** from the **control group** to measure our results against. 

In [18]:
# Creating list of indices 
null_idx = list(nan_rows.index)

# Creating Answer Key to compare future results against
answer_key = iris_ctrl.iloc[null_idx]
answer_key

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
7,5.0,3.4,1.5,0.2,setosa
14,5.8,4.0,1.2,0.2,setosa
16,5.4,3.9,1.3,0.4,setosa
17,5.1,3.5,1.4,0.3,setosa
20,5.4,3.4,1.7,0.2,setosa
23,5.1,3.3,1.7,0.5,setosa
25,5.0,3.0,1.6,0.2,setosa
