# Obtaining data

In this notebook, we're going to load a few different datasets, inspect them, and do any necessary cleaning so that we have datasets that we can measure the effectiveness of KNN Imputation against.

Our datasets will include:

* The Iris Dataset
* The Titanic Dataset 
* The King County Housing Dataset
* The Terry Stops Dataset
* 

Let's load our necessary libraries.

# Importing Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Inspecting and Scrubbing

## Iris Dataset

The Iris Dataset is a perfectly clean clean and curated dataset.  The features are all continuous variables while the target is divided into 3 separate classes.  The data is evenly balanced and the target variable even comes out of the box with numerical dummy labels.  We're going to take a step back and relabel the target classes for future interpretability.

In [2]:
from sklearn import datasets

# loading dataset
data = datasets.load_iris()

# Concatenating data into Dataframe
# Using Numpy's Concatenate function to join (np.c_) to join the datasets 
iris_df = pd.DataFrame(data=np.c_[data['data'], data['target']], 
                                  columns=data['feature_names'] + ['target'])
iris_df['target'] = iris_df.target.astype(int)
iris_df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [3]:
# Returning labels to classes
class_ = ['0', '1', '2']
label = ['setosa', 'versicolor', 'virginica']
iris_df['target'] = iris_df.target.astype(str)
for i in range(len(iris_df)):
    for j in range(len(class_)):
        if iris_df['target'][i] == '0':
            iris_df['target'][i] = iris_df['target'][i].replace(class_[j], 
                                                                'setosa')
        elif iris_df['target'][i] == '1':
            iris_df['target'][i] = iris_df['target'][i].replace(class_[j], 
                                                                label[j])
        elif iris_df['target'][i] == '2':
            iris_df['target'][i] = iris_df['target'][i].replace(class_[j], 
                                                                label[j])
iris_df.target.value_counts()
            

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if sys.path[0] == '':
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  from ipykernel import kernelapp as app


versicolor    50
virginica     50
setosa        50
Name: target, dtype: int64

In [4]:
iris_df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [5]:
# exporting to csv
iris_df.to_csv('datasets/iris/iris_cleaned')

## Titanic Dataset
This is a classic which again is used for classification problems.  This will be a nice and easy dive into applying KNN Imputation to slightly larger datasets that contain categorical, continuous, and discreet variables.

In [6]:
titanic = pd.read_csv('datasets/titanicdataset/full.csv')
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,...,Embarked,WikiId,Name_wiki,Age_wiki,Hometown,Boarded,Destination,Lifeboat,Body,Class
0,1,0.0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,...,S,691.0,"Braund, Mr. Owen Harris",22.0,"Bridgerule, Devon, England",Southampton,"Qu'Appelle Valley, Saskatchewan, Canada",,,3.0
1,2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,...,C,90.0,"Cumings, Mrs. Florence Briggs (née Thayer)",35.0,"New York, New York, US",Cherbourg,"New York, New York, US",4,,1.0
2,3,1.0,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,...,S,865.0,"Heikkinen, Miss Laina",26.0,"Jyväskylä, Finland",Southampton,New York City,14?,,3.0
3,4,1.0,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,...,S,127.0,"Futrelle, Mrs. Lily May (née Peel)",35.0,"Scituate, Massachusetts, US",Southampton,"Scituate, Massachusetts, US",D,,1.0
4,5,0.0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,...,S,627.0,"Allen, Mr. William Henry",35.0,"Birmingham, West Midlands, England",Southampton,New York City,,,3.0


In [7]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 21 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  1309 non-null   int64  
 1   Survived     891 non-null    float64
 2   Pclass       1309 non-null   int64  
 3   Name         1309 non-null   object 
 4   Sex          1309 non-null   object 
 5   Age          1046 non-null   float64
 6   SibSp        1309 non-null   int64  
 7   Parch        1309 non-null   int64  
 8   Ticket       1309 non-null   object 
 9   Fare         1308 non-null   float64
 10  Cabin        295 non-null    object 
 11  Embarked     1307 non-null   object 
 12  WikiId       1304 non-null   float64
 13  Name_wiki    1304 non-null   object 
 14  Age_wiki     1302 non-null   float64
 15  Hometown     1304 non-null   object 
 16  Boarded      1304 non-null   object 
 17  Destination  1304 non-null   object 
 18  Lifeboat     502 non-null    object 
 19  Body  

Since the target variable is the column `Survived`, we can't use any of the rows where a passenger's survival is unknown **can't be used**.

In [8]:
deceased = titanic[['Survived', 'Body']]
deceased.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    float64
 1   Body      130 non-null    object 
dtypes: float64(1), object(1)
memory usage: 20.6+ KB


### Body
The `Body` column includes details about deceased passengers whose bodies were found on and by what ship (Wikipedia).  From this we can eliminate any NaN values where `Survived` == `1.0`.

In [17]:
for i in range(len(titanic['Body'])):
    
    if titanic['Survived'][i] == 1.0:
        titanic['Body'][i] = str('Not Applicable')
    
    if type(titanic['Body'][i]) == float: 
        titanic['Body'][i] = 'Lost at Sea'
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 21 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  1309 non-null   int64  
 1   Survived     891 non-null    float64
 2   Pclass       1309 non-null   int64  
 3   Name         1309 non-null   object 
 4   Sex          1309 non-null   object 
 5   Age          1046 non-null   float64
 6   SibSp        1309 non-null   int64  
 7   Parch        1309 non-null   int64  
 8   Ticket       1309 non-null   object 
 9   Fare         1308 non-null   float64
 10  Cabin        295 non-null    object 
 11  Embarked     1307 non-null   object 
 12  WikiId       1304 non-null   float64
 13  Name_wiki    1304 non-null   object 
 14  Age_wiki     1302 non-null   float64
 15  Hometown     1304 non-null   object 
 16  Boarded      1304 non-null   object 
 17  Destination  1304 non-null   object 
 18  Lifeboat     502 non-null    object 
 19  Body  

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys


In [19]:
deceased = titanic[['Survived', 'Body']]
deceased.Body.value_counts()

Lost at Sea       837
Not Applicable    342
176MB               1
1{?}MB[86][87]      1
19MB                1
                 ... 
187MB               1
17MB                1
50MB                1
209MB               1
47MB                1
Name: Body, Length: 132, dtype: int64

In [11]:
# Dropping rows where 'Survived' == NaN
#titanic = titanic[titanic['Survived'].isnull() != True]
#titanic.info()

`Body` details 