#### Bookkeeping

In [14]:
# Adding a folder full of useful code into the sys.path
import os
import sys
base_path = os.getcwd()
lib_path = os.path.join(base_path,"Useful_Code")
sys.path.append(lib_path)
print base_path
print lib_path

# Relevant imports
import numpy as np
import pandas as pd
import sklearn
from sklearn import datasets
import scipy
import matplotlib.pyplot as plt
import Swiss

C:\Users\Juan\Machine Learning\Week 3
C:\Users\Juan\Machine Learning\Week 3\Useful_Code


Hi this notebook is heavily adopted from:
"Python Machine Learning 2nd Edition by Sebastian Raschka, Packt Publishing Ltd. 2017

Code Repository: https://github.com/rasbt/python-machine-learning-book-2nd-edition

Code License: MIT License"
by Juan Bohorquez, no licenses just me.

Additional resources:


# Dealing With Missing Data

Sometimes your dataset will have holes in it. This is obviously bad. Here we'll explore a few methods that are useful when dealing with broken data.

## But first a word about dataframes and csv files

Throughout the book Raschka uses dataframes as the standard datastructure. Here he presents two ways we might create a dataframe:
1. Importing a .csv, this is convenient as a good fraction of datasets are available originally as .csvs
2. Building a dataframe object using a constructor
I'll show examples of doing both

### 1. Building a dataframe from a .csv file.
I'll use Greg's FIFA data file as an example of how pandas imports .csvs

In [15]:
dfcsv = pd.read_csv("FIFA_Data\\FIFA_WorkRate_Analysis_195.csv",sep="\t")
dfcsv

Unnamed: 0,Attacking_Work_Rate,Marking,Aggression,Reactions,Speed,Stamina
0,High,22,63,96,92,92
1,Medium,13,48,95,87,74
2,High,21,56,88,90,79
3,High,30,78,93,77,89
4,Medium,10,29,85,61,44
5,Medium,13,38,88,56,25
6,High,25,80,88,82,79
7,High,51,65,87,95,78
8,Medium,15,84,85,74,75
9,Medium,11,23,81,52,38


### 2. Building a dataframe from an existing dataset
Here I'll use truncated data from the iris dataset in sklearn.
the DataFrame constructor allows you to input data in a variety of different ways. Here I just demonstrate how to do this using a 2D numpy array of data along with an array of feature names.

More info on the DataFrame class, and some examples, can be found here:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html

In [22]:
iris = datasets.load_iris()
iris_data = iris['data']
df = pd.DataFrame(data = data[:5], columns = iris['feature_names'])
df

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


### Writing data to a .csv
Here I'll write to a .csv with tab separators

Further documentation and examples are available here:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html

In [23]:
data_path = os.path.join(base_path,"Iris_Data\iris.csv")
df.to_csv(path_or_buf=data_path,sep='\t')

## Now lets poke some holes in the data

I've made a file called Swiss with one method (so far) that randomly deletes entries in a 2D numpy array. Docstring:  

Swiss.holes(): a function that, pseudo-randomly, pokes holes into a dataset.  
Parameters:  
* data - a 2D numpy array of data in the form of Scikit-learn datasets. ie: data.size = ["Number of Samples","Number of Features"]  
* ratio - desired ratio of deleted datapoints to untouched datapoints  
* seed -  a seed to be fed into the random number generator for repeatability  

Return:  
* data - the input numpy array with random entries removed

In [24]:
holy_data = Swiss.holes(iris_data,ratio = 0.1,seed = 599594)
holy_df = pd.DataFrame(data = holy_data,columns = iris['feature_names'])
holy_df

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
5,,3.9,1.7,0.4
6,4.6,3.4,1.4,0.3
7,5.0,3.4,1.5,0.2
8,4.4,2.9,1.4,0.2
9,4.9,3.1,1.5,0.1


Now we have data with holes in it. We can't exactly plug this data into a ML algorithm so we'll have to get rid of the holes in our data.
Raschka presents two methods for dealing with problem data:

## Method 1: Deleting problem data
pandas has methods built in that indicate if a particular index contains Nan, and then has methods which can be used to delete the identified rows or columns, here are a bunch of examples of how you might delete problem data.

Deleting problem data is often the best way to go if you can afford to lose some data. Imputing data can introduce noise into your model and reduce its efficacy.

In [26]:
# removing rows that contain missing values
drop_rows = holy_df.dropna(axis=0)
# removing columns that contain missing values
drop_columns = holy_df.dropna(axis=1)
# drop rows where all columns are NaN
no_empty = holy_df.dropna(how='all')
# drop rows that have less than 3 real values
thresh = holy_df.dropna(thresh=4)
# only drop rows where NaN appears in a given column
# colum values = ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)','petal width (cm)']
given_column = holy_df.dropna(subset = ['sepal length (cm)'])

#Uncomment lines below to check out datasets treated in a given way

new_iris = drop_rows
#new_iris = drop_columns
#new_iris = no_empty
#new_iris = thresh
#new_iris = given_column

new_iris

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
1,4.9,3.0,1.4,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
6,4.6,3.4,1.4,0.3
7,5.0,3.4,1.5,0.2
8,4.4,2.9,1.4,0.2
9,4.9,3.1,1.5,0.1
10,5.4,3.7,1.5,0.2
11,4.8,3.4,1.6,0.2
13,4.3,3.0,1.1,0.1


Above we show the new dataframe after removing all samples with missing data. Note that with a roughly 10% data error rate we lose 51 out of 150 samples, thats about 33%!


## Method 2: Imputing missing values
deleting data has obvious drawbacks. If you delete too much data might not have enough training data, and deleting features might make you lose relevant fitting data.
Another method is to fill in these gaps. Commonly by taking the mean value of the data in that column.

The book details using the built in Imputer class as we do below. Common implementations use the mean value for a feature with missing data to fill in the gaps. Other strategies, such as regression are also viable but can be computationally costly. Below we implement mean imputing.

In [31]:
from sklearn.preprocessing import Imputer

imr = Imputer(missing_values='NaN', strategy = 'mean', axis = 0)
imr = imr.fit(holy_df.values)
imputed_iris = imr.transform(holy_df.values)
imputed_iris
patchy_iris = pd.DataFrame(data = imputed_iris,columns = iris['feature_names'])
patchy_iris

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.100000,3.50000,3.766418,0.200000
1,4.900000,3.00000,1.400000,0.200000
2,4.700000,3.20000,3.766418,0.200000
3,4.600000,3.10000,1.500000,0.200000
4,5.000000,3.60000,1.400000,0.200000
5,5.800752,3.90000,1.700000,0.400000
6,4.600000,3.40000,1.400000,0.300000
7,5.000000,3.40000,1.500000,0.200000
8,4.400000,2.90000,1.400000,0.200000
9,4.900000,3.10000,1.500000,0.100000
