# Python Machine Learning: Preprocessing

Preprocessing is an essential step of the machine learning workflow and important for the performance of models. This notebook will introduce the major steps of preprocessing for machine learning. 


## Load Data

For today, we will be working with the `penguins` data set. This data set is from [Kaggle](https://www.kaggle.com/parulpandey/penguin-dataset-the-new-iris) and includes some penguins of three different species, their location, and some measurements for each penguin.

First, let's import some packages we'll need.

In [None]:
import warnings

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

Now, let's load in the data from the `data` subfolder of this directory.

**Question:** How many columns are there in this data set? How many rows?

In [None]:
data = pd.read_csv('../data/penguins.csv')
data

Below is the information for each of the columns:
1. **species**: Species of penguin [Adelie, Chinstrap, Gentoo]
2. **island**: Island where the penguin was found [Torgersen, Biscoe]
3. **culmen_length_mm**: Length of upper part of penguin's bill (millimeters)
4. **culmen_depth_mm**: Height of upper part of bill (millimeters)
5. **flipper_length_mm**: Length of penguin flipper (millimeters)
6. **body_mass_g**: Body mass of the penguin (grams)
7. **sex**: Biological sex of the penguin [MALE, FEMALE]


**Question:** Which of the columns are continuous? Which are categorical?


We will need to treat the numeric and categorical data differently in preprocessing.


## Missing Data Preprocessing

First, let's check to see if there are any missing values in the data set. Missing values are represented by `NaN`. 

**Question:** In this case, what do missing values stand for?

In [None]:
data.isnull().sum()

It is also possible to have non `NaN` missing values. For example, let's take a look at the `sex` column.

In [None]:
data['sex'].unique()

In this case, the `.` represents a missing value, so let's replace those with `np.nan` objects.

In [None]:
data.replace('.', np.nan, inplace=True)

data['sex'].unique()

### Imputation

In the case of missing values, we have the option to fill in the missing values with the best guess. This is called **imputation**. Here we'll impute any missing values using the average, or mean, of all the data that does exist, as that's the best guess for a data point if all we have is the data itself. To do that we'll use the `SimpleImputer` to assign the mean to all missing values in the data.

There are also other strategies that can be used to impute missing data ([see documentation](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html)).

Let's see how the `SimpleImputer` works on a subset of the data. 

In [None]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(missing_values=np.nan,
                        strategy='mean', 
                        copy=True)
imputed = imputer.fit_transform(data[['body_mass_g','flipper_length_mm']])


Now let's check that the previously null values have been filled in. 

In [None]:
print(imputed[data[data['body_mass_g'].isna()].index])

### Dropping Null Values

Another option option is to use `pd.dropna()` to drop `Null` values from the `DataFrame`. This should almost always be used with the `subset` argument which restricts the function to only dropping values that are null in a certain column(s).

In [None]:
data = data.dropna(subset=['sex'])

# Now this line will return an empty dataframe
data[data['sex'].isna()]

## Categorical Data Processing

As we saw earlier, the `penguins` dataset contains both categorical and continuous features, which will each need to be preprocessed in different ways. First, we want to transform the categorical variables from strings to **indicator variables**. Indicator variables have one column per level, For example, the island variable will change from Biscoe/Dream/Torgersen --> Biscoe (1/0), Dream (1/0), and Torgerson (1/0). For each set of indicator variables, there should be a 1 in exactly one column.

 Let's make a list of the categorical variable names to be transformed into indicator variables.

In [None]:
# Define the variable names that are categorical for use later
cat_var_names = ['island', 'sex']
data_cat = data[cat_var_names]
data_cat.head()

### Categorical Variable Encoding (One-hot & Dummy)

Many machine learning algorithms require that categorical data be encoded numerically in some fashion. There are two main ways to do so:


- **One-hot-encoding**, which creates `k` new variables for a single categorical variable with `k` categories (or levels), where each new variable is coded with a `1` for the observations that contain that category, and a `0` for each observation that doesn't. 
- **Dummy encoding**, which creates `k-1` new variables for a categorical variable with `k` categories

However, when using some machine learning algorithms we can run into the so-called ["Dummy Variable Trap"](https://www.algosome.com/articles/dummy-variable-trap-regression.html) when using One-Hot-Encoding on multiple categorical variables within the same set of features. This occurs because each set of one-hot-encoded variables can be added together across columns to create a single column of all `1`s, and so are multi-colinear when multiple one-hot-encoded variables exist within a given model. This can lead to misleading results. 

To resolve this, we can simply add an intercept term to our model (which is all `1`s) and remove the first one-hot-encoded variable for each categorical variables, resulting in `k-1` so-called "Dummy Variables". 

Luckily the `OneHotEncoder` from `sklearn` can perform both one-hot and dummy encoding simply by setting the `drop` parameter (`drop = 'first'` for Dummy Encoding and `drop = None` for One Hot Encoding). 

**Question:** How many total columns will there be in the output?

In [None]:
from sklearn.preprocessing import OneHotEncoder
dummy_e = OneHotEncoder(categories='auto', drop='first', sparse=False)
dummy_e.fit(data_cat);
dummy_e.categories_

In [None]:
temp = dummy_e.transform(data_cat)

## Continuous Data Preprocessing

For numeric data, we don't need to create indicator variables, instead we need to normalize our variables, which helps improve performance of many machine learning models.

 Let's make subset out the continuous variables to be normalized.

In [None]:
data_num = data.drop(columns=cat_var_names + ['species'])
data_num.head()

### Normalization

[Normalization](https://en.wikipedia.org/wiki/Normalization_(statistics)) is a transformation that puts data into some known "normal" scale. We use normalization to improve the performance of many machine learning algorithms (see [here](https://en.wikipedia.org/wiki/Feature_scaling)). There are many forms of normalization, but perhaps the most useful to machine learning algorithms is called the "z-score" also known as the standard score. 

To z-score normalize the data, we simply subtract the mean of the data, and divide by the standard deviation. This results in data with a mean of `0` and a standard deviation of `1`.

We'll use the `StandardScaler` from `sklearn` to do normalization.

In [None]:
from sklearn.preprocessing import StandardScaler
norm_e = StandardScaler()
norm_e.fit_transform(data_num,).mean(axis=0)


To check the normalization works, let's look at the mean and standard variation of the resulting columns. 

**Question:** What should the mean and std variation be?

In [None]:
print('mean:',norm_e.fit_transform(data_num,).mean(axis=0))
print('std:',norm_e.fit_transform(data_num,).std(axis=0))

---
## Challenge 1: Fitting preprocessing functions

The simple imputer, normalization and one-hot-encoding rely on sklearn functions that are fit to a data set. 

1) What is being fit for each of the three functions?
    1) One Hot Encoding
    2) Standard Scaler
    3) Simple Imputer
    
*YOUR ANSWER HERE*

When we are preprocessing data we have a few options: 
1) Fit on the whole data set
2) Fit on the training data
3) Fit on the testing data

Which of the above methods would you use and why?

*YOUR ANSWER HERE*

---


## Combine it all together

Now let's combine what we've learned to preprocess the entire dataset.

First we will reload the data set to start with a clean copy.

In [None]:
data = pd.read_csv('../data/penguins.csv')
data.replace('.', np.nan, inplace=True)
data = data.dropna(subset=['sex'])


In [None]:
# Perform the train-test split
y = data['species']
X = data.drop('species', axis =1, inplace=False)
X_train, X_test, y_train, y_test=train_test_split(X, y, test_size=.25, stratify=y)
print(X_train.shape)


We want to train our imputers on the training data using `fit_transform`, then `transform` the test data. This more closely resembles what the workflow would look like if you are bringing in brand new test data.

First, we will subset out the categorical and numerical features separately. 

In [None]:
# Get the categorical and numerical variable column indices
cat_var = ['island', 'sex']
num_var = ['culmen_length_mm', 'culmen_depth_mm',
           'flipper_length_mm', 'body_mass_g']
# Splice the training array
X_train_cat = X_train[cat_var]
X_train_num = X_train[num_var]

# Splice the test array
X_test_cat = X_test[cat_var]
X_test_num = X_test[num_var]

Now, let's process the categorical data with **Dummy encoding**

In [None]:
warnings.filterwarnings('ignore')

# Categorical feature encoding
X_train_dummy = dummy_e.fit_transform(X_train_cat)
X_test_dummy = dummy_e.transform(X_test_cat)


# Check the shape
X_train_dummy.shape, X_test_dummy.shape

Now, let's process the numerical data by imputing any missing values and normalizing the results.

In [None]:
# Numerical feature standardization

# Impute the data
X_train_imp = imputer.fit_transform(X_train_num)
X_test_imp = imputer.transform(X_test_num)

# Check for missing values
np.isnan(X_train_imp).any(), np.isnan(X_test_imp).any()

# normalize
X_train_norm = norm_e.fit_transform(X_train_num)
X_test_norm = norm_e.transform(X_test_num)

X_train_norm.shape, X_test_norm.shape

Now that we've processed the numerical and categorical data separately, we can put the two arrays back together.

In [None]:
X_train = np.hstack((X_train_dummy, X_train_norm))
X_test = np.hstack((X_test_dummy, X_test_norm))

X_train.shape, X_test.shape

---
## Challenge 2: Order of Preprocessing

In the preprocessing we did the following steps: 

1) Null values
2) One-hot-encoding
3) Imputation
4) Normalization

Now, consider that we change the order of the steps in the following ways. What effect might that have on the algorithms?
**Hint**: Try copying the code from above and trying it out!

- One-Hot-Encoding before Null Values
- Normalization before Null values

**Bonus:** Are there any other switches in order that might affect preprocessing?

---

In [None]:
# YOUR CODE HERE

Finally, let's save our results as separate `.csv` files, so we won't have to run the preprocessing again.

First we will make them DataFrames, add columns, and save them as .csv files

In [None]:
X_train = pd.DataFrame(X_train)
X_train.columns = ['Dream','Torgersen', 'Male',
                   'culmen_length_mm', 'culmen_depth_mm',
                   'flipper_length_mm', 'body_mass_g']

X_test = pd.DataFrame(X_test)

X_test.columns = ['Dream','Torgersen', 'Male',
                   'culmen_length_mm', 'culmen_depth_mm',
                   'flipper_length_mm', 'body_mass_g']
y_train = pd.DataFrame(y_train)
y_train.columns = ['species']

y_test = pd.DataFrame(y_test)
y_test.columns = ['species']

X_train.to_csv('../data/penguins_X_train.csv')
X_test.to_csv('../data/penguins_X_test.csv')
y_train.to_csv('../data/penguins_y_train.csv')
y_test.to_csv('../data/penguins_y_test.csv')


Although now we will move on to talk about classification, all of the choices we make in the preprocessing pipeline are extremely important to machine learning.

---
## Challenge 3: Preprocessing and regularization

We are preprocessing data in preparation for a classification task down the line. However, preprocessing also applies to regression. 

Consider the regularization task applied in the previous notebook. How might the preprocessing steps affect the performance of regularization?

---

In [None]:
# YOUR CODE HERE