# Python Machine Learning: Preprocessing


## Load Data

Instead of being a built-in `sklearn` dataset, the `auto-mpg` dataset is stored in a `.csv` file that can be accessed from the UCI repository, so we'll use `pandas` to load in a local copy. This dataset will require some preprocessing, which we will do after performing some exploratory data analysis (EDA).

First, let's import some packages we'll need.

In [None]:
import warnings

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
data = pd.read_csv('data/auto-mpg.csv', index_col='car name')
data.head()

Below is the information for the variable types of each of the columns from the UCI machine learning repository's [website](https://archive.ics.uci.edu/ml/datasets/auto+mpg):
1. **mpg**: continuous
2. **cylinders**: multi-valued discrete
3. **displacement**: continuous
4. **horsepower**: continuous
5. **weight**: continuous
6. **acceleration**: continuous
7. **model year**: multi-valued discrete
8. **origin**: multi-valued discrete
9. **car name**: string (unique for each instance)

## Missing Data Preprocessing

Let's take a little more time to explore this dataset and perform any preprocessing necessary. One of the most important steps before we start any machine learning problem is to get a better understanding of the data at hand.

First, we see that the original dataset has 398 and 9 columns (1 column to identify the unique cars, 1 column for the target variable, and 7 columns of indepedent variables).

In [None]:
data.shape

### Missing values

Next, we want to check to see if there are any missing values.

In [None]:
data.isna().any()

In [None]:
data['horsepower'].sort_values(ascending=False).unique()

In [None]:
data = data.replace('?', np.nan)
data = data.astype({'horsepower': 'float'})

In [None]:
data[data['horsepower'].isna()]

### Imputation

Imputation is the name given to the preprocessing step that transforms missing values. Here we'll impute any missing values using the average, or mean, of all the data that does exist, as that's the best guess for a data point if all we have is the data itself. To do that we'll use the `SimpleImputer` to assign the mean to all missing values by fitting against the train data

There are also other strategies that can be used to impute missing data ([see documentation](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html)).

In [None]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(missing_values=np.nan,
                        strategy='mean', 
                        copy=True)
imputer.fit(X_train_raw);

## Categorical Data Processing

As we saw from the documentation, the `auto-mpg` dataset contains both categorical and continuous features, which will each need to be preprocessed in different ways. We'll want transform the categorical variables into indicator variables (which are either 0 or 1) using a technique known as one-hot encoding.

 Let's make a list of the categorical variable names to be transformed into indicator variables.

In [None]:
# Define the variable names that are categorical for use later
cat_var_names = ['cylinders', 'model year', 'origin']
X_train_raw_cat = X_train_raw[cat_var_names]
X_train_raw_cat.head()

### Categorical Variable Encoding (One-hot & Dummy)

Many machine learning algorithms require that categorical data be encoded numerically in some fashion. A common technique used is called One-hot-encoding, which creates `k` new variables for a single categorical variable with `k` categories (or levels), where each new variable is coded with a `1` for the observations that contain that category, and a `0` for each observation that doesn't. 

However, when using some machine learning alorithms, such as linear regression, ridge regression and elastic net regression (which we will use first), we can run into the so-called ["Dummy Variable Trap"](https://www.algosome.com/articles/dummy-variable-trap-regression.html) when using One-Hot-Encoding on multiple categorical variables within the same set of features. This occurs because each set of one-hot-encoded variables can be added together across columns to create a single column of all `1`s, and so are multi-colinear when multiple one-hot-encoded variables exist within a given model. This can lead to misleading results when using the aforemetioned algorithms.

To resolve this, we can simply add an intercept term to our model (which is all `1`s) and remove the first one-hot-encoded variable for each categorical variables, resulting in `k-1` so-called "Dummy Variables". 

Luckily the `OneHotEncoder` from `sklearn` can perform both one-hot and dummy encoding simply by setting the `drop` parameter. Let's use it to transform the `cylinders`, `model year`, and `origin` variables into `k-1` dummy variables.

In [None]:
from sklearn.preprocessing import OneHotEncoder
dummy_e = OneHotEncoder(categories='auto', drop='first', handle_unknown='ignore', sparse=False)
dummy_e.fit(X_train_raw_cat);

Before using the dummy encoder, there are 21 total unique values (or possible variables) among the categorical variables. After we apply the dummy encoder, this dimension will be reduced to 18 total unique values.

In [None]:
num_unique = sum([len(cat) for cat in dummy_e.categories_])
print(f"{num_unique} total unique values among the categorical variables")

### [OPTIONAL] Using `pandas`

Optionally you can use `pandas` to do one-hot-encoding or dummy encoding. The problem with this, as we'll see in Day 3 of this workshop, is that we cannot include this into a `sklearn` pipeline, which will be a useful thing to do. Similar to the `OneHotEncoder`, we can set the optional parameter `drop_first` to change the behavior of the function from one-hot-encoding to dummy encoding.

In [None]:
X_train_raw_dummy = pd.get_dummies(X_train_raw, columns=cat_var_names, drop_first=True)
X_train_raw.shape, X_train_raw_dummy.shape

## Continuous Data Preprocessing

Preprocessing continuous data requires different steps than categorical data. We'll still want to impute continuous data, but here we use the mean, median, or even more complex methods to make guesses at the missing data values. We don't need to create indicator variables, instead we need to normalize our variables, which helps improve performance of many machine learning models.

 Let's make subset out the continuous varialbles to be normalized.

In [None]:
X_train_raw_num = X_train_raw.drop(columns=cat_var_names)
X_train_raw_num.head()

### Normalization

[Normalization](https://en.wikipedia.org/wiki/Normalization_(statistics)) is a transformation that puts data into some known "normal" scale. We use normalization to improve the performance of many machine learning algorithms (see [here](https://en.wikipedia.org/wiki/Feature_scaling)). There are many forms of normalization, but perhaps the most useful to machine learning algorithms is called the "z-score" also known as the standard score. 

To z-score normalize the data, we simply subtract the mean of the data, and divide by the standard deviation. This results in data with a mean of `0` and a standard deviation of `1`.

We'll use the `StandardScaler` from `sklearn` to do normalization.

In [None]:
from sklearn.preprocessing import StandardScaler
norm_e = StandardScaler()
norm_e.fit(X_train_raw_num)
norm_e.mean_, norm_e.var_

## Combine it all together

Now let's combine what we've learned to preprocess the entire dataset. On Day 3, we'll learn how to do this using an sklearn object called `Pipelines`. While these objects are extremely useful for preventing data leakage and having structured preprocessing, they require some set up, so we will use our preprocessors directly for now.

### Transform the `train` and `test` Input Data

Becuase we've already fit our preprocessors on the train data, we can be safe in the knowledge that we can use them to transform both the train and test data without any data leakage.

First, use the imputer to fill the missing values.

In [None]:
# Impute the data
X_train_imp = imputer.transform(X_train_raw)
X_test_imp = imputer.transform(X_test_raw)

# Check for missing values
np.isnan(X_train_imp).any(), np.isnan(X_test_imp).any()

Subset out the categorical and numerical features separately. 

In [None]:
# Get the categorical and numerical variable column indices
feature_map = {idx:feat for idx, feat in enumerate(imputer.feature_names_in_)}
cat_var_idx = [idx for idx, feat in feature_map.items() if feat in cat_var_names]
num_var_idx = [idx for idx, feat in feature_map.items() if feat not in cat_var_names]

# Splice the training array
X_train_cat = X_train_imp[:, cat_var_idx]
X_train_num = X_train_imp[:, num_var_idx]

# Splice the test array
X_test_cat = X_test_imp[:, cat_var_idx]
X_test_num = X_test_imp[:, num_var_idx]

Apply the dummy encoder to the categorical variables and the normalizer to the numerical variables.

In [None]:
warnings.filterwarnings('ignore')

# Categorical feature encoding
X_train_dummy = dummy_e.transform(X_train_cat)
X_test_dummy = dummy_e.transform(X_test_cat)

X_train_dummy.shape, X_test_dummy.shape

In [None]:
# Numerical feature standardization
X_train_norm = norm_e.transform(X_train_num)
X_test_norm = norm_e.transform(X_test_num)

X_train_norm.shape, X_test_norm.shape

Finally, merge the categorical and numerical columns back into one array.

In [None]:
X_train = np.hstack((X_train_dummy, X_train_norm))
X_test = np.hstack((X_test_dummy, X_test_norm))

X_train.shape, X_test.shape

### Transform the `train` and `test` Outcome Variable

Similarly to how we transformed the continous variables for the input data, we will want to do something similar for the outcome/dependent variable, `mpg`. Here, we'll use the `fit_transform` method on the train data which performs both the `fit` and `transform` steps in a single call, as we don't need to worry about any other prior fitting of preprocessors.

In [None]:
mpg_scaler = StandardScaler()
y_train = mpg_scaler.fit_transform(y_train_raw.values.reshape(-1, 1))
y_test = mpg_scaler.transform(y_test_raw.values.reshape(-1, 1))

In scikit-learn, as soon as you have `X_train`, `X_test`, `y_train`, and `y_test`, everything else is just a matter of choosing your mdoel and the parameters for it. But this should not be trivialized, selecting models and that model's parameters is *very* important. While we will not cover it here, choosing the correct model and parameters is the core skill of applying machine learning algorithms, and can have dramatic affects on the performance of your predictions.