(ml-data)=
# Prepping data for Machine Learning

## Introduction

In this chapter, we're going to look at some issues around preparing data for machine learning. This chapter is enormously indebted to the [**scikit-learn**](https://scikit-learn.org/) documentation and Chris Albon's [Machine Learning Flashcards](https://machinelearningflashcards.com/).

The context here is that some machine learning algorithms are not *scale-free*, ie what units your measurements in really matters and you will get better or worse results depending on whether you have rescaled your data appropriately. One algorithm that benefits from this  is the Support Vector Machine. Scaling and pre-processing can help in different ways, but one key way is by easing convergence (such as with non-penalised logistic regression).

In this section, we'll also talk about some pitfalls with pre-processing—namely the risk of information leakage.

There are a few different ways to scale data, as we'll see, and, when scaling, you will need to remember to put your data back into the original "space" if you want to interpret predictions.

Of course, machine learning isn't the only context in which you may want to pre-process your data by rescaling it somehow, and these methods can be used in other scenarios too.

## Pre-processing

First, some imports:

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

random_state = 42  # We'll use this throughout to make this page reproducible
prng = np.random.default_rng(random_state)

In [None]:
import matplotlib_inline.backend_inline

# Plot settings
plt.style.use(
    "https://github.com/aeturrell/coding-for-economists/raw/main/plot_style.txt"
)
matplotlib_inline.backend_inline.set_matplotlib_formats("svg")

# Set max rows displayed for readability
pd.set_option("display.max_rows", 6)

### Standardisation and data leakage

Standardisation is a common requirement for many machine learning estimators. These estimators might not be able to work at peak performance if the individual features do not more or less look like standard, normally distributed data: that is, a Gaussian with zero mean and unit variance.

In practice, we often ignore the shape of the distribution and just transform the data to center it by removing the mean value of each feature, then scale it by dividing non-constant features by their standard deviation.

**scikit-learn** provides tools for standardisation. We'll demonstrate, first creating some fake data with 2 features

In [None]:
X = np.array([[8, 7, 9, 11, 12, 13, 15, 5, 20, 0, 0.43, 16.7],
              [0.1, 0.2, 0.3, 0.6, 0.7, 0.8, 0.9, 0.3, 0.7, 0.88, 0.33, 0.22]]).T
print("The mean of X is:")
print(X.mean(axis=0).round(3))
print("The std of X is:")
print(X.std(axis=0).round(3))

In [None]:
from sklearn import preprocessing

scaler = preprocessing.StandardScaler().fit(X)
scaler

Remember, everything is an object! We've created a scaler object. It has state:

In [None]:
print(scaler.mean_)
print(scaler.scale_)

And we can use it as a *function* to scale other data. Well, technically, we're using the `transform` *method*, which is available to scaler objects.

In [None]:
X_scaled = scaler.transform(X)
print("The mean of X is:")
print(X_scaled.mean(axis=0).round(3))
print("The std of X is:")
print(X_scaled.std(axis=0).round(3))

Here we come to an important point: *your scaler should only be created from your training data*. Why? Because mean and standard deviations are *global functions* that take information from the entire series. So, if you naively use the mean and std from the entire series you are letting information from the test set into your scaling function, and this could enable (erroneously) higher performance. So, typically, you'll be doing steps that look like this:

```python
scaler = preprocessing.StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
... training ...
X_test_scaled = scaler.transform(X_test)
y_pred_scaled = model.predict(X_test_scaled)
```

Not all pre-processing functions have this problem—it's only *global* ones. But they are most pre-processing functions, so you do need to take care.

### Minmax scaling

Another popular 

## Pipelines

