# Feature engineering

In this notebook we cover examples of common feature engineering tasks on both numeric and categorical data. The goal isn't to be exhaustive, but provide you with enough examples that you get the picture. As we will see in future lectures, good feature engineering can significantly improve model performance, but feature engineering, just like machine learning in general, can be "part art, part science". So let's see what we mean by that.

We begin with numeric features. Let's begin by reading some data.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
%matplotlib inline

In [None]:
LocalFile = './data/auto-mpg.csv'
UCI_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
url = UCI_url
auto = pd.read_csv(url, sep = '\s+', header = None, 
                   names = ['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 
                            'acceleration', 'model year', 'origin', 'car_name'])

auto['cylinders'] = auto['cylinders'].astype('category')
auto.describe()

We saw some examples of **feature transformation** functions in the previous lesson. **Feature engineering**, in fact, consists of running similar feature transformations on the data and gradually modifying existing columns and adding new features to the data, with the goal of ending up with features that are more useful to the model than the original features we started with. What makes feature engineering so special is that we apply these transformations with an eye towards making the machine learning easier or more doable. Having good features (what feature engineering is all about) can significantly impact how well we do when we move on to machine learning.

A common type of feature transformation for numeric features is **feature normalization**. Note that **normalization** is a word that means something very different in relational databases than in machine learning, so be careful not to confuse the two. 

The general formula for **linear normalization** is:
$$xNorm = \dfrac{x - offset}{scale}$$
- **offset** is the upward shift of the original variable before normalization
- **scale** is the spread or stretch of the original variable before normalization

The two most common ways to normalize features is **Z-normalization** and **min-max normalization**:

- **Z-normalization** consists of the following transformation, and results in most of the values for the transformed $x$ being between -2 and 2. 
$$x \rightarrow \dfrac{x - mean(x)}{std(x)}$$
- **min-max normalization** consists of the following transformation, and forces all the values of the transformed $x$ to be between 0 and 1: 
$$x \rightarrow \dfrac{x - min(x)}{max(x) - min(x)}$$  

So what is the purpose of normalization? If we only have one feature, normalization might not serve a purpose. Normalization makes sense when we have many features and we want to **put them on the same scale**, which is why normalization is also sometimes called **rescaling** or **standardization**. **Some (but not all) ML algorithms only work properly if the data is normalized, otherwise the features that have larger numbers or large scales will dominate the model.** So normalization acts as a way of leveling the playing field among variables.

### Exercise (15 minutes)

- Normalize `mpg`, `displacement`, `weight` and `acceleration`. Instead of overwriting the original columns, add the normalized features as new columns and name each column using the column name and a `_norm` suffix. You are free to choose between Z-normalization or min-max normalization.

In [None]:
#Add code here


Unless your Python skills are improving by leaps and bounds, you probably normalized the features one at a time. What if we wanted to do it all at once? 

- Write a loop to iterate over the four columns and normalize each. To make it easier, we already put the column names in a list for you.

In [None]:
num_cols = ['mpg', 'displacement', 'weight', 'acceleration']
# your code goes here


There is an even better way to run our transformations all at once without writing a loop. First we have to write a function whose input is an array and whose output is an array of the same size with the values normalized. 

- Write such a function and use the below cell to test it and make sure it worked.

In [None]:
def normalize(x):
    x_norm = x # Add code here:  modify here to write your function
    return x_norm

x_test = np.array([3, 5, 9, 11, 2, 0])
normalize(x_test)

- Apply the function to the data. HINT: use the `apply` method.
  - note that we need to limit the data to only the four columns we wish to transform
  - we need to use the `axis = 0` argument to let `apply` know that the transformation applies to columns (`axis = 1` would apply it to rows, which is not what we want here)

In [None]:
# Add code here


- Check the results using `describe`. Then do an additional sanity check:
  - if your function is doing Z-normalization, then check the mean and standard deviation of your normalized columns to make sure they are 0 and 1 respectively
  - if your function is doing min-max normalization, then check the minimum and maximum values of your normalized columns to make sure they are 0 and 1 respectively

In [None]:
# Add code here


In our implementation of the `normalize` function above, we computed the mean and standard deviation (or min and max in the case of min-max normalization) **on the fly**. This means that any time we want to normalize new data, we compute the mean and standard deviation of the new data and then normalize it accordingly. In machine learning, this poses a problem: normalizing two different data sets using the mean and standard deviation of each means that they each get normalized silghtly differently and we lose the consistency (we will see why in future lectures). So instead, we want to learn the mean and standard deviation of one data, and normalize **that data and any future data** using the same mean and standard deviation.

Modify the `normalize` function so the mean and standard deviation are determined from one variable. Then apply the transformation to the other variable.  The test shows the following results:
- normalize `x_test_1` using the normalization parameters from `x_test_1` 
- normalize `x_test_2` using the normalization parameters from `x_test_1`
- normalize `x_test_2` using the normalization parameters from `x_test_2`

In [None]:
# Add code here

def normalize(x, by):
    x_norm = x # modify here to write your function
    return x_norm

x_test_1 = np.array([3, 5, 9, 11, 2, 0])
x_test_2 = np.array([1, 2, 5, 13, 9, -4])

print('Normalization of x_test_1:', normalize(x=x_test_1, by=x_test_1))
print('Normalization of x_test_2 by prameters from x_test_1:', normalize(x=x_test_2, by=x_test_1))
print('Normalization of x_test_2 by prameters from x_test_2:', normalize(x=x_test_2, by=x_test_2))

This last part has important consequences in machine learning. As we will see next, this is something that is automatically handled by `sklearn`.

### End of exercise

Since normalization is a very common task, you shoudn't be surprised to find out that there are already functions for it. I hope you still found the previous exercise useful!

However, this time we have to go to the `sklearn` library to find our function. The `sklearn` library is Python's most common machine learning library and one that we will return to in future lectures. In addition to the machine learning algorithms we will learn about, `sklearn` also has functions for **pre-processing data**, which is a vague term that includes tasks such as missing-value imputation, feature engineering and so on. Let's see how we can use it to normalize our data.

In [None]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler
num_cols_minmax = [c + '_minmax' for c in num_cols] # names of min-max-transformed columns

At first blush, the following code might look a little strange, but this pattern as we will see is very common to ML-related tasks in `sklearn`:
- initialize the process by choosing the function (with arguments we wish, if any)
- run `fit` first on the data to determine the parameters
- run `transform` to apply the parameters in the transformation

In [None]:
minmax_scaler = MinMaxScaler() # initialization / create an instance of the class
minmax_scaler.fit(auto[num_cols])
auto[num_cols_minmax] = minmax_scaler.transform(auto[num_cols])
auto[num_cols_minmax].head()

In [None]:
print(minmax_scaler.data_min_)
print(minmax_scaler.data_max_)
print(1./minmax_scaler.scale_)

In [None]:
sns.pairplot(auto[num_cols_minmax]);

Here's the same example, but using Z-normalization.

In [None]:
num_cols_z = [c + '_z' for c in num_cols] # names of Z-transformed columns
znorm_scaler = StandardScaler()
znorm_scaler.fit(auto[num_cols])
auto[num_cols_z] = znorm_scaler.transform(auto[num_cols])
auto[num_cols_z].head()

Let's look at the scatter plot matirx for the normalized features.

In [None]:
sns.pairplot(auto[num_cols_z]);

### Comment on Scatterplots and Histograms
In either case, it doesn't look like normalization changed anything to the scatter plot matrix. Do you notice what changed? The answer is that the **range of the data** is what changed. Just check the $x$ and $y$ axes and you'll see. Normalization is not really supposed to change the distribution of the data, just put all features on the same scale.

Let's look one last time at the code for normalizing the data using `sklearn`.

---
<font size="3">
    
`minmax_scaler = MinMaxScaler()`

`minmax_scaler.fit(auto[num_cols])`

`auto[num_cols_minmax] = minmax_scaler.transform(auto[num_cols])`

`auto[num_cols_minmax].head()`

</font>

---
You might be curious why we use `fit` followed by `transform`. What exactly happens when we run `fit`? Why should those two steps not be a single step? Here's a short answer using `MinMaxScaler` as our example:
  - When we run `fit` we find the min and max for the columns and rememeber it.
  - When we run `transform` we apply the transformation using the min and max we found when we ran `fit`.

This means that we can learn the min and max once, and then apply the **same** transformation (with the same min and max) not just to the original data, but any future data. In machine learning, this has important consequences, but that's the topic of a future lecture.

# Assignment

In this assignment, we want to read the `retail-churn.csv` dataset that we examined in a previous assignment and begin to pre-process it. The goal of the assignment is to become familiar with some common pre-processing and feature engineering steps by implementing them.

Find your assignment in **Lesson_06_h_assignment.ipynb**