# Feature Engineering

**Feature engineering** is the process of using domain knowledge to create new features (variables) from raw data that make machine learning algorithms work better. While [Exploratory Data Analysis (EDA)](e2e020_eda.ipynb) focuses on *understanding* the data, feature engineering focuses on *transforming* it to improve model performance.

This is often one of the most impactful steps in a machine learning pipeline: well-engineered features can dramatically improve model accuracy, while poorly chosen features can limit what even sophisticated algorithms can achieve.

## Previous steps

In [7]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

housing = pd.read_csv("./data/housing.csv")
train_set, test_set = train_test_split(housing, test_size=0.2,
    stratify=pd.cut(housing["median_income"], bins=[0., 1.5, 3.0, 4.5, 6., np.inf], labels=[1, 2, 3, 4, 5]),
    random_state=42
    )


## Combining attributes

It will often make sense to try to combine certain variables to obtain new, more useful variables. For example, the total number of rooms is not very useful if we don't know how many households there are. What we really want is the number of rooms per household. Similarly, the total number of bedrooms alone is not very useful: we probably want to compare it with the number of rooms. And population per household also seems like an interesting attribute combination to look at. Let's create these new variables.

Therefore, we add columns with these new variables to the dataframe and see their correlations with 'median_house_value'.

In [8]:
train_set["rooms_per_house"] = train_set["total_rooms"] / train_set["households"] # number of rooms per house
train_set["bedrooms_ratio"] = train_set["total_bedrooms"] / train_set["total_rooms"] # ratio of bedrooms to total rooms
train_set["people_per_house"] = train_set["population"] / train_set["households"] # number of people per house

In [9]:
corr_matrix = train_set.corr(numeric_only=True)
corr_matrix["median_house_value"].sort_values(ascending=False, key=np.abs)

median_house_value    1.000000
median_income         0.687151
bedrooms_ratio       -0.259952
rooms_per_house       0.146255
latitude             -0.142673
total_rooms           0.135140
housing_median_age    0.114146
households            0.064590
total_bedrooms        0.047781
longitude            -0.047466
population           -0.026882
people_per_house     -0.021991
Name: median_house_value, dtype: float64

We've obtained a new attribute "bedrooms_ratio" that has a much higher correlation with "median_house_value" than the total number of rooms or bedrooms. Apparently, houses with a lower bedroom/room ratio tend to be more expensive. Additionally, the number of rooms per household also has a higher correlation than the total number of rooms or bedrooms.

Note that in a real pipeline, these transformations would be implemented as custom transformers to ensure they are applied consistently to both training and test data. See [Custom Transformers](e2e051_custom_transformers.ipynb) for how to create reusable transformers that can be integrated into scikit-learn pipelines.