This notebook is an adaptation of the [original by *Aurélien Gerón*](https://github.com/ageron/handson-ml3/blob/main/02_end_to_end_machine_learning_project.ipynb), from his book: [Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 3rd Edition. Aurélien Géron](https://www.oreilly.com/library/view/hands-on-machine-learning/9781098125967/)

**Feature engineering** is the process of using domain knowledge to create new features (variables) from raw data that make machine learning algorithms work better. While [Exploratory Data Analysis (EDA)](e2e030_eda.ipynb) focuses on *understanding* the data, feature engineering focuses on *transforming* it to improve model performance.

This is often one of the most impactful steps in a machine learning pipeline: well-engineered features can dramatically improve model accuracy, while poorly chosen features can limit what even sophisticated algorithms can achieve.

## Previous steps

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

housing = pd.read_csv("./data/housing.csv") # Load the dataset

# Generation of training and test sets through stratified sampling by median income
train_set, test_set = train_test_split(housing, test_size=0.2,
    stratify=pd.cut(housing["median_income"], bins=[0., 1.5, 3.0, 4.5, 6., np.inf], labels=[1, 2, 3, 4, 5]),
    random_state=42
    )

housing = train_set.drop("median_house_value", axis=1)
housing_labels = train_set["median_house_value"].copy()

## Combining attributes

It will often make sense to try to combine certain variables to obtain new, more useful variables. For example, the total number of rooms is not very useful if we don't know how many households there are. What we really want is the number of rooms per household. Similarly, the total number of bedrooms alone is not very useful: we probably want to compare it with the number of rooms. And population per household also seems like an interesting attribute combination to look at. Let's create these new variables.

Therefore, we add columns with these new variables to the dataframe and see their correlations with 'median_house_value'.

In [None]:
housing["rooms_per_house"] = housing["total_rooms"] / housing["households"] # number of rooms per house
housing["bedrooms_ratio"] = housing["total_bedrooms"] / housing["total_rooms"] # ratio of bedrooms to total rooms
housing["people_per_house"] = housing["population"] / housing["households"] # number of people per house

In [None]:
corr_matrix = housing.corr(numeric_only=True)
corr_matrix["median_house_value"].sort_values(ascending=False, key=np.abs)

We've obtained a new attribute "bedrooms_ratio" that has a much higher correlation with "median_house_value" than the total number of rooms or bedrooms. Apparently, houses with a lower bedroom/room ratio tend to be more expensive. Additionally, the number of rooms per household also has a higher correlation than the total number of rooms or bedrooms.

Note that in a real pipeline, these transformations would be implemented as custom transformers to ensure they are applied consistently to both training and test data. See [Custom Transformers](e2e051_custom_transformers.ipynb) for how to create reusable transformers that can be integrated into scikit-learn pipelines.