# **Feature Engineering & Data Preparation**

## Objectives

* Prepare the `cleaned_data.csv` dataset for machine learning (ML), following the modelling pipeline described in the project brief.
* Engineer the features identified during EDA as relevant predictors of `SalePrice` (e.g., GrLivArea, TotalBsmtSF, LotArea, GarageArea, OverallQual, and categorical quality ratings).
* Convert all categorical variables into numerical format using suitable encoding techniques.
* Apply data transformations informed by EDA - for example:
  * log-transform `SalePrice` if required
  * handle skewed numerical features
  * normalise or standardise features if necessary
* Create the final modelling dataset (`X` and `y`) and perform a train–test split to support model evaluation.
* Apply identical feature engineering steps to `inherited_houses.csv` so predictions can be generated consistently later in the dashboard.

## Inputs

* `data/processed/cleaned_data.csv`
  * Cleaned and fully validated Ames dataset produced in 02_data_cleaning.ipynb.
* `data/raw/inherited_houses.csv`
  * Client’s four inherited homes, used for aligning preprocessing and generating predictions later.
* Insights from 03_exploratory_analysis.ipynb:
  * Most predictive numerical features
  * Categorical variables requiring encoding
  * Level of skewness in key features
  * Confirmation that inherited houses fall within normal market ranges
* Project requirements and modelling expectations from the Project Plan 
  * Final model must achieve **R² ≥ 0.75**
  * Dataset must be prepared following a clear ML pipeline
  * Model will later be used in a Streamlit dashboard with prediction capability

## Outputs

* A fully engineered modelling dataset including:
  * Encoded categorical variables
  * Transformed numerical variables (if needed)
  * Final feature matrix `X` and target `y`
  * `X_train`, `X_test`, `y_train`, `y_test` split for modelling
* A processed version of the inherited dataset with exactly the same encoding and transformations applied.
* A list of final selected modelling features.
* Saved processed datasets in:
  * `data/processed/engineered_training_data.csv`
  * `data/processed/engineered_inherited_houses.csv`
* Clear justification for all feature engineering decisions (based on EDA).

## Additional Comments

* All feature engineering decisions must directly support the modelling requirements in the project specification and dashboard design (prediction page, feature insights page, etc.).
* No new external features will be added; only the original cleaned dataset will be transformed.
* This notebook ensures that the dataset used in Notebook 05 (Model Training) is clean, fully numeric, consistent, and suitable for supervised regression modelling.
* The engineered feature set will later support:
  * Model performance evaluation (R² score)
  * Price predictions for the 4 inherited houses
  * Real-time predictions in the Streamlit dashboard
* Every transformation applied here must be reproducible and applied identically during inference (prediction).


---

# Briefly inspect the cleaned data

### Step 1: Load cleaned datasets for feature engineering

In [1]:
import pandas as pd

# Load the cleaned Ames housing dataset
housing_df = pd.read_csv("data/processed/cleaned_data.csv")

# Load the inherited houses dataset
inherited_df = pd.read_csv("data/raw/inherited_houses.csv")

# Quick checks
housing_df.head()

Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,GarageArea,GarageFinish,GarageYrBlt,...,LotArea,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,YearBuilt,YearRemodAdd,SalePrice
0,856,854.0,3.0,No,706,GLQ,150,548,RFn,2003.0,...,8450,65.0,196.0,61,5,7,856,2003,2003,208500
1,1262,0.0,3.0,Gd,978,ALQ,284,460,RFn,1976.0,...,9600,80.0,0.0,0,8,6,1262,1976,1976,181500
2,920,866.0,3.0,Mn,486,GLQ,434,608,RFn,2001.0,...,11250,68.0,162.0,42,5,7,920,2001,2002,223500
3,961,0.0,3.0,No,216,ALQ,540,642,Unf,1998.0,...,9550,60.0,0.0,35,5,7,756,1915,1970,140000
4,1145,0.0,4.0,Av,655,GLQ,490,836,RFn,2000.0,...,14260,84.0,350.0,84,5,8,1145,2000,2000,250000


In this step, I load the `cleaned_data.csv` dataset, along with the client’s inherited properties (`inherited_houses.csv`). 

The `housing_df` DataFrame will be used to engineer the features needed for modelling the target variable `SalePrice`, while `inherited_df` will be kept in sync with the same transformations. This ensures that any encoding, scaling, or transformations applied during feature engineering can later be applied consistently when generating price predictions for the four inherited houses.

---

# Define Features (X) and Target (y)

In [2]:
# Step 2: Define Features (X) and Target (y)

# The target variable we want to predict
y = housing_df["SalePrice"]

# Features used for modelling (excluding SalePrice)
# Selected based on EDA findings and project business requirements
selected_features = [
    "GrLivArea",        # Above-ground living area
    "LotArea",          # Lot size
    "TotalBsmtSF",      # Basement size
    "GarageArea",       # Garage size
    "OverallQual",      # Overall material and finish quality
    "OverallCond",      # Overall condition rating
    "KitchenQual",      # Categorical: kitchen quality
    "BsmtExposure",     # Categorical: basement exposure
    "BsmtFinType1",     # Categorical: basement finish type
    "GarageFinish"      # Categorical: garage finish quality
]

# Create the feature matrix
X = housing_df[selected_features]

X.head()

Unnamed: 0,GrLivArea,LotArea,TotalBsmtSF,GarageArea,OverallQual,OverallCond,KitchenQual,BsmtExposure,BsmtFinType1,GarageFinish
0,1710,8450,856,548,7,5,Gd,No,GLQ,RFn
1,1262,9600,1262,460,6,8,TA,Gd,ALQ,RFn
2,1786,11250,920,608,7,5,Gd,Mn,GLQ,RFn
3,1717,9550,756,642,7,5,Gd,No,ALQ,Unf
4,2198,14260,1145,836,8,5,Gd,Av,GLQ,RFn


### Step 2: Inspecting the selected feature matrix (X)

The table above shows the first five rows of the feature matrix `X` after selecting the variables that will be used for modelling.

The columns include a mix of numerical and categorical predictors that our earlier EDA identified as important for explaining variation in `SalePrice`:

- **GrLivArea, LotArea, TotalBsmtSF, GarageArea** – continuous measures of size (living area, land, basement, and garage).
- **OverallQual, OverallCond** – numeric quality and condition ratings recorded on an ordinal scale (higher values = better quality/condition).
- **KitchenQual, BsmtExposure, BsmtFinType1, GarageFinish** – categorical descriptors of kitchen quality, basement exposure, basement finish type, and garage finish.

From this preview we can see that:

- All selected columns are present and correctly loaded from `housing_df`.
- The numerical features contain realistic values consistent with the ranges observed during EDA.
- The categorical features use the expected Ames coding scheme (e.g. `Gd`, `TA`, `GLQ`, `RFn`, `Unf`).

This confirms that the chosen predictors are correctly assembled into `X` and ready for the next steps of feature engineering, where the categorical variables will be encoded and numerical features transformed as needed for modelling.


--- 

# Numerical and Categorical variables

In [3]:
# Step 3: Identify numerical and categorical variables

# Numerical features (int or float)
numerical_features = X.select_dtypes(include=["int64", "float64"]).columns.tolist()

# Categorical features (object/string)
categorical_features = X.select_dtypes(include=["object"]).columns.tolist()

numerical_features, categorical_features

(['GrLivArea',
  'LotArea',
  'TotalBsmtSF',
  'GarageArea',
  'OverallQual',
  'OverallCond'],
 ['KitchenQual', 'BsmtExposure', 'BsmtFinType1', 'GarageFinish'])

### Step 3: Identify numerical and categorical variables

In this step, I identified which features are numerical and which are categorical within the selected feature set `X`.

From the selected modelling features:

**Numerical features:**  
- GrLivArea  
- LotArea  
- TotalBsmtSF  
- GarageArea  
- OverallQual  
- OverallCond  

**Categorical features:**  
- KitchenQual  
- BsmtExposure  
- BsmtFinType1  
- GarageFinish  

This classification will guide the next steps of feature engineering, where I apply appropriate encoding and transformations to prepare the dataset for modelling.


##### Choosing the appropriate encoding method for categorical variables

Before encoding the categorical features, it is important to decide whether each variable should use ordinal encoding  or one-hot encoding. This decision depends on whether the categories represent a meaningful order or are simply different labels with no ranking.

In this project, all four selected categorical variables represent **quality levels** or **finish types**, meaning they naturally follow an ordered hierarchy. Because of this, using **ordinal encoding** is the most appropriate approach. 

**Variables and their justification:**

- **KitchenQual** — Represents kitchen quality (Ex > Gd > TA > Fa).  
  Since these categories reflect increasing levels of quality, ordinal encoding preserves this order.

- **BsmtExposure** — Indicates the level of basement exposure (Gd > Av > Mn > No).  
  These categories form a clear progression from high exposure to none, making them ordinal.

- **BsmtFinType1** — Basement finish type (GLQ > ALQ > BLQ > Rec > LwQ > Unf).  
  These categories represent increasing levels of finish quality, suitable for ordinal encoding.

- **GarageFinish** — Garage finish level (Fin > RFn > Unf).  
  These categories follow a logical quality order and should therefore be ordinal.

---

# Encode Categorical Variables

In [4]:
# Inspect the unique categories before encoding
for col in categorical_features:
    print(col, sorted(housing_df[col].unique()))

KitchenQual ['Ex', 'Fa', 'Gd', 'TA']
BsmtExposure ['Av', 'Gd', 'Mn', 'No', 'No Basement']
BsmtFinType1 ['ALQ', 'BLQ', 'GLQ', 'LwQ', 'No Basement', 'Rec', 'Unf']
GarageFinish ['Fin', 'No Garage', 'RFn', 'Unf']


In [5]:
# Step 4: Ordinal encode categorical variables in both datasets

# Define ordinal mappings based on quality/order semantics

kitchen_mapping = {
    "Fa": 1,  # Fair
    "TA": 2,  # Typical/Average
    "Gd": 3,  # Good
    "Ex": 4   # Excellent
}

bsmt_exposure_mapping = {
    "No": 1,  # No exposure
    "Mn": 2,  # Minimum
    "Av": 3,  # Average
    "Gd": 4   # Good
}

bsmt_fin_type_mapping = {
    "Unf": 1,  # Unfinished
    "LwQ": 2,  # Low quality
    "Rec": 3,  # Average Rec room
    "BLQ": 4,  # Below average living quarters
    "ALQ": 5,  # Average living quarters
    "GLQ": 6   # Good living quarters
}

garage_finish_mapping = {
    "Unf": 1,  # Unfinished
    "RFn": 2,  # Rough finished
    "Fin": 3   # Finished
}


datasets = [housing_df, inherited_df]

for df in datasets:
    # Apply mappings to each categorical column
    df["KitchenQual"] = df["KitchenQual"].map(kitchen_mapping)
    df["BsmtExposure"] = df["BsmtExposure"].map(bsmt_exposure_mapping)
    df["BsmtFinType1"] = df["BsmtFinType1"].map(bsmt_fin_type_mapping)
    df["GarageFinish"] = df["GarageFinish"].map(garage_finish_mapping)

# Rebuild X from the updated housing_df to make sure it now holds encoded values
X = housing_df[selected_features]


X.head()

Unnamed: 0,GrLivArea,LotArea,TotalBsmtSF,GarageArea,OverallQual,OverallCond,KitchenQual,BsmtExposure,BsmtFinType1,GarageFinish
0,1710,8450,856,548,7,5,3,1.0,6.0,2.0
1,1262,9600,1262,460,6,8,2,4.0,5.0,2.0
2,1786,11250,920,608,7,5,3,2.0,6.0,2.0
3,1717,9550,756,642,7,5,3,1.0,5.0,1.0
4,2198,14260,1145,836,8,5,3,3.0,6.0,2.0


### Step 4: Ordinal encoding of categorical variables

In this step, I converted the selected categorical features into numeric form using ordinal encoding.  

The following mappings were applied:

- **KitchenQual**  
  - Fa → 1, TA → 2, Gd → 3, Ex → 4

- **BsmtExposure**  
  - No → 1, Mn → 2, Av → 3, Gd → 4

- **BsmtFinType1**  
  - Unf → 1, LwQ → 2, Rec → 3, BLQ → 4, ALQ → 5, GLQ → 6

- **GarageFinish**  
  - Unf → 1, RFn → 2, Fin → 3

These mappings were applied consistently to both:

- the main training dataset (`housing_df`)
- the client’s inherited houses (`inherited_df`),

so that the model can later be used to predict sale prices for the inherited properties using the same feature representation.

After encoding, all four categorical features are now numeric and can be used directly in the modelling process.


---

In [6]:
# Step 5: Check skewness of numerical features

import numpy as np

# Check skewness for all numerical features in X
skew_values = X[numerical_features].skew().sort_values(ascending=False)
skew_values

LotArea        12.207688
TotalBsmtSF     1.524255
GrLivArea       1.366560
OverallCond     0.693067
OverallQual     0.216944
GarageArea      0.179981
dtype: float64

### Step 5: Interpreting Skewness of Numerical Features

The table above shows the skewness values for all numerical features in the modelling dataset. Skewness measures how asymmetric a distribution is, and understanding it helps determine whether any variables require transformation before modelling. As a guideline:

- **|skew| > 1** → highly skewed (log transformation recommended)  
- **0.5 < |skew| ≤ 1** → moderately skewed (transformation optional)  
- **|skew| ≤ 0.5** → fairly symmetrical (no transformation needed)

**Findings from the skewness results:**

- **LotArea (skew ≈ 12.21)**  
  Extremely right-skewed. This indicates strong outliers and a long tail, so a log transformation is strongly recommended.

- **TotalBsmtSF (skew ≈ 1.52)**  
  Highly skewed. A log transformation will help normalise the distribution.

- **GrLivArea (skew ≈ 1.37)**  
  Also highly skewed. A log transformation is recommended.

- **OverallCond (skew ≈ 0.69)**  
  Moderately skewed, but this is an ordinal rating (1–9), not a continuous measurement. Log transformation is not appropriate.

- **OverallQual (skew ≈ 0.22)** and **GarageArea (skew ≈ 0.18)**  
  Both show low skewness and do not need transformation.


Only three features require log transformation due to their high skewness: `LotArea`, `TotalBsmtSF`, and `GrLivArea`. Reducing skew in these variables will help stabilise variance and improve the performance of regression models.


In [7]:
# Step 5.2: Apply log transformations to highly skewed features

# Apply log1p (log(1 + x)) to avoid issues with zeros
housing_df["LotArea_log"] = np.log1p(housing_df["LotArea"])
housing_df["TotalBsmtSF_log"] = np.log1p(housing_df["TotalBsmtSF"])
housing_df["GrLivArea_log"] = np.log1p(housing_df["GrLivArea"])

# Apply the same transformations to inherited houses
inherited_df["LotArea_log"] = np.log1p(inherited_df["LotArea"])
inherited_df["TotalBsmtSF_log"] = np.log1p(inherited_df["TotalBsmtSF"])
inherited_df["GrLivArea_log"] = np.log1p(inherited_df["GrLivArea"])


housing_df[["LotArea_log", "TotalBsmtSF_log", "GrLivArea_log"]].head()


Unnamed: 0,LotArea_log,TotalBsmtSF_log,GrLivArea_log
0,9.04204,6.753438,7.444833
1,9.169623,7.141245,7.141245
2,9.328212,6.82546,7.488294
3,9.164401,6.629363,7.448916
4,9.565284,7.044033,7.695758


### 5.2: Apply log transformations to highly skewed features

The table above shows the first few rows of the newly created log-transformed features:

- `LotArea_log`
- `TotalBsmtSF_log`
- `GrLivArea_log`

These values confirm that the `np.log1p()` transformation was applied correctly. For example:

- An original `LotArea` value of 8450 became approximately **9.04**
- An original `TotalBsmtSF` value of 856 became approximately **6.75**
- An original `GrLivArea` value of 1710 became approximately **7.44**

The transformed values are now on a much smaller and more normally distributed scale, which helps reduce the impact of extreme values and supports more stable regression modelling. The same transformations were also applied to the inherited dataset to ensure consistency when generating predictions later in the project.


---

# Feature Scaling

In [13]:
# Step 6: Select numerical features to scale 

from sklearn.preprocessing import StandardScaler

# Select numerical features to scale
numerical_to_scale = [
    "GrLivArea_log",
    "TotalBsmtSF_log",
    "LotArea_log",
    "GarageArea",
    "OverallQual",
    "OverallCond"
]


In [14]:
# Build a new modelling dataframe with transformed features (replace original skewed features)
X_model = X.copy()

X_model["GrLivArea_log"] = housing_df["GrLivArea_log"]
X_model["TotalBsmtSF_log"] = housing_df["TotalBsmtSF_log"]
X_model["LotArea_log"] = housing_df["LotArea_log"]

# Drop the original skewed features
X_model = X_model.drop(["GrLivArea", "TotalBsmtSF", "LotArea"], axis=1)

X_model.head()

Unnamed: 0,GarageArea,OverallQual,OverallCond,KitchenQual,BsmtExposure,BsmtFinType1,GarageFinish,GrLivArea_log,TotalBsmtSF_log,LotArea_log
0,548,7,5,3,1.0,6.0,2.0,7.444833,6.753438,9.04204
1,460,6,8,2,4.0,5.0,2.0,7.141245,7.141245,9.169623
2,608,7,5,3,2.0,6.0,2.0,7.488294,6.82546,9.328212
3,642,7,5,3,1.0,5.0,1.0,7.448916,6.629363,9.164401
4,836,8,5,3,3.0,6.0,2.0,7.695758,7.044033,9.565284


### 6: Interpretation of the Modelling Feature Preview - X_model.head()

This table shows the first five rows of the final modelling dataset (`X_model`) created after all feature engineering steps were applied. It confirms that the correct 10 features are included and structured appropriately for ML.

Key details:

- The original skewed variables (`GrLivArea`, `LotArea`, `TotalBsmtSF`) have been replaced with their log-transformed versions (`*_log`), which helps stabilise variance and reduce the impact of extreme values.
- Ordinal categorical variables (`KitchenQual`, `BsmtExposure`, `BsmtFinType1`, `GarageFinish`) are now represented as numerical codes, allowing them to be used directly by regression and tree-based models.
- All remaining numerical features (`GarageArea`, `OverallQual`, `OverallCond`) appear correctly as integers or floats.

This preview confirms that the modelling dataset has been prepared correctly and is ready for train–test splitting and scaling in the next step.