# **Feature Engineering & Data Preparation**

## Objectives

* Prepare the `cleaned_data.csv` dataset for machine learning (ML), following the modelling pipeline described in the project brief.
* Engineer the features identified during EDA as relevant predictors of `SalePrice` (e.g., GrLivArea, TotalBsmtSF, LotArea, GarageArea, OverallQual, and categorical quality ratings).
* Convert all categorical variables into numerical format using suitable encoding techniques.
* Apply data transformations informed by EDA - for example:
  * log-transform `SalePrice` if required
  * handle skewed numerical features
  * normalise or standardise features if necessary
* Create the final modelling dataset (`X` and `y`) and perform a train–test split to support model evaluation.
* Apply identical feature engineering steps to `inherited_houses.csv` so predictions can be generated consistently later in the dashboard.

## Inputs

* `data/processed/cleaned_data.csv`
  * Cleaned and fully validated Ames dataset produced in 02_data_cleaning.ipynb.
* `data/raw/inherited_houses.csv`
  * Client’s four inherited homes, used for aligning preprocessing and generating predictions later.
* Insights from 03_exploratory_analysis.ipynb:
  * Most predictive numerical features
  * Categorical variables requiring encoding
  * Level of skewness in key features
  * Confirmation that inherited houses fall within normal market ranges
* Project requirements and modelling expectations from the Project Plan 
  * Final model must achieve **R² ≥ 0.75**
  * Dataset must be prepared following a clear ML pipeline
  * Model will later be used in a Streamlit dashboard with prediction capability

## Outputs

* A fully engineered modelling dataset including:
  * Encoded categorical variables
  * Transformed numerical variables (if needed)
  * Final feature matrix `X` and target `y`
  * `X_train`, `X_test`, `y_train`, `y_test` split for modelling
* A processed version of the inherited dataset with exactly the same encoding and transformations applied.
* A list of final selected modelling features.
* Saved processed datasets in:
  * `data/processed/engineered_training_data.csv`
  * `data/processed/engineered_inherited_houses.csv`
* Clear justification for all feature engineering decisions (based on EDA).

## Additional Comments

* All feature engineering decisions must directly support the modelling requirements in the project specification and dashboard design (prediction page, feature insights page, etc.).
* No new external features will be added; only the original cleaned dataset will be transformed.
* This notebook ensures that the dataset used in Notebook 05 (Model Training) is clean, fully numeric, consistent, and suitable for supervised regression modelling.
* The engineered feature set will later support:
  * Model performance evaluation (R² score)
  * Price predictions for the 4 inherited houses
  * Real-time predictions in the Streamlit dashboard
* Every transformation applied here must be reproducible and applied identically during inference (prediction).


---