# **Feature Engineering Notebook**

## Objectives

* Engineer features for Regression

## Inputs

* inputs/datasets/cleaned/TrainSet.csv
* inputs/datasets/cleaned/TestSet.csv

## Outputs

* generate a list with variables to engineer

## Overview

This notebook covers the full feature engineering process applied to the cleaned housing dataset. Feature engineering is a critical step in preparing the data for machine learning, as it transforms raw variables into a format that improves the model’s ability to learn patterns. 

We will address:
- Encoding of categorical variables
- Normalization of skewed numerical features
- Management of multicollinearity
- Evaluation of transformation effects via visualization


---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with `os.getcwd()`

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* `os.path.dirname()` gets the parent directory
* `os.chir()` defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
print("New working directory set to:", os.getcwd())

---

# Load Cleaned Data

Train Set

In [None]:
import pandas as pd
train_set_path = "outputs/datasets/cleaned/TrainSetCleaned.csv"
TrainSet = pd.read_csv(train_set_path)
TrainSet.head(3)

Test Set

In [None]:
test_set_path = 'outputs/datasets/cleaned/TestSetCleaned.csv'
TestSet = pd.read_csv(test_set_path)
TestSet.head(3)

# Data Exploration

In [None]:
from ydata_profiling import ProfileReport
pandas_report = ProfileReport(df=TrainSet, minimal=True)
pandas_report.to_notebook_iframe()

---

# Impute Missing Values

Before any modeling or transformation can occur, it's essential to address missing data. This ensures model compatibility and avoids distortions during encoding and scaling. 

We begin by identifying variables with missing values and apply suitable imputation strategies based on the variable type and domain knowledge. Where relevant, we also add binary indicators to flag imputed values for potential predictive insight.
We apply imputation strategies based on data type and domain understanding.

We see that there are 64 instances of 'Zero' in `GarageArea` and 64 empty instances in `GarageYrBlt`, leading us to infer that the missing data in `GarageYrBlt` relates to the lack of a garage in that house.

We start by handling missing values before encoding or scaling. This includes:

- Creating binary flags for missing values (e.g. `GarageYrBlt_missing`)
- Using grouped median imputation for `LotFrontage` based on `Neighborhood`
- Using constant values (like `0` or `"None"`) for features where missingness signals absence
- Applying SimpleImputer to other features using strategies based on metadata

In [None]:
# Add missingness indicators
TrainSet['GarageYrBlt_missing'] = TrainSet['GarageYrBlt'].isna().astype(int)
TrainSet['LotFrontage_missing'] = TrainSet['LotFrontage'].isna().astype(int)

TestSet['GarageYrBlt_missing'] = TestSet['GarageYrBlt'].isna().astype(int)
TestSet['LotFrontage_missing'] = TestSet['LotFrontage'].isna().astype(int)

# Use global median for LotFrontage
lotfrontage_median = TrainSet['LotFrontage'].median()
TrainSet['LotFrontage'] = TrainSet['LotFrontage'].fillna(lotfrontage_median)
TestSet['LotFrontage'] = TestSet['LotFrontage'].fillna(lotfrontage_median)


## Imputation Plan

|Feature | Metadata Insight | Strategy  | Notes|
|--------|------------------|----------------------------|------|
|`LotFrontage`  | Linear feet of street connected to property | Median | Group median |
|`GarageYrBlt`  | Year garage was built (1900–2010); missing if no garage | Fill with `0` | 0 clearly means no garage (can also flag with new feature if needed)|
|`2ndFlrSF`     | Square footage of second floor (0–2065); 0 is common | Fill with `0` | No imputation needed — zero is valid|
|`MasVnrArea`   | Masonry veneer area (0–1600); 0 means no veneer | Fill with `0` | 0 is semantically meaningful|
|`BedroomAbvGr` | Bedrooms above ground; 0–8 range | Median | Could also test for mode; median is fine|
|`BsmtExposure` | Exposure rating or `"None"` for no basement | Fill with `"None"` | Use `"None"` instead of "Missing" to match domain encoding|
|`BsmtFinType1` | Finish type or `"None"` if no basement | Fill with `"None"` | `"None"` is an actual category in metadata|
|`GarageFinish` | Garage interior finish or `"None"` if no garage | Fill with `"None"` | Use `"None"` for clarity and alignment with domain semantics|

In [None]:
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

# LotFrontage is already handled
numerical_impute_zero = ['2ndFlrSF', 'MasVnrArea', 'GarageYrBlt']
numerical_impute_median = ['BedroomAbvGr']
categorical_fill_none = ['BsmtExposure', 'BsmtFinType1', 'GarageFinish']

zero_imputer = SimpleImputer(strategy='constant', fill_value=0)
median_imputer = SimpleImputer(strategy='median')
cat_imputer = SimpleImputer(strategy='constant', fill_value='None')

imputer_transformer = ColumnTransformer(transformers=[
    ('num_zero', zero_imputer, numerical_impute_zero),
    ('num_median', median_imputer, numerical_impute_median),
    ('cat_fill', cat_imputer, categorical_fill_none)
], remainder='passthrough')

TrainSet_imputed = pd.DataFrame(
    imputer_transformer.fit_transform(TrainSet),
    columns=numerical_impute_zero + numerical_impute_median + categorical_fill_none +
            [col for col in TrainSet.columns if col not in numerical_impute_zero + numerical_impute_median + categorical_fill_none]
)

TestSet_imputed = pd.DataFrame(
    imputer_transformer.transform(TestSet),
    columns=TrainSet_imputed.columns
)


### Preview Results After Imputation Step

In [None]:
TrainSet_imputed.head()

### Check for remaining missing values

In [None]:
print("Remaining missing values in TrainSet:")
print(TrainSet_imputed.isnull().sum().sort_values(ascending=False).head())
print("Remaining missing values in TestSet:")
print(TestSet_imputed.isnull().sum().sort_values(ascending=False).head())

### Confirm new flags exist

In [None]:
print("\nColumns added as missingness flags:")
print([col for col in TrainSet_imputed.columns if '_missing' in col])

### Confirm data types and number of columns

In [None]:
print("\nData types after transformation:")
print(TrainSet_imputed.dtypes.value_counts())

---

## Post-Imputation Correlation & PPS Check

Now that we've imputed missing values and added flags, we reassess the feature relationships.

This helps us:
- Re-confirm top predictors of `SalePrice`
- Detect any new multicollinearity
- Spot newly valuable features (e.g. missingness indicators)

We'll examine both:
- **Pearson/Spearman correlation** (for linear and monotonic relationships)
- **Power Predictive Score (PPS)** (for general predictive strength)
NOTE

In [None]:
# 1. Preview column data types
print(" Column data types after imputation:")
print(TrainSet_imputed.dtypes.value_counts())

# 2. Get object columns for review
object_cols = TrainSet_imputed.select_dtypes(include='object').columns.tolist()
print(f"\n Object-type columns to review ({len(object_cols)}):")
print(object_cols)

# 3. Preview unique values in a few columns
print("\n Sample values from first few object columns:")
for col in object_cols[:5]:
    print(f"- {col}: {TrainSet_imputed[col].unique()[:5]}")


## Encoding Categorical Features

With missing data handled, we turn to encoding. Machine learning algorithms require input features to be numeric, so we must convert all categorical features into numerical representations. In this project, we use **ordinal encoding**, which preserves category order where meaningful and allows the use of a single numeric column per variable.


1. Almost everything is still stored as object, including numeric-looking values like:

- `'2ndFlrSF'`: [0.0, 772.0, …]

- `'BedroomAbvGr'`: [3.0, 2.0, …]

    These need to be converted to numeric types.

2. Truly categorical variables (with labels) include:

- `'BsmtExposure'`: ['No', 'Av', 'Gd', 'Mn', 'None']

- `'BsmtFinType1'`, `'GarageFinish'`, `'KitchenQual'`

    These need encoding (Ordinal or One-Hot depending on model preference — we’ll use Ordinal for now to keep things simple for correlation and PPS).

3. SalePrice is not yet in this list — but should be numeric and included. So we double-check this:

In [None]:
print("SalePrice type:", TrainSet['SalePrice'].dtype)

4. Convert numeric-looking object columns to numbers. So first we identify which object columns can be converted to numeric or leave it as object if conversion fails.

In [None]:

for col in TrainSet_imputed.columns:
    if TrainSet_imputed[col].dtype == 'object':
        try:
            TrainSet_imputed[col] = pd.to_numeric(TrainSet_imputed[col])
            TestSet_imputed[col] = pd.to_numeric(TestSet_imputed[col])
        except:
            pass

5. Encode remaining categorical variables

In [None]:
from sklearn.preprocessing import OrdinalEncoder

# Manually specify true categorical features
categorical_features = ['BsmtExposure', 'BsmtFinType1', 'GarageFinish', 'KitchenQual']

# Fit and apply ordinal encoder
encoder = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)
TrainSet_imputed[categorical_features] = encoder.fit_transform(TrainSet_imputed[categorical_features])
TestSet_imputed[categorical_features] = encoder.transform(TestSet_imputed[categorical_features])


In [None]:
print("🔄 Updated data types after conversion + encoding:")
print(TrainSet_imputed.dtypes.value_counts())

# Confirm SalePrice is numeric
print("SalePrice dtype:", TrainSet['SalePrice'].dtype)


---

# Evaluate Distribution Transformations

Now that all features are numeric and encoded, we assess whether **distribution transformations** (e.g. log, Yeo-Johnson) could help normalize features and benefit certain algorithms (like linear models or KNN).

We'll use a custom utility `FeatureEngineeringAnalysis` to preview the effect of various transformations.


In [None]:
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import warnings
from feature_engine import transformation as vt
from feature_engine.outliers import Winsorizer
from feature_engine.encoding import OrdinalEncoder
sns.set(style="whitegrid")
warnings.filterwarnings('ignore')

# %matplotlib inline


def FeatureEngineeringAnalysis(df, analysis_type=None):
    """
    - used for quick feature engineering on numerical and categorical variables
    to decide which transformation can better transform the distribution shape
    - Once transformed, use a reporting tool, like ydata-profiling, to evaluate distributions
    """
    check_missing_values(df)
    allowed_types = ['numerical', 'ordinal_encoder', 'outlier_winsorizer']
    check_user_entry_on_analysis_type(analysis_type, allowed_types)
    list_column_transformers = define_list_column_transformers(analysis_type)

    # Loop in each variable and engineer the data according to the analysis type
    df_feat_eng = pd.DataFrame([])
    for column in df.columns:
        # create additional columns (column_method) to apply the methods
        df_feat_eng = pd.concat([df_feat_eng, df[column]], axis=1)
        for method in list_column_transformers:
            df_feat_eng[f"{column}_{method}"] = df[column]

        # Apply transformers in respective column_transformers
        df_feat_eng, list_applied_transformers = apply_transformers(
            analysis_type, df_feat_eng, column)

        # For each variable, assess how the transformations perform
        transformer_evaluation(
            column, list_applied_transformers, analysis_type, df_feat_eng)

    return df_feat_eng


def check_user_entry_on_analysis_type(analysis_type, allowed_types):
    """ Check analysis type """
    if analysis_type is None:
        raise SystemExit(
            f"You should pass analysis_type parameter as one of the following options: {allowed_types}")
    if analysis_type not in allowed_types:
        raise SystemExit(
            f"analysis_type argument should be one of these options: {allowed_types}")


def check_missing_values(df):
    if df.isna().sum().sum() != 0:
        raise SystemExit(
            f"There is a missing value in your dataset. Please handle that before getting into feature engineering.")


def define_list_column_transformers(analysis_type):
    """ Set suffix columns according to analysis_type"""
    if analysis_type == 'numerical':
        list_column_transformers = [
            "log_e", "log_10", "reciprocal", "power", "box_cox", "yeo_johnson"]

    elif analysis_type == 'ordinal_encoder':
        list_column_transformers = ["ordinal_encoder"]

    elif analysis_type == 'outlier_winsorizer':
        list_column_transformers = ['iqr']

    return list_column_transformers


def apply_transformers(analysis_type, df_feat_eng, column):
    for col in df_feat_eng.select_dtypes(include='category').columns:
        df_feat_eng[col] = df_feat_eng[col].astype('object')

    if analysis_type == 'numerical':
        df_feat_eng, list_applied_transformers = FeatEngineering_Numerical(
            df_feat_eng, column)

    elif analysis_type == 'outlier_winsorizer':
        df_feat_eng, list_applied_transformers = FeatEngineering_OutlierWinsorizer(
            df_feat_eng, column)

    elif analysis_type == 'ordinal_encoder':
        df_feat_eng, list_applied_transformers = FeatEngineering_CategoricalEncoder(
            df_feat_eng, column)

    return df_feat_eng, list_applied_transformers


def transformer_evaluation(column, list_applied_transformers, analysis_type, df_feat_eng):
    # For each variable, assess how the transformations perform
    print(f"* Variable Analyzed: {column}")
    print(f"* Applied transformation: {list_applied_transformers} \n")
    for col in [column] + list_applied_transformers:

        if analysis_type != 'ordinal_encoder':
            DiagnosticPlots_Numerical(df_feat_eng, col)

        else:
            if col == column:
                DiagnosticPlots_Categories(df_feat_eng, col)
            else:
                DiagnosticPlots_Numerical(df_feat_eng, col)

        print("\n")


def DiagnosticPlots_Categories(df_feat_eng, col):
    plt.figure(figsize=(4, 3))
    sns.countplot(data=df_feat_eng, x=col, palette=[
                  '#432371'], order=df_feat_eng[col].value_counts().index)
    plt.xticks(rotation=90)
    plt.suptitle(f"{col}", fontsize=30, y=1.05)
    plt.show()
    print("\n")


def DiagnosticPlots_Numerical(df, variable):
    fig, axes = plt.subplots(1, 3, figsize=(12, 4))
    sns.histplot(data=df, x=variable, kde=True, element="step", ax=axes[0])
    stats.probplot(df[variable], dist="norm", plot=axes[1])
    sns.boxplot(x=df[variable], ax=axes[2])

    axes[0].set_title('Histogram')
    axes[1].set_title('QQ Plot')
    axes[2].set_title('Boxplot')
    fig.suptitle(f"{variable}", fontsize=30, y=1.05)
    plt.tight_layout()
    plt.show()


def FeatEngineering_CategoricalEncoder(df_feat_eng, column):
    list_methods_worked = []
    try:
        encoder = OrdinalEncoder(encoding_method='arbitrary', variables=[
                                 f"{column}_ordinal_encoder"])
        df_feat_eng = encoder.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_ordinal_encoder")

    except Exception:
        df_feat_eng.drop([f"{column}_ordinal_encoder"], axis=1, inplace=True)

    return df_feat_eng, list_methods_worked


def FeatEngineering_OutlierWinsorizer(df_feat_eng, column):
    list_methods_worked = []

    # Winsorizer iqr
    try:
        disc = Winsorizer(
            capping_method='iqr', tail='both', fold=1.5, variables=[f"{column}_iqr"])
        df_feat_eng = disc.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_iqr")
    except Exception:
        df_feat_eng.drop([f"{column}_iqr"], axis=1, inplace=True)

    return df_feat_eng, list_methods_worked


def FeatEngineering_Numerical(df_feat_eng, column):
    list_methods_worked = []

    # LogTransformer base e
    try:
        lt = vt.LogTransformer(variables=[f"{column}_log_e"])
        df_feat_eng = lt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_log_e")
    except Exception:
        df_feat_eng.drop([f"{column}_log_e"], axis=1, inplace=True)

    # LogTransformer base 10
    try:
        lt = vt.LogTransformer(variables=[f"{column}_log_10"], base='10')
        df_feat_eng = lt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_log_10")
    except Exception:
        df_feat_eng.drop([f"{column}_log_10"], axis=1, inplace=True)

    # ReciprocalTransformer
    try:
        rt = vt.ReciprocalTransformer(variables=[f"{column}_reciprocal"])
        df_feat_eng = rt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_reciprocal")
    except Exception:
        df_feat_eng.drop([f"{column}_reciprocal"], axis=1, inplace=True)

    # PowerTransformer
    try:
        pt = vt.PowerTransformer(variables=[f"{column}_power"])
        df_feat_eng = pt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_power")
    except Exception:
        df_feat_eng.drop([f"{column}_power"], axis=1, inplace=True)

    # BoxCoxTransformer
    try:
        bct = vt.BoxCoxTransformer(variables=[f"{column}_box_cox"])
        df_feat_eng = bct.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_box_cox")
    except Exception:
        df_feat_eng.drop([f"{column}_box_cox"], axis=1, inplace=True)

    # YeoJohnsonTransformer
    try:
        yjt = vt.YeoJohnsonTransformer(variables=[f"{column}_yeo_johnson"])
        df_feat_eng = yjt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_yeo_johnson")
    except Exception:
        df_feat_eng.drop([f"{column}_yeo_johnson"], axis=1, inplace=True)

    return df_feat_eng, list_methods_worked


## Step 1: Select Features to Analyze
You should prioritize:

A. Features highly correlated with SalePrice
From your correlation analysis, these are likely candidates:

|  Variable   |  Why Analyze?                                               |
|-------------|-------------------------------------------------------------|
| GrLivArea   | Often skewed right, strong correlation                      |
| GarageArea  | May benefit from log transformation                         |
| TotalBsmtSF | Can vary widely and often right-skewed                      |
|  1stFlrSF   | Similar to above                                            |
| OverallQual | Ordinal, might not need transformation but worth visualizing|

B. New or imputed variables

|Variable    | Why Analyze?                            |
|------------|-----------------------------------------|
|LotFrontage | We imputed it — so let's check its shape|
|MasVnrArea  | Many zeroes — investigate transformation|
|YearBuilt   | Time-based but numeric                  |
|OpenPorchSF | Could have long tail                    |

## Step 2: Create Subsets to pass into the function

In [None]:
top_numerical_features = [
    'GrLivArea',
    'GarageArea',
    'TotalBsmtSF',
    '1stFlrSF',
    'LotFrontage',
    'MasVnrArea',
    'OpenPorchSF',
    'YearBuilt'
]

# Run the tool on a few at a time (to avoid overload)
FeatureEngineeringAnalysis(df=TrainSet_imputed[['MasVnrArea']], analysis_type='numerical')
FeatureEngineeringAnalysis(df=TrainSet_imputed[['OpenPorchSF']], analysis_type='numerical')
FeatureEngineeringAnalysis(df=TrainSet_imputed[['YearBuilt']], analysis_type='numerical')


### How We Chose Which Transformations to Apply

Each numerical variable was evaluated across multiple transformation methods using three visual diagnostics:
- **Histogram**: checked for symmetry and bell-shaped distribution
- **QQ Plot**: checked alignment with the diagonal (indicating normality)
- **Boxplot**: checked for outlier compression and spread

We selected the transformation that yielded the best visual improvement while preserving interpretability. Below is a summary of the selected transformations:

| Variable        | Transformation Applied | Reasoning                                                                 |
|-----------------|------------------------|---------------------------------------------------------------------------|
| `GrLivArea`     | `log_e`                | Reduced right-skew and improved normality visually                        |
| `GarageArea`    | `yeo_johnson`          | Handles zero values, improved symmetry                                   |
| `1stFlrSF`      | `log_10`               | Significantly improved QQ plot and histogram                             |
| `TotalBsmtSF`   | `power`                | Strong visual improvement in distribution shape                          |
| `LotFrontage`   | `yeo_johnson`          | Smoothed outliers and improved bell-shaped symmetry                      |
| `OverallQual`   | None                   | Ordinal variable, distribution already clean and interpretable           |
| `YearBuilt`     | None                   | Discrete time-based variable; transformations distorted interpretability |

Transformations not selected (like `reciprocal` or `box_cox`) were either less interpretable or introduced new artifacts.


### SmartCorrelatedSelection Variables

To improve model stability and interpretability, we aim to reduce multicollinearity — the presence of strong correlations between independent variables. We use domain knowledge, Spearman correlation analysis, and variance-based filtering to remove redundant features when necessary.

* Step 1: Select variable(s)
    - for this transformer, we don't need to select variables, since we need all variables for this transformer

* Step 2: Create a separate DataFrame, with your variable(s)

In [None]:
df_engineering = TrainSet_imputed.copy()
df_engineering.head(5)

* Step 3: Create engineered variables(s) applying the transformation(s)

In [None]:
from feature_engine.selection import SmartCorrelatedSelection
corr_sel = SmartCorrelatedSelection(variables=None, method="spearman", threshold=0.8, selection_method="variance")

corr_sel.fit_transform(df_engineering)
corr_sel.correlated_feature_sets_

In [None]:
corr_sel.features_to_drop_

# Conclusions and Next Steps

### Key Outcomes
- All missing values were imputed using context-appropriate strategies.
- Categorical variables were encoded using ordinal encoding to preserve interpretability.
- Several numerical features were transformed using techniques like `log`, `box-cox`, and `yeo-johnson` to reduce skewness and improve normality.
- Visual diagnostics (histogram, QQ plot, boxplot) guided the selection of transformations.
- Redundant features and those inappropriate for modeling (e.g. highly correlated or low variance) were dropped where applicable.

### Next Steps
- Finalize the modeling dataset by consolidating transformed features and dropping unused variants.
- Conduct feature scaling if required (depending on the chosen algorithm).
- Train baseline models and compare performance (e.g., Linear Regression, Random Forest).
- Apply feature importance analysis post-modeling to validate choices made during feature engineering.