
# Data Cleaning Notebook – Heritage Housing Prices

## Objectives

This notebook prepares the dataset for machine learning modeling by:

- Identifying and evaluating missing values
- Dropping sparse or non-predictive features
- Splitting the data into training and testing sets
- Saving cleaned datasets for modeling

## Inputs

- outputs\datasets\collection\house_prices_records.csv

## Outputs

- Generate cleaned Train and Test sets, both saved under outputs/datasets/cleaned



## Roadmap

1. Load raw collected dataset  
2. Identify missing values and assess feature completeness  
3. Split into training and testing sets   
4. Drop sparse or low-value features
5. Save cleaned data  


---

# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os

current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* `os.path.dirname()` gets the parent directory
* `os.chir()` defines the new current directory

Then we confirm the new current directory by printing it with `os.getcwd()` again.

In [None]:
os.chdir(os.path.dirname(current_dir))
print("New working directory set to:", os.getcwd())

---

# Load Collected data

In [None]:
import pandas as pd

df_raw_path = "outputs/datasets/collection/house_prices_records.csv"
df = pd.read_csv(df_raw_path)
df.head(5)

# Data Exploration

In this section, we are interested in checking the distribution and shape of variables with missing data.

So we list all variables with missing data:

In [None]:
vars_with_missing_data = df.columns[df.isna().sum() > 0].to_list()
vars_with_missing_data


Then we create a profile with the variables with missing data

In [None]:
from ydata_profiling import ProfileReport
if vars_with_missing_data:
    profile = ProfileReport(df=df[vars_with_missing_data], minimal=True)
    profile.to_notebook_iframe()
else:
    print("There are no variables with missing data")

# Correlation and PPS Analysis

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import ppscore as pps

%matplotlib inline


def heatmap_corr(df, threshold, figsize=(20, 12), font_annot=8):
    if len(df.columns) > 1:
        mask = np.zeros_like(df, dtype=bool)
        mask[np.triu_indices_from(mask)] = True
        mask[abs(df) < threshold] = True

        fig, axes = plt.subplots(figsize=figsize)
        sns.heatmap(df, annot=True, xticklabels=True, yticklabels=True,
                    mask=mask, cmap='viridis', annot_kws={"size": font_annot}, ax=axes,
                    linewidth=0.5
                    )
        axes.set_yticklabels(df.columns, rotation=0)
        plt.ylim(len(df.columns), 0)
        plt.show()


def heatmap_pps(df, threshold, figsize=(20, 12), font_annot=8):
    if len(df.columns) > 1:
        mask = np.zeros_like(df, dtype=bool)
        mask[abs(df) < threshold] = True
        fig, ax = plt.subplots(figsize=figsize)
        ax = sns.heatmap(df, annot=True, xticklabels=True, yticklabels=True,
                         mask=mask, cmap='rocket_r', annot_kws={"size": font_annot},
                         linewidth=0.05, linecolor='grey')
        plt.ylim(len(df.columns), 0)
        plt.show()


def CalculateCorrAndPPS(df):
    df_corr_spearman = df.corr(method="spearman", numeric_only=True)
    df_corr_pearson = df.corr(method="pearson", numeric_only=True)

    import warnings

    with warnings.catch_warnings():
        warnings.simplefilter("ignore", category=FutureWarning) # Ignore FutureWarning for ppscore to improve readability
        pps_matrix_raw = pps.matrix(df)
        pps_matrix = pps_matrix_raw.filter(['x', 'y', 'ppscore']).pivot(columns='x', index='y', values='ppscore')

        pps_score_stats = pps_matrix_raw.query("ppscore < 1").filter(['ppscore']).describe().T
        print("\nPPS threshold - check PPS score IQR to decide threshold for heatmap \n")
        print(pps_score_stats.round(3))

    return df_corr_pearson, df_corr_spearman, pps_matrix


def DisplayCorrAndPPS(df_corr_pearson, df_corr_spearman, pps_matrix, CorrThreshold, PPS_Threshold,
                      figsize=(20, 12), font_annot=8):

    print("\n")
    print("* Analyse how the target variable for your ML models are correlated with other variables (features and target)")
    print("* Analyse multi-colinearity, that is, how the features are correlated among themselves")

    print("\n")
    print("*** Heatmap: Spearman Correlation ***")
    print("It evaluates monotonic relationship \n")
    heatmap_corr(df=df_corr_spearman, threshold=CorrThreshold, figsize=figsize, font_annot=font_annot)

    print("\n")
    print("*** Heatmap: Pearson Correlation ***")
    print("It evaluates the linear relationship between two continuous variables \n")
    heatmap_corr(df=df_corr_pearson, threshold=CorrThreshold, figsize=figsize, font_annot=font_annot)

    print("\n")
    print("*** Heatmap: Power Predictive Score (PPS) ***")
    print(f"PPS detects linear or non-linear relationships between two columns.\n"
          f"The score ranges from 0 (no predictive power) to 1 (perfect predictive power) \n")
    heatmap_pps(df=pps_matrix, threshold=PPS_Threshold, figsize=figsize, font_annot=font_annot)


Calculate Correlations and Power Predictive Score

In [None]:
df_corr_pearson, df_corr_spearman, pps_matrix = CalculateCorrAndPPS(df)

Display Heatmaps

In [None]:
DisplayCorrAndPPS(df_corr_pearson = df_corr_pearson,
                  df_corr_spearman = df_corr_spearman, 
                  pps_matrix = pps_matrix,
                  CorrThreshold = 0.4, PPS_Threshold =0.2,
                  figsize=(12,10), font_annot=10)

### Top Predictors of SalePrice

All three methods consistently highlight the following variables as highly predictive of house prices:

- `OverallQual` (Quality of materials/finish)

- `GrLivArea` (Above-ground living area)

- `GarageArea`

- `TotalBsmtSF` (Basement area)

- `YearBuilt`

- `GarageYrBlt` (PPS identified this as particularly strong)

These features seem like good candidates to retain for model development.

### Multicollinearity Considerations

Some variables are strongly correlated with each other, which may cause multicollinearity in linear models:

- `GrLivArea` ↔ `1stFlrSF` (0.69)

- `1stFlrSF` ↔ `TotalBsmtSF` (0.82)

- `GarageArea` ↔ `GrLivArea` (0.57)

These relationships may lead to redundancy.

---

# Data Cleaning

## Assessing Missing Data Levels

- Custom function to display missing data levels in a DataFrame, it shows the absolute levels, relative levels and data type.

In [None]:
def evaluate_missing_data(df):
    missing_abs = df.isnull().sum()
    missing_pct = round(missing_abs / len(df) * 100, 2)
    df_missing_data = (pd.DataFrame(
                            data={"RowsWithMissingData": missing_abs,
                                   "PercentageOfDataset": missing_pct,
                                   "DataType": df.dtypes}
                                    )
                          .sort_values(by=['PercentageOfDataset'], ascending=False)
                          .query("PercentageOfDataset > 0")
                          )

    return df_missing_data

Check missing data levels for the collected dataset.

In [None]:
evaluate_missing_data(df)

## Dealing with missing data

### Split Train and Test Set

We split the dataset before cleaning to avoid data leakage.

This ensures that:

- All cleaning decisions (like which variables to drop) are based solely on the training data

- The test set remains a realistic “unseen” sample to evaluate model performance

- We simulate what would happen in a real-world deployment, where new data is cleaned using a process built on the training set

In [None]:
from sklearn.model_selection import train_test_split

TrainSet, TestSet = train_test_split(
    df,
    test_size=0.2,
    random_state=42
)

print(f"TrainSet shape: {TrainSet.shape}")
print(f"TestSet shape: {TestSet.shape}")

###  Re-Evaluate Missing Data in Train Set
Now we check missing data only in the training set, which we will use to guide cleaning decisions.

In [None]:
df_missing_data = evaluate_missing_data(TrainSet)
print(f"* There are {df_missing_data.shape[0]} variables with missing data \n")
df_missing_data

### Data Cleaning Decision: Dropping Sparse Features

Based on the profiling report, we reviewed each variable with missing values and made decisions grounded in their:

- Missing percentage
- Predictive value potential
- Domain relevance

#### Features to Drop:

- **`EnclosedPorch`** – 90.7% missing  
  Too sparse to be useful. Even if imputed, it would contribute noise rather than signal.

- **`WoodDeckSF`** – 89.4% missing  
  Very low coverage and low variability among non-missing values. Similar to `EnclosedPorch`, better removed.

These features are dropped to simplify the dataset and avoid bias or overfitting due to poor-quality data.

### Test Dropping Variables on the Training Set

We start by applying the drop transformation **only to the training set** and saving the result to a temporary DataFrame. This lets us assess the effect before committing to the change.


In [None]:
from feature_engine.selection import DropFeatures

variables_to_drop = ['EnclosedPorch', 'WoodDeckSF']

dropper = DropFeatures(features_to_drop=variables_to_drop)
dropper.fit(TrainSet)

# Preview effect of dropping columns
TrainSet_preview = dropper.transform(TrainSet)
TrainSet_preview.head(5)
TrainSet_preview.shape, TrainSet.shape

### Assess the Effect of Dropping Columns

We're removing columns, not rows — so the number of samples remains the same.  
But we want to confirm how many columns are being dropped and whether they were meaningful.

In this case, `EnclosedPorch` and `WoodDeckSF` were sparse and mostly zero, with little to no predictive power based on prior correlation and EDA analysis.


In [None]:
print(f"Before drop: {TrainSet.shape[1]} columns")
print(f"After drop: {TrainSet_preview.shape[1]} columns")
print(f"Dropped columns: {variables_to_drop}")

### Apply the Transformation to Train and Test Sets

Now that we’re satisfied, we apply the same dropper to both sets.

In [None]:
dropper = DropFeatures(features_to_drop=variables_to_drop)
dropper.fit(TrainSet)

TrainSet = dropper.transform(TrainSet)
TestSet = dropper.transform(TestSet)


### Re-Evaluate Missing Data

We check for remaining missing values after removing sparse features.  
Any remaining columns with missing data will be handled in the modeling notebook during imputation.

In [None]:
evaluate_missing_data(TrainSet)

# Save Cleaned Train and Test Sets

Here we create create outputs/datasets/collection folder

In [None]:
import os

try:
    os.makedirs("outputs/datasets/cleaned", exist_ok=True)
    TrainSet.to_csv("outputs/datasets/cleaned/TrainSetCleaned.csv", index=False)
    TestSet.to_csv("outputs/datasets/cleaned/TestSetCleaned.csv", index=False)
except Exception as e:
    print(f"Error creating directories or saving files: {e}")

# Push cleaned data to Repo

You can now push the changes to your GitHub repository, using the Git commands (git add, git commit, git push).

## Conclusions and Next Steps

- Identified variables with missing data and evaluated their type and proportion.
- Dropped `EnclosedPorch` and `WoodDeckSF` due to sparsity and low predictive potential.
- Split dataset into training and testing subsets before applying any modeling logic.
- Saved cleaned datasets for reuse in the upcoming modeling notebook.

In the next notebook, we’ll:
- Impute remaining missing values
- Encode categorical features
- Perform feature scaling and model training
