# Housing Price Study Notebook

## Objectives
- Answer business requirement 1:
    - The client is interested in discovering how the house attributes correlate with the sale price. Therefore, the client expects data visualisations of the correlated variables against the sale price to show that.

# Inputs

- outputs/datasets/collection/HousingPrices.csv

## Outputs
- Generate code that answers business requirement 1 and can be used to build the Streamlit App

---

## Change working directory
Change current working directory to its parent

In [None]:
import os 
cwd = os.getcwd()
cwd

In [None]:
os.chdir(os.path.dirname(cwd))
print("You set a new current working directory")

In [None]:
cwd = os.getcwd()
cwd

---

## Load Data

In [None]:
import pandas as pd
df = pd.read_csv("outputs/datasets/cleaned/HousingPrices.csv")
df.head()

## Data Exploration

In [None]:
from ydata_profiling import ProfileReport
pandas_report = ProfileReport(df=df, minimal=True)
pandas_report.to_notebook_iframe()

## Correlation Study

### Spearman and Pearson Methods on Numerical Variables

In [None]:
numeric_features = df.select_dtypes(include=['number'])
corr_spearman = numeric_features.corr(method='spearman')['SalePrice'].sort_values(key=abs, ascending=False)[1:].head(10)
corr_pearson = numeric_features.corr(method='pearson')['SalePrice'].sort_values(key=abs, ascending=False)[1:].head(10)

In [None]:
corr_pearson

In [None]:
corr_spearman

In [None]:
top_n = 5
set(corr_pearson[:top_n].index.to_list() + corr_spearman[:top_n].index.to_list())

In [None]:
vars_to_study = ['AboveGradeSF', 'GarageArea', 'GrLivArea', 'HouseAge', 'OverallQual', 'TotalSF']

### Group Analysis and Box Plots on Categorical Variables

In [None]:
%matplotlib inline

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt


import matplotlib.pyplot as plt
import seaborn as sns

def auto_order_encode_plot(df, target, cat_cols):
    for col in cat_cols:
        median_order = df.groupby(col)[target].median().sort_values()
        order = list(median_order.index)
        
        plt.figure(figsize=(10, 5))
        sns.boxplot(x=col, y=target, data=df, order=order)
        plt.title(f"{target} distribution by {col} (ordered by median)")
        plt.xticks(rotation=45)
        plt.show()
        
        mapping = {k: v for v, k in enumerate(order, 1)}
        new_col_name = col + '_encoded'
        df[new_col_name] = df[col].map(mapping)
        
        print(f"Encoded '{col}' as '{new_col_name}' with mapping:\n{mapping}\n")

        print(df.groupby(col)['SalePrice'].mean().sort_values())
    return df


In [None]:
categorical_features = df.select_dtypes(include=['object']).columns
df = auto_order_encode_plot(df, target='SalePrice', cat_cols=categorical_features.to_list())

Observations:
- The relationship between categories of BsmtFinType1 and SalePrice isn't monotonic
- BsmtExposure, GarageFinish and KitchenQual have a monotonic relationship with SalePrice and should be included for further investigation

In [None]:
vars_to_study += ['BsmtExposure', 'GarageFinish', 'KitchenQual']
vars_to_study

## EDA on selected variables

In [None]:
num_vars = ['AboveGradeSF', 'GarageArea', 'GrLivArea', 'HouseAge', 'OverallQual', 'TotalSF']
cat_vars = ['BsmtExposure', 'GarageFinish', 'KitchenQual']

### Numerical Variables

In [None]:
for var in num_vars:
    plt.figure(figsize=(8, 5))
    sns.scatterplot(x=var, y='SalePrice', data=df)
    sns.regplot(x=var, y='SalePrice', data=df, scatter=False, color='red')
    plt.title(f'SalePrice vs {var}')
    plt.show()

In [None]:
import numpy as np

corr = df[num_vars + ['SalePrice']].corr()
threshold = 0.0
mask = np.abs(corr) < threshold
plt.figure(figsize=(8, 6))
sns.heatmap(corr, annot=True, cmap='coolwarm', fmt=".2f", mask=mask, cbar=True)
plt.title(f"Correlation Matrix (|corr| >= {threshold})")
plt.show()

Observations:
- OveralQual has the strongest correlation with SalePrice, following by TotalSF
- HouseAge has an inverse relationship with SalePrice


### Categorical Variables

In [None]:
print("KitchenQual counts: ", df['KitchenQual'].value_counts())
print("BsmtExposure counts: ", df['BsmtExposure'].value_counts())
print("GarageFinish counts: ", df['GarageFinish'].value_counts())

In [None]:
pivot_table = df.pivot_table(
    values='SalePrice',
    index='KitchenQual',
    columns='BsmtExposure',
    aggfunc='median'
)

row_order = ['Ex', 'Gd', 'TA', 'Fa']  # KitchenQual
col_order = ['Missing', 'No', 'Mn', 'Av', 'Gd'] # BsmtExposure  

pivot_table_ordered = pivot_table.loc[row_order, col_order]

plt.figure(figsize=(10, 7))
sns.heatmap(pivot_table_ordered, annot=True, fmt=".0f", cmap='Blues')
plt.title('Median SalePrice by KitchenQual and BsmtExposure')
plt.show()

In [None]:
pivot_table = df.pivot_table(
    values='SalePrice',
    index='KitchenQual',
    columns='GarageFinish',
    aggfunc='median'
)

row_order = ['Ex', 'Gd', 'TA', 'Fa']  # KitchenQual
col_order = ['Missing', 'Unf', 'RFn', 'Fin'] # GarageFinish  

pivot_table_ordered = pivot_table.loc[row_order, col_order]

plt.figure(figsize=(10, 7))
sns.heatmap(pivot_table_ordered, annot=True, fmt=".0f", cmap='Blues')
plt.title('Median SalePrice by KitchenQual and GarageFinish')
plt.show()

In [None]:
pivot_table = df.pivot_table(
    values='SalePrice',
    index='BsmtExposure',
    columns='GarageFinish',
    aggfunc='median'
)

row_order = ['Gd', 'Av', 'Mn', 'No', 'Missing'] # BsmtExposure  
col_order = ['Missing', 'Unf', 'RFn', 'Fin'] # GarageFinish  

pivot_table_ordered = pivot_table.loc[row_order, col_order]

plt.figure(figsize=(10, 7))
sns.heatmap(pivot_table_ordered, annot=True, fmt=".0f", cmap='Blues')
plt.title('Median SalePrice by KitchenQual and GarageFinish')
plt.show()

## PPS Matrix for all features

In [None]:
import ppscore as pps

pps_matrix = pps.matrix(df)

pps_target = pps_matrix[pps_matrix['y'] == 'SalePrice'].sort_values(by='ppscore', ascending=False)

print(pps_target[['x', 'ppscore']])

## Conclusions

- OverallQual 
    - Pearson correlation: 0.79
    - Spearman correlation: 0.81
    - PPS: 0.44
    - Home with a consistently superior quality achieve higher sale prices. 
- TotalSF
    - Pearson: 0.77
    - Spearman: 0.8
    - PPS: 0.27
    - Larger total living area (including basement) is strongly linked to higher sale prices.
- KitchenQual
    - Boxplots: montonic relationship 
    - PPS: 0.26
    - Higher kitchen quality ratings predict higher sale prices. This suggest kitchen condition is a major factor for buyers.
- GrLivArea 
    - Pearson: 0.71
    - Spearman: 0.73
    - PPS: 0.1
    - Larger above-ground space is strongly correlated with higher prices, but PPS suggests it is less uniquely predictive than total size.
- GarageArea
    - Pearson: 0.62
    - Spearman: 0.65
    - PPS: 0.19
    - Larger garages increase home value. Related to GrLivArea and TotalSF, but implies that garage space is less important than living areas. 
- HouseAge
    - Pearson: -0.62
    - Spearman: -0.65
    - PPS: 0.2
    - Newer houses generally sell for more. Age is negatively related to price.
- RemodAge
    - Pearson: -0.51
    - Spearman: -.57
    - PPS: 0.14
    - Homes more recently remodeled tend to be priced higher. 