# Housing Price Study Notebook

## Objectives
- Answer business requirement 1:
    - The client is interested in discovering how the house attributes correlate with the sale price. Therefore, the client expects data visualisations of the correlated variables against the sale price to show that.

# Inputs

- outputs/datasets/collection/HousingPrices.csv

## Outputs
- Generate code that answers business requirement 1 and can be used to build the Streamlit App

---

## Change working directory
Change current working directory to its parent

In [1]:
import os 
cwd = os.getcwd()
cwd

'/workspaces/heritage-housing/jupyter_notebooks'

In [2]:
os.chdir(os.path.dirname(cwd))
print("You set a new current working directory")

You set a new current working directory


In [3]:
cwd = os.getcwd()
cwd

'/workspaces/heritage-housing'

---

## Load Data

In [7]:
import pandas as pd
df = pd.read_csv("outputs/datasets/cleaned/HousingPrices.csv")
df.head()

Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,GarageArea,GarageFinish,GrLivArea,...,SalePrice,HouseAge,RemodAge,GarageAge,TotalSF,AboveGradeSF,IsRemodeled,Has2ndFlr,HasPorch,HasDeck
0,856,854.0,3.0,No,706,GLQ,150,548,RFn,1710,...,208500,22,22,22.0,2566.0,1710.0,0,1,0,0
1,1262,0.0,3.0,Gd,978,ALQ,284,460,RFn,1262,...,181500,49,49,49.0,2524.0,1262.0,0,0,0,0
2,920,866.0,3.0,Mn,486,GLQ,434,608,RFn,1786,...,223500,24,23,24.0,2706.0,1786.0,1,1,0,0
3,961,0.0,3.0,No,216,ALQ,540,642,Unf,1717,...,140000,110,55,27.0,1717.0,961.0,1,0,0,0
4,1145,0.0,4.0,Av,655,GLQ,490,836,RFn,2198,...,250000,25,25,25.0,2290.0,1145.0,0,0,0,0


## Data Exploration

In [8]:
from ydata_profiling import ProfileReport
pandas_report = ProfileReport(df=df, minimal=True)
pandas_report.to_notebook_iframe()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

100%|██████████| 28/28 [00:00<00:00, 153.12it/s]


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

## Correlation Study

### Spearman and Pearson Methods on Numerical Variables

In [18]:
numeric_features = df.select_dtypes(include=['number'])
corr_spearman = numeric_features.corr(method='spearman')['SalePrice'].sort_values(key=abs, ascending=False)[1:].head(10)
corr_pearson = numeric_features.corr(method='pearson')['SalePrice'].sort_values(key=abs, ascending=False)[1:].head(10)

In [19]:
corr_pearson

OverallQual     0.790982
TotalSF         0.772116
GrLivArea       0.708624
AboveGradeSF    0.704381
GarageArea      0.623431
TotalBsmtSF     0.613581
1stFlrSF        0.605852
HouseAge       -0.522897
RemodAge       -0.507101
MasVnrArea      0.472614
Name: SalePrice, dtype: float64

In [20]:
corr_spearman

OverallQual     0.809829
TotalSF         0.804128
GrLivArea       0.731310
AboveGradeSF    0.721304
HouseAge       -0.652682
GarageArea      0.649379
TotalBsmtSF     0.602725
1stFlrSF        0.575408
RemodAge       -0.571159
OpenPorchSF     0.477561
Name: SalePrice, dtype: float64

In [21]:
top_n = 5
set(corr_pearson[:top_n].index.to_list() + corr_spearman[:top_n].index.to_list())

{'AboveGradeSF',
 'GarageArea',
 'GrLivArea',
 'HouseAge',
 'OverallQual',
 'TotalSF'}

In [22]:
num_vars_to_study = ['AboveGradeSF', 'GarageArea', 'GrLivArea', 'HouseAge', 'OverallQual', 'TotalSF']