# House Price Correlation Study

## Objectives

* Answer business requirement 1:
    * The client is interested in discovering how the house attributes correlate with the sale price. Therefore, the client expects data visualisations of the correlated variables against the sale price to show that.

## Inputs

* `outputs/datasets/cleaned/CompleteSetCleaned.csv`

## Outputs

* Generate initial information that answers business requirement 1 that can be used to build the Streamlit Application. 

## Additional Comments




---

# Change working directory

The notebooks for this project are stored in a subfolder called `jupyter_notebooks`, therefore when running the notebook, the working directory needs to be changed to the parent folder. 
* We access the current directory with `os.getcwd()`

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory of: ", os.getcwd())

---

# Load Collected Data

In [None]:
import pandas as pd
df_raw_path = "outputs/datasets/cleaned/CompleteSetCleaned.csv"
df = pd.read_csv(df_raw_path)
df.head()

## Data Exploration

In [None]:
df.info()

In [None]:
vars_with_missing_data = df.columns[df.isna().sum() > 0].to_list()
vars_with_missing_data

In [None]:
from ydata_profiling import ProfileReport
pandas_report = ProfileReport(df=df, minimal=True)
pandas_report.to_notebook_iframe()

---

# Correlation Study

Feature Engine has different options for encoding categorical variables. One Hot Encoding and Ordinal Encoder. However, I have decided to map the categorical features by hand as they do show inherent order. For example

In [None]:
unique_values_dict = {}

list_of_ordinal_features = df.select_dtypes(include=['object']).columns.tolist()

for item in list_of_ordinal_features:
    unique_values_dict[item] = df[item].unique().tolist()
    print(f"Feature: {item} - {unique_values_dict[item]}")

In [None]:
# Define an ordinal mapping for the quality
BsmtExposure_mapping = {
    'Missing': -1,
    'None': 0,
    'No': 1,
    'Mn': 2,
    'Av': 3,
    'Gd': 4
}

BsmtFinType1_mapping = {
    'Missing': -1,
    'None': 0,
    'Unf': 1,
    'LwQ': 2,
    'Rec': 3,
    'BLQ': 4,
    'ALQ': 5,
    'GLQ': 6,
}

GarageFinish_mapping = {
    'Missing': -1,
    'None': 0,
    'Unf': 1,
    'RFn': 2,
    'Fin': 3,
}

KitchenQual_mapping = {
    'Missing': -1,
    'Po': 0,
    'Fa': 1,
    'TA': 2,
    'Gd': 3,
    'Ex': 4,
}


# Apply the mapping to the column
df['BsmtExposure'] = df['BsmtExposure'].map(BsmtExposure_mapping)
df['BsmtFinType1'] = df['BsmtFinType1'].map(BsmtFinType1_mapping)
df['GarageFinish'] = df['GarageFinish'].map(GarageFinish_mapping)
df['KitchenQual'] = df['KitchenQual'].map(KitchenQual_mapping)
df

In [None]:
df.info()

In [None]:
corr_spearman = df.corr(method='spearman')['SalePrice'].sort_values(key=abs, ascending=False)[1:].head(10)
corr_spearman

In [None]:
corr_pearson = df.corr(method='pearson')['SalePrice'].sort_values(key=abs, ascending=False)[1:].head(10)
corr_pearson

In [24]:
top_n = 5
top_vars = set(corr_pearson[:top_n].index.to_list() + corr_spearman[:top_n].index.to_list())

## EDA

In [None]:
vars_to_study = list(top_vars)
vars_to_study

In [None]:
df_eda = df.filter(vars_to_study + ['SalePrice'])
df_eda.head(3)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')

def plot_numerical(df, col, target_var):
    plt.figure(figsize=(8, 5))    
    sns.regplot(data=df, x=col, y=target_var, scatter_kws={'s':5}, line_kws={"color":"green"})
    plt.title(f"{col} vs {target_var}", fontsize=20, y=1.05)
    plt.show()

target_var = 'SalePrice'
for col in vars_to_study:
        plot_numerical(df_eda, col, target_var)
        print("\n\n")