# **House prices study of the Ames, Iowa area.**

## Objectives

* Answer business requirement 1:
    * The client is interested to understand the average sale price of houses in the Ames, Iowa area according to their respective features. 

## Inputs

* outputs/datasets/collection/house_prices_ames_iowa_cleaned.csv

## Outputs

* Generate code that answers business requirement 1 and can be used to build the Streamlit App 


---

## Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Load Data

In [None]:
import pandas as pd
df = (pd.read_csv("outputs/datasets/collection/house_prices_ames_iowa_cleaned.csv"))
df.head(3)

---

# Data Exploration

We are interested to get more familiar with the dataset, check variable type and distribution, missing levels and what these variables mean in a business context

In [None]:
from pandas_profiling import ProfileReport
pandas_report = ProfileReport(df=df, minimal=True)
pandas_report.to_notebook_iframe()

# Correlation Study

Observing how many variables are categorical:

In [None]:
df.info()

Categorical features:
- BsmtExposure
- BsmtFinType1
- GarageFinish
- KitchenQual

This transformation helps the correlation analysis, since the correlation methods need the variables to be numbers, and OneHotEncoder does that.

In [None]:
from feature_engine.encoding import OneHotEncoder
encoder = OneHotEncoder(variables=df.columns[df.dtypes=='object'].to_list(), drop_last=False)
df_ohe = encoder.fit_transform(df)
print(df_ohe.shape)
df_ohe.head(3)

I use `spearman` and `pearson` methods, and investigate the top 10 correlations
* I know this command returns a pandas series and the first item is the correlation between SalePrice and SalePrice, which happens to be 1, so I exclude that with `[1:]`
* I sort values considering the absolute value, by setting `key=abs`

In [None]:
corr_spearman = df_ohe.corr(method='spearman')['SalePrice'].sort_values(key=abs, ascending=False)[1:].head(10)
corr_spearman

We do the same for `pearson`

In [None]:
corr_pearson = df_ohe.corr(method='pearson')['SalePrice'].sort_values(key=abs, ascending=False)[1:].head(10)
corr_pearson

I've noticed the two methods return almost the same features as highly correlated with some differences on the importance of YearBuilt and 1stFlrSF so I decided to consider the 9 top features for further analysis, since their value was above 0.50.

In [None]:
top_n = 9
set(corr_pearson[:top_n].index.to_list() + corr_spearman[:top_n].index.to_list())

Therefore we are studying at df the following features. We will investigate if:
* A house price is higher if the first floor, garage, basement and ground living area is large.
* A house price is higher if the kitchen and overall quality is higher as our first hypotesis implies.
* A house price is higher if the year it was built or remodeled is more recent, as our second hypotesis implies.

In [None]:
vars_to_study = ['1stFlrSF',
                 'GarageArea',
                 'GarageYrBlt',
                 'GrLivArea',
                 'KitchenQual_Ex',
                 'KitchenQual_TA',
                 'OverallQual',
                 'TotalBsmtSF',
                 'YearBuilt',
                 'YearRemodAdd']
vars_to_study


---

# EDA on selected variables

In [None]:
df_eda = df.filter(vars_to_study + ['SalePrice'])
df_eda.head(10)

Import libraries to plot features and analyse it.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style('whitegrid')

In [None]:
def scatter_plot_for_eda(df, col, target_var):
    plt.figure(figsize=(12, 6))
    sns.scatterplot(data=df, x=col, y=target_var)
    plt.title(f"{col}", fontsize=20, y=1.05)
    plt.show()


target_var = 'SalePrice'
for col in vars_to_study:
    scatter_plot_for_eda(df_eda, col, target_var)
    print("\n\n")


---

# PPS Analysis

In [None]:
# Libraries not yet imported
import numpy as np
import ppscore as pps


def heatmap_corr(df, threshold, figsize=(20, 12), font_annot=8):
    if len(df.columns) > 1:
        mask = np.zeros_like(df, dtype=np.bool)
        mask[np.triu_indices_from(mask)] = True
        mask[abs(df) < threshold] = True

        fig, axes = plt.subplots(figsize=figsize)
        sns.heatmap(df, annot=True, xticklabels=True, yticklabels=True,
                    mask=mask, cmap='viridis', annot_kws={"size": font_annot}, ax=axes,
                    linewidth=0.5
                    )
        axes.set_yticklabels(df.columns, rotation=0)
        plt.ylim(len(df.columns), 0)
        plt.show()


def heatmap_pps(df, threshold, figsize=(20, 12), font_annot=8):
    if len(df.columns) > 1:
        mask = np.zeros_like(df, dtype=np.bool)
        mask[abs(df) < threshold] = True
        fig, ax = plt.subplots(figsize=figsize)
        ax = sns.heatmap(df, annot=True, xticklabels=True, yticklabels=True,
                         mask=mask, cmap='rocket_r', annot_kws={"size": font_annot},
                         linewidth=0.05, linecolor='grey')
        plt.ylim(len(df.columns), 0)
        plt.show()


def CalculateCorrAndPPS(df):
    df_corr_spearman = df.corr(method="spearman")
    df_corr_pearson = df.corr(method="pearson")

    pps_matrix_raw = pps.matrix(df)
    pps_matrix = pps_matrix_raw.filter(['x', 'y', 'ppscore']).pivot(
        columns='x', index='y', values='ppscore')

    pps_score_stats = pps_matrix_raw.query(
        "ppscore < 1").filter(['ppscore']).describe().T
    print("PPS threshold - check PPS score IQR to decide threshold for heatmap \n")
    print(pps_score_stats.round(3))

    return df_corr_pearson, df_corr_spearman, pps_matrix


def DisplayCorrAndPPS(df_corr_pearson, df_corr_spearman, pps_matrix, CorrThreshold, PPS_Threshold,
                      figsize=(20, 12), font_annot=8):

    print("\n")
    print("* Analyse how the target variable for your ML models are correlated with other variables (features and target)")
    print("* Analyse multi-colinearity, that is, how the features are correlated among themselves")

    print("\n")
    print("*** Heatmap: Spearman Correlation ***")
    print("It evaluates monotonic relationship \n")
    heatmap_corr(df=df_corr_spearman, threshold=CorrThreshold,
                 figsize=figsize, font_annot=font_annot)

    print("\n")
    print("*** Heatmap: Pearson Correlation ***")
    print("It evaluates the linear relationship between two continuous variables \n")
    heatmap_corr(df=df_corr_pearson, threshold=CorrThreshold,
                 figsize=figsize, font_annot=font_annot)

    print("\n")
    print("*** Heatmap: Power Predictive Score (PPS) ***")
    print(f"PPS detects linear or non-linear relationships between two columns.\n"
          f"The score ranges from 0 (no predictive power) to 1 (perfect predictive power) \n")
    heatmap_pps(df=pps_matrix, threshold=PPS_Threshold,
                figsize=figsize, font_annot=font_annot)


Calculate Correlations and Power Predictive Score

In [None]:
df_corr_pearson, df_corr_spearman, pps_matrix = CalculateCorrAndPPS(df)

Display at Heatmaps

In [None]:
DisplayCorrAndPPS(df_corr_pearson=df_corr_pearson,
                  df_corr_spearman=df_corr_spearman,
                  pps_matrix=pps_matrix,
                  CorrThreshold=0.4, PPS_Threshold=0.2,
                  figsize=(12, 10), font_annot=10)


---

# Conclusions

From the analysis performed I have gathered that:
- Houses that are larger in area on various feature are also higher in value. This seems to be the strongest correlation at the moment, which was not one of the initial hypothesis.
- Houses that are in better condition and with higher quality building features are higher in value, confirming hypothesis 1.
- Houses which are newer or more recently renovated are higher in value, confirming hypothesis 2.

This findings will be the foundation for the modelling step of the process.