# **Correlation Study Notebook**

Part of CRISP-DM **Data Understanding**

## Objectives

* Investigate the relationship between house attributes and sale price to address business requirement 1.

## Approach

* Perform exploratory data analysis using a ProfileReport to understand the distribution of variables and identify correlations.
* Conduct correlation and Predictive Power Score (PPS) analysis to quantify the relationships between variables.
* Create informative plots to visualize the correlations and facilitate understanding.

## Inputs

* Cleaned dataset: outputs/datasets/cleaned/house_prices_cleaned.csv

## Outputs

* Correlation plots and analysis that can be used to build the Streamlit App and provide insights into the relationships between house attributes and sale price.


---

# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Load Cleaned Data

In [None]:
import pandas as pd
df = pd.read_csv("outputs/datasets/cleaned/house_prices_cleaned.csv")
df.head(5)

---

# Exploratory Data Analysis (EDA)

First we want to get familiar with the dataset. Using ProfileReport we can look at the variable types, distribution, missing data levels, etc.

In [None]:
from ydata_profiling import ProfileReport
profile = ProfileReport(df=df, minimal=True)
profile.to_notebook_iframe()

The above report reveals that

* 9 features have missing values.
* EnclosedPorch and WoodDeckSF have 90.7% and 89.4% missing values respectively.

## Correlation and PPS Analysis

To prepare our dataset for correlation analysis, we need to encode the categorical variables. This involves converting the categorical variables into numerical variables that can be used to calculate correlation coefficients.

In [None]:
from feature_engine.encoding import OneHotEncoder
encoder = OneHotEncoder(variables=df.columns[df.dtypes=='object'].to_list(), drop_last=False)
df_ohe = encoder.fit_transform(df)
print(df_ohe.shape)
df_ohe.head()

After encoding the categorical variables, our dataset now has 37 columns, including the original variables and the new encoded columns. We can now proceed with calculating the correlations and creating heatmaps to visualize the relationships between the variables.

In the following cell we define several functions to calculate the correlations, create heatmaps, and display the results. These functions will also save the heatmaps to a folder for later use in the documentation of this project.

Our goal is to analyze how the target variable for our machine learning models is correlated with other variables, including features and the target. We also want to examine multi-colinearity, which refers to the correlation between features themselves.

We use the Spearman correlation coefficient to evaluate the monotonic relationship between variables, and the Pearson correlation coefficient to evaluate the linear relationship between two continuous variables. Additionally, we use the Power Predictive Score (PPS) to detect linear or non-linear relationships between two columns.

We create heatmaps to visualize the correlations and PPS scores, and save them to a folder for later use. The heatmaps provide a clear and concise way to visualize the relationships between the variables and identify areas of high correlation.

In [None]:
import numpy as np
import ppscore as pps
import seaborn as sns
sns.set(style="whitegrid")
import matplotlib.pyplot as plt

def heatmap_corr(df,threshold, figsize=(20,12), font_annot = 8):
  """
  Function to create heatmap using correlations.
  """
  if len(df.columns) > 1:
    mask = np.zeros_like(df, dtype=np.bool)
    mask[np.triu_indices_from(mask)] = True
    mask[abs(df) < threshold] = True

    fig, axes = plt.subplots(figsize=figsize)
    sns.heatmap(df, annot=True, xticklabels=True, yticklabels=True,
                mask=mask, cmap='viridis', annot_kws={"size": font_annot}, ax=axes,
                linewidth=0.5
                     )
    axes.set_yticklabels(df.columns, rotation = 0)
    plt.ylim(len(df.columns),0)
     # Save heatmaps to docs folder
    if df.name == "corr_spearman":
      try:
        # create here your folder
        os.makedirs(name='docs/plots')
      except Exception as e:
        print(e)
      plt.savefig(f'docs/plots/heatmap_corr_spearman.png', bbox_inches='tight')
    else:
      try:
        # create here your folder
        os.makedirs(name='docs/plots')
      except Exception as e:
        print(e)
      plt.savefig(f'docs/plots/heatmap_corr_pearson.png', bbox_inches='tight')
    plt.show()


def heatmap_pps(df,threshold, figsize=(20,12), font_annot = 8):
    """
    Function to create heatmap using pps.
    """
    if len(df.columns) > 1:

      mask = np.zeros_like(df, dtype=np.bool)
      mask[abs(df) < threshold] = True

      fig, ax = plt.subplots(figsize=figsize)
      ax = sns.heatmap(df, annot=True, xticklabels=True,yticklabels=True,
                       mask=mask,cmap='rocket_r', annot_kws={"size": font_annot},
                       linewidth=0.05,linecolor='grey')
      
      plt.ylim(len(df.columns),0)
      # Save heatmap to docs folder
      plt.savefig(f'docs/plots/heatmap_pps.png', bbox_inches='tight')
      plt.show()


def CalculateCorrAndPPS(df):
  """
  Function to calculate correlations and pps.
  """
  df_corr_spearman = df.corr(method="spearman")
  df_corr_spearman.name = 'corr_spearman'
  df_corr_pearson = df.corr(method="pearson")
  df_corr_pearson.name = 'corr_pearson'

  pps_matrix_raw = pps.matrix(df)
  pps_matrix = pps_matrix_raw.filter(['x', 'y', 'ppscore']).pivot(columns='x', index='y', values='ppscore')

  pps_score_stats = pps_matrix_raw.query("ppscore < 1").filter(['ppscore']).describe().T
  print("PPS threshold - check PPS score IQR to decide threshold for heatmap \n")
  print(pps_score_stats.round(3))

  return df_corr_pearson, df_corr_spearman, pps_matrix


def DisplayCorrAndPPS(df_corr_pearson, df_corr_spearman, pps_matrix,CorrThreshold,PPS_Threshold,
                      figsize=(20,12), font_annot=8 ):
  """
  Function to display the correlations and pps.
  """

  print("\n")
  print("* Analyze how the target variable for your ML models are correlated with other variables (features and target)")
  print("* Analyze multi-colinearity, that is, how the features are correlated among themselves")

  print("\n")
  print("*** Heatmap: Spearman Correlation ***")
  print("It evaluates monotonic relationship \n")
  heatmap_corr(df=df_corr_spearman, threshold=CorrThreshold, figsize=figsize, font_annot=font_annot)

  print("\n")
  print("*** Heatmap: Pearson Correlation ***")
  print("It evaluates the linear relationship between two continuous variables \n")
  heatmap_corr(df=df_corr_pearson, threshold=CorrThreshold, figsize=figsize, font_annot=font_annot)

  print("\n")
  print("*** Heatmap: Power Predictive Score (PPS) ***")
  print(f"PPS detects linear or non-linear relationships between two columns.\n"
        f"The score ranges from 0 (no predictive power) to 1 (perfect predictive power) \n")
  heatmap_pps(df=pps_matrix,threshold=PPS_Threshold, figsize=figsize, font_annot=font_annot)

We use CalculateCorrAndPPS function to calculate Correlations and Predictive Power Score.

In [None]:
df_corr_pearson, df_corr_spearman, pps_matrix = CalculateCorrAndPPS(df)

In [None]:
DisplayCorrAndPPS(df_corr_pearson = df_corr_pearson,
                  df_corr_spearman = df_corr_spearman, 
                  pps_matrix = pps_matrix,
                  CorrThreshold = 0.4, PPS_Threshold =0.2,
                  figsize=(12,10), font_annot=10)

## Variables to Study

We now calculate and list the highest correlation values for our target variable ['SalePrice].

In [None]:
corr_spearman = df_ohe.corr(method='spearman')['SalePrice'].sort_values(key=abs, ascending=False)[1:].head(10)
corr_spearman

In [None]:
plt.bar(x=corr_spearman[:5].index, height=corr_spearman[:5])
plt.title("Spearman Correlation", fontsize=20, y=1.05)
plt.show()

In [None]:
corr_pearson = df_ohe.corr(method='pearson')['SalePrice'].sort_values(key=abs, ascending=False)[1:].head(10)
corr_pearson

In [None]:
plt.bar(x=corr_pearson[:5].index, height=corr_pearson[:5])
plt.title("Pearson Correlation", fontsize=20, y=1.05)
plt.show()

We merge the results of both correlation methods and choose the variables with coefficient scores of 0.5 and above.

In [None]:
top_n = 8
vars_to_study = set(corr_pearson[:top_n].index.to_list() + corr_spearman[:top_n].index.to_list())
vars_to_study

### EDA of chosen variables

In [None]:
df_eda = df_ohe.filter(list(vars_to_study) + ['SalePrice'])
df_eda.head(3)

## Plotting Against Target

### Target Variable Analysis

We take a look at the range and distribution of our target ['SalePrice].

In [None]:
sns.set_style('whitegrid')
target_var = 'SalePrice'

def plot_target_hist(df, target_var):
  """
  Function to plot a histogram of the target and
  save the figure to folder.
  """
  plt.figure(figsize=(12, 5))
  sns.histplot(data=df, x=target_var, kde=True)
  plt.title(f"Distribution of {target_var}", fontsize=20)
  # plt.savefig(f'docs/plots/hist_plot_{target_var}.png', bbox_inches='tight')        
  plt.show()

plot_target_hist(df, target_var)

Now analyse the correlations of our chosen variables with our target visually. (Business Requirement 1)

In [None]:
target_var = 'SalePrice'
time = ['YearBuilt', 'YearRemodAdd']

def corr_line_plot(df, col, target_var):
  """
  Line plots of target variable vs time variables (years)
  Figures are saved to folder.
  """
  fig, axes = plt.subplots(figsize=(10, 5))
  sns.lineplot(data=df, x=col, y=target_var)
  plt.title(f"{col}", fontsize=20, y=1.05)
  plt.savefig(f'docs/plots/line_plot_price_by_{col}.png', bbox_inches='tight')        
  plt.show()

def corr_box_plot(df, col, target_var):
  """
  Box plots of target variable vs categorical variables
  Figures are saved to folder.
  """
  fig, axes = plt.subplots(figsize=(10, 5))
  sns.boxplot(data=df, x=col, y=target_var) 
  plt.title(f"{col}", fontsize=20, y=1.05)
  plt.savefig(f'docs/plots/box_plot_price_by_{col}', bbox_inches='tight')
  plt.show()

def corr_lm_plot(df, col, target_var):
  """
  Linear regression plots of target variable vs continuous features"
  Figures are saved to folder.
  """
  sns.lmplot(data=df, x=col, y=target_var, height=6, aspect=1.5)
  plt.title(f"{col}", fontsize=20, y=1.05)
  plt.savefig(f'docs/plots/lm_plot_price_by_{col}.png', bbox_inches='tight')        
  plt.show()


for col in vars_to_study:
  if len(df_eda[col].unique()) <= 10:
    corr_box_plot(df_eda, col, target_var)
    print("\n\n")
  else:
    if col in time:
      corr_line_plot(df_eda, col, target_var)
      print("\n\n")
    else:
      corr_lm_plot(df_eda, col, target_var)
      print("\n\n")

We also look at all chosen variables vs. sale prices in relation to the overall quality.

In [None]:
def correlation_to_sale_price_scat(df, vars_to_study):
    """  scatterplots of variables vs SalePrice """
    target_var = 'SalePrice'
    for col in vars_to_study:
        fig, axes = plt.subplots(figsize=(10, 5))
        axes = sns.scatterplot(data=df, x=col, y=target_var, hue='OverallQual')
        plt.title(f"{col}", fontsize=20, y=1.05)
        plt.show()
        print("\n\n")

correlation_to_sale_price_scat(df_eda, vars_to_study)

---

# Conclusion

Our analysis reveals that:

* Larger properties tend to have higher sale prices. (Size Hypothesis)
    * Related variables: ['1stFlrSF', 'GarageArea', 'GrLivArea', 'TotalBsmtSF']
* Higher quality ratings are associated with higher sale prices.
    * Related variables: ['KitchenQual_TA', 'OverallQual']
* Recently built houses and those with recent renovations tend to have higher sale prices.
    * Related variables: [ 'YearBuilt', 'YearRemodAdd']
