# Correlation Study Notebook
## Objectives
- Business requirement 1:
    - The client is interested in discovering how the house attributes correlate with the sale price. Therefore, the client expects data visualisations of the correlated variables against the sale price to show that.

## Inputs
- outputs/datasets/cleaned/TrainSetCleaned.csv

## Outputs
- outputs/datasets/cleaned/TrainSetCleaned.csv

## Conclusions
- The price of a property is directly correlated with its quality and size, as well as its construction date.

---

# Change working directory
We need to change the working directory from its current folder to its parent folder

In [None]:
import os

current_path = os.getcwd()
os.chdir(os.path.dirname(current_path))
current_path = os.getcwd()
current_path

# Load Cleaned Data

In [None]:
import pandas as pd
TrainSet = pd.read_csv("outputs/datasets/cleaned/TrainSetCleaned.csv")
TrainSet.head()

# Data Exploration

In [None]:
from pandas_profiling import ProfileReport
pandas_report = ProfileReport(df=TrainSet, minimal=True)
pandas_report.to_notebook_iframe()

---

## Correlation and PPS Analysis
We need to find the correlation between different features and the sales price. Let's start by 

## Pearson Correlation

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

pearson_corr = TrainSet.corr(method='pearson')['SalePrice'].sort_values(key=abs, ascending=False)

# Exlude the SalePrice correlation with itself
pearson_corr = pearson_corr.drop('SalePrice')

plt.figure(figsize=(10, 8))
sns.barplot(x=pearson_corr.index, y=pearson_corr.values, palette='viridis')
plt.xticks(rotation=90)
plt.title('Pearson Correlation with SalePrice')
plt.xlabel('Features')
plt.ylabel('Correlation Coefficient')

plt.show()

## Spearman Correlation

In [None]:
spearman_corr = TrainSet.corr(method='spearman')['SalePrice'].sort_values(ascending=False)

# Exlude the SalePrice correlation with itself
spearman_corr = spearman_corr.drop('SalePrice')

plt.figure(figsize=(10, 8))
sns.barplot(x=spearman_corr.index, y=spearman_corr.values, palette='viridis')
plt.xticks(rotation=90)
plt.title('Spearman Correlation with SalePrice')
plt.xlabel('Features')
plt.ylabel('Correlation Coefficient')

## Comparing results.
Both the Pearson and the Spearman correlations gives almost the same results, let's compare the results.


In [None]:
# Define the threshold
threshold = 0.5

correlation_comparison = pd.DataFrame({
    'Pearson': pearson_corr,
    'Spearman': spearman_corr
})

plt.figure(figsize=(10, 8))
sns.scatterplot(x='Pearson', y='Spearman', data=correlation_comparison)
plt.title('Comparison of Pearson vs. Spearman Correlations')
plt.xlabel('Pearson Correlation Coefficient')
plt.ylabel('Spearman Correlation Coefficient')

plt.axhline(threshold, color='red', linestyle='--', linewidth=1)
plt.axvline(threshold, color='red', linestyle='--', linewidth=1)

for line in range(0, correlation_comparison.shape[0]):
    plt.text(correlation_comparison.Pearson[line]+0.01, correlation_comparison.Spearman[line], 
             correlation_comparison.index[line], horizontalalignment='left', size='medium', color='black', weight='semibold')

plt.grid(True)
plt.show()

In [None]:
pearson_features = pearson_corr[abs(pearson_corr) > threshold]
spearman_features = spearman_corr[abs(spearman_corr) > threshold]


result= list(set(pearson_features.index.to_list()) | set(spearman_features.index.to_list()))
result

Comparing the results of the two, we can see that these variables has the highest correlation to sales price:
- OverallQual
- GrLivArea
- KitchenQual
- GarageArea
- YearBuilt
- TotalBsmtSF
- GarageFinish
- YearRemodAdd
- 1stFlrSF
- GarageYrBlt

Moving forward, we can drop all the other variables.

# Save changes to the Train Set

In [None]:
df_corr = TrainSet.drop(columns=[col for col in TrainSet.columns if col not in result])

df_corr.to_csv("outputs/datasets/cleaned/TrainSetCleaned.csv", index=False)

---

# Conclusions
- The price of a property is directly correlated with its quality and size, as well as its construction date.