# **Data Exploration and Correlation Studies**

## Objectives

* **Business Requirement 1:**
    * Investigate the relationship between house attributes and sale prices.
    * Conduct a detailed correlation study using both Spearman and Pearson methods.
    * Visualize the strength of correlations with sale price.
    * Generate plots for the variables that show the highest correlations with sale price to derive actionable insights.

## Inputs

* The cleaned datasets from the previous notebook:
  * `outputs/datasets/collection/HousePricing.csv`
  * `outputs/datasets/collection/InheritedHouses.csv`

## Outputs

* Visualizations showing relationships between key house attributes and `SalePrice`.
* Correlation matrices (Pearson and Spearman) to identify the most relevant variables.
* Summary of insights derived from the data.


---

## Install Python packages in the Notebook

In [None]:
%pip install -r /workspace/HeritageHousing/requirements-notebooks.txt

## Change working directory

Before starting we need to change to the correct directory (from where it is to its parent folder).

We first access the current directory using os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory.

* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Below will confirm the current directory

In [None]:
current_dir = os.getcwd()
current_dir

## Libraries Import

In [5]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder

## Load the Data

Load the datasets

In [6]:
df = pd.read_csv('outputs/datasets/collection/HousePricing.csv')
inherited_houses = pd.read_csv('outputs/datasets/collection/InheritedHouses.csv')

Display the first few rows to ensure the data loaded correctly

In [None]:
df.head()

## Basic Summary of the Data

Summary of the dataframe

In [None]:
df.info()

Descriptive statistics

In [None]:
df.describe()

## Check for Missing Values

Check for missing values in the dataframe

In [None]:
df.isnull().sum().sort_values(ascending=False)

## Handling Missing Values

**Handle missing values based on the nature of the data in each column.**

* EnclosedPorch and WoodDeckSF:
    * Missing values likely indicate the absence of these features. Replace with 0.

In [None]:
df['EnclosedPorch'].fillna(0, inplace=True)
df['WoodDeckSF'].fillna(0, inplace=True)

* LotFrontage:
    * This is a critical numerical feature. Replace missing values with the median of LotFrontage.

In [None]:
df['LotFrontage'].fillna(df['LotFrontage'].median(), inplace=True)

* GarageFinish and GarageYrBlt:
    * Missing values likely indicate no garage. Replace GarageFinish with 'No Garage' and GarageYrBlt with 0.

In [None]:
df['GarageFinish'].fillna('No Garage', inplace=True)
df['GarageYrBlt'].fillna(0, inplace=True)

* BsmtFinType1 and BsmtExposure:
    * Missing values likely indicate no basement. Replace with 'No Basement' and 'No Exposure'.

In [None]:
df['BsmtFinType1'].fillna('No Basement', inplace=True)
df['BsmtExposure'].fillna('No Exposure', inplace=True)

* BedroomAbvGr:
    * Unusual to have missing values. Fill with the mode (most common number of bedrooms).

In [None]:
df['BedroomAbvGr'].fillna(df['BedroomAbvGr'].mode()[0], inplace=True)

* 2ndFlrSF:
    * Missing values likely indicate no second floor. Replace with 0.

In [None]:
df['2ndFlrSF'].fillna(0, inplace=True)

* MasVnrArea:
    * Missing values likely indicate no masonry veneer. Replace with 0.

In [None]:
df['MasVnrArea'].fillna(0, inplace=True)

## Feature Engineering

Given that we are trying to find the corraltion between house attributes and sale price I think it would make sense to have a total sq footage column made up of total of 1st floor, 2nd floor, and basement square footage.

Create a new feature for total square footage

In [18]:
df['TotalSF'] = df['1stFlrSF'] + df['2ndFlrSF'] + df['TotalBsmtSF']

Inspect the new feature

In [None]:
df[['TotalSF', 'SalePrice']].head()

Before proceeding to the corrolation tests we will check and print out which columns have categorical data first.

In [None]:
categorical_cols = df.select_dtypes(include=['object']).columns
print("Categorical columns:", categorical_cols)


We will use label encoding as this method will convert the categorical variables into numeric codes. Since the categories likely have an inherent order (e.g., Gd > Av > Mn > No), label encoding is appropriate.

List of categorical columns

In [21]:
categorical_cols = ['BsmtExposure', 'BsmtFinType1', 'GarageFinish', 'KitchenQual']

Check unique values in each categorical column before encoding

In [None]:
for col in categorical_cols:
    print(f"Unique values in {col}: {df[col].unique()}")


Define mappings for each categorical column

In [23]:
bsmt_exposure_mapping = {'No': 0, 'Mn': 1, 'Av': 2, 'Gd': 3, 'No Exposure': 4}
bsmt_fin_type_mapping = {'No Basement': 0, 'Unf': 1, 'LwQ': 2, 'Rec': 3, 'BLQ': 4, 'ALQ': 5, 'GLQ': 6}
garage_finish_mapping = {'No Garage': 0, 'Unf': 1, 'RFn': 2, 'Fin': 3}
kitchen_qual_mapping = {'Fa': 0, 'TA': 1, 'Gd': 2, 'Ex': 3}

Apply the mappings

In [24]:
df['BsmtExposure'] = df['BsmtExposure'].map(bsmt_exposure_mapping)
df['BsmtFinType1'] = df['BsmtFinType1'].map(bsmt_fin_type_mapping)
df['GarageFinish'] = df['GarageFinish'].map(garage_finish_mapping)
df['KitchenQual'] = df['KitchenQual'].map(kitchen_qual_mapping)

Check the DataFrame after mapping

In [None]:
df.head()

In [None]:
# Save the cleaned and feature-engineered data
df.to_csv('outputs/datasets/collection/HousePricing_cleaned.csv', index=False)

print("Cleaned and feature-engineered data saved successfully.")


## Exploratory Data Analysis (EDA) - Correlation Matrices

Calculate and visualize the Pearson and Spearman correlation matrices.

* **Pearson Correlation Matrix:**
    * Measures the linear relationship between variables.

In [None]:
plt.figure(figsize=(12, 8))
pearson_corr = df.corr(method='pearson')
sns.heatmap(pearson_corr, annot=True, fmt='.2f', cmap='coolwarm')
plt.title('Pearson Correlation Matrix')
plt.show()

* **Spearman Correlation Matrix:**
    * Measures the monotonic relationship between variables (whether linear or not).

In [None]:
plt.figure(figsize=(12, 8))
spearman_corr = df.corr(method='spearman')
sns.heatmap(spearman_corr, annot=True, fmt='.2f', cmap='coolwarm')
plt.title('Spearman Correlation Matrix')
plt.show()

## Pearson Correlation (Linear Relationships):

#### Positive Correlation

* **OverallQual:** Correlation of 0.79 - This indicates that the overall quality of the house has a strong positive linear relationship with a higher sale price.

* **GrLivArea:** Correlation of 0.71 - The above-ground living area also shows a strong positive linear correlation with a higher sale price.

* **TotalSF:** Correlation of 0.77 - The total square footage (which likely includes both above-ground and basement space) is also strongly correlated with a higher sale price.

* **GarageArea:** Correlation of 0.62 - The size of the garage has a moderate positive linear correlation with a higher sale price.

* **YearBuilt:** Correlation of 0.59 - Newer houses tend to have a higher sale price.

## Spearman Correlation (Monotonic Relationships):

#### Positive Correlation

* **OverallQual:** Correlation of 0.80 - This suggests that overall quality is very strongly associated with higher sale prices, even in a non-linear way.

* **GrLivArea:** Correlation of 0.73 - This shows a strong monotonic relationship between living area and a higher sale price.

* **TotalSF:** Correlation of 0.80 - This confirms the strong relationship between total square footage and a higher sale price.

* **GarageArea:** Correlation of 0.66 - The garage area remains moderately correlated with a higher sale price in a monotonic way.

* **YearBuilt:** Correlation of 0.59 - Again, newer houses tend to have higher sale prices, and this relationship is consistent even in a monotonic sense.

**Lets plot these correlations to salesprice to get a better understanding of the findings**

In [None]:
key_attributes = ['GrLivArea', 'TotalSF', 'GarageArea', 'YearBuilt']

plt.figure(figsize=(15, 12))

for i, attribute in enumerate(key_attributes, 1):
    plt.subplot(3, 2, i)
    sns.scatterplot(x=df[attribute], y=df['SalePrice'])
    plt.title(f'SalePrice vs {attribute}')
    plt.xlabel(attribute)
    plt.ylabel('SalePrice')

plt.tight_layout()
plt.show()

In [None]:

plt.figure(figsize=(15, 12))

# Plot bar chart for OverallQual
plt.subplot(3, 2, 1)
sns.barplot(x=df['OverallQual'], y=df['SalePrice'], errorbar=None)
plt.title('SalePrice vs OverallQual')
plt.xlabel('OverallQual')
plt.ylabel('SalePrice')

## Pearson Correlation (Linear Relationships):

#### Negative Correlation

* **BsmtFinType1:** Correlation of 0.01 - Indicates almost no linear relationship with the sale price.

* **BsmtUnfSF:** Correlation of 0.18 - Very weak correlation with the sale price.

* **EnclosedPorch:** Correlation of 0.05 - Very little correlation with the sale price.

* **BedroomAbvGr:** Correlation of 0.16 - Weak correlation with sale price.

* **LotFrontage:** Correlation of 0.26 - Weak correlation with sale price.

## Spearman Correlation (Monotonic Relationships):

#### Negative Correlation

* **EnclosedPorch:** Correlation of 0.05 - Almost no monotonic relationship with the sale price.

* **BsmtFinType1:** Correlation of 0.06 - Very weak monotonic relationship with the sale price.

* **BsmtUnfSF:** Correlation of 0.18 - Weak monotonic relationship with the sale price.

* **LotFrontage:** Correlation of 0.29 - Weak monotonic relationship with the sale price.

* **BedroomAbvGr:** Correlation of 0.47 - Somewhat weak monotonic relationship with the sale price.

**Lets plot these correlations to salesprice to get a better understanding of the findings**

In [None]:

# Filter out rows where EnclosedPorch is 0
filtered_df = df[df['EnclosedPorch'] != 0]

# Define the attributes
negative_attributes = ['EnclosedPorch', 'BsmtFinType1', 'BsmtUnfSF', 'LotFrontage', 'BedroomAbvGr']

plt.figure(figsize=(15, 12))

# First plot: Scatter plot for EnclosedPorch after filtering
plt.subplot(3, 2, 1)
sns.scatterplot(x=filtered_df['EnclosedPorch'], y=filtered_df['SalePrice'])
plt.title('SalePrice vs EnclosedPorch')
plt.xlabel('EnclosedPorch')
plt.ylabel('SalePrice')

# Second plot: Bar plot for BsmtFinType1
plt.subplot(3, 2, 2)
sns.barplot(x=df['BsmtFinType1'], y=df['SalePrice'], errorbar=None)
plt.title('SalePrice vs BsmtFinType1')
plt.xlabel('BsmtFinType1')
plt.ylabel('SalePrice')

# Third plot: Scatter plot for BsmtUnfSF
plt.subplot(3, 2, 3)
sns.scatterplot(x=df['BsmtUnfSF'], y=df['SalePrice'])
plt.title('SalePrice vs BsmtUnfSF')
plt.xlabel('BsmtUnfSF')
plt.ylabel('SalePrice')

# Fourth plot: Scatter plot for LotFrontage
plt.subplot(3, 2, 4)
sns.scatterplot(x=df['LotFrontage'], y=df['SalePrice'])
plt.title('SalePrice vs LotFrontage')
plt.xlabel('LotFrontage')
plt.ylabel('SalePrice')

# Fifth plot: Bar plot for BedroomAbvGr
plt.subplot(3, 2, 5)
sns.barplot(x=df['BedroomAbvGr'], y=df['SalePrice'], errorbar=None)
plt.title('SalePrice vs BedroomAbvGr')
plt.xlabel('BedroomAbvGr')
plt.ylabel('SalePrice')

plt.tight_layout()
plt.show()


### **Summary of Key Takeaways from Correlation Studies and Scatter Plots**

#### **1. Strongly Correlated Attributes:**
- **OverallQual (Overall Quality):**
  - **Pearson Correlation:** 0.79
  - **Spearman Correlation:** 0.80
  - **Takeaway:** This is the most strongly correlated attribute with a higher `SalePrice`. Higher quality homes tend to sell for significantly higher prices. The scatter plot shows a clear upward trend, indicating that as the quality rating increases, so does the sale price.

- **GrLivArea (Above Ground Living Area):**
  - **Pearson Correlation:** 0.71
  - **Spearman Correlation:** 0.73
  - **Takeaway:** The living area square footage is a strong predictor of a higher sale price. Larger homes generally command higher prices. The scatter plot demonstrates a positive linear relationship, with some outliers where large homes do not fetch high prices, potentially due to other factors.

- **TotalSF (Total Square Footage):**
  - **Pearson Correlation:** 0.77
  - **Spearman Correlation:** 0.80
  - **Takeaway:** The total square footage of the house (including basements) is a strong indicator of a higher sale price. The scatter plot shows a clear positive relationship, similar to `GrLivArea`.

- **GarageArea (Garage Size):**
  - **Pearson Correlation:** 0.62
  - **Spearman Correlation:** 0.66
  - **Takeaway:** The size of the garage also contributes positively to a higher sale price, though the correlation is not as strong as `OverallQual` or `GrLivArea`. The scatter plot shows that larger garages tend to be associated with higher sale prices, though there is more variability here.

- **YearBuilt (Year the House was Built):**
  - **Pearson Correlation:** 0.59
  - **Spearman Correlation:** 0.59
  - **Takeaway:** Newer homes tend to sell for higher prices. The scatter plot reveals that while newer homes generally fetch higher prices, there are instances where older homes are valued highly, likely due to factors like location or renovations.

#### **2. Weakly or Negatively Correlated Attributes:**
- **EnclosedPorch (Enclosed Porch Area):**
  - **Pearson Correlation:** 0.05
  - **Spearman Correlation:** 0.05
  - **Takeaway:** This attribute shows almost no correlation with `SalePrice`. The scatter plot reveals no discernible pattern, suggesting that the enclosed porch area is not a significant factor in determining house prices.

- **BsmtFinType1 (Basement Finish Type):**
  - **Pearson Correlation:** 0.01
  - **Spearman Correlation:** 0.06
  - **Takeaway:** The type of basement finish has a very weak correlation with `SalePrice`. The scatter plot indicates that the different types of basement finishes do not have a consistent impact on the sale price, likely due to the varied importance of basements to different buyers.

- **BsmtUnfSF (Unfinished Basement Area):**
  - **Pearson Correlation:** 0.18
  - **Spearman Correlation:** 0.18
  - **Takeaway:** The unfinished square footage in the basement shows a weak correlation with the sale price. The scatter plot reveals some minor positive trends, but overall, this attribute does not heavily influence the sale price.

- **LotFrontage (Linear Feet of Street Connected to Property):**
  - **Pearson Correlation:** 0.26
  - **Spearman Correlation:** 0.29
  - **Takeaway:** Lot frontage has a weak positive correlation with `SalePrice`. The scatter plot shows a lot of variability, suggesting that this attribute alone does not strongly influence house prices.

- **BedroomAbvGr (Number of Bedrooms Above Grade):**
  - **Pearson Correlation:** 0.16
  - **Spearman Correlation:** 0.47
  - **Takeaway:** The number of bedrooms above grade has a moderate Spearman correlation but weak Pearson correlation, indicating that while more bedrooms might relate to higher sale prices, this relationship isn't linear. The scatter plot shows significant scatter, indicating other factors likely influence the sale price more than the number of bedrooms.

### **Conclusion:**
- **Strong Predictors:** Attributes like `OverallQual`, `GrLivArea`, and `TotalSF` are strong predictors of a higher `SalePrice`. These should be central to any predictive modeling or valuation efforts.
  
- **Weak Predictors:** Attributes like `EnclosedPorch`, `BsmtFinType1`, and `BsmtUnfSF` show little to no correlation with `SalePrice` and may have limited utility in predictive models.

- **Mixed Results:** Attributes like `BedroomAbvGr` and `LotFrontage` show some correlation but are not as influential as the top predictors. These may contribute to price under certain conditions but are not as consistently strong.

This analysis highlights the importance of focusing on key attributes that have a proven impact on sale prices while understanding that some features, even if relevant in other contexts, may not significantly influence the price of homes in this dataset.
