# **EDA**

## Objectives

* Answer Business Requirement 1: “The client is interested to understand the patterns from the house sales dataset, to learn the most relevant variables that are correlated to house sale prices.”
    * Explore the main patterns in the housing dataset.
    * Identify the most relevant variables correlated with house sale prices.
    * Generate visualizations to support insights.
    * Prepare insights for use in the Streamlit app answering Business Requirement 1.

## Inputs

* `outputs/datasets/collection/house_prices_records.csv`: cleaned and curated dataset with house sale records.
* `outputs/datasets/collection/inherited_houses.csv`: dataset containing inherited properties the client owns.

## Outputs

* Printed top correlated variables to `SalePrice` using Pearson and Spearman correlation.
* Visualizations (scatter plots, box plots) of most correlated features.
* Optional parallel plot for multidimensional categorical visualization. 

## Additional Comments
- We decided not to combine the two CSVs, instead analyzed the main dataset (`house_prices_records.csv`) and kept `inherited_houses.csv` for prediction later.

---

# Change working directory

In [None]:
import os
current_dir = os.getcwd()
current_dir

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

In [None]:
current_dir = os.getcwd()
current_dir

---

# Load Data

In [None]:
import pandas as pd

# Load the pre-processed datasets from outputs
df_main = pd.read_csv("outputs/datasets/collection/house_prices_records.csv")
df_client = pd.read_csv("outputs/datasets/collection/inherited_houses.csv")

print("🏠 Main dataset:")
display(df_main.head())

print("🏘️ Inherited properties:")
display(df_client.head())

# Data Exploration

In [None]:
from ydata_profiling import ProfileReport

profile = ProfileReport(df_main, minimal=True)
profile.to_notebook_iframe()

# Temporary Encoding Categorical Features

In [None]:
from sklearn.preprocessing import LabelEncoder

df_encoded = df_main.copy()
categorical_cols = df_encoded.select_dtypes(include=['object']).columns

le = LabelEncoder()
for col in categorical_cols:
    df_encoded[col] = le.fit_transform(df_encoded[col].astype(str))


# Correlation Study

In [None]:
# Pearson correlation
corr_pearson = df_encoded.corr(method='pearson')['SalePrice'].sort_values(key=abs, ascending=False)[1:11]
print("📊 Top Pearson correlations:\n", corr_pearson)

# Spearman correlation
corr_spearman = df_encoded.corr(method='spearman')['SalePrice'].sort_values(key=abs, ascending=False)[1:11]
print("📊 Top Spearman correlations:\n", corr_spearman)

In [None]:
# Combine top variables from both methods
top_n = 5
top_vars = list(set(corr_pearson[:top_n].index.to_list() + corr_spearman[:top_n].index.to_list()))
print("Variables to investigate:", top_vars)

## Visualising Relationships with SalesPrice

## Scatter & Box Plots for Top Features

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

sns.set_style('whitegrid')

def plot_categorical(df_main, col, target_var):
    plt.figure(figsize=(12, 5))
    sns.boxplot(data=df_main, x=col, y=target_var)
    plt.xticks(rotation=90)
    plt.title(f"{col} vs {target_var}")
    plt.show()

def plot_numerical(df_main, col, target_var):
    plt.figure(figsize=(8, 5))
    sns.scatterplot(data=df_main, x=col, y=target_var)
    plt.title(f"{col} vs {target_var}")
    plt.show()

target_var = 'SalePrice'

for col in top_vars:
    if df_encoded[col].dtype == 'object':
        plot_categorical(df_main, col, target_var)
    else:
        plot_numerical(df_main, col, target_var)

---

# Parallel  Plot

In [None]:
from feature_engine.discretisation import ArbitraryDiscretiser
import plotly.express as px
%matplotlib inline

# Example for 'OverallQual' binning
var_to_bin = 'OverallQual' if 'OverallQual' in df_encoded.columns else top_vars[0]

quality_map = [-np.Inf, 4, 6, 8, np.Inf]
disc = ArbitraryDiscretiser(binning_dict={var_to_bin: quality_map})
df_parallel = disc.fit_transform(df_encoded[top_features + ['SalePrice']].copy())

# Rename bins
labels_map = {
    0: "<4", 1: "4-6", 2: "6-8", 3: "8+"
}
df_parallel[var_to_bin] = df_parallel[var_to_bin].replace(labels_map)

fig = px.parallel_categories(df_parallel, color="SalePrice", color_continuous_scale='Viridis')
fig.show(renderer='notebook')

---

# Conclusion

## Conclusions and Next Steps

- We identified strong correlations between several numerical features and `SalePrice`, such as `OverallQual`, `GrLivArea`, and `GarageArea`.
- Several categorical variables like `Neighborhood`, `KitchenQual`, and `GarageFinish` are likely important but need proper encoding.
- We will now move to the **Data Cleaning Notebook**, where we will:
  - Handle missing values
  - Drop or transform irrelevant or problematic features
  - Prepare the dataset for modeling

Outputs from this notebook:
- A good understanding of variable relationships
- `df_main` ready for cleaning and transformation in the next step


---

# Push files to Repo

In [None]:
import os

try:
    os.makedirs(name='outputs/datasets/cleaned', exist_ok=True)
except Exception as e:
    print(e)

# Save the combined dataset to be cleaned in the next notebook
df_main.to_csv("outputs/datasets/cleaned/df_main_for_cleaning.csv", index=False)