## 📂 Dataset Overview: Melbourne Housing Dataset

This dataset contains historical property sales data from Melbourne, Australia, and serves as a medium-complexity, real-world dataset for supervised regression tasks.

**Use Case:** Predict housing prices (`Price`) based on a mix of structural, locational, and historical attributes.

---

### 🔢 Dataset Composition

- **Total observations:** 13,834 rows
- **Total features:** 21 columns
- **Numerical features:** 13
- **Categorical features:** 8
- **Columns with missing values:** 4

| Column         | Missing Values | Data Type   | Notes                                        |
|----------------|----------------|-------------|----------------------------------------------|
| `BuildingArea` | 6,450          | float64     | Large gap; may benefit from imputation or removal |
| `YearBuilt`    | 5,375          | float64     | Crucial feature for model accuracy; imputation strategy will be key |
| `CouncilArea`  | 1,369          | object      | Categorical; may influence location pricing |
| `Car`          | 62             | float64     | Low volume of missing values; suitable for median imputation |
| `Suburb`       | 0              | object      | No missing values; high-cardinality category |

---

### 🏡 Target Variable

- **Target:** `Price`
- **Type:** Continuous (Regression)
- **Distribution:** Positively skewed (to be confirmed visually in EDA)
- **Potential Issues:**
  - Outliers likely due to luxury properties
  - May require log transformation for linear models

---

### 🔍 Modeling Challenges Anticipated

- **Missing Data:** Extensive in key features like `BuildingArea` and `YearBuilt`
- **Categorical Encoding:** High-cardinality variables such as `Suburb` and `CouncilArea` require thoughtful encoding (e.g., frequency or target encoding)
- **Feature Correlation:** Potential multicollinearity in size-related features (e.g., `Rooms`, `BuildingArea`, `Car`)
- **Non-linear Patterns:** Likely present; ensemble models like Random Forests are expected to capture these more effectively than linear baselines

---

### ✅ Summary

This dataset provides an ideal testbed for intermediate machine learning techniques, including:
- Robust preprocessing via pipelines
- Thoughtful imputation strategies
- Feature engineering and dimensionality reduction
- Non-linear modeling with tree-based ensembles
- Evaluation through cross-validation and metric tradeoffs

It is particularly suited to demonstrating the value of human-centered model development, where interpretability, pipeline design, and iterative refinement play key roles in building trustable, maintainable ML systems.

---

## 🔍 Initial Data Audit

To build an effective model, it’s critical to begin with an informed understanding of the dataset’s structure, quality, and idiosyncrasies.

Below, we perform a systematic audit of the data using Pandas:

- **Schema inspection** to assess data types and missingness
- **Statistical summaries** to understand scale, spread, and central tendency
- **Missing value profiles** to guide imputation strategy
- **First-look anomaly detection** (e.g., impossible values, likely outliers)

In [None]:
import pandas as pd
import numpy as np

df = pd.read_csv("../data/housing.csv")

# Summary info
df.info()

# Show full output for each key DataFrame call
from IPython.display import display

print("\n--- Descriptive Statistics ---")
display(df.describe())

print("\n--- First 5 Rows ---")
display(df.head())

print("\n--- Missing Values ---")
display(df.isnull().sum().sort_values(ascending=False))

### Key Observations:
- `BuildingArea` and `YearBuilt` have significant missingness (>35%), affecting downstream modeling decisions
- `Price` (the target) has no missing values
- Numerical columns like `Landsize`, `Distance`, and `Car` show wide ranges and potential outliers
- Some columns such as `Postcode` or `Method` may require domain interpretation before inclusion

---

## 📉 Baseline Model Recap: Block 3 Performance Reference

This notebook builds upon the foundation established in Block 3, where I created a basic supervised regression model using housing sales data from Melbourne. The primary goal of Block 3 was to load, clean, and model the dataset using straightforward methods with minimal preprocessing. This now serves as the benchmark for all future model improvements in Block 4.

---

### 🔧 Block 3 Workflow Summary

| Step                | Approach                                         |
|---------------------|--------------------------------------------------|
| **Target**          | `Price` (continuous variable)                   |
| **Features Used**   | All numerical predictors (categoricals dropped) |
| **Missing Data**    | Rows with missing values were dropped entirely  |
| **Categorical Data**| Not used; filtered out during preprocessing     |
| **Model**           | `DecisionTreeRegressor` (scikit-learn)          |
| **Validation**      | Hold-out split (train/test)                     |
| **Performance Metric** | Mean Absolute Error (MAE)                   |

---

### 📊 Results from Block 3

From the final model cell in Block 3, the model’s predictive performance was:

- **MAE (Mean Absolute Error):** **$2,621.79 AUD**
- **R² Score:** 0.592

This Decision Tree Regressor was built using default hyperparameters and trained on a preprocessed dataset with only numerical features and no imputation, encoding, or scaling. The lack of robustness in handling missing values and categorical information means this result is unlikely to generalize well to real-world data pipelines.

---

### 🧠 Lessons & Limitations

- **Overfitting risk:** Decision Trees are prone to high variance without depth constraints or ensemble averaging.
- **No pipeline used:** Preprocessing steps (dropping NaNs, filtering numerics) were not encapsulated in a reproducible structure.
- **Categoricals ignored:** Location-based insights (e.g., `Suburb`, `CouncilArea`) were excluded, missing key predictive signals.
- **No cross-validation:** Only a single test split was used, which limits confidence in performance reliability.

---

### 🎯 Block 4 Goals (Building from Here)

- Integrate robust preprocessing using `Pipeline` and `ColumnTransformer`
- Handle missing values with appropriate imputation strategies
- Include and encode categorical variables
- Introduce ensemble learning with `RandomForestRegressor`
- Evaluate via cross-validation and multiple metrics (MAE, RMSE, R²)
- Track performance improvements in a structured comparison report

---