# 🏡 Housing Prices – Regression Model Training

This notebook explores a supervised learning regression task using housing data.  
We compare Decision Tree and Linear Regression models to predict house prices based on numeric features.  
Later, we refine the models with better feature selection and introduce Random Forests to improve generalization.

## 1. Loading Data

We begin by importing libraries, loading the dataset, and performing an initial inspection.

In [None]:
# Import core libraries
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

# Load the housing dataset
file_path = 'data/housing.csv'  # Adjust if needed
home_data = pd.read_csv(file_path)

# Show dataset shape and preview top rows
print("Shape:", home_data.shape)
home_data.head()

In [None]:
# Display summary statistics
home_data.describe()

In [None]:
# List all available columns
print("Columns:", home_data.columns.tolist())

## 2. Feature Selection

We identify relevant numeric features to use as inputs and define the target variable (`Price`).
Categorical features will be explored in later stages.

In [None]:
# Define the target variable
y = home_data['Price']

# Select numeric features likely to impact house price
feature_names = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea', 'YearBuilt', 'Car']
X = home_data[feature_names]

# Preview selected features
X.head()

### Notes

- Selected features are all numeric and directly related to property size or structure.
- These features are simple to work with initially and avoid preprocessing complexity.

## 3. Train/Test Split

We split the dataset into training and validation subsets using an 80/20 ratio.
This allows us to evaluate model performance on unseen data.

In [None]:
from sklearn.model_selection import train_test_split

# Split data into training and validation sets
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

# Confirm the split sizes
print("Training set size:", train_X.shape)
print("Validation set size:", val_X.shape)

## 4. Decision Tree Regressor

We train a baseline `DecisionTreeRegressor` using the selected numeric features.
Performance is evaluated using Mean Absolute Error (MAE) on the validation set.

In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error

# Create and train the model
tree_model = DecisionTreeRegressor(random_state=1)
tree_model.fit(train_X, train_y)

# Predict on validation data
tree_preds = tree_model.predict(val_X)

# Evaluate model performance
tree_mae = mean_absolute_error(val_y, tree_preds)
print(f"Decision Tree MAE: {tree_mae:,.0f}")

### Notes

- MAE measures average absolute prediction error.
- The Decision Tree fits the training data exactly but may not generalize well.

## 5. Linear Regression

We train a `LinearRegression` model after handling any missing values.  
This model assumes linear relationships between input features and the target.

In [None]:
from sklearn.linear_model import LinearRegression

# Drop rows with missing values from training and validation sets
train_X_clean = train_X.dropna()
train_y_clean = train_y.loc[train_X_clean.index]

val_X_clean = val_X.dropna()
val_y_clean = val_y.loc[val_X_clean.index]

# Train Linear Regression model
linreg_model = LinearRegression()
linreg_model.fit(train_X_clean, train_y_clean)

# Predict and evaluate
linreg_preds = linreg_model.predict(val_X_clean)
linreg_mae = mean_absolute_error(val_y_clean, linreg_preds)

print(f"Linear Regression MAE: {linreg_mae:,.0f}")

### Notes

- Dropped rows with missing values to ensure training compatibility.
- Linear Regression produced lower MAE than the Decision Tree, suggesting less overfitting.

## 6. Model Comparison

We compare validation MAE for both models.  
Lower MAE indicates better average predictive accuracy.

In [None]:
import matplotlib.pyplot as plt

# Bar chart comparing model performance
mae_values = [tree_mae, linreg_mae]
models = ['Decision Tree', 'Linear Regression']

plt.figure(figsize=(6, 4))
plt.bar(models, mae_values, color='skyblue')
plt.ylabel('Mean Absolute Error')
plt.title('Model Comparison: Validation MAE')
plt.grid(axis='y')
plt.tight_layout()
plt.savefig('plots/housing_model_comparison.png')
plt.show()

### Observations

| Model             | MAE       |
|------------------|-----------|
| Decision Tree     | 417,118   |
| Linear Regression | 319,812   |

- **Linear Regression** performed better, likely due to its simplicity and generalization ability.
- **Decision Tree** likely overfitted the training data with full-depth splits.

## 7. Observations & Next Steps

### Summary of Initial Models

- **Decision Tree Regressor**
  - Trained on all numeric features with default settings.
  - Validation MAE: ~417K
  - Likely overfitted — sensitive to data noise and outliers.

- **Linear Regression**
  - Trained after dropping rows with missing values.
  - Validation MAE: ~320K
  - Outperformed the tree model, suggesting linear relationships exist in the data.

### Next Steps

- Try refined feature sets to reduce noise.
- Explore Random Forests to balance overfitting and underfitting.
- Experiment with categorical features and data imputation strategies.

## 8. Model Refinement

We now evaluate how different feature sets affect Decision Tree performance.
This helps identify features that may add noise or improve predictive power.

### 8.1 Feature Set Tuning

In [None]:
# Compare performance across multiple feature combinations
feature_sets = {
    "Basic": ['Rooms', 'Bathroom', 'Landsize'],
    "No Landsize": ['Rooms', 'Bathroom', 'BuildingArea'],
    "With Car + YearBuilt": ['Rooms', 'Bathroom', 'Car', 'YearBuilt'],
    "All Available": ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea', 'YearBuilt', 'Car']
}

from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error

mae_results = {}

for name, features in feature_sets.items():
    X = home_data[features]
    y = home_data['Price']
    
    # Drop rows with missing values
    X = X.dropna()
    y = y.loc[X.index]

    train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)
    
    model = DecisionTreeRegressor(random_state=1)
    model.fit(train_X, train_y)
    preds = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds)
    
    print(f"{name} MAE: {mae:,.0f}")
    mae_results[name] = mae

In [None]:
# Visualize feature set performance
plt.figure(figsize=(8, 5))
plt.bar(mae_results.keys(), mae_results.values(), color='skyblue')
plt.ylabel("Validation MAE")
plt.title("Feature Set Comparison – Housing Regression")
plt.xticks(rotation=15)
plt.tight_layout()
plt.savefig("plots/housing_feature_set_comparison.png")
plt.show()

#### Observations

| Feature Set              | Validation MAE |
|--------------------------|----------------|
| Basic                    | 444,666        |
| No Landsize              | 398,598        |
| With Car + YearBuilt     | 339,981        |
| All Available            | 372,995        |

- **"With Car + YearBuilt"** produced the lowest MAE — a strong candidate for further modeling.
- **"No Landsize"** outperformed **"Basic"**, suggesting `Landsize` may introduce noise.
- More features ≠ better performance — optimal subsets can outperform larger sets.

### 8.2 Random Forest Regression

We now apply a `RandomForestRegressor`, an ensemble method that averages multiple decision trees.  
This typically improves generalization and reduces overfitting.

In [None]:
from sklearn.ensemble import RandomForestRegressor

# Use the best-performing feature set: 'With Car + YearBuilt'
X = home_data[['Rooms', 'Bathroom', 'Car', 'YearBuilt']]
y = home_data['Price']

# Drop missing values
X = X.dropna()
y = y.loc[X.index]

# Split into training and validation sets
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

# Train Random Forest
rf_model = RandomForestRegressor(random_state=1)
rf_model.fit(train_X, train_y)

# Predict and evaluate
rf_preds = rf_model.predict(val_X)
rf_mae = mean_absolute_error(val_y, rf_preds)

print(f"Random Forest MAE: {rf_mae:,.0f}")

In [None]:
# Plot feature importances
importances = rf_model.feature_importances_
features = X.columns

pd.Series(importances, index=features).sort_values().plot(
    kind='barh', title="Random Forest Feature Importances"
)
plt.xlabel("Relative Importance")
plt.tight_layout()
plt.savefig("plots/housing_rf_feature_importance.png")
plt.show()

### Observations

- **Random Forest MAE:** ~277,538 — the lowest so far.
- Feature importances show `YearBuilt` and `Car` are strong contributors.
- Ensemble methods significantly improved prediction accuracy.

## 9. Conclusion

This notebook developed and refined machine learning models to predict house prices using structured tabular data.

### Summary of Results

| Model               | MAE       |
|--------------------|-----------|
| Decision Tree       | 417,118   |
| Linear Regression   | 319,812   |
| Random Forest       | 277,538   |

- The **Decision Tree Regressor** overfit the training data, resulting in higher MAE.
- **Linear Regression** generalized better and improved performance by ~100K.
- **Random Forest**, with a well-chosen subset of features (`Rooms`, `Bathroom`, `Car`, `YearBuilt`), delivered the best performance overall.

### Key Learnings

- **Feature selection** plays a critical role: eliminating noisy features like `Landsize` improved model accuracy.
- **Simpler models** like Linear Regression can outperform more complex ones when data is clean and relationships are linear.
- **Random Forests** improve generalization and reduce variance through ensemble learning.
- **Visualization** (MAE comparison, feature importance) is essential for interpreting model behavior and guiding refinement.

### Next Steps

- Perform **hyperparameter tuning** (e.g., `n_estimators`, `max_depth`) for Random Forests.
- Explore **categorical encoding** (e.g., Suburb, Type).
- Try **cross-validation** to ensure robust performance estimates.
- Implement **data imputation** to avoid dropping valuable rows with missing values.