[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gdsaxton/GDAN5400/blob/main/Week%2011%20Notebooks/GDAN%205400%20-%20Week%2011%20Notebooks%20%28I%29%20-%20Hyperparameter%20Tuning.ipynb)

This notebook provides a mini-tutorial on *hyperparameter tuning*.

---

<br> Read in The Usual Packages and Set up Environment

In [1]:
import numpy as np
import pandas as pd

#http://pandas.pydata.org/pandas-docs/stable/options.html
pd.set_option('display.max_columns', None)  #Set PANDAS to show all columns in DataFrame
pd.set_option('max_colwidth', 500)

# Load the Housing Prices `Training` Dataset
I have uploaded the training and test datasets onto the class GitHub repository.

In [2]:
train_url = 'https://raw.githubusercontent.com/gdsaxton/GDAN5400/refs/heads/main/Housing_Prices/train.csv'
train = pd.read_csv(train_url)
print('# of rows in training dataset:', len(train), '\n')
train[:2]

# of rows in training dataset: 1460 



Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,,Attchd,2003.0,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976.0,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,,,,0,5,2007,WD,Normal,181500


# Fill in Missing Values for `LotFrontage`
- The `LotFrontage` column contains missing values that must be filled before modeling.  
- Use the **median** value to replace missing values, as it is less affected by outliers.  
- After filling in the missing values, verify that `LotFrontage` no longer has any missing entries.  

In [5]:
print("Missing values in LotFrontage column:", train["LotFrontage"].isnull().sum())

Missing values in LotFrontage column: 0


In [6]:
train['LotFrontage'] = train['LotFrontage'].fillna(train["LotFrontage"].median())
print("Missing values in LotFrontage column:", train["LotFrontage"].isnull().sum())

Missing values in LotFrontage column: 0


# Prepare the Data for Modeling
- Select the **predictor variables (`X`)** and the **target variable (`y`)**.  
- Set `LotArea`, `LotFrontage`, `YearBuilt`, `1stFlrSF`, and `2+ Car Garage` as the features for prediction (`X`).  
- Split the data into **training (`X_train, y_train`)** and **testing (`X_val, y_val`)** sets using a standard 80/20 split.  
- Set `random_state=42` in your `train_test_split` command to ensure reproducibility.  

---

I will show the process here on four of the variables from the assignment. 

In [7]:
features = ['LotArea', 'LotFrontage', 'YearBuilt', '1stFlrSF']
train[features].describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
LotArea,1460.0,10516.828082,9981.264932,1300.0,7553.5,9478.5,11601.5,215245.0
LotFrontage,1460.0,69.863699,22.027677,21.0,60.0,69.0,79.0,313.0
YearBuilt,1460.0,1971.267808,30.202904,1872.0,1954.0,1973.0,2000.0,2010.0
1stFlrSF,1460.0,1162.626712,386.587738,334.0,882.0,1087.0,1391.25,4692.0


In [8]:
X = train[features]
y = train['SalePrice']
print(X.shape, y.shape)

# Splitting training data into train and validation sets
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
# Print out number of rows and columns in each of the four dataframes we just generated
print(X_train.shape, X_val.shape, y_train.shape, y_val.shape)

(1460, 4) (1460,)
(1168, 4) (292, 4) (1168,) (292,)


# Train and Evaluate Different Models

Here I will expand the code shown in Week 10 by looping over *8* different models. We will test multiple machine learning models to compare their performance in predicting house prices. Instead of relying only on linear regression, we are evaluating a mix of **linear models (Ridge, Lasso, ElasticNet), tree-based models (DecisionTree, RandomForest, XGBoost), and nonlinear models (SVR)**.  

By training each model on the same dataset and computing the **Root Mean Squared Logarithmic Error (RMSLE)** for validation predictions, we can determine which model generalizes best. This process helps us **identify the most accurate and robust approach** for this specific problem, guiding model selection for final predictions.  

We will be using the same **train-test split and features** – `Age`, `LotFrontage`, `High_Quality` – so we will not re-run those parts of the code.  

---


We will test the following **eight machine learning models**:  

#### **1️⃣ Linear Regression (The Straight-Line Approach)**  
- **How it works**: Assumes that house prices change in a **straight-line relationship** with features (e.g., if `YearBuilt` goes up, price increases by a fixed amount).  
- **Pros**: Simple, interpretable.  
- **Cons**: Can't capture complex patterns.  

---

#### **2️⃣ Ridge Regression (Prevents Overfitting)**  
- **How it works**: A linear regression model that **prevents overfitting** by reducing extreme coefficient values.  
- **Why it's useful**: Helps stabilize predictions when features are highly correlated.  

---

#### **3️⃣ Lasso Regression (Feature Selection Model)**  
- **How it works**: Similar to Ridge Regression, but **removes less important features** by shrinking some coefficients to zero.  
- **Why it's useful**: Automatically selects the most important features, simplifying the model.  

---

#### **4️⃣ ElasticNet (Balanced Regularization Model)**  
- **How it works**: A combination of **Lasso and Ridge Regression** that balances feature selection and coefficient shrinkage.  
- **Why it's useful**: Helps when **some features should be removed** while others need **regularization**.  

---

#### **5️⃣ Decision Tree (The Rule-Based Approach)**  
- **How it works**: Think of this model as a series of **Yes/No questions** that split the data into groups based on features.  
  - Example: *Is the house built after 2000?* → If yes, go to the next rule.  
- **Why it's useful**: Can handle **non-linear relationships** in the data.  
- **Cons**: Can overfit if the tree is too deep.  

---

#### 6️⃣ Random Forest (The Team Decision Tree)   
- **How it works**: Instead of using just one decision tree, this model **combines multiple decision trees** and takes the average of their predictions.  
- **Why it's useful**: More stable, avoids overfitting, and captures **complex relationships** in the data.  

---

#### **7️⃣ XGBoost (The Smartest Tree Model)**  
- **How it works**: Like Random Forest, but instead of treating trees equally, XGBoost **learns from mistakes** step by step.  
- **Why it's useful**: Often one of the **most powerful models** for structured data.  

---

#### **8️⃣ Support Vector Regression (SVR)**  
- **How it works**: Instead of fitting a single best-fit line, SVR finds a **small range (margin)** where most predictions will fall.  
- **Why it's useful**: Handles **nonlinear relationships** better than standard regression.  
- **Cons**: Can be slower on large datasets and performed worst in our analysis.  

---


#### **📊 Summary of Models**
| Model | Purpose |
|-------|---------|
| **Linear Regression** | Baseline model, assumes a linear relationship | 
| **Ridge Regression** | Prevents overfitting by **shrinking coefficients** |
| **Lasso Regression** | Shrinks **and removes** irrelevant features | 
| **ElasticNet** | Balances **feature selection (L1) and shrinkage (L2)** | 
| **Decision Tree** | Splits data using **rule-based conditions** | 
| **Random Forest** | Uses **multiple decision trees** to improve stability | 
| **XGBoost** | Learns from **previous mistakes** to improve predictions | 
| **SVR** | Uses a **flexible margin** instead of a single line | 

---


## Key Question: Which Model Will Perform Best?

---


### Code Explanation

1. **Define a dictionary of models (`models`)**  
   - Several regression models are stored in a dictionary with their names as keys and model objects as values.  
   - Models include **Linear Regression, Ridge, Lasso, ElasticNet, Decision Tree, Random Forest, XGBoost, and SVR**.  

2. **Create an empty list (`results`)**  
   - This list will store the performance of each model.  

3. **Loop through each model, train it, and evaluate it**  
   - For each model:  
     - It is trained using `X_train` and `y_train`.  
     - It makes predictions on `X_val` (validation data).  
     - The **Root Mean Squared Logarithmic Error (RMSLE)** is calculated to measure model performance.  
     - The model's name and its RMSLE score are added to the `results` list.  

4. **Convert results into a Pandas DataFrame (`results_df`)**  
   - The results are stored in a DataFrame and sorted in **ascending order by RMSLE**, so the best-performing model appears first.  
   
#### **Why Are We Doing This?**  
This approach allows us to **compare multiple models efficiently** and determine which one gives the best predictions for house prices. It helps us make an informed decision on **which model to use in the final analysis**. 

In [21]:
from sklearn.linear_model import LinearRegression, RidgeCV, LassoCV, ElasticNetCV
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.svm import SVR


from sklearn.metrics import mean_squared_log_error

# Define RMSLE scoring function
def rmsle(y_true, y_pred):
    return np.sqrt(mean_squared_log_error(y_true, np.maximum(y_pred, 0)))  # Ensure predictions are non-negative

# Define models with default parameters
models = {
    'LinearRegression': LinearRegression(),
    'Ridge': RidgeCV(),
    'LASSO': LassoCV(),
    'ElasticNet': ElasticNetCV(),      
    'DecisionTree': DecisionTreeRegressor(), 
    'RandomForest': RandomForestRegressor(),
    'XGBoost': XGBRegressor(), 
    'SVR': SVR(),                   
}

# Create empty list for storing results
results = []

#Train each model in a loop, saving model name RMSLE score into results dataframe
for name, model in models.items():
    print(f"Training {name}...")
    model.fit(X_train, y_train)
    y_pred = model.predict(X_val)
    rmsle_score = rmsle(y_val, y_pred)
    results.append({'Model': name, 'RMSLE': rmsle_score})

# Convert results to DataFrame; sort by RMSLE and output
results_df = pd.DataFrame(results)
results_df = results_df.sort_values('RMSLE')
results_df

Training LinearRegression...
Training Ridge...
Training LASSO...
Training ElasticNet...
Training DecisionTree...
Training RandomForest...
Training XGBoost...
Training SVR...


Unnamed: 0,Model,RMSLE
5,RandomForest,0.229732
6,XGBoost,0.231524
0,LinearRegression,0.274103
1,Ridge,0.274103
2,LASSO,0.283017
4,DecisionTree,0.295314
3,ElasticNet,0.381148
7,SVR,0.432187


# Assessment

Above we can see that `RandomForest` model performs best, with an *RMSLE* score of 0.229732.

Building on what we discussed in Weeks 9 and 10, though, we can improve on these scores in several ways. One of the methods you should try is `hyperparameter tuning`.

---

# What are Hyperparameters?

Hyperparameters are adjustable settings in a machine learning model that control how it learns from data and must be tuned to optimize performance.

Think of **hyperparameters** like the settings on a washing machine. When you do laundry, you choose settings like **water temperature, spin speed, and cycle time** to get the best results. These settings don’t change during the wash, but they affect how well your clothes get cleaned.  

Similarly, in machine learning, hyperparameters are **adjustable settings** that control how a model learns from data. They aren’t learned automatically but need to be chosen carefully to make the model work well. For example:  
- In a decision tree, you might set **how deep the tree can grow** (like choosing a wash cycle length).  
- In Ridge or Lasso regression, you adjust **how much to penalize complexity** (like deciding how much detergent to use).  

Finding the right hyperparameters helps the model make better predictions—just like picking the right wash settings keeps your clothes clean and fresh!


---

## **Brief Overview of Each of the 8 Models**
Below is a summary of each model, including **pros, cons, and hyperparameters**.

### **1️⃣ Linear Models**
---
#### **🔹 Linear Regression**
✅ **Pros:** Simple, interpretable, efficient for small datasets.  
❌ **Cons:** Assumes a linear relationship, sensitive to outliers.  
🔧 **Hyperparameters:** None.

#### **🔹 Ridge Regression (L2 Regularization)**
✅ **Pros:** Handles multicollinearity, prevents overfitting.  
❌ **Cons:** Doesn’t eliminate irrelevant features, just shrinks them.  
🔧 **Hyperparameter:** `alpha` (higher values shrink coefficients more).  

#### **🔹 LASSO Regression (L1 Regularization)**
✅ **Pros:** Feature selection by forcing some coefficients to zero.  
❌ **Cons:** Can be unstable if features are highly correlated.  
🔧 **Hyperparameter:** `alpha` (higher values remove more features).  

#### **🔹 ElasticNet (L1 + L2 Regularization)**
✅ **Pros:** Combines Ridge and LASSO, balances shrinkage & feature selection.  
❌ **Cons:** Requires tuning two parameters.  
🔧 **Hyperparameters:** `alpha` (regularization strength), `l1_ratio` (mix of L1 vs. L2).

---

### **2️⃣ Tree-Based Models**
---
#### **🔹 Decision Tree Regressor**
✅ **Pros:** Captures non-linearity, interpretable.  
❌ **Cons:** Easily overfits if `max_depth` is too large.  
🔧 **Hyperparameters:** `max_depth`, `min_samples_split` (controls splits).

#### **🔹 Random Forest Regressor**
✅ **Pros:** Reduces overfitting, works well on structured data.  
❌ **Cons:** Slower than linear models, hard to interpret.  
🔧 **Hyperparameters:** `n_estimators` (number of trees), `max_depth`.

---

### **3️⃣ Gradient Boosting Models**
---
#### **🔹 XGBoost (Extreme Gradient Boosting)**
✅ **Pros:** Handles missing values, strong performance in Kaggle competitions.  
❌ **Cons:** Can overfit if not tuned properly.  
🔧 **Hyperparameters:** `n_estimators`, `learning_rate`, `max_depth`.

Other gradient-boosting models include `LightGBM` (Light Gradient Boosting Machine) and `CatBoost` (Categorical Boosting)

---

### **4️⃣ Support Vector Regression (SVR)**
---
#### **🔹 SVR (Support Vector Machine for Regression)**
✅ **Pros:** Works well for small datasets with complex relationships.  
❌ **Cons:** Slow for large datasets, sensitive to outliers.  
🔧 **Hyperparameters:** `C` (regularization), `epsilon` (tolerance), `kernel` (`linear`, `rbf`).

---




# How Can We Incorporate Hyperparameter Tuning?

You could do this manually – setting the parameters, running the code, evaluating RMSLE, re-setting the paramaters, re-running the code, etc. But that is inefficient. Instead, the best way is to use `GridSearchCV`.

---

`GridSearchCV` is a way to using **cross-validation**. Specifically, it allows you to test different hyperparameters and select the best one(s) for each model.

GridSearchCV significantly improves the comparison by:

1. **Finding optimal hyperparameters** for each model rather than using default or manually selected values
2. **Providing more robust evaluation** through cross-validation instead of a single validation split
3. **Ensuring fair comparison** between models at their best performance

Below is a modified version of our loop on these 8 models that incorporates a number of key changes.


### 🔧 1. **Model Tuning with GridSearchCV**

#### 🔍 What’s new:
- Uses `GridSearchCV` for all models *except* `LinearRegression`.
- Specifies **explicit hyperparameter grids** for each model.

#### 💡 Why it matters:
- This lets you **systematically search** for the best parameters using cross-validation, leading to better performance and fairer comparisons.

```python
GridSearchCV(model, param_grid=params, scoring=rmsle_scorer, cv=5, n_jobs=-1)
```

---

### 🧠 2. **Custom Scoring Metric with `make_scorer`**

#### 🔍 What’s new:
```python
from sklearn.metrics import make_scorer
rmsle_scorer = make_scorer(rmsle, greater_is_better=False)
```

#### 💡 Why it matters:
- `GridSearchCV` expects a scoring function. Wrapping your `rmsle()` into `make_scorer()` allows seamless integration while signaling that **lower is better**.

---

### 🧼 3. **Cleaner Model Dictionary with Hyperparameters**

#### 🔍 What’s new:
Each model entry is now a **tuple** of:
```python
(model_instance, param_grid)
```

Example:
```python
'Ridge': (Ridge(), {'alpha': [0.1, 1, 10, 100]})
```

#### 💡 Why it matters:
- This is a scalable and clean way to organize models and their tuning spaces.
- It allows your loop to easily apply `GridSearchCV` when a param grid is present.

---

### 📊 4. **Storing Best Parameters**

#### 🔍 What’s new:
After model training, your code also stores the **best hyperparameters** found by the grid search:

```python
best_params = grid_search.best_params_
```

#### 💡 Why it matters:
- This gives visibility into *why* a model performed well and can guide future choices.

---

### 🛑 5. **Suppressing Warnings for Cleaner Output**

```python
import warnings
warnings.filterwarnings("ignore")
```

#### 💡 Why it matters:
- Keeps the console output tidy — especially helpful when doing lots of cross-validation with verbose models like XGBoost or SVR.

---

### ✅ Summary Table of Improvements

| Feature                      | Original Code     | New Code      | Benefit |
|-----------------------------|--------------|---------------|---------|
| Hyperparameter tuning       | ❌ Default only | ✅ `GridSearchCV` | Optimizes performance |
| Custom scoring for CV       | ❌ No         | ✅ `make_scorer(rmsle)` | Enables RMSLE as CV metric |
| Param grid per model        | ❌ No         | ✅ Yes         | Scalable and readable |
| Best parameters tracking    | ❌ No         | ✅ Yes         | Transparent model selection |
| Warning suppression         | ❌ No         | ✅ Yes         | Cleaner output |

---

### Key Takeaways

> In the original code, you’re using each model with its default settings. That’s a good starting point, but models like Ridge, Lasso, Random Forest, and XGBoost have hyperparameters that greatly affect performance.
>
> The new version shows how to tune each model more effectively using `GridSearchCV`. It also uses a custom error metric (RMSLE), so you’re measuring what matters most for your specific prediction problem.
>
> This approach is a more professional, production-grade template that you can use in real-world machine learning work!

In [66]:
from sklearn.linear_model import LinearRegression, RidgeCV, LassoCV, ElasticNetCV
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split, GridSearchCV

from sklearn.metrics import mean_squared_log_error, make_scorer

import warnings
# Suppress warnings
warnings.filterwarnings("ignore")


# Define RMSLE scoring function
def rmsle(y_true, y_pred):
    return np.sqrt(mean_squared_log_error(y_true, np.maximum(y_pred, 0)))  # Ensure predictions are non-negative

rmsle_scorer = make_scorer(rmsle, greater_is_better=False)  # Lower is better for error metrics


# Define models and hyperparameter grids
models = {
    'LinearRegression': (LinearRegression(), {}),
    # Regularized Linear Models
    'Ridge': (Ridge(), {'alpha': [0.1, 1, 10, 100]}),
    'Lasso': (Lasso(), {'alpha': [0.0001, 0.001, 0.01, 0.1, 1]}),
    'ElasticNet': (ElasticNet(), {'alpha': [0.0001, 0.001, 0.01, 0.1, 1], 'l1_ratio': [0.1, 0.5, 0.9]}),
    # Tree-based Models
    'DecisionTree': (DecisionTreeRegressor(), {'max_depth': [3, 5, 10, 20], 'min_samples_split': [2, 5, 10]}),
    'RandomForest': (RandomForestRegressor(), {'n_estimators': [50, 100, 200], 'max_depth': [5, 10, 20]}),
    # Gradient Boosting Models
    'XGBoost': (XGBRegressor(), {'n_estimators': [100, 200], 'learning_rate': [0.01, 0.1, 0.3], 'max_depth': [3, 6, 9]}),    # Support Vector Regression
    # Support Vector Regression
    'SVR': (SVR(), {'C': [0.1, 1], 'epsilon': [0.01, 0.1], 'kernel': ['linear', 'rbf']}),
}


trained_models = {}  # Dictionary to store trained models
# Store results in a DataFrame
results = []

for name, (model, params) in models.items():
    print(f"Training {name}...")

    if params:  # Perform GridSearchCV for models with hyperparameters
        grid_search = GridSearchCV(model, param_grid=params, scoring=rmsle_scorer, cv=5, n_jobs=-1)
        grid_search.fit(X_train, y_train)
        best_model = grid_search.best_estimator_
        best_params = grid_search.best_params_
        y_pred = best_model.predict(X_val)
        best_score = rmsle(y_val, y_pred)
    else:  # Directly fit Linear Regression and Stacking (no hyperparams)
        best_model = model.fit(X_train, y_train)
        best_params = "N/A"
        y_pred = best_model.predict(X_val)
        best_score = rmsle(y_val, y_pred)

    # Append results
    results.append({'Model': name, 'Best Parameters': best_params, 'RMSLE': best_score})

    trained_models[name] = best_model
    
    
# Convert results to DataFrame
results_df = pd.DataFrame(results)
results_df = results_df.sort_values('RMSLE')
results_df

Training LinearRegression...
Training Ridge...
Training Lasso...
Training ElasticNet...
Training DecisionTree...
Training RandomForest...
Training XGBoost...
Training SVR...


Unnamed: 0,Model,Best Parameters,RMSLE
5,RandomForest,"{'max_depth': 10, 'n_estimators': 200}",0.228803
6,XGBoost,"{'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 100}",0.23553
4,DecisionTree,"{'max_depth': 5, 'min_samples_split': 2}",0.251423
0,LinearRegression,,0.274103
2,Lasso,{'alpha': 1},0.274103
1,Ridge,{'alpha': 100},0.274103
3,ElasticNet,"{'alpha': 1, 'l1_ratio': 0.1}",0.274111
7,SVR,"{'C': 0.1, 'epsilon': 0.01, 'kernel': 'linear'}",0.281761


### Model Comparison Overview

The above table summarizes the performance of different regression models using *Root Mean Squared Logarithmic Error (RMSLE)* as the evaluation metric. Lower RMSLE values indicate better performance. All models were trained on the same data using consistent validation and tuning procedures.

Note that the RMSEL of our best model, `RandomForest`, has improved after hyperparameter tuning.

---

### Interpretation & Key Takeaways
- **RandomForest** achieved the best performance with a tuned combination of `max_depth` and `n_estimators`. Note above that we tried the following hyperparameters for RandomForest: `max_depth`: [5, 10, 20] and `n_estimators`: [50, 100, 200].  
  The best combination selected by cross-validation was `max_depth=10` and `n_estimators=200`.  
  This highlights how hyperparameter tuning can significantly improve model accuracy.


- **XGBoost**, another ensemble method, also performed very well and is often preferred in real-world machine learning competitions.
- The **DecisionTree** model performed decently but was outperformed by its ensemble counterpart (RandomForest), highlighting the value of averaging multiple trees.
- The **linear models** (LinearRegression, Ridge, Lasso, ElasticNet) performed similarly. This suggests that regularization had little effect on this dataset, and the relationship may not be fully linear.
- **SVR** was the worst performer in this case, likely due to its sensitivity to hyperparameters and scaling (or potentially underfitting).

---

### 📚 Learning Point

> This exercise demonstrates the importance of **model tuning** and **performance evaluation**. Ensemble models like Random Forest and XGBoost often outperform simpler models, but tuning hyperparameters is essential. Don't just default to one model—test multiple approaches and validate them carefully. 

### Select the Best Model for Generating Updated Submission File
Extract best model from `results_df` manually

In [97]:
# Example: Use the trained RandomForest model
rf_model = trained_models['RandomForest']
val_predictions = rf_model.predict(X_val)
print("RandomForest RMSLE (manual re-use):", rmsle(y_val, val_predictions))

RandomForest RMSLE (manual re-use): 0.22880293177239436


Extract best model from `results_df` dynamically

In [98]:
# Extract the best model from results_df based on lowest RMSLE
best_model_name = results_df.loc[results_df['RMSLE'].idxmin(), 'Model']
print(f"Best Model: {best_model_name}")

best_model = trained_models[best_model_name] # No re-fitting, just using the stored model
#best_model.fit(X_train, y_train)  #If you want to retrain the model, uncomment this line

# Generate predictions on validation set
val_predictions = best_model.predict(X_val) # Model is already trained, just predict
#print(f"{best_model_name} RMSLE (from trained_models):", rmsle(y_val, val_predictions))
print("RMSLE:", rmsle(y_val, val_predictions))

Best Model: RandomForest
RMSLE: 0.22880293177239436


### Make Predictions on `test.csv` and Generate Submission File
- Use your **best-performing model** to predict house prices in the Kaggle test dataset and generate a correctly formatted submission file.
- Load the **test dataset** from Kaggle’s *House Prices - Advanced Regression Techniques* competition (for ease of use, you can access the version I have provided on GitHub). 
- Use *the same feature selection* as in training. 
- Handle any missing values in these columns using the median values.
- Use the best trained model to make *predictions on the test set*.  
    - Load and apply the best model from your training in Task 9.
- Ensure that all predictions are *non-negative* (house prices cannot be negative). 
- Save the predictions as *`submission.csv`* in the required format:  
    - The submission file must contain *two columns:* `Id` and `SalePrice`.  
    - *Important:* The submission file *must match Kaggle’s format exactly*—every `Id` must have a prediction, and no values can be missing.     

In [99]:
test_url = 'https://raw.githubusercontent.com/gdsaxton/GDAN5400/refs/heads/main/Housing_Prices/test.csv'
test = pd.read_csv(test_url)
print(len(test))
test.head()

1459


Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,1461,20,RH,80.0,11622,Pave,,Reg,Lvl,AllPub,Inside,Gtl,NAmes,Feedr,Norm,1Fam,1Story,5,6,1961,1961,Gable,CompShg,VinylSd,VinylSd,,0.0,TA,TA,CBlock,TA,TA,No,Rec,468.0,LwQ,144.0,270.0,882.0,GasA,TA,Y,SBrkr,896,0,0,896,0.0,0.0,1,0,2,1,TA,5,Typ,0,,Attchd,1961.0,Unf,1.0,730.0,TA,TA,Y,140,0,0,0,120,0,,MnPrv,,0,6,2010,WD,Normal
1,1462,20,RL,81.0,14267,Pave,,IR1,Lvl,AllPub,Corner,Gtl,NAmes,Norm,Norm,1Fam,1Story,6,6,1958,1958,Hip,CompShg,Wd Sdng,Wd Sdng,BrkFace,108.0,TA,TA,CBlock,TA,TA,No,ALQ,923.0,Unf,0.0,406.0,1329.0,GasA,TA,Y,SBrkr,1329,0,0,1329,0.0,0.0,1,1,3,1,Gd,6,Typ,0,,Attchd,1958.0,Unf,1.0,312.0,TA,TA,Y,393,36,0,0,0,0,,,Gar2,12500,6,2010,WD,Normal
2,1463,60,RL,74.0,13830,Pave,,IR1,Lvl,AllPub,Inside,Gtl,Gilbert,Norm,Norm,1Fam,2Story,5,5,1997,1998,Gable,CompShg,VinylSd,VinylSd,,0.0,TA,TA,PConc,Gd,TA,No,GLQ,791.0,Unf,0.0,137.0,928.0,GasA,Gd,Y,SBrkr,928,701,0,1629,0.0,0.0,2,1,3,1,TA,6,Typ,1,TA,Attchd,1997.0,Fin,2.0,482.0,TA,TA,Y,212,34,0,0,0,0,,MnPrv,,0,3,2010,WD,Normal
3,1464,60,RL,78.0,9978,Pave,,IR1,Lvl,AllPub,Inside,Gtl,Gilbert,Norm,Norm,1Fam,2Story,6,6,1998,1998,Gable,CompShg,VinylSd,VinylSd,BrkFace,20.0,TA,TA,PConc,TA,TA,No,GLQ,602.0,Unf,0.0,324.0,926.0,GasA,Ex,Y,SBrkr,926,678,0,1604,0.0,0.0,2,1,3,1,Gd,7,Typ,1,Gd,Attchd,1998.0,Fin,2.0,470.0,TA,TA,Y,360,36,0,0,0,0,,,,0,6,2010,WD,Normal
4,1465,120,RL,43.0,5005,Pave,,IR1,HLS,AllPub,Inside,Gtl,StoneBr,Norm,Norm,TwnhsE,1Story,8,5,1992,1992,Gable,CompShg,HdBoard,HdBoard,,0.0,Gd,TA,PConc,Gd,TA,No,ALQ,263.0,Unf,0.0,1017.0,1280.0,GasA,Ex,Y,SBrkr,1280,0,0,1280,0.0,0.0,2,0,2,1,Gd,5,Typ,0,,Attchd,1992.0,RFn,2.0,506.0,TA,TA,Y,0,82,0,0,144,0,,,,0,1,2010,WD,Normal


In [100]:
test['LotFrontage'] = test['LotFrontage'].fillna(test["LotFrontage"].median())
print("Missing values in LotFrontage column:", test["LotFrontage"].isnull().sum())

Missing values in LotFrontage column: 0


In [106]:
test[features].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1459 entries, 0 to 1458
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   LotArea      1459 non-null   int64  
 1   LotFrontage  1459 non-null   float64
 2   YearBuilt    1459 non-null   int64  
 3   1stFlrSF     1459 non-null   int64  
dtypes: float64(1), int64(3)
memory usage: 45.7 KB


In [102]:
# Select the same predictor variables as in training
X_test = test[features]
X_test[:5]

Unnamed: 0,LotArea,LotFrontage,YearBuilt,1stFlrSF
0,11622,80.0,1961,896
1,14267,81.0,1958,1329
2,13830,74.0,1997,928
3,9978,78.0,1998,926
4,5005,43.0,1992,1280


In [104]:
# Generate predictions for Kaggle test set using the trained Stacking model
test_predictions = best_model.predict(X_test)
print('# of predictions:', len(test_predictions))
test_predictions[:5]

# of predictions: 1459


array([147674.54432663, 175834.70514362, 215891.93486456, 212958.69472185,
       191867.59222523])

In [112]:
# Ensure predictions are non-negative (house prices cannot be negative)
print(f"Min SalePrice: {test_predictions.min()}")
print(f"Max SalePrice: {test_predictions.max()}")

#If there are non-negative, run the following line:
#test_predictions = np.maximum(test_predictions, 0)

Min SalePrice: 75616.84221014459
Max SalePrice: 465705.3025159118


In [109]:
# Add predictions to test dataset
test['SalePrice'] = test_predictions

In [110]:
#Create submission file
submission_df = test[['Id', 'SalePrice']]
submission_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1459 entries, 0 to 1458
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Id         1459 non-null   int64  
 1   SalePrice  1459 non-null   float64
dtypes: float64(1), int64(1)
memory usage: 22.9 KB


In [None]:
#Save file
submission_df.to_csv("submission.csv", index=False)
print("Submission file saved as 'submission.csv'")