# AutoGluon Ensemble for California Housing

This notebook uses AutoGluon to train and ensemble multiple models (GBM, LGBM, XGBoost, CatBoost, Random Forest) on the California Housing dataset.
It includes Weighted Ensemble (Voting) and Stacking techniques.

In [1]:
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from autogluon.tabular import TabularDataset, TabularPredictor

  from .autonotebook import tqdm as notebook_tqdm


## 1. Data Loading and Preprocessing

In [2]:
# Load dataset
data = fetch_california_housing()
df = pd.DataFrame(data['data'], columns=data['feature_names'])
df['MedHouseVal'] = data['target']

print("Dataset Shape:", df.shape)
df.head()

Dataset Shape: (20640, 9)


Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


In [3]:
# Split into Train and Test sets (e.g., 80% Train, 20% Test)
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

print(f"Train size: {train_df.shape}, Test size: {test_df.shape}")

Train size: (16512, 9), Test size: (4128, 9)


## 2. AutoGluon Training

We will configure AutoGluon to train specific models:
- GBM (LightGBM)
- CAT (CatBoost)
- XGB (XGBoost)
- RF (Random Forest)

And use `presets='best_quality'` to enable Stacking and Bagging. AutoGluon automatically creates a Weighted Ensemble.

In [4]:
# Define the label column
label = 'MedHouseVal'

# Specify hyperparameters to ensure we get the desired base models
# 'GBM' is LightGBM, 'CAT' is CatBoost, 'XGB' is XGBoost, 'RF' is Random Forest
hyperparameters = {
    'GBM': {},
    'CAT': {},
    'XGB': {},
    'RF': {}
}

# Define the predictor path
save_path = 'ag_models_california_housing'

# Initialize Predictor
predictor = TabularPredictor(label=label, path=save_path, problem_type='regression', eval_metric='mean_squared_error')

# Fit models
# presets='best_quality' enables Bagging and Stacking (L2/L3 ensembles)
# time_limit can be adjusted as needed (e.g., 600 seconds for a quick run, or remove for full training)
predictor.fit(
    train_data=TabularDataset(train_df),
    presets='best_quality',
    hyperparameters=hyperparameters,
    time_limit=600  # Set a time limit for demonstration (10 minutes)
)

Verbosity: 2 (Standard Logging)
AutoGluon Version:  1.5.0
Python Version:     3.11.14
Operating System:   Windows
Platform Machine:   AMD64
Platform Version:   10.0.26200
CPU Count:          32
Pytorch Version:    2.9.1+cpu
CUDA Version:       CUDA is not available
Memory Avail:       74.84 GB / 127.76 GB (58.6%)
Disk Space Avail:   852.19 GB / 1861.70 GB (45.8%)
Presets specified: ['best_quality']
Setting dynamic_stacking from 'auto' to True. Reason: Enable dynamic_stacking when use_bag_holdout is disabled. (use_bag_holdout=False)
Stack configuration (auto_stack=True): num_stack_levels=1, num_bag_folds=8, num_bag_sets=1
DyStack is enabled (dynamic_stacking=True). AutoGluon will try to determine whether the input data is affected by stacked overfitting and enable or disable stacking as a consequence.
	This is used to identify the optimal `num_stack_levels` value. Copies of AutoGluon will be fit on subsets of the data. Then holdout validation data is used to detect stacked overfitting.


[36m(_ray_fit pid=56132)[0m [1000]	valid_set's l2: 0.222258


[36m(_dystack pid=55320)[0m 	-0.201	 = Validation score   (-mean_squared_error)
[36m(_dystack pid=55320)[0m 	2.35s	 = Training   runtime
[36m(_dystack pid=55320)[0m 	0.24s	 = Validation runtime
[36m(_dystack pid=55320)[0m Fitting model: RandomForest_BAG_L1 ... Training model for up to 82.28s of the 128.31s of remaining time.
[36m(_dystack pid=55320)[0m 	Fitting 1 model on all data (use_child_oof=True) | Fitting with cpus=32, gpus=0, mem=0.1/72.9 GB
[36m(_dystack pid=55320)[0m 	-0.2541	 = Validation score   (-mean_squared_error)
[36m(_dystack pid=55320)[0m 	1.57s	 = Training   runtime
[36m(_dystack pid=55320)[0m 	0.38s	 = Validation runtime
[36m(_dystack pid=55320)[0m Fitting model: CatBoost_BAG_L1 ... Training model for up to 79.99s of the 126.02s of remaining time.
[36m(_dystack pid=55320)[0m 	Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy (8 workers, per: cpus=4, gpus=0, memory=0.53%)


[36m(_ray_fit pid=55904)[0m [2000]	valid_set's l2: 0.195191[32m [repeated 8x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)[0m


[36m(_ray_fit pid=61720)[0m 	Ran out of time, early stopping on iteration 9443.
[36m(_dystack pid=55320)[0m 	-0.1884	 = Validation score   (-mean_squared_error)
[36m(_dystack pid=55320)[0m 	64.2s	 = Training   runtime
[36m(_dystack pid=55320)[0m 	0.03s	 = Validation runtime
[36m(_dystack pid=55320)[0m Fitting model: XGBoost_BAG_L1 ... Training model for up to 12.22s of the 58.25s of remaining time.
[36m(_dystack pid=55320)[0m 	Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy (8 workers, per: cpus=4, gpus=0, memory=0.02%)
[36m(_dystack pid=55320)[0m 	-0.2058	 = Validation score   (-mean_squared_error)
[36m(_dystack pid=55320)[0m 	1.95s	 = Training   runtime
[36m(_dystack pid=55320)[0m 	0.06s	 = Validation runtime
[36m(_dystack pid=55320)[0m Fitting model: WeightedEnsemble_L2 ... Training model for up to 138.02s of the 52.58s of remaining time.
[36m(_dystack pid=55320)[0m 	Fitting 1 model on all data | Fitting with cpus=32, gpus=0

<autogluon.tabular.predictor.predictor.TabularPredictor at 0x2592bc32850>

## 3. Evaluation and Leaderboard

In [5]:
# Display the Leaderboard
# This shows individual models (GBM, CAT, etc.) and Ensembles (WeightedEnsemble, Stacking)
leaderboard = predictor.leaderboard(test_df)
print(leaderboard)

                 model  score_test  score_val         eval_metric  \
0  WeightedEnsemble_L3   -0.176687  -0.185744  mean_squared_error   
1      CatBoost_BAG_L1   -0.178364  -0.189618  mean_squared_error   
2  WeightedEnsemble_L2   -0.178425  -0.187358  mean_squared_error   
3      LightGBM_BAG_L2   -0.178680  -0.189710  mean_squared_error   
4      CatBoost_BAG_L2   -0.178698  -0.187564  mean_squared_error   
5       XGBoost_BAG_L2   -0.180692  -0.192206  mean_squared_error   
6      LightGBM_BAG_L1   -0.183305  -0.200409  mean_squared_error   
7  RandomForest_BAG_L2   -0.186302  -0.197647  mean_squared_error   
8       XGBoost_BAG_L1   -0.188759  -0.204334  mean_squared_error   
9  RandomForest_BAG_L1   -0.251320  -0.251288  mean_squared_error   

   pred_time_test  pred_time_val   fit_time  pred_time_test_marginal  \
0        1.694330       1.387620  82.391284                 0.003068   
1        0.161831       0.039309  69.945071                 0.161831   
2        1.308043       

In [6]:
# Evaluate on Test Data
performance = predictor.evaluate(test_df)
print("Test Metrics:", performance)

Test Metrics: {'mean_squared_error': -0.17668715217642764, 'root_mean_squared_error': np.float64(-0.4203417088232235), 'mean_absolute_error': -0.2690656527236728, 'r2': 0.8651664059768475, 'pearsonr': 0.9302012429813988, 'median_absolute_error': -0.1696379127502441}


## 4. Predictions and Feature Importance

In [7]:
# Feature Importance
feature_importance = predictor.feature_importance(test_df)
print(feature_importance)

Computing feature importance via permutation shuffling for 8 features using 4128 rows with 5 shuffle sets...
	65.01s	= Expected runtime (13.0s per shuffle set)
	13.29s	= Actual runtime (Completed 5 of 5 shuffle sets)


            importance    stddev       p_value  n  p99_high   p99_low
Latitude      2.120570  0.020930  1.138578e-09  5  2.163664  2.077476
Longitude     1.946148  0.026099  3.880311e-09  5  1.999886  1.892410
MedInc        0.333517  0.007865  3.709104e-08  5  0.349712  0.317322
AveOccup      0.169404  0.003114  1.369990e-08  5  0.175817  0.162992
AveRooms      0.146086  0.006502  4.696100e-07  5  0.159473  0.132699
HouseAge      0.049270  0.003330  2.487598e-06  5  0.056126  0.042414
AveBedrms     0.012792  0.001733  3.946142e-05  5  0.016360  0.009223
Population    0.009867  0.000852  6.609502e-06  5  0.011622  0.008113


In [8]:
# Make predictions
y_pred = predictor.predict(test_df)
y_test = test_df[label]

# Calculate standard metrics manually if needed
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"MSE: {mse:.4f}")
print(f"MAE: {mae:.4f}")
print(f"R2: {r2:.4f}")

MSE: 0.1767
MAE: 0.2691
R2: 0.8652
