## Instructions

Download the dataset here: https://archive.ics.uci.edu/ml/datasets/AbaloneLinks to an external site.

The data and variables names are in different files; you will likely need them both. The goal here is to predict the age of the abalone using the other variables in the dataset because the traditional method for aging these organisms is boring and tedious.

There are two challenges (in my opinion):

1. You should try to build the best, stacking-based model(s) to predict age.

2. The UC Irvine Machine Learning Repository classifies this dataset as a "classification" dataset, but age is stored as a numeric (albeit discrete-valued) variable. So, I think it could maybe be reasonable to treat this as a regression problem. It's up to you!

How does your work here compare to your results with bagging?!

## Data Import

In [13]:
# Packages
import pandas as pd
from sklearn.ensemble import ExtraTreesRegressor, StackingRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import Ridge, LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR, LinearSVR
from xgboost import XGBRegressor
from sklearn.compose import ColumnTransformer
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler


In [14]:
# Data
abalone_df = pd.read_csv('Data/abalone.data', header=None)
abalone_df.columns = [
    'Sex',
    'Length',
    'Diameter',
    'Height',
    'Whole_weight',
    'Shucked_weight',
    'Viscera_weight',
    'Shell_weight',
    'Rings'
]
abalone_df.head()

Unnamed: 0,Sex,Length,Diameter,Height,Whole_weight,Shucked_weight,Viscera_weight,Shell_weight,Rings
0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
1,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
2,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
3,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
4,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7


## Data Prep

In [15]:
# Rings +1.5 gives the age in years
abalone_df["Age"] = abalone_df["Rings"] + 1.5
abalone_df.drop(columns=["Rings"], inplace=True)
abalone_df.head()

Unnamed: 0,Sex,Length,Diameter,Height,Whole_weight,Shucked_weight,Viscera_weight,Shell_weight,Age
0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,16.5
1,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,8.5
2,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,10.5
3,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,11.5
4,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,8.5


## Stacked Ensemble Modeling

The stacking regressors below are implemented using two different sets of base models to compare their effectiveness in predicting abalone age. The first stack includes tree-based models—Random Forest, Extra Trees, and XGBoost (inspiration taken from https://github.com/vecxoz/vecstack/blob/master/examples/01_regression.ipynb), while the second uses a more diverse mix of strong learners: K-Nearest Neighbors, SVR, and XGBoost. Each stack is combined with a Ridge regression 'final estimator' and wrapped in a scikit-learn pipeline with preprocessing. `GridSearchCV` is used to tune the meta-learner, and model performance is evaluated using test set MSE and R².

### Splitting and Preprocessing

In [16]:
X = abalone_df.drop(columns='Age')
y = abalone_df['Age']

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Preprocessing
categorical_features = ['Sex']
categorical_transformer = OneHotEncoder(drop='first')
numeric_transformer = StandardScaler()

preprocessor = ColumnTransformer(
    transformers=[
        ('cat', categorical_transformer, categorical_features),
        ('num', numeric_transformer, [col for col in X.columns if col not in categorical_features])
    ]
)

### Base Models and Final Estimators
Some Inspiration:

https://machinelearningmastery.com/xgboost-for-regression/

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesRegressor.html

In [17]:
# Two base model sets
base_models_1 = [
    ('rf', RandomForestRegressor(n_estimators=200, random_state=42)), # 200 was best from PA: Bagging!
    ('et', ExtraTreesRegressor(n_estimators=200, random_state=42)),
    ('xgb', XGBRegressor(n_estimators=200, random_state=42, verbosity=0))
]

base_models_2 = [
    ('knn', KNeighborsRegressor()),
    ('svr', SVR()),
    ('linsvr', LinearSVR(random_state=42, max_iter=10000))
]

# Final estimators with optional param grids
final_estimators = [
    ('Ridge', Ridge(), {'regressor__final_estimator__alpha': [0.01, 0.1, 1.0]}),
    ('RandomForest', RandomForestRegressor(n_estimators=50, random_state=42), {}),
    ('LinearRegression', LinearRegression(), {})
]

### Grid Search on Combos

In [19]:
results = []

# Loop through each base model set and final estimator combo
for idx, base_models in enumerate([base_models_1, base_models_2], start=1):
    for final_name, final_est, final_params in final_estimators:
        # Define stacking regressor
        stack = StackingRegressor(
            estimators=base_models,
            final_estimator=final_est,
            passthrough=True,
            n_jobs=-1
        )

        # Wrap in pipeline
        pipe = Pipeline(steps=[
            ('preprocessor', preprocessor),
            ('regressor', stack)
        ])

        # Grid search
        grid_search = GridSearchCV(pipe, final_params, cv=5, scoring='neg_mean_squared_error', n_jobs=-1)
        grid_search.fit(X_train, y_train)

        # Evaluate best model
        best_model = grid_search.best_estimator_
        y_pred = best_model.predict(X_test)
        mse = mean_squared_error(y_test, y_pred)
        r2 = r2_score(y_test, y_pred)

        # Store results
        results.append({
            'Base Model Set': f'Set {idx}',
            'Final Estimator': final_name,
            'Best Params': grid_search.best_params_,
            'Test MSE': round(mse, 3),
            'Test R²': round(r2, 3)
        })


In [20]:
# Results!
pd.DataFrame(results)

Unnamed: 0,Base Model Set,Final Estimator,Best Params,Test MSE,Test R²
0,Set 1,Ridge,{'regressor__final_estimator__alpha': 1.0},4.609,0.574
1,Set 1,RandomForest,{},5.26,0.514
2,Set 1,LinearRegression,{},4.606,0.574
3,Set 2,Ridge,{'regressor__final_estimator__alpha': 1.0},4.541,0.581
4,Set 2,RandomForest,{},4.964,0.541
5,Set 2,LinearRegression,{},4.554,0.579


Set 2 with Ridge as the final estimator performed best overall, achieving the lowest test MSE of 4.54 and the highest R² of 0.580. LinearRegression also performed very well in this configuration (MSE = 4.55, R² = 0.579), indicating that a simple linear model was sufficient to combine the diverse base predictions effectively. In contrast, using RandomForest as the final estimator in Set 2 resulted in weaker performance (MSE = 4.97, R² = 0.54), likely due to overfitting or redundancy with the base learners.

Set 1, which used only tree-based models as base learners, showed solid performance with both Ridge and LinearRegression as final models (MSE ≈ 4.61, R² = 0.574). However, when RandomForest was used as the final estimator in Set 1, performance dropped noticeably (MSE = 5.26, R² = 0.514), making it the weakest overall configuration. These results support the idea that stacking benefits from both model diversity at the base level and simplicity at the meta level, with regularized or linear meta-learners often producing the best results.