# Abalone Age Prediction Using Machine Learning

In this notebook, we aim to implement a simple yet effective machine learning model to predict the age of abalones based on their physical attributes. The dataset contains measurements such as length, diameter, height, and weight, which are used to estimate the age of abalones.

## Approach and Guidelines

- **Model Simplicity**: For this first version, we will use a **Linear Regression** model. This is a straightforward model that helps capture linear relationships between the input features and the target variable (age).
- **Experiment Tracking with MLflow**: Throughout our experimentation, we use **MLflow** to track different model runs, hyperparameters, and performance metrics such as MSE and R². This allows us to compare different models and easily monitor the progress of our experiments.
- **Repository Guidelines**: While we track our experiments using MLflow, **no MLflow data will be pushed** to the repository. Only the code used for running the experiments will be versioned for future reference.

## Model Selection and Results

After trying several models, including **Ridge Regression** and **Extended Ridge**, we concluded that **Linear Regression** was the most effective in predicting the age of abalones. The simplicity of the model, combined with its reasonable performance, made it the best choice for this task.

However, it is worth mentioning that we also explored **Extended Ridge Regression** with optimized parameters. Below are the details of the best configuration:

- **Best Parameters for Extended Ridge**:
  - `alpha`: 1.0
  - `fit_intercept`: True
  - `solver`: lsqr
- **Best MSE for Extended Ridge**: 4.889
- **Test MSE for Extended Ridge**: 4.891
- **Test R² for Extended Ridge**: 0.548

In conclusion, after comparing the different models, **Linear Regression** emerged as the best model due to its simplicity, interpretability, and competitive performance in terms of predictive accuracy.


# Imports

In [19]:
import os
import pandas as pd
import mlflow
import mlflow.sklearn
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge, ElasticNet, LinearRegression
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor, GradientBoostingRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from mlflow.models.signature import infer_signature

# Data

In [20]:
# Download the dataset
path = kagglehub.dataset_download("rodolfomendes/abalone-dataset")
csv_path = os.path.join(path, "abalone.csv") 

df = pd.read_csv(csv_path)
df.head()

Unnamed: 0,Sex,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight,Rings
0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
1,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
2,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
3,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
4,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7


# Modelling

In [21]:
# One-hot encoding the categorical variables
df = pd.get_dummies(df, columns=['Sex'], drop_first=True)

In [22]:
# Defining the target variable
X = df.drop(columns=['Rings'])  
y = df['Rings']                 

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [23]:
# Normalizing the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [24]:
# Defining the linear regression model
model = LinearRegression()

# Track the experiment using MLflow
with mlflow.start_run(run_name="Linear Regression Experiment"):

    # Fit the model
    model.fit(X_train_scaled, y_train)
    
    # Make predictions on the test set
    y_pred = model.predict(X_test_scaled)
    
    # Evaluate the model
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    
    # Log parameters and metrics to MLflow
    mlflow.log_param("model", "Linear Regression")
    mlflow.log_param("scaling", "StandardScaler")
    mlflow.log_metric("mse", mse)
    mlflow.log_metric("r2_score", r2)

     # Create an input example
    input_example = pd.DataFrame(X_test_scaled[:5], columns=X.columns)  # Using the first 5 rows as an example

    # Infer model signature (schema)
    signature = infer_signature(X_test_scaled, y_pred)

    # Log the model with input example and signature
    mlflow.sklearn.log_model(model, "linear_regression_model", signature=signature, input_example=input_example)

    print(f"Mean Squared Error: {mse}")
    print(f"R^2 Score: {r2}")

Downloading artifacts:   0%|          | 0/7 [00:00<?, ?it/s]

Mean Squared Error: 4.891232447128579
R^2 Score: 0.5481628137889263


In [25]:
model = XGBRegressor(random_state=42)

with mlflow.start_run(run_name="XGBoost Regressor Experiment"):

    # Fit the model
    model.fit(X_train_scaled, y_train)

    # Make predictions on the test set
    y_pred = model.predict(X_test_scaled)

    # Evaluate the model
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)

    # Log parameters and metrics to MLflow
    mlflow.log_param("model", "XGBoost Regressor")
    mlflow.log_param("scaling", "StandardScaler")
    mlflow.log_metric("mse", mse)
    mlflow.log_metric("r2_score", r2)

    # Create an input example
    input_example = pd.DataFrame(X_test_scaled[:5], columns=X.columns)

    # Infer model signature (schema)
    signature = infer_signature(X_test_scaled, y_pred)

    # Log the model with input example and signature
    mlflow.sklearn.log_model(model, "xgboost_regressor_model", signature=signature, input_example=input_example)

    print(f"Mean Squared Error: {mse}")
    print(f"R^2 Score: {r2}")

Downloading artifacts:   0%|          | 0/7 [00:00<?, ?it/s]

Mean Squared Error: 5.437235685045327
R^2 Score: 0.49772469428649535


In [26]:
# LGBM Regressor
model = LGBMRegressor(random_state=42)

with mlflow.start_run(run_name="LGBM Regressor Experiment"):
    
    # Fit the model
    model.fit(X_train_scaled, y_train)

    # Make predictions on the test set
    y_pred = model.predict(X_test_scaled)

    # Evaluate the model
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)

    # Log parameters and metrics to MLflow
    mlflow.log_param("model", "LGBM Regressor")
    mlflow.log_param("scaling", "StandardScaler")
    mlflow.log_metric("mse", mse)
    mlflow.log_metric("r2_score", r2)

    # Create an input example
    input_example = pd.DataFrame(X_test_scaled[:5], columns=X.columns)

    # Infer model signature (schema)
    signature = infer_signature(X_test_scaled, y_pred)

    # Log the model with input example and signature
    mlflow.sklearn.log_model(model, "lgbm_regressor_model", signature=signature, input_example=input_example)

    print(f"Mean Squared Error: {mse}")
    print(f"R^2 Score: {r2}")


[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001161 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1298
[LightGBM] [Info] Number of data points in the train set: 3341, number of used features: 9
[LightGBM] [Info] Start training from score 9.944627


Downloading artifacts:   0%|          | 0/7 [00:00<?, ?it/s]

Mean Squared Error: 5.0601262594226855
R^2 Score: 0.5325609167741505


In [27]:
# RandomForest Regressor
model = RandomForestRegressor(random_state=42)

with mlflow.start_run(run_name="RandomForest Regressor Experiment"):
    
    # Fit the model
    model.fit(X_train_scaled, y_train)

    # Make predictions on the test set
    y_pred = model.predict(X_test_scaled)

    # Evaluate the model
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)

    # Log parameters and metrics to MLflow
    mlflow.log_param("model", "RandomForest Regressor")
    mlflow.log_param("scaling", "StandardScaler")
    mlflow.log_metric("mse", mse)
    mlflow.log_metric("r2_score", r2)

    # Create an input example
    input_example = pd.DataFrame(X_test_scaled[:5], columns=X.columns)

    # Infer model signature (schema)
    signature = infer_signature(X_test_scaled, y_pred)

    # Log the model with input example and signature
    mlflow.sklearn.log_model(model, "randomforest_regressor_model", signature=signature, input_example=input_example)

    print(f"Mean Squared Error: {mse}")
    print(f"R^2 Score: {r2}")

Downloading artifacts:   0%|          | 0/7 [00:00<?, ?it/s]

Mean Squared Error: 5.107539234449761
R^2 Score: 0.5281810502563149


In [28]:
# GradientBoosting Regressor
model = GradientBoostingRegressor(random_state=42)

with mlflow.start_run(run_name="GradientBoosting Regressor Experiment"):
    
    # Fit the model
    model.fit(X_train_scaled, y_train)

    # Make predictions on the test set
    y_pred = model.predict(X_test_scaled)

    # Evaluate the model
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)

    # Log parameters and metrics to MLflow
    mlflow.log_param("model", "GradientBoosting Regressor")
    mlflow.log_param("scaling", "StandardScaler")
    mlflow.log_metric("mse", mse)
    mlflow.log_metric("r2_score", r2)

    # Create an input example
    input_example = pd.DataFrame(X_test_scaled[:5], columns=X.columns)

    # Infer model signature (schema)
    signature = infer_signature(X_test_scaled, y_pred)

    # Log the model with input example and signature
    mlflow.sklearn.log_model(model, "gradientboosting_regressor_model", signature=signature, input_example=input_example)

    print(f"Mean Squared Error: {mse}")
    print(f"R^2 Score: {r2}")

Downloading artifacts:   0%|          | 0/7 [00:00<?, ?it/s]

Mean Squared Error: 5.095921650201711
R^2 Score: 0.5292542473766624


In [29]:
# Extra Trees Regressor
model = ExtraTreesRegressor(random_state=42)

with mlflow.start_run(run_name="Extra Trees Regressor Experiment"):
    
    # Fit the model
    model.fit(X_train_scaled, y_train)

    # Make predictions on the test set
    y_pred = model.predict(X_test_scaled)

    # Evaluate the model
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)

    # Log parameters and metrics to MLflow
    mlflow.log_param("model", "Extra Trees Regressor")
    mlflow.log_param("scaling", "StandardScaler")
    mlflow.log_metric("mse", mse)
    mlflow.log_metric("r2_score", r2)

    # Create an input example
    input_example = pd.DataFrame(X_test_scaled[:5], columns=X.columns)

    # Infer model signature (schema)
    signature = infer_signature(X_test_scaled, y_pred)

    # Log the model with input example and signature
    mlflow.sklearn.log_model(model, "extra_trees_regressor_model", signature=signature, input_example=input_example)

    print(f"Mean Squared Error: {mse}")
    print(f"R^2 Score: {r2}")

Downloading artifacts:   0%|          | 0/7 [00:00<?, ?it/s]

Mean Squared Error: 4.995821172248803
R^2 Score: 0.5385012252673118


In [30]:
# Ridge Regressor
model = Ridge(alpha=1.0)

with mlflow.start_run(run_name="Ridge Regressor Experiment"):
    
    # Fit the model
    model.fit(X_train_scaled, y_train)

    # Make predictions on the test set
    y_pred = model.predict(X_test_scaled)

    # Evaluate the model
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)

    # Log parameters and metrics to MLflow
    mlflow.log_param("model", "Ridge Regressor")
    mlflow.log_param("scaling", "StandardScaler")
    mlflow.log_metric("mse", mse)
    mlflow.log_metric("r2_score", r2)

    # Create an input example
    input_example = pd.DataFrame(X_test_scaled[:5], columns=X.columns)

    # Infer model signature (schema)
    signature = infer_signature(X_test_scaled, y_pred)

    # Log the model with input example and signature
    mlflow.sklearn.log_model(model, "ridge_regressor_model", signature=signature, input_example=input_example)

    print(f"Mean Squared Error: {mse}")
    print(f"R^2 Score: {r2}")

Downloading artifacts:   0%|          | 0/7 [00:00<?, ?it/s]

Mean Squared Error: 4.8910510674343675
R^2 Score: 0.5481795690937609


In [31]:
# ElasticNet Regressor
model = ElasticNet(alpha=1.0, l1_ratio=0.5)

with mlflow.start_run(run_name="ElasticNet Regressor Experiment"):
    
    # Fit the model
    model.fit(X_train_scaled, y_train)

    # Make predictions on the test set
    y_pred = model.predict(X_test_scaled)

    # Evaluate the model
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)

    # Log parameters and metrics to MLflow
    mlflow.log_param("model", "ElasticNet Regressor")
    mlflow.log_param("scaling", "StandardScaler")
    mlflow.log_metric("mse", mse)
    mlflow.log_metric("r2_score", r2)

    # Create an input example
    input_example = pd.DataFrame(X_test_scaled[:5], columns=X.columns)

    # Infer model signature (schema)
    signature = infer_signature(X_test_scaled, y_pred)

    # Log the model with input example and signature
    mlflow.sklearn.log_model(model, "elasticnet_regressor_model", signature=signature, input_example=input_example)

    print(f"Mean Squared Error: {mse}")
    print(f"R^2 Score: {r2}")

Downloading artifacts:   0%|          | 0/7 [00:00<?, ?it/s]

Mean Squared Error: 7.093149719884988
R^2 Score: 0.34475638901844075


In [32]:
 # Extended parameter grid for Ridge Regression (without 'normalize')
extended_ridge_param_grid = {
    'alpha': [0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0],  # Wider range of alpha
    'solver': ['auto', 'svd', 'cholesky', 'lsqr', 'sag', 'lbfgs'],  # Different solvers for optimization
    'fit_intercept': [True, False]  # Whether to calculate the intercept for this model
}

# Initialize the Ridge Regressor
ridge_extended = Ridge()

# Use GridSearchCV for tuning the hyperparameters
ridge_extended_grid_search = GridSearchCV(estimator=ridge_extended, param_grid=extended_ridge_param_grid, scoring='neg_mean_squared_error', cv=5, n_jobs=-1)

# Fit the model
ridge_extended_grid_search.fit(X_train_scaled, y_train)

# Best parameters and performance
best_extended_ridge_model = ridge_extended_grid_search.best_estimator_
print(f"Best parameters for Extended Ridge: {ridge_extended_grid_search.best_params_}")
print(f"Best MSE for Extended Ridge: {-ridge_extended_grid_search.best_score_}")

# Evaluate on the test set
y_pred_extended_ridge = best_extended_ridge_model.predict(X_test_scaled)
mse_extended_ridge = mean_squared_error(y_test, y_pred_extended_ridge)
r2_extended_ridge = r2_score(y_test, y_pred_extended_ridge)

print(f"Test MSE for Extended Ridge: {mse_extended_ridge}")
print(f"Test R² for Extended Ridge: {r2_extended_ridge}")

Best parameters for Extended Ridge: {'alpha': 1.0, 'fit_intercept': True, 'solver': 'lsqr'}
Best MSE for Extended Ridge: 4.889262037659363
Test MSE for Extended Ridge: 4.891051067336059
Test R² for Extended Ridge: 0.5481795691028423


70 fits failed out of a total of 420.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
70 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\user\anaconda3\Lib\site-packages\sklearn\model_selection\_validation.py", line 732, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\Users\user\anaconda3\Lib\site-packages\sklearn\base.py", line 1151, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\user\anaconda3\Lib\site-packages\sklearn\linear_model\_ridge.py", line 1142, in fit
    return super().fit(X, y, sample_weight=sample_weight)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c