# Abalone Dataset Workbook

by Bryan Carr

2 August 2023

This notebook will attempt to predict the age of Abalone based on their physical attributes. Age can be determined by counting the rings on the abalone's shell, but this is typiclaly much more time consuming than taking a few measurements.

The notebook will include some exploratory data analysis.

The dataset is available from UC Irvine here:
https://archive.ics.uci.edu/dataset/1/abalone

One major side goal is to learn about MLFLOW usage. MLFlow is a package to help track and compare models' results, such as when trying different model types, tuning hyperparameters, etc. I have never used MLFlow, but have heard of it, and can imagine it would be very useful in tracking the performance of models.

In [2]:
# import key libraries
import numpy as np
import pandas as pd

import plotly.express as px

In [94]:
# import data from kaggle
! pip install kaggle

from google.colab import files

files.upload()



{}

In [6]:
! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json

mkdir: cannot create directory ‘/root/.kaggle’: File exists


In [8]:
! kaggle datasets list -s abalone

ref                                                        title                                            size  lastUpdated          downloadCount  voteCount  usabilityRating  
---------------------------------------------------------  -----------------------------------------------  ----  -------------------  -------------  ---------  ---------------  
rodolfomendes/abalone-dataset                              Abalone Dataset                                  57KB  2018-07-19 05:31:02          16325        136  1.0              
hurshd0/abalone-uci                                        Abalone UCI                                      52KB  2019-01-08 23:32:54           1898         19  0.7058824        
sandeepmajumdar/abalone-age-prediction                     Abalone Age Prediction                           57KB  2022-09-04 10:42:31            253         18  1.0              
maik3141/abalone                                           Abalone                                       

In [9]:
# download data
! kaggle datasets download -d rodolfomendes/abalone-dataset

#unzip data in main directory /content/
! unzip /content/abalone-dataset.zip

Downloading abalone-dataset.zip to /content
  0% 0.00/57.3k [00:00<?, ?B/s]
100% 57.3k/57.3k [00:00<00:00, 91.2MB/s]


In [12]:
# read in the data
df = pd.read_csv('/content/abalone.csv')

In [13]:
df

Unnamed: 0,Sex,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight,Rings
0,M,0.455,0.365,0.095,0.5140,0.2245,0.1010,0.1500,15
1,M,0.350,0.265,0.090,0.2255,0.0995,0.0485,0.0700,7
2,F,0.530,0.420,0.135,0.6770,0.2565,0.1415,0.2100,9
3,M,0.440,0.365,0.125,0.5160,0.2155,0.1140,0.1550,10
4,I,0.330,0.255,0.080,0.2050,0.0895,0.0395,0.0550,7
...,...,...,...,...,...,...,...,...,...
4172,F,0.565,0.450,0.165,0.8870,0.3700,0.2390,0.2490,11
4173,M,0.590,0.440,0.135,0.9660,0.4390,0.2145,0.2605,10
4174,M,0.600,0.475,0.205,1.1760,0.5255,0.2875,0.3080,9
4175,F,0.625,0.485,0.150,1.0945,0.5310,0.2610,0.2960,10


## 2. EDA

In [17]:
# Print Info -- check for Nulls, and note the Data Types
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4177 entries, 0 to 4176
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Sex             4177 non-null   object 
 1   Length          4177 non-null   float64
 2   Diameter        4177 non-null   float64
 3   Height          4177 non-null   float64
 4   Whole weight    4177 non-null   float64
 5   Shucked weight  4177 non-null   float64
 6   Viscera weight  4177 non-null   float64
 7   Shell weight    4177 non-null   float64
 8   Rings           4177 non-null   int64  
dtypes: float64(7), int64(1), object(1)
memory usage: 293.8+ KB


In [18]:
# Print Summary statistice
df.describe()

Unnamed: 0,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight,Rings
count,4177.0,4177.0,4177.0,4177.0,4177.0,4177.0,4177.0,4177.0
mean,0.523992,0.407881,0.139516,0.828742,0.359367,0.180594,0.238831,9.933684
std,0.120093,0.09924,0.041827,0.490389,0.221963,0.109614,0.139203,3.224169
min,0.075,0.055,0.0,0.002,0.001,0.0005,0.0015,1.0
25%,0.45,0.35,0.115,0.4415,0.186,0.0935,0.13,8.0
50%,0.545,0.425,0.14,0.7995,0.336,0.171,0.234,9.0
75%,0.615,0.48,0.165,1.153,0.502,0.253,0.329,11.0
max,0.815,0.65,1.13,2.8255,1.488,0.76,1.005,29.0


In [21]:
# Plot Scatter Matrix

fig = px.scatter_matrix(df,
                        color='Sex')
fig.show()

We likely want to identify the age as early as possible -- for example, to avoid killing and harvesting the infants, so that they can both grow larger and reproduce. Therefore we should focus on the Whole Weight variable, and drop the other Weights, which can only be learned after harvesting. All weights show generally positive correlation with one another, so there is likely to be little information lost there.

In [23]:
df2 = df.drop(labels=['Shucked weight', 'Viscera weight', 'Shell weight'], axis=1)

In [25]:
# Plot Scatter Matrix

fig = px.scatter_matrix(df2,
                        color='Sex')
fig.show()

In [26]:
# Split the data

from sklearn.model_selection import train_test_split

In [27]:
# Set reuseable Random Seed
seed = 5678

In [28]:
x_train, x_test, y_train, y_test = train_test_split(
    df2.drop(labels=['Rings'], axis=1),
    df2['Rings'],
    test_size = 0.2,
    random_state = seed
)

## 3. Early Modelling

We will create a basic model, to evaluate feature importance.

We will need to encode the categorical variable - Sex - into something numerical. One Hot Encoding each sex is the simplest and clearest, so I will do that. Ordinal Encoding might also be appropriate, but it is not clear what the order/ranking should be, so would be less fair and less clear.

In [35]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

In [66]:
# Adding MLFlow
!pip install mlflow

Collecting mlflow
  Downloading mlflow-2.5.0-py3-none-any.whl (18.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.2/18.2 MB[0m [31m75.6 MB/s[0m eta [36m0:00:00[0m
Collecting databricks-cli<1,>=0.8.7 (from mlflow)
  Downloading databricks-cli-0.17.7.tar.gz (83 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m83.5/83.5 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting gitpython<4,>=2.1.0 (from mlflow)
  Downloading GitPython-3.1.32-py3-none-any.whl (188 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m188.5/188.5 kB[0m [31m16.5 MB/s[0m eta [36m0:00:00[0m
Collecting alembic!=1.10.0,<2 (from mlflow)
  Downloading alembic-1.11.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m18.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting docker<7,>=4.0.0 (from mlflow)
  Downloading docker-6.1.3-py3-none-

NameError: ignored

In [67]:
# Turn on MLFLOW Auto Logging - this is the simplest way to use MLFLOW. But a bit ugly.
import mlflow
mlflow.autolog()

2023/08/02 21:55:17 INFO mlflow.tracking.fluent: Autologging successfully enabled for sklearn.


In [68]:
# Build the basic model

forest_reg = RandomForestRegressor(
    n_estimators = 100,
    max_depth = 5,
    n_jobs = -1,
    random_state = seed
)

In [69]:
# build the encoder and transformer
ohe = OneHotEncoder(handle_unknown='error')

ct = ColumnTransformer(
    [("onehot", ohe, ['Sex'])],
    remainder='passthrough'
)

In [70]:
# Apply the Column Transformer
x_train_transf = ct.fit_transform(x_train)
x_test_transf = ct.transform(x_test)

In [71]:
forest_reg.fit(x_train_transf, y_train)

2023/08/02 21:55:25 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID '2425143e85b748e4bf29a156506f7f42', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow


In [72]:
# compute R2 score for Train
forest_reg.score(x_train_transf, y_train)

0.45772088475782813

In [81]:
# compute R2 score for Test
forest_reg.score(x_test_transf, y_test)

0.386768737597181

## 3. New Models

In [84]:
forest_reg10 = RandomForestRegressor(
    n_estimators = 100,
    max_depth = 10,
    n_jobs = -1,
    random_state = seed
)


In [85]:
forest_reg10.fit(x_train_transf, y_train)


2023/08/02 22:40:35 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID '15b1822322f3421298353d652f559bbb', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow


In [86]:
# compute R2 score for Train
forest_reg10.score(x_train_transf, y_train)

0.6871161295650494

In [87]:
# compute R2 score for Test
forest_reg10.score(x_test_transf, y_test)

0.37523589432217375

## 4. Gradient Boosting

Gradient boosting is typically a robust method -- let's try it. For simplicity, I will use the SKLearn Gradient Boosting Regressor, and not XGBoost or LGBM (typically more powerful packages). This is because I am mainly testing MLFLOW with these experiments.

In [88]:
from sklearn.ensemble import GradientBoostingRegressor

In [89]:
gbr = GradientBoostingRegressor(
    loss = 'squared_error',
    learning_rate = 0.1,
    n_estimators = 100,
    max_depth = 3
)

In [90]:
gbr.fit(x_train_transf, y_train)

2023/08/02 22:45:11 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID 'b83afdeea59d49dd86d844a6a8aa3d38', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow


In [91]:
#Training Score
gbr.score(x_train_transf, y_train)

0.5043552872868744

In [92]:
# Testing Score
gbr.score(x_test_transf, y_test)

0.3926588374251301

In [96]:
gbr_feature_importance = gbr.feature_importances_

print(gbr_feature_importance)
print(df2.columns)

[8.08700504e-03 4.64543336e-02 2.72859262e-04 5.93641683e-02
 2.01955488e-01 4.92178031e-01 1.91688115e-01]
Index(['Sex', 'Length', 'Diameter', 'Height', 'Whole weight', 'Rings'], dtype='object')


So Diameter, Height and Whole Weight are the key features here

In [101]:
# Look at what's inside the Col Transformer
ct.output_indices_

{'onehot': slice(0, 3, None), 'remainder': slice(3, 7, None)}

In [102]:
# Look at what's inside the Col Transformer
ct.transformers_

[('onehot', OneHotEncoder(), ['Sex']),
 ('remainder', 'passthrough', [1, 2, 3, 4])]

In [104]:
# Look at what's inside the Col Transformer
ct.get_feature_names_out

<bound method ColumnTransformer.get_feature_names_out of ColumnTransformer(remainder='passthrough',
                  transformers=[('onehot', OneHotEncoder(), ['Sex'])])>

## 5. MLFLOW UI Access

To access the MLFLOW UI, we need to tunnel into the collab VM. This can be done with the NGROK utility.

1) install NGROK

2) set up the URL for the VM's localhost:5000 port

3) run the MLFLOW UI command

4) Access the tunneled URL

In [82]:
# install NGrok to access the UI site
!pip install pyngrok



In [83]:
from pyngrok import ngrok

ngrok.kill() #kill any existing tunnels

#set auth token
# get token from https://dashboard.ngrok.com/auth
# NGROK_AUTH_TOKEN = "TOKEN_GOES_HERE"
ngrok.set_auth_token(NGROK_AUTH_TOKEN)

#open tunnel
ngrok_tunnel = ngrok.connect(addr="5000",
                             proto="http",
                             bind_tls=True)

print("MLFlow tracking ui: ", ngrok_tunnel.public_url)



MLFlow tracking ui:  https://cf7d-35-245-90-28.ngrok-free.app


In [93]:
! mlflow ui

[2023-08-02 22:47:15 +0000] [45729] [INFO] Starting gunicorn 20.1.0
[2023-08-02 22:47:15 +0000] [45729] [INFO] Listening at: http://127.0.0.1:5000 (45729)
[2023-08-02 22:47:15 +0000] [45729] [INFO] Using worker: sync
[2023-08-02 22:47:15 +0000] [45734] [INFO] Booting worker with pid: 45734
[2023-08-02 22:47:15 +0000] [45735] [INFO] Booting worker with pid: 45735
[2023-08-02 22:47:15 +0000] [45736] [INFO] Booting worker with pid: 45736
[2023-08-02 22:47:15 +0000] [45737] [INFO] Booting worker with pid: 45737

[2023-08-02 22:48:24 +0000] [45729] [INFO] Handling signal: int
Aborted!
[2023-08-02 22:48:24 +0000] [45737] [INFO] Worker exiting (pid: 45737)
[2023-08-02 22:48:24 +0000] [45734] [INFO] Worker exiting (pid: 45734)
[2023-08-02 22:48:24 +0000] [45735] [INFO] Worker exiting (pid: 45735)
[2023-08-02 22:48:25 +0000] [45736] [INFO] Worker exiting (pid: 45736)
[2023-08-02 22:48:25 +0000] [45729] [INFO] Shutting down: Master


## 6. Conclusions

The Autologger is quite ugly. It is not providing easy to access/compare scores in the MLFLOW UI. It would be better to try something more robust, such as with the Runs features. I will investigate that in my next notebook.

## 7. Random Forest Classifier for Feature Importance

I want to go back and analyse the initial dataframe, to determine the importance of the various Weight features.

In [105]:
# Split the data from the initial dataframe, with all Weights

x_train, x_test, y_train, y_test = train_test_split(
    df.drop(labels=['Rings'], axis=1),
    df['Rings'],
    test_size = 0.2,
    random_state = seed
)

In [106]:
# Apply the Column Transformer
x_train_transf = ct.fit_transform(x_train)
x_test_transf = ct.transform(x_test)

In [107]:
rand_forest_full = RandomForestRegressor()

In [108]:
rand_forest_full.fit(x_train_transf, y_train)

2023/08/03 00:02:12 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID '0805b4ff09b443a486e0b311820398d6', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow


In [109]:
#training score
rand_forest_full.score(x_train_transf, y_train)

0.9378864174070912

In [110]:
#testing score
rand_forest_full.score(x_test_transf, y_test)

0.5354878653969988

In [111]:
#get feature importance
rand_forest_full.feature_importances_

array([0.00584289, 0.0207642 , 0.00557941, 0.04570987, 0.05391068,
       0.05210847, 0.08006078, 0.1547235 , 0.07147667, 0.50982353])

In [113]:
df.columns

Index(['Sex', 'Length', 'Diameter', 'Height', 'Whole weight', 'Shucked weight',
       'Viscera weight', 'Shell weight', 'Rings'],
      dtype='object')

We can see that the performance of this model is much better, at 53% Test accuracy. The Feature Importance gives some important hints: Shell Weight is the most important feature, which was not kept for the other models. It accounts for over 50% of the importance on its own. Shucked Weight is second most important, accounting for 15%. Those are more than any of the length-width-height dimensions (approx 5% each), or the Whole Weight (8%).

Clearly we should make our model to include the Shell Weight and Shucked Weight, if we want to have the highest accuracy. Though this may be less useful from practical, conservational approaches.

Another possible approach, for the conservational side, would be to focus on Infant or not.