# AdaBoost (Regression)

Data Source: [Abalone]("http://archive.ics.uci.edu/ml/datasets/Abalone")

**Data Set Information**

Predicting the age of abalone from physical measurements. The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope -- a boring and time-consuming task. Other measurements, which are easier to obtain, are used to predict the age. Further information, such as weather patterns and location (hence food availability) may be required to solve the problem.

From the original data examples with missing values were removed (the majority having the predicted value missing), and the ranges of the continuous values have been scaled for use with an ANN (by dividing by 200).

**Attribute Information**

Given is the attribute name, attribute type, the measurement unit and a brief description. The number of rings is the value to predict: either as a continuous value or as a classification problem.


Name / Data Type / Measurement Unit / Description

- Sex / nominal / -- / M, F, and I (infant)
- Length / continuous / mm / Longest shell measurement
- Diameter / continuous / mm / perpendicular to length
- Height / continuous / mm / with meat in shell
- Whole weight / continuous / grams / whole abalone
- Shucked weight / continuous / grams / weight of meat
- Viscera weight / continuous / grams / gut weight (after bleeding)
- Shell weight / continuous / grams / after being dried
- Rings / integer / -- / +1.5 gives the age in years 

In [1]:
# Import the necessary packages
import numpy as np
import pandas as pd
from sklearn.ensemble import AdaBoostRegressor, ExtraTreesRegressor, RandomForestRegressor, BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
from math import sqrt
from sklearn.model_selection import train_test_split

In [2]:
# Loading and reading the data
abalone_orig = pd.read_csv("./abalone/abalone.txt")
abalone_orig.head()

Unnamed: 0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
0,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
1,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
2,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
3,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7
4,I,0.425,0.3,0.095,0.3515,0.141,0.0775,0.12,8


In [3]:
# Column rename as per data source
abalone_orig.columns = ["sex", "length", "diameter", "height", "wholeweight", "shuckedweight",
                       "visceraweight", "shellweight", "rings"]
abalone_orig.head()

Unnamed: 0,sex,length,diameter,height,wholeweight,shuckedweight,visceraweight,shellweight,rings
0,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
1,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
2,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
3,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7
4,I,0.425,0.3,0.095,0.3515,0.141,0.0775,0.12,8


In [4]:
# Export this to .csv data
# abalone_orig.to_csv("./abalone/abalone.csv", index = None)

In [5]:
# Reading the .csv dataset
abalone = pd.read_csv("./abalone/abalone.csv")
abalone.head()

Unnamed: 0,sex,length,diameter,height,wholeweight,shuckedweight,visceraweight,shellweight,rings
0,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
1,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
2,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
3,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7
4,I,0.425,0.3,0.095,0.3515,0.141,0.0775,0.12,8


In [6]:
# Display the characteristics of dataset
print("Dimension of dataset are: ", abalone.shape)
print("The variables present in dataset are: \n", abalone.columns)

Dimension of dataset are:  (4176, 9)
The variables present in dataset are: 
 Index(['sex', 'length', 'diameter', 'height', 'wholeweight', 'shuckedweight',
       'visceraweight', 'shellweight', 'rings'],
      dtype='object')


In [7]:
# Dataset information
abalone.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4176 entries, 0 to 4175
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   sex            4176 non-null   object 
 1   length         4176 non-null   float64
 2   diameter       4176 non-null   float64
 3   height         4176 non-null   float64
 4   wholeweight    4176 non-null   float64
 5   shuckedweight  4176 non-null   float64
 6   visceraweight  4176 non-null   float64
 7   shellweight    4176 non-null   float64
 8   rings          4176 non-null   int64  
dtypes: float64(7), int64(1), object(1)
memory usage: 293.8+ KB


In [8]:
# For the sake of simplicity lets drop the sex column
abalone.drop("sex", axis = 1, inplace = True)
abalone.head()

Unnamed: 0,length,diameter,height,wholeweight,shuckedweight,visceraweight,shellweight,rings
0,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
1,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
2,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
3,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7
4,0.425,0.3,0.095,0.3515,0.141,0.0775,0.12,8


In [9]:
# Using random seed function for generating the same dataset
np.random.seed(3000)

In [10]:
# Train-Test Split
# Dependent Variable - rings
training, test = train_test_split(abalone, test_size = 0.3)

x_trg = training.drop("rings", axis = 1)
y_trg = training["rings"]

x_test = test.drop("rings", axis = 1)
y_test = test["rings"]

### Creating AdaBoost model

In [11]:
# Model building - AdaBoost
abalone_ada = AdaBoostRegressor()

# Fit the model
abalone_ada.fit(x_trg, y_trg)
print("Accuracy of AdaBoost model on training set is: ", abalone_ada.score(x_trg, y_trg))
print("Accuracy of AdaBoost model on test set is: ", abalone_ada.score(x_test, y_test))

# Prediction using AdaBoost model
abalone_ada_pred = abalone_ada.predict(x_test)

# Compute RMSE of AdaBoost model
abalone_ada_rmse = sqrt(mean_squared_error(y_test, abalone_ada_pred))
print("RMSE value of AdaBoost model is: ", abalone_ada_rmse)

Accuracy of AdaBoost model on training set is:  0.2354689688138678
Accuracy of AdaBoost model on test set is:  0.1509188953985222
RMSE value of AdaBoost model is:  2.9958820220046696


#### Create a new AdaBoost model with grid search

In [12]:
# Import the necessary package
from sklearn.model_selection import GridSearchCV

In [13]:
# Setting the parameters
param_grid = {"n_estimators" : [50,100,200], "learning_rate" : [0.5,0.7,0.9,1]}

abalone_ada_grid = AdaBoostRegressor()
abalone_ada_CV = GridSearchCV(estimator = abalone_ada_grid, param_grid = param_grid, cv = 5)

# Fit the model
abalone_ada_result = abalone_ada_CV.fit(x_trg, y_trg)
print("Best Parameters are: \n", abalone_ada_CV.best_params_)

Best Parameters are: 
 {'learning_rate': 0.5, 'n_estimators': 50}


#### Creating the new AdaBoost model with best scores

In [14]:
# Model building
abalone_ada_best = AdaBoostRegressor(n_estimators = abalone_ada_result.best_params_["n_estimators"],
                            learning_rate = abalone_ada_result.best_params_["learning_rate"])

#### Evaluating the model considering best parameters

In [15]:
# Fit the model
abalone_ada_best.fit(x_trg, y_trg)
print("Accuracy of new AdaBoost model on training set is: ", abalone_ada_best.score(x_trg, y_trg))
print("Accuracy of new AdaBoost model on test set is: ", abalone_ada_best.score(x_test, y_test))

# Predict using new AdaBoost model
abalone_ada_pred_2 = abalone_ada_best.predict(x_test)

# Compute the RMSE of new AdaBoost model
abalone_ada_rmse_2 = sqrt(mean_squared_error(y_test, abalone_ada_pred_2))
print("RMSE value of new AdaBoost model is: ", abalone_ada_rmse_2)

Accuracy of new AdaBoost model on training set is:  0.405858161839737
Accuracy of new AdaBoost model on test set is:  0.3199258662623835
RMSE value of new AdaBoost model is:  2.6811940515838404


#### Create Extra Tree model

In [16]:
# Model Building - Extra Tree Model
abalone_extratree = ExtraTreesRegressor()

# Fit the model
abalone_extratree.fit(x_trg, y_trg)
print("Accuracy of Extra Tree model on training set is: ", abalone_extratree.score(x_trg, y_trg))
print("Accuracy of Extra Tree model on test set is: ", abalone_extratree.score(x_test, y_test))

# Predict using Extra Tree model
abalone_extratree_pred = abalone_extratree.predict(x_test)

# Compute RMSE of Extra Tree model
abalone_extratree_rmse = sqrt(mean_squared_error(y_test, abalone_extratree_pred))
print("RMSE value of Extra Tree model is: ", abalone_extratree_rmse)

Accuracy of Extra Tree model on training set is:  1.0
Accuracy of Extra Tree model on test set is:  0.5299793825666006
RMSE value of Extra Tree model is:  2.228992324733101


#### Create Random Forest model

In [17]:
# Model building
abalone_forest = RandomForestRegressor(random_state = 0)

# Fit the model
abalone_forest.fit(x_trg, y_trg)
print("Accuracy of Random Forest model on training set is: ", abalone_forest.score(x_trg, y_trg))
print("Accuracy of Random Forest model on test set is: ", abalone_forest.score(x_test, y_test))

# Predict using Random Forest model
abalone_forest_pred = abalone_forest.predict(x_test)

# Compute RMSE of Random Forest model
abalone_forest_rmse = sqrt(mean_squared_error(y_test, abalone_forest_pred))
print("RMSE value of Random Forest model is: ", abalone_forest_rmse)

Accuracy of Random Forest model on training set is:  0.9355857362667112
Accuracy of Random Forest model on test set is:  0.525049212814444
RMSE value of Random Forest model is:  2.2406520720319425


#### Create Bagging model

In [18]:
# Model building
abalone_bag = BaggingRegressor(base_estimator = None, n_estimators = 10, max_samples = 1.0,
                              max_features = 1.0, bootstrap = True)

# Fit the model
abalone_bag.fit(x_trg, y_trg)
print("Accuracy of Bagging model on training set is: ", abalone_bag.score(x_trg, y_trg))
print("Accuracy of Bagging model on test set is: ", abalone_bag.score(x_test, y_test))

# Predict using Bagging model
abalone_bag_pred = abalone_bag.predict(x_test)

# Compute RMSE of Bagging model
abalone_bag_rmse = sqrt(mean_squared_error(y_test, abalone_bag_pred))
print("RMSE value of Bagging model is : ", abalone_bag_rmse)

Accuracy of Bagging model on training set is:  0.9138279776200258
Accuracy of Bagging model on test set is:  0.4672721306872275
RMSE value of Bagging model is :  2.3730278071828845


#### Create Decision Tree model

In [19]:
# Model building
abalone_tree = DecisionTreeRegressor(random_state = 0)

# Fit the model
abalone_tree.fit(x_trg, y_trg)
print("Accuracy of Decision Tree model on training set is: ", abalone_tree.score(x_trg, y_trg))
print("Accuracy of Decision Tree model on test set is: ", abalone_tree.score(x_test, y_test))

# Predict using Decision Tree model
abalone_tree_pred = abalone_tree.predict(x_test)

# Compute RMSE of Decision Tree model
abalone_tree_rmse = sqrt(mean_squared_error(y_test, abalone_tree_pred))
print("RMSE value of Decision Tree model is: ", abalone_tree_rmse)

Accuracy of Decision Tree model on training set is:  1.0
Accuracy of Decision Tree model on test set is:  0.0802553151489882
RMSE value of Decision Tree model is:  3.118054932206196


The RMSE values of various models are:
- AdaBoost : 2.99 (Train accuracy - 0.23 and Test accuracy - 0.15)
- new AdaBoost : 2.68 (Train accuracy - 0.40 and Test accuracy - 0.31)
- Extra Tree : 2.22 (Train accuracy - 1.00 and Test accuracy - 0.52)
- Random Forest : 2.24 (Train accuracy - 0.93 and Test accuracy - 0.52)
- Bagging : 2.37 (Train accuracy - 0.91 and Test accuracy - 0.46)
- Decision Tree : 3.11 (Train accuracy - 1.00 and Test accuracy - 0.08)

The RMSE value of Extra Tree model is least among all the models tried. We can conclude that Extra Tree model is the best model for this problem.