The data and variables names are in different files; you will likely need them both. The goal here is to predict the age of the abalone using the other variables in the dataset because the traditional method for aging these organisms is boring and tedious.

There are two challenges (in my opinion):

1. You should try to build the best, bagging-based model (this includes random forests) to predict age.

2. The UC Irvine Machine Learning Repository classifies this dataset as a "classification" dataset, but age is stored as a numeric (albeit discrete-valued) variable. So, I think it could maybe be reasonable to treat this as a regression problem. It's up to you!

Submit your HTML or .ipynb file here.

In [4]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score, RepeatedStratifiedKFold, RepeatedKFold
from sklearn.compose import make_column_selector, ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures
from sklearn.metrics import (confusion_matrix, accuracy_score, precision_score,
                             recall_score, roc_auc_score, f1_score, roc_curve, classification_report, 
                             cohen_kappa_score)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
import warnings
from sklearn.exceptions import ConvergenceWarning
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier
import seaborn as sns
from sklearn.ensemble import BaggingRegressor

#suppress convergence warnings
warnings.simplefilter("ignore", ConvergenceWarning)
pd.options.mode.chained_assignment = None

In [5]:
# load in the data

column_names = [
    'Sex', 'Length', 'Diameter', 'Height',
    'WholeWeight', 'ShuckedWeight',
    'VisceraWeight', 'ShellWeight', 'Rings'
]
abalone = pd.read_csv("abalone.data", header=None, names=column_names)


## Bagging Classification

In [None]:
X = pd.get_dummies(abalone.drop("Rings", axis=1), drop_first=True)
y = abalone["Rings"]

model = BaggingClassifier()

# evaluate the model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
# report performance
print('Accuracy: %.3f (%.3f)' % (np.mean(n_scores), np.std(n_scores)))



Accuracy: 0.233 (0.021)


In [11]:
model.fit(X, y)
predictions = model.predict(X)
avg_pred = np.mean(predictions)
avg_pred

np.float64(9.915250179554704)

## Bagging Regressor

In [8]:
# define the model
model2 = BaggingRegressor()
# evaluate the model
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model2, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise')
# report performance
print('MAE: %.3f (%.3f)' % (np.mean(n_scores), np.std(n_scores)))

MAE: -1.619 (0.074)


In [10]:

model2.fit(X, y)
predictions = model2.predict(X)
avg_pred = np.mean(predictions)
avg_pred

np.float64(9.948288245152023)

For the bagging regression model, the predicted average rings is approximately 9.95.