<a href="https://colab.research.google.com/github/bwood06/EE8603_Final_Project/blob/main/final_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# EE8603 Final Project - Abalone Regression
### Created by: Brendan Wood - November 29, 2023

Run the following cells in order to perfrom build a regression model to predict the age of Abalone Snails.

## Import libraries
First, import all required python libraries to the notebook.

In [None]:
!pip install uci-dataset
!pip install pycaret
!pip install pandas
!pip install matplotlib
!pip install scikit-learn
!pip install seaborn
!pip install scipy

## Load Dataset
Next, the dataset must be loaded. The head of the dataset will be printed to ensure the dataset loaded correctly and a heatmap of the dataset's feature correlations will be shown to better describe the dataset.

In [None]:
import seaborn as sns
from uci_dataset import load_abalone
from matplotlib import pyplot as plt
dataset = load_abalone()
print(f'Below is the head of the dataset to ensure it loaded correctly. \n')
print(dataset.head())
print('\n----------\n')
print(f'There are {dataset.shape[0]} samples.')
print('\n----------\n')
print(f'There are {dataset.shape[1] - 1} features that can be used for prediction. The features and their data types are listed below: \n')
print(dataset.dtypes)
print(f'\nThere are {dataset.select_dtypes(exclude=object).shape[1]} numeric features and {dataset.select_dtypes(include=object).shape[1]} non-numeric feature.' \
      f'There are {len(dataset["Sex"].unique())} options for the categorical variable.')
print(f'After one hot encoding, there will be a total of {dataset.shape[1] - 1 + 2} features that can be used for prediction.')
print('\n----------\n')
print('Below is a Pearson Correlation Coefficient heat map of all the features, from this we can predict which features will be the most important.')
print('Given the larger the coefficient, the stronger the correlation, it can be hypothesized that the weight of the shell will be the strongest predictor. \n')
sns.heatmap(dataset.corr(numeric_only=True, method='pearson'), square=True, cmap='RdYlGn')
plt.suptitle('Pearson Correlation Coefficient Heat Map of Abalone Numeric Features')
plt.show()

## Preprocessing
The below code creates a training object with various preprocessing settings. It outlines the target variable, the proportion of samples to use for training, which cateogrical features to one-hot encode, and the number of folds to use in k-fold cross validation. It also adds a normalized MSE metric to the experiment for accurate comparsion to the literature.

In [None]:
from pycaret.regression import RegressionExperiment
from sklearn.metrics import mean_squared_error

# First set the target variable to rings +1.5, since that is how age is determined
dataset['Rings'] += 1.5

# Next build the experiment object with the required arguments
exp = RegressionExperiment()
exp.setup(dataset,
          target='Rings',
          train_size=0.8,
          categorical_features=['Sex'],
          fold=5,
          session_id=123)

# Since the papers cited used a normalized version of the mean squared error, create a lambda function that normalizes the sklearn mean_squared_error
# by the variance of the output space.
norm_mse = lambda y_true, y_pred: mean_squared_error(y_true, y_pred) / (dataset['Rings'].std() ** 2)

# Add the metric to the experiment and call it Normalized MSE
_ = exp.add_metric('norm_mse', 'Normalized MSE', norm_mse, greater_is_better=False)

## Train the Model
This cell of code trains all types of regression models and selects the overall best performing model to use as the final model. Note that even though the highest Normalized MSE is highlighted, this is a visual error and the best performing Normalzied MSE is at the top of the table.

In [None]:
best = exp.compare_models(sort='Normalized MSE')

## Analyzing the Selected Model

Below are a series plots illustrating the performance and the best model.

### Residuals
The plot below shows the models residuals as a function of predicted value.

In [None]:
exp.plot_model(best, plot='residuals')

### Feature Importance
The plot below shows which features have the highest importance when predicting the age of the Abalone. All features are shown.

In [None]:
exp.plot_model(best, plot='feature_all')

### Error Plot
The final plot shows the predicted age as a function of actual age. A perfect model would have all points along the identity line.

In [None]:
exp.plot_model(best, plot='error')