<a href="https://colab.research.google.com/github/bruceMacLeod/COS475-575/blob/main/Lab/Abalone-Lab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1><center>Lab 9</center></h1>
<h1><center>Neural Networks on Tabular Data</center></h1>

Neural networks applied to tabular data are not usually the highest performing Machine Learning technique. Recent papers suggest tree based algorithms with boosting perform well (https://arxiv.org/abs/2110.01889, https://www.sciencedirect.com/science/article/abs/pii/S1566253521002360). Nonetheless, we will apply neural networks to the problem of predicting the age an abalone to develop our foundational deep learning skills.

Goal : Predicting the age of abalone from physical measurements. The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope -- a boring and time-consuming task. Other measurements, which are easier to obtain, are used to predict the age. 
Attribute Information:

Given is the attribute name, attribute type, the measurement unit and a brief description. The number of rings is the value to predict: either as a continuous value or as a classification problem.

Name / Data Type / Measurement Unit / Description

- Sex / nominal / -- / M, F, and I (infant)
- Length / continuous / mm / Longest shell measurement
- Diameter / continuous / mm / perpendicular to length
- Height / continuous / mm / with meat in shell
- Whole weight / continuous / grams / whole abalone
- Shucked weight / continuous / grams / weight of meat
- Viscera weight / continuous / grams / gut weight (after bleeding)
- Shell weight / continuous / grams / after being dried
- Rings / integer / -- / +1.5 gives the age in years

In [None]:
import pandas as pd
import requests
import io
import os

import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns

from sklearn.impute import SimpleImputer 
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer 
from pylab import cm

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import sklearn.metrics as metrics
from sklearn.metrics import mean_squared_error, mean_absolute_error
import math

import numpy as np


import tensorflow as tf
from tensorflow import keras
print(tf.__version__)
print(keras.__version__)

from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.models import Sequential


In [None]:
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data'

abalone_df = pd.read_csv(url, header=None)
abalone_df.columns=['Sex','Length','Diameter','Height',
                                             'Whole Weight','Shucked Weight',
                                             'Viscera Weight','Shell Weight',
                                             'Rings']

In [None]:
abalone_df.head()

#### Data Cleaning 


In [None]:
abalone_df[abalone_df['Height']<=0]

In [None]:
abalone_df = abalone_df[abalone_df['Height']>0]

#### Setup train/test data 

In [None]:
X = abalone_df.drop('Rings', axis=1)
y = abalone_df.Rings + 1.5

In [None]:
# set aside 20% of train and test data for evaluation
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, shuffle = True, random_state = 42)


#### Normalize the data for analysis 


In [None]:
num_attribs = ['Length','Diameter','Height','Whole Weight','Shucked Weight','Viscera Weight','Shell Weight']
cat_attribs = ["Sex"]

In [None]:
num_pipeline = Pipeline([
            ('imputer', SimpleImputer(strategy="median")),
            ('std_scaler', StandardScaler()),
        ])

In [None]:
full_pipeline = ColumnTransformer([
             ("num", num_pipeline, num_attribs),
             ("cat", OneHotEncoder(), cat_attribs),
         ])
abalone_train = full_pipeline.fit_transform(X_train)
abalone_test = full_pipeline.transform(X_test)

In [None]:
abalone_train.shape

#### Linear regression model 

The mean absolute error indicates that our linear regression is off by approximately 1.6 

In [None]:
lin_reg = LinearRegression()
lin_reg.fit(abalone_train, y_train)
yp = lin_reg.predict(abalone_test)

print(mean_absolute_error(y_test, yp))


#### Exercise : Develop a neural network and test the performance using a validation set

Steps : 
- Neural networks have a lot of hyperparameters, so we start by creating a validation set to compare our models. Let's keep 15% out of the training set. Note, that this immediately puts the neural network at a disadvantage by reducing the amount of data it has to do training. I do this step below

- Use the code on Pg 308 of the book, develop a neural network and evaluate the performance using the validation set and the mean absolute error metric 

- Plot the training and validation loss

- Experiment with adding layers and changing the size of the layers ... choose the best model 

- Complete your modeling by evaluating on the test dataset. Did you manage to beat the linear regression 


In [None]:
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.15, random_state= 8) 
abalone_train = full_pipeline.transform(X_train)
abalone_val = full_pipeline.transform(X_val)

#### Graduate students/Extra Credit 

Add tensorboard visualization to help guide your neural network model building. See the section in the book starting on Pg 317