## Lesson 7 - Age of Abalone
### Author: Ana Javed

### Workplace Scenario

Kennedy's oceanographic institute client pulled into port the other day with a ton (literally) of collected samples and corresponding data to process. Some of these data tasks are being distributed to others to work on; you've got the abalone (marine snails) data to classify and determine the age from physical characteristics. 

##### Background

Age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope. Other measurements, which are easier to obtain, could be used to predict the age. According to the data provider, original data examples with missing values were removed (the majority having the predicted value missing), and the ranges of the continuous values have been scaled (by dividing by 200) for use with machine learning algorithms such as SVMs.

The target field is “Rings”. Since the output is continuous the solution can be handled by a Support Vector Regression or it can be changed to a binary Support Vector Classification by assigning examples that are younger than 11 years old to class: ‘0’ and those that are older (class: ‘1’).

### Instructions

Using the Abalone csv file (location: https://library.startlearninglabs.uw.edu/DATASCI420/2019/Datasets/Abalone.csv) , create a new notebook to build an experiment using support vector machine classifier and regression. Perform each of the following tasks and answer the questions:

- Convert the continuous output value from continuous to binary (0,1) and build an SVC
- Using your best guess for hyperparameters and kernel, what is the percentage of correctly classified results?
- Test different kernels and hyperparameters or consider using sklearn.model_selection.SearchGridCV. Which kernel performed best with what settings?
- Show recall, precision and f-measure for the best model
- Using the original data, with rings as a continuous variable, create an SVR model
- Report on the predicted variance and the mean squared error





In [1]:
## Importing Necessary Libraries & Packages 
import matplotlib.pyplot as plt
import pandas as pd 
import numpy as np
import datetime as dt
import csv
import sklearn 
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Declaring inline visualizations 
%matplotlib inline


In [2]:
# ## Reading data file into Dataframe 
url = 'https://library.startlearninglabs.uw.edu/DATASCI420/2019/Datasets/Abalone.csv'
df = pd.read_csv(url, sep=",")

## First 5 Rows from Dataframe
df.head()

Unnamed: 0,Sex,Length,Diameter,Height,Whole Weight,Shucked Weight,Viscera Weight,Shell Weight,Rings
0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
1,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
2,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
3,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
4,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7


In [3]:
## Conducting Exploratory Data Analysis: 
print(df.shape)  # (4177, 9)
print(df.dtypes) 
print(df.describe()) 

(4177, 9)
Sex                object
Length            float64
Diameter          float64
Height            float64
Whole Weight      float64
Shucked Weight    float64
Viscera Weight    float64
Shell Weight      float64
Rings               int64
dtype: object
            Length     Diameter       Height  Whole Weight  Shucked Weight  \
count  4177.000000  4177.000000  4177.000000   4177.000000     4177.000000   
mean      0.523992     0.407881     0.139516      0.828742        0.359367   
std       0.120093     0.099240     0.041827      0.490389        0.221963   
min       0.075000     0.055000     0.000000      0.002000        0.001000   
25%       0.450000     0.350000     0.115000      0.441500        0.186000   
50%       0.545000     0.425000     0.140000      0.799500        0.336000   
75%       0.615000     0.480000     0.165000      1.153000        0.502000   
max       0.815000     0.650000     1.130000      2.825500        1.488000   

       Viscera Weight  Shell Weight    

### 1. Convert the continuous output value from continuous to binary (0,1) and build an SVC

In [4]:
# Assigning examples that are younger than 11 years old to class: ‘0’ and those that are older (class: ‘1’).

## Creating a copy of the Rings Column: 
df.loc[:, "ring_class"] = df.loc[:, "Rings"]

## For Loop to create a binary value in "ring_class" column
for each in df.loc[:, "ring_class"].unique():
    if int(each) >= 11:
        df.loc[df.loc[:, "ring_class"] == each, "ring_class"] = '1' # Older
    elif int(each) < 11.0: 
        df.loc[df.loc[:, "ring_class"] == each, "ring_class"] = '0' # Younger
    continue
        
print("\nAfter:")
print(df.loc[:, "ring_class"].value_counts())



After:
0    2730
1    1447
Name: ring_class, dtype: int64


In [5]:
## Making "Sex" One-hot encoded columns 
df_expanded = pd.get_dummies(df, columns = ["Sex"]) 


In [6]:
## Final Dataframe 
df_expanded.head()

Unnamed: 0,Length,Diameter,Height,Whole Weight,Shucked Weight,Viscera Weight,Shell Weight,Rings,ring_class,Sex_F,Sex_I,Sex_M
0,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15,1,0,0,1
1,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7,0,0,0,1
2,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9,0,1,0,0
3,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10,0,0,0,1
4,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7,0,0,1,0


### 2. Using your best guess for hyperparameters and kernel, what is the percentage of correctly classified results?

In [7]:
## Separating out Target Variable & Test/Train Sets 

col_names_list = list(df_expanded.columns)
col_names_list.remove("ring_class")
col_names_list.remove("Rings")

X = df_expanded.loc[:, col_names_list]
Y = df_expanded.loc[:, "ring_class"]

X_train, X_test, y_train, y_test = train_test_split(X, Y, 
                    test_size = 0.25, random_state = 99)

In [8]:
cost = .9 # penalty parameter 
gamma = 5 # defines the influence of input vectors on the margins

from sklearn import svm, metrics
from sklearn.metrics import classification_report

# Test rbf 
clf = svm.SVC(gamma=gamma, kernel='linear', C=cost).fit(X_train, y_train)
clf.predict(X_test)
print("rbf Kernel")
print(classification_report(clf.predict(X_test), y_test))

rbf Kernel
              precision    recall  f1-score   support

           0       0.90      0.76      0.83       807
           1       0.47      0.72      0.57       238

    accuracy                           0.75      1045
   macro avg       0.69      0.74      0.70      1045
weighted avg       0.80      0.75      0.77      1045



The percentage of correctly classified results i 0.75 or 75%. 

### 3. Test different kernels and hyperparameters or consider using sklearn.model_selection.SearchGridCV. Which kernel performed best with what settings?

In [9]:
from sklearn import model_selection
from sklearn.model_selection import GridSearchCV
parameters = {'kernel':('linear', 'poly'), 'C':[.7, .8, .9, 1], 'gamma':[5, 10]}
svc = svm.SVC()
clf = GridSearchCV(svc, parameters)
clf.fit(X_train, y_train)
clf.predict(X_test)


array(['0', '1', '0', ..., '0', '1', '0'], dtype=object)

In [10]:
## The Best Parameters: 
print("Best parameters identified: ")
print(clf.best_params_)

print("/n")

## Some Results of the GridSearchCV Output 
results = pd.DataFrame(clf.cv_results_)
print(results.head(15))



Best parameters identified: 
{'C': 0.7, 'gamma': 10, 'kernel': 'poly'}
/n
    mean_fit_time  std_fit_time  mean_score_time  std_score_time param_C  \
0        0.185732      0.008743         0.038187        0.000794     0.7   
1        2.404163      0.212576         0.036837        0.001945     0.7   
2        0.182182      0.003174         0.037997        0.000442     0.7   
3       20.037556      2.785862         0.036111        0.001944     0.7   
4        0.182693      0.003560         0.037234        0.000570     0.8   
5        2.337878      0.206666         0.036232        0.001848     0.8   
6        0.181574      0.005814         0.038001        0.001256     0.8   
7       24.088684      3.467405         0.035141        0.000614     0.8   
8        0.183876      0.006564         0.038716        0.001069     0.9   
9        2.832718      0.410373         0.036512        0.002961     0.9   
10       0.182851      0.004464         0.038089        0.001102     0.9   
11      28.427

### 4. Show recall, precision and f-measure for the best model

In [9]:
## Updating Model to use Parameters from the GridSearchCV
cost = 0.7 # penalty parameter 
gamma = 10 # defines the influence of input vectors on the margins

from sklearn import svm, metrics
from sklearn.metrics import classification_report

# SVC Classifier with "Best" Parameters
clf = svm.SVC(gamma=gamma, kernel='poly', C=cost).fit(X_train, y_train)
clf.predict(X_test)
print("Poly Kernel")
print(classification_report(clf.predict(X_test), y_test))

Poly Kernel
              precision    recall  f1-score   support

           0       0.89      0.79      0.84       770
           1       0.55      0.73      0.63       275

    accuracy                           0.77      1045
   macro avg       0.72      0.76      0.73      1045
weighted avg       0.80      0.77      0.78      1045



Once using the polynomial kernal and the specific cost (0.7) and gamma (10), the accuracy increased by 2% to 77%. While this improved the accuracy metric, the overall number could still be improved proved.

### 5. Using the original data, with rings as a continuous variable, create an SVR model

In [13]:
X = df_expanded.loc[:, col_names_list]
Y = df_expanded.loc[:, "Rings"]  # Only selecting "Rings" Continous Column for target 


## Standardizing Data Columns that are Not Binary:
for each in ['Length', 'Diameter', 'Height', 'Whole Weight', 'Shucked Weight', 
            'Viscera Weight', 'Shell Weight']:
    column_df = pd.DataFrame(df_expanded.loc[:, each])
    standardization_scale = StandardScaler().fit(column_df)
    column_df = pd.DataFrame(standardization_scale.transform(column_df))
    X.loc[:, each] = column_df[0]


## Splitting Testing /Training Data: 
X_train, X_test, y_train, y_test = train_test_split(X, Y, 
                    test_size = 0.25, random_state = 99)

## SVR 
regr = svm.SVR()
regr.fit(X_train, y_train) ## Training Model
y_predict = regr.predict(X_test) ## Testing Model 


### 6. Report on the predicted variance and the mean squared error



In [14]:
from sklearn.metrics import * 
rmse = mean_squared_error(y_test, y_predict) * 100
print("RMSE is : {}%".format(rmse))

rsqd = regr.score(X_train, y_train)
print("R^2: ", rsqd)
print("Predicted Variance: ", (1-rsqd) * np.var(y_test))

RMSE is : 454.63868941135115%
R^2:  0.5619097059750076
Predicted Variance:  4.3959934413831805


#### Summary

In this lesson, a few different Support Vector Machine techniques were applied for predicting the age of the Abalone. First a support vector machine classifier was used, with random hyperparameters chosen. This resulted in an accuracy rate of 75%. Cost was 0.9, Gamma =5, and the kernel chosen was linear. Next, to improve the results, GridSearchCV was utilize to identify the best combination of parameters for the support vector machine classifier. This resulted in using Cost =  0.7, gamma = 10, and a polynomial kernel, which achieved a 77% accuracy rate. 

Lastly, support vector machine regression model was generated and applied to the dataset. This resulted in a RMSE of 454 and R-squared of 0.56, which could certainly be improved perhaps with parameters. Parameters were passed to this model, however it was computationally expensive and did not result after 1-hr of processing - thus parameters were left off. 