# Lesson 07 
# Aeden Jameson

## Best Practices for Assignments & Milestones

- <b>Break the assignment into sections - one section per numbered requirement.</b> Each assignment has numbered requirements/instructions e.g. "1. Read the CIFAR-10 dataset". Each requirement should have at least one markdown cell and at least one code cell. Feel free to combine sections or make other sensible changes if that makes sense for your code and is still clear. The intent is to give you a useful structure and to make sure you get full credit for your work.

- <b>Break the milestone into sections - one section for each item in the rubric.</b> Each milestone has rubric items e.g. "5. Handle class imbalance problem". Each rubric item should have at least one markdown cell and at least one code cell. Feel free to combine sections or make other sensible changes if that makes sense for your code and is still clear. The intent is to give you a useful structure and to make sure you get full credit for your work.

- <b>Include comments, with block comments preferred over in-line comments.</b> A good habit is to start each code cell with comments.

The above put into a useful pattern:

<b>Markdown cell:</b> Requirement #1: Read the CIFAR-10 dataset<br>
<b>Code cell:</b>: Comments followed by code<br>
<b>Markdown cell:</b> Requirement #2: Explore the data<br>
<b>Code cell:</b>: Comments followed by code<br>
<b>Markdown cell:</b> Requirement #3: Preprocess the data and prepare for classification<br>
<b>Code cell:</b>: Comments followed by code<br>

For more information:
- A good notebook example: [DataFrame Basics](https://github.com/Tanu-N-Prabhu/Python/blob/master/Pandas/Pandas_DataFrame.ipynb) 
- More example notebooks: [A gallery of interesting Jupyter Notebooks](https://github.com/jupyter/jupyter/wiki/A-gallery-of-interesting-Jupyter-Notebooks#pandas-for-data-analysis)
- [PEP 8 on commenting](https://www.python.org/dev/peps/pep-0008/)
- [PEP 257 - docstrings](https://www.python.org/dev/peps/pep-0257/)

Occasionally an assignment or milestone will ask you to do something other than write Python code e.g. ask you turn in a .docx file. In which case, please use logical structuring, but the specific notes above may not apply.

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
from pandas.plotting import scatter_matrix
import scipy.stats as ss

plt.rc('font', size=14) 
sns.set(style="ticks", color_codes=True)

## Step 0: Read & Explore the Dataset

In [4]:
def prepare(fileName = "https://library.startlearninglabs.uw.edu/DATASCI420/2019/Datasets/Abalone.csv"):
    data = pd.read_csv(fileName)
    
    data['Sex'] = data['Sex'].astype('category')

    return data

abalones = prepare()
print('Prepared...')

Prepared...


In [5]:
abalones.head()

Unnamed: 0,Sex,Length,Diameter,Height,Whole Weight,Shucked Weight,Viscera Weight,Shell Weight,Rings
0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
1,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
2,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
3,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
4,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7


In [6]:
abalones.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4177 entries, 0 to 4176
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype   
---  ------          --------------  -----   
 0   Sex             4177 non-null   category
 1   Length          4177 non-null   float64 
 2   Diameter        4177 non-null   float64 
 3   Height          4177 non-null   float64 
 4   Whole Weight    4177 non-null   float64 
 5   Shucked Weight  4177 non-null   float64 
 6   Viscera Weight  4177 non-null   float64 
 7   Shell Weight    4177 non-null   float64 
 8   Rings           4177 non-null   int64   
dtypes: category(1), float64(7), int64(1)
memory usage: 265.4 KB


In [7]:
abalones.isnull().any(axis = 0)

Sex               False
Length            False
Diameter          False
Height            False
Whole Weight      False
Shucked Weight    False
Viscera Weight    False
Shell Weight      False
Rings             False
dtype: bool

## Step 1: Convert Rings to Binary Classifier (0,1) and Build an SVC

### Conversion

In [8]:
abalones['Class'] = pd.cut(abalones['Rings'], bins=[0, 11, float("inf")], labels=[0,1]) 
abalones.head()

Unnamed: 0,Sex,Length,Diameter,Height,Whole Weight,Shucked Weight,Viscera Weight,Shell Weight,Rings,Class
0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15,1
1,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7,0
2,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9,0
3,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10,0
4,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7,0


In [9]:
abalones['Class'].value_counts()

0    3217
1     960
Name: Class, dtype: int64

### Build an SVC

In [10]:
!pip install category_encoders



In [11]:
from sklearn.model_selection import train_test_split
import category_encoders as ce
from sklearn.svm import SVC


X_train, X_test, y_train, y_test = train_test_split(abalones.drop(columns = ["Rings", "Class"]), abalones["Class"], 
                                                    test_size = 0.20, random_state = 42)
X_train = X_train.reset_index(drop = True)
X_test = X_test.reset_index(drop = True)

onehoter =  ce.OneHotEncoder(return_df = True, 
                             cols = ["Sex"], 
                             drop_invariant = True,
                             use_cat_names = True, 
                             handle_missing = 'value', 
                             handle_unknown = 'value')

X_train_featurized = onehoter.fit_transform(X_train)
X_test_featurized = onehoter.fit_transform(X_test)

X_train_featurized.head()

Unnamed: 0,Sex_I,Sex_F,Sex_M,Length,Diameter,Height,Whole Weight,Shucked Weight,Viscera Weight,Shell Weight
0,1,0,0,0.55,0.445,0.125,0.672,0.288,0.1365,0.21
1,1,0,0,0.475,0.355,0.1,0.5035,0.2535,0.091,0.14
2,0,1,0,0.305,0.225,0.07,0.1485,0.0585,0.0335,0.045
3,1,0,0,0.275,0.2,0.065,0.1165,0.0565,0.013,0.035
4,0,0,1,0.495,0.38,0.135,0.6295,0.263,0.1425,0.215


In [12]:
svmc = SVC(probability=True, gamma = 'scale', cache_size = 4096) # cache size can improve performance
svmc.fit(X_train_featurized, y_train)

SVC(C=1.0, break_ties=False, cache_size=4096, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=True, random_state=None, shrinking=True, tol=0.001,
    verbose=False)

### Accuracy of Our SVC

In [13]:
from sklearn.metrics import accuracy_score

y_hat_test = svmc.predict(X_test_featurized)


print("Accuracy is : {}%".format(accuracy_score(y_test, y_hat_test)*100))

Accuracy is : 80.98086124401914%


## Step 3: Hyper Parameter Search

**NOTE:** I ran GridViewSearch with the following hyperparameter space 
```
hyper_parameter_space = {'kernel': ['poly','linear','rbf','sigmoid'], 
                         'degree': [2, 5],
                         'C': [3, 10], 
                         'gamma': [1, 10]}
```

for about eight hours on an i9 with 8 cores and 64GB of RAM and it still didn't finish. The gamma value seems to be what eats a lot of time. Unfortunately I don't have the time to play with it to that extent.

In [29]:
from sklearn.model_selection import GridSearchCV
    
hyper_parameter_space = {'kernel': ['poly','linear','rbf','sigmoid'], 
                         'degree': [2, 5],
                         'C':[0.1, 1, 10, 100],
                        }

svc = SVC(gamma='scale', probability = False, cache_size = 4096)
clf = GridSearchCV(svc, 
                   hyper_parameter_space, 
                   cv = 5, 
                   refit = True, 
                   return_train_score = True, 
                   n_jobs=-1, 
                   verbose=10)

clf.fit(X_train_featurized, y_train)

Fitting 5 folds for each of 32 candidates, totalling 160 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 tasks      | elapsed:    0.9s
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:    1.0s
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:    1.4s
[Parallel(n_jobs=-1)]: Done  24 tasks      | elapsed:    1.7s
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:    2.1s
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    2.6s
[Parallel(n_jobs=-1)]: Done  53 tasks      | elapsed:    2.9s
[Parallel(n_jobs=-1)]: Done  64 tasks      | elapsed:    3.6s
[Parallel(n_jobs=-1)]: Done  77 tasks      | elapsed:    4.2s
[Parallel(n_jobs=-1)]: Done  90 tasks      | elapsed:    4.6s
[Parallel(n_jobs=-1)]: Done 105 tasks      | elapsed:    6.8s
[Parallel(n_jobs=-1)]: Done 120 tasks      | elapsed:    7.8s
[Parallel(n_jobs=-1)]: Done 137 tasks      | elapsed:    8.8s
[Parallel(n_jobs=-1)]: Done 160 out of 160 | elapsed:   41.7s finished


GridSearchCV(cv=5, error_score=nan,
             estimator=SVC(C=1.0, break_ties=False, cache_size=4096,
                           class_weight=None, coef0=0.0,
                           decision_function_shape='ovr', degree=3,
                           gamma='scale', kernel='rbf', max_iter=-1,
                           probability=False, random_state=None, shrinking=True,
                           tol=0.001, verbose=False),
             iid='deprecated', n_jobs=-1,
             param_grid={'C': [0.1, 1, 10, 100], 'degree': [2, 5],
                         'kernel': ['poly', 'linear', 'rbf', 'sigmoid']},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
             scoring=None, verbose=10)

In [30]:
print(f'Best Params: {clf.best_params_}')

Best Params: {'C': 100, 'degree': 2, 'kernel': 'rbf'}


In [31]:
print(f'Best Params: {clf.best_estimator_}')

Best Params: SVC(C=100, break_ties=False, cache_size=4096, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=2, gamma='scale', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)


## Step 4: Show recall, precision and f-measure for the best mod

In [32]:
from sklearn.metrics import classification_report
labels = ['Younger than 11','11 or Older']

y_hat_test = clf.predict(X_test_featurized)
print(classification_report(y_test, y_hat_test, target_names=labels))

                 precision    recall  f1-score   support

Younger than 11       0.87      0.92      0.90       653
    11 or Older       0.66      0.52      0.58       183

       accuracy                           0.83       836
      macro avg       0.76      0.72      0.74       836
   weighted avg       0.83      0.83      0.83       836



Looks like the SVC classifier with the hyper parameters `{'C': 100, 'degree': 2, 'kernel': 'rbf'}` performed better than our initial random guess in step two.

## Step 5: Using the original data, with rings as a continuous variable, create an SVR model

In [41]:
Xsvr_train, Xsvr_test, ysvr_train, ysvr_test = train_test_split(abalones.drop(columns = ["Rings", "Class"]), abalones["Rings"], 
                                                    test_size = 0.20, random_state = 42)
Xsvr_train = Xsvr_train.reset_index(drop = True)
Xsvr_test = Xsvr_test.reset_index(drop = True)

onehoter =  ce.OneHotEncoder(return_df = True, 
                             cols = ["Sex"], 
                             drop_invariant = True,
                             use_cat_names = True, 
                             handle_missing = 'value', 
                             handle_unknown = 'value')

Xsvr_train_featurized = onehoter.fit_transform(Xsvr_train)
Xsvr_test_featurized = onehoter.fit_transform(Xsvr_test)

In [44]:
from sklearn.svm import SVR

hyper_parameter_space = {'kernel': ['rbf','sigmoid'], 
                         'degree': [2, 5],
                         #'C':[0.1, 1, 10, 100],
                         'C': np.logspace(np.log10(0.01), np.log10(100), num=10),
                         'gamma':np.logspace(np.log10(0.001), np.log10(2), num=20)
                        }

svr = SVR(cache_size = 4096)
svr_clf = GridSearchCV(svr, 
                   hyper_parameter_space, 
                   cv = 5, 
                   refit = True, 
                   return_train_score = True, 
                   n_jobs=-1, 
                   verbose=10)

svr_clf.fit(X_train_featurized, y_train)
svr_predictions = svr_clf.predict(Xsvr_test_featurized)

Fitting 5 folds for each of 800 candidates, totalling 4000 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 tasks      | elapsed:    1.0s
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:    1.2s
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:    1.6s
[Parallel(n_jobs=-1)]: Done  24 tasks      | elapsed:    1.9s
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:    2.4s
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    2.9s
[Parallel(n_jobs=-1)]: Done  53 tasks      | elapsed:    3.5s
[Parallel(n_jobs=-1)]: Done  64 tasks      | elapsed:    4.0s
[Parallel(n_jobs=-1)]: Done  77 tasks      | elapsed:    4.8s
[Parallel(n_jobs=-1)]: Done  90 tasks      | elapsed:    5.4s
[Parallel(n_jobs=-1)]: Done 105 tasks      | elapsed:    6.3s
[Parallel(n_jobs=-1)]: Done 120 tasks      | elapsed:    7.1s
[Parallel(n_jobs=-1)]: Done 137 tasks      | elapsed:    8.3s
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed:    9.4s
[Parallel(n_jobs=-1)]: Done 173 tasks      | elapsed:   

## Step 6: Report on the predicted variance and the mean squared error

In [45]:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import explained_variance_score

### SVC

In [46]:
print("Mean Squared Error is : {}%".format(mean_squared_error(y_test, y_hat_test)))
print("Explain Variance is : {}%".format(explained_variance_score(y_test, y_hat_test)))

Mean Squared Error is : 0.16507177033492823%
Explain Variance is : 0.046653110067866765%


### SVR

In [47]:
print("Mean Squared Error is : {}%".format(mean_squared_error(ysvr_test, svr_predictions)))
print("Explain Variance is : {}%".format(explained_variance_score(ysvr_test, svr_predictions)))

Mean Squared Error is : 103.69919681669526%
Explain Variance is : 0.07858155082099783%
