# Train, Validate $\rightarrow$ Train, Test

In this exercise, you will perform empirical comparison of the results of a ten-fold cross validated model with a fully trained model.

## Notes and Guidelines
* Read a dataset from disk and use it for a classification task.
* Construct a Gaussian Naive Bayes classifier and fit it to the phoneme dataset provided.
* Save and re-load a trained classifier.
* Compare K-fold cross-validation scores with the success rate of a fully-trained model.


### Dataset
* Dataset acquired from [KEEL](http://sci2s.ugr.es/keel/dataset.php?cod=105), an excellent resource for finding 'toy' datasets (and a few more serious ones).
    * A description of the dataset is provided at the above link - **read it.**
    * Excerpt: 
    *The aim of this dataset is to distinguish between nasal (class 0) and oral sounds (class 1).
    The class distribution is 3,818 samples in class 0 and 1,586 samples in class 1.
    The phonemes are transcribed as follows: sh as in she, dcl as in dark, iy as the vowel in she, aa as the vowel in dark, and ao as the first vowel in water.*
    
* It is not necessary to fully understand the nature or context of the values in the dataset - only that there are five columns of input (featural) data and one column of output (class) data.

## Handling imports and dataset inclusion

In [2]:
import os, sys
import pandas as pd
import numpy as np
import sklearn.model_selection
from sklearn.naive_bayes import GaussianNB
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from collections import OrderedDict
from sklearn.model_selection import cross_val_score, train_test_split

# <import necessary modules> 

# locate dataset
DATASET = '/dsa/data/all_datasets/phoneme.csv'  # phoneme classification dataset
assert os.path.exists(DATASET)  # check if the file actually exists

## Constructing DataFrame from raw dataset

<span style="background:yellow">**Note**</span>: Variable `dataset` should be used for the dataframe.

In [3]:

dataset = pd.read_csv(DATASET, header=0).sample(frac=1)

# verify dataset shape
print("Dataset shape: ", dataset.shape)

Dataset shape:  (5404, 6)


In [4]:
# show first few lines of the dataset
dataset.head()

Unnamed: 0,Aa,Ao,Dcl,Iy,Sh,Class
3467,0.435,2.454,1.028,0.507,-0.295,0
968,2.954,0.951,0.729,0.326,-0.126,0
3627,0.612,3.149,-0.747,0.85,-0.389,1
866,0.252,1.166,3.149,-0.726,-0.404,0
5220,1.024,2.403,-0.505,0.258,0.434,0


## Splitting data into training and test sets

Split the datasets into training (80%) and testing (20%) sets. 

The below is only necessary if you are interested in visualizing
the data or providing neatly-labeled output within the program.

```python
# extract labels from column headers
phonemes = dataset.columns[0:5].tolist()  # Feature labels
labels = {0: 'Nasal', 1: 'Oral'}  # Class labels
```

In [5]:
# extract features and class data from primary data frame


X = dataset.iloc[:,:-1]
y = np.array(dataset.Class)  

# split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
print("Training shapes (X, y): ", X_train.shape, y_train.shape)
print("Testing shapes (X, y): ", X_test.shape, y_test.shape)

Training shapes (X, y):  (4323, 5) (4323,)
Testing shapes (X, y):  (1081, 5) (1081,)


## Constructing the classifier and running automated cross-validation

* Run a 10-fold cross validation with `GaussianNB` classifier
* Print the accuracy scores for these 10 folds

In [6]:
# Your code below this line (Question #E101)
# --------------------------

model = GaussianNB()

cv_scores = sklearn.model_selection.cross_val_score(model, X, y, scoring="accuracy", cv=10)
cv_scores


array([0.75970425, 0.77634011, 0.74676525, 0.76340111, 0.76851852,
       0.71481481, 0.7962963 , 0.77222222, 0.75925926, 0.74814815])

## Training the classifier and pickling to disk
* Learn the model with all the training instances and store to disk

In [7]:
# Your code below this line (Question #E102)
# --------------------------

from sklearn.metrics import classification_report

model.fit(X_train, y_train)

import joblib
joblib.dump(model, 'GaussianIris.pkl')

['GaussianIris.pkl']

## Unpickling the model and making predictions

* Load the saved model 
* Make predictions for the testing set


In [8]:
# Your code below this line (Question #E103)
# --------------------------

# load pickled model
loaded_model = joblib.load('GaussianIris.pkl')

# make predictions with freshly loaded model
y_pred = loaded_model.predict(X_test)

# verify input and output shape are appropriate
print("Input vs. output shape:")
print(X_test.shape, y_pred.shape)




Input vs. output shape:
(1081, 5) (1081,)


## Performing final performance comparison

In [10]:
# tally up right + wrong 'guesses' by model
true, false = 0, 0
for i, j in zip(y_test, y_pred):
    # print(i, j)
    if i == j:
        true += 1
    else:
        false += 1

# report results numerically and by percentage
true_percent = true / (true + false) * 100
print("Correct guesses: " + str(true) + "\nIncorrect guesses: " + str(false))
print("Percent correct: " + str(true_percent))

# compare to average of cross-validation scores
avg_cv = np.sum(cv_scores) / len(cv_scores) * 100
print("Percent cross-validation score (10 folds, average): " + str(avg_cv))

Correct guesses: 843
Incorrect guesses: 238
Percent correct: 77.98334875115633
Percent cross-validation score (10 folds, average): 76.05469980146505


## Measure performance using Scikit Learn modules 

Compute and display the following:
 1. Compute Confusion Matrix
 1. Accuracy
 1. Precision
 1. Recall
 1. $F_1$-Score
 
Add additional cells if required. 

In [11]:
# Your code below this line  (Question #E104)
# --------------------------
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.tree import DecisionTreeClassifier


confusion_matrix(y_test, model.predict(X_test))


array([[599, 158],
       [ 80, 244]])

In [14]:
print(classification_report(y_test, model.predict(X_test)))

              precision    recall  f1-score   support

           0       0.88      0.79      0.83       757
           1       0.61      0.75      0.67       324

    accuracy                           0.78      1081
   macro avg       0.74      0.77      0.75      1081
weighted avg       0.80      0.78      0.79      1081



## Conclusions ?

How did your trained model perform relative to your expectations based on the cross-validation?
Provide your answer in the cell below.

In [None]:
# Add your answer below this comment  (Question #E105)
# -----------------------------------

# It performed about according to my expectations. The results on the classification report were actually not much different
# from the cross validation accuracy scores.





# Save your notebook!  Then `File > Close and Halt`