# Train, Validate --> Train, Test

In this exercise, you will perform empirical comparison of the results of a ten-fold cross validated model with a fully trained model.

## Notes and Guidelines
* Read a dataset from disk and use it for a classification task.
* Construct a Gaussian Naive Bayes classifier and fit it to the phoneme dataset provided.
* Save and re-load a trained classifier.
* Compare K-fold cross-validation scores with the success rate of a fully-trained model.


### Dataset
* Dataset acquired from [KEEL](http://sci2s.ugr.es/keel/dataset.php?cod=105), an excellent resource for finding 'toy' datasets (and a few more serious ones).
    * A description of the dataset is provided at the above link - **read it.**
    * Excerpt: 
    *The aim of this dataset is to distinguish between nasal (class 0) and oral sounds (class 1).
    The class distribution is 3,818 samples in class 0 and 1,586 samples in class 1.
    The phonemes are transcribed as follows: sh as in she, dcl as in dark, iy as the vowel in she, aa as the vowel in dark, and ao as the first vowel in water.*
    
* It is not necessary to fully understand the nature or context of the values in the dataset - only that there are five columns of input (featural) data and one column of output (class) data.

## Handling imports and dataset inclusion

In [None]:
import os
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.naive_bayes import GaussianNB
from sklearn.externals import joblib as jb

# locate dataset
DATASET = '/dsa/data/all_datasets/phoneme.csv'  # phoneme classification dataset
assert os.path.exists(DATASET)  # check if the file actually exists

## Constructing DataFrame from raw dataset

<span style="background:yellow">**Note**</span>: Variable `dataset` should be used for the dataframe.

In [None]:
# read_table() is like read_csv(), but generalized for delimited file formats
# and calling .sample() with a fraction of 1 shuffles the data
dataset = pd.read_table(DATASET, sep=',', engine='c', header=0).sample(frac=1)

# verify dataset shape
print("Dataset shape: ", dataset.shape)

## Splitting data into training and test sets

In [None]:
"""
The below is only necessary if you are interested in visualizing
the data or providing neatly-labeled output within the program.

# extract labels from column headers
phonemes = dataset.columns[0:5].tolist()  # Feature labels
labels = {0: 'Nasal', 1: 'Oral'}  # Class labels
"""

# extract target data from primary data frame
X = dataset.iloc[:, 0:5]  # iloc is used here for numeric indexing
y = dataset.loc[:, "Class"]  # loc can be used for label-based indexing

# split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
print("Training shapes (X, y): ", X_train.shape, y_train.shape)
print("Testing shapes (X, y): ", X_test.shape, y_test.shape)

## Constructing the classifier and running automated cross-validation

In [None]:
# Your code below this line
# --------------------------








## Training the classifier and pickling to disk

In [None]:
# Your code below this line
# --------------------------





## Unpickling the model and making predictions

#### Fill in the `# ???` with proper code

In [None]:
# Your code below this line
# --------------------------

# load pickled model
loaded_model = # ???

# make predictions with freshly loaded model
predictions = # ???

# verify input and output shape are appropriate
print("Input vs. output shape:")
print(X_test.shape, predictions.shape)




## Performing final performance comparison

In [None]:
# tally up right + wrong 'guesses' by model
true, false = 0, 0
for i, j in zip(y_test, predictions):
    # print(i, j)
    if i == j:
        true += 1
    else:
        false += 1

# report results numerically and by percentage
true_percent = true / (true + false) * 100
print("Correct guesses: " + str(true) + "\nIncorrect guesses: " + str(false))
print("Percent correct: " + str(true_percent))

# compare to average of cross-validation scores
avg_cv = np.sum(scores) / len(scores) * 100
print("Percent cross-validation score (10 folds, average): " + str(avg_cv))

## Measure performance using Scikit Learn modules 
#### (see Module 1 labs)
Compute and display the following:
 1. Compute Confusion Matrix
 1. Accuracy
 1. Precision
 1. $F_1$-Score

In [None]:
# Your code below this line
# --------------------------







## Conclusions ?

How did your trained model perform relative to your expectations based on the cross-validation?
Provide your answer in the cell below.

# Save your notebook!