# Practice: Train, Validate --> Train, Test
In this practice you will revisit the cross-validation process to evaluate and then fully train a classification model on the (**ANYTHING OTHER THAN IRIS**) dataset using the `GaussianNB` classifier.

## Required resources and imports

In [1]:
# Add your code below this comment (Question #P2101)
# ----------------------------------
# note: you will need a few other things here - come back and import them later
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.naive_bayes import GaussianNB
from collections import OrderedDict
from sklearn.metrics import classification_report

## Model construction

In [2]:
# Add your code below this comment (Question #P2102)
# ----------------------------------
# load dataset (we're working with the digits dataset)
raw = load_digits()

# TODO: split raw into X and Y sets
X = raw.data
y = raw.target

# TODO: perform a validation split, reserving 30% of the data for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# we'll use the Gaussian Naive Bayes classifier
classifier = GaussianNB()
help(load_digits)

Help on function load_digits in module sklearn.datasets._base:

load_digits(*, n_class=10, return_X_y=False, as_frame=False)
    Load and return the digits dataset (classification).
    
    Each datapoint is a 8x8 image of a digit.
    
    Classes                         10
    Samples per class             ~180
    Samples total                 1797
    Dimensionality                  64
    Features             integers 0-16
    
    Read more in the :ref:`User Guide <digits_dataset>`.
    
    Parameters
    ----------
    n_class : int, default=10
        The number of classes to return. Between 0 and 10.
    
    return_X_y : bool, default=False
        If True, returns ``(data, target)`` instead of a Bunch object.
        See below for more information about the `data` and `target` object.
    
        .. versionadded:: 0.18
    
    as_frame : bool, default=False
        If True, the data is a pandas DataFrame including columns with
        appropriate dtypes (numeric). The ta

## Cross-validation

In [3]:
# Add your code below this comment (Question #P203)
# ----------------------------------
# perform 10-fold *automated* cross-validation on the data
scores = cross_val_score(classifier, X_train, y_train, cv=10)
print(scores) # this should print out a 10-item array

[0.85714286 0.82539683 0.88888889 0.82539683 0.81746032 0.77777778
 0.80952381 0.848      0.752      0.84      ]


## Fully train the model

In [4]:
# Add your code below this comment (Question #P204)
# ----------------------------------
# TODO: re-fit a model to the data
classifier.fit(X_train, y_train)

GaussianNB()

## Test final model performance

In [5]:
# GaussianNB.predict() returns class labels (integers)
y_pred = classifier.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        54
           1       0.69      0.90      0.78        60
           2       0.93      0.80      0.86        51
           3       0.91      0.88      0.89        48
           4       0.87      0.73      0.80        56
           5       0.96      0.87      0.91        61
           6       0.98      0.96      0.97        53
           7       0.65      0.96      0.78        55
           8       0.67      0.65      0.66        48
           9       0.97      0.67      0.79        54

    accuracy                           0.84       540
   macro avg       0.87      0.84      0.85       540
weighted avg       0.87      0.84      0.85       540



In [6]:
# GaussianNB.predict() returns class labels (integers)
predict = classifier.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        54
           1       0.69      0.90      0.78        60
           2       0.93      0.80      0.86        51
           3       0.91      0.88      0.89        48
           4       0.87      0.73      0.80        56
           5       0.96      0.87      0.91        61
           6       0.98      0.96      0.97        53
           7       0.65      0.96      0.78        55
           8       0.67      0.65      0.66        48
           9       0.97      0.67      0.79        54

    accuracy                           0.84       540
   macro avg       0.87      0.84      0.85       540
weighted avg       0.87      0.84      0.85       540



## Save model to disk via pickling

In [7]:
# Add your code below this comment (Question #P2105)
# ----------------------------------
# TODO: pickle model to disk as "GaussianDigits.pkl"
import joblib
joblib.dump(classifier, 'GaussianDigits.pkl')
# (this will require an import - use either of the modules discussed in lab)

['GaussianDigits.pkl']

## Load and re-test model

In [8]:
# Add your code below this comment (Question #P2106)
# ----------------------------------
# TODO: load/unpickle model
new_clf = joblib.load('GaussianDigits.pkl')
y_pred = new_clf.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        54
           1       0.69      0.90      0.78        60
           2       0.93      0.80      0.86        51
           3       0.91      0.88      0.89        48
           4       0.87      0.73      0.80        56
           5       0.96      0.87      0.91        61
           6       0.98      0.96      0.97        53
           7       0.65      0.96      0.78        55
           8       0.67      0.65      0.66        48
           9       0.97      0.67      0.79        54

    accuracy                           0.84       540
   macro avg       0.87      0.84      0.85       540
weighted avg       0.87      0.84      0.85       540

