# 15.2 Case Study: Classification with k-Nearest Neighbors and the Digits Dataset, Part 1

**This file contains Sections 15.2 and 15.3 and all of their subsections and Self Check exercises**

### Classification Problems
### Our Approach

![Self Check Exercises check mark image](files/art/check.png)
## 15.2 Self Check

**1. _(Fill-In)_** `________` classification divides samples into two distinct classes, and `________`-classification divides samples into many distinct classes.

**Answer:** binary, multi. 

## 15.2.1 k-Nearest Neighbors Algorithm
### Hyperparameters and Hyperparameter Tuning

![Self Check Exercises check mark image](files/art/check.png)
## 15.2.1 Self Check
**1. _(True/False)_** In machine learning, a model implements a machine-learning algorithm. In scikit-learn, models are called estimators.

**Answer:** True.

**2. _(Fill-In)_** The process of choosing the best value of *k* for the k-nearest neighbors algorithm is called `________`

**Answer:** hyperparameter tuning.

## 15.2.2 Loading the Dataset

**We added `%matplotlib inline` to enable Matplotlib in this notebook.**

In [None]:
%matplotlib inline
from sklearn.datasets import load_digits

In [None]:
digits = load_digits()

### Displaying the Description

In [None]:
print(digits.DESCR)

### Checking the Sample and Target Sizes

In [None]:
digits.target[::100]

In [None]:
digits.data.shape

In [None]:
digits.target.shape

### A Sample Digit Image

In [None]:
digits.images[13]

### Preparing the Data for Use with Scikit-Learn

In [None]:
digits.data[13]

![Self Check Exercises check mark image](files/art/check.png)
## 15.2.2 Self Check

**1. _(Fill-In)_** A `Bunch` object’s  `________` and `________` attributes are NumPy arrays containing the dataset’s samples and labels, respectively.

**Answer:**  `data`, `target`.

**2. _(True/False)_** A scikit-learn `Bunch` object contains only a dataset’s data.

**Answer:** False. A scikit-learn `Bunch` object contains a dataset’s data and information about the dataset (called metadata), available through the `DESCR` attribute.

**3. _(IPython Session)_** For sample number `22` in the Digits dataset, display the 8-by-8 image data and numeric value of the digit the image represents.

**Answer:** 

In [None]:
digits.images[22]

In [None]:
digits.target[22]

## 15.2.3 Visualizing the Data
### Creating the Diagram 

In [None]:
import matplotlib.pyplot as plt

In [None]:
figure, axes = plt.subplots(nrows=4, ncols=6, figsize=(6, 4))

### Displaying Each Image and Removing the Axes Labels 

for item in zip(axes.ravel(), digits.images, digits.target):
    axes, image, target = item
    axes.imshow(image, cmap=plt.cm.gray_r)
    axes.set_xticks([])  # remove x-axis tick marks
    axes.set_yticks([])  # remove y-axis tick marks
    axes.set_title(target)
plt.tight_layout()     

In [None]:
# This placeholder cell was added because we had to combine 
# the sections snippets 12-13 for the visualization to work in Jupyter
# and want the subsequent snippet numbers to match the book

![Self Check Exercises check mark image](files/art/check.png)
## 15.2.3 Self Check
**1. _(Fill-In)_** The process of familiarizing yourself with your data is called `________`.

**Answer:**  data exploration.

**2. _(IPython Session)_** Display the image for sample number `22` of the Digits dataset. 

**Answer:** 

In [None]:
axes = plt.subplot()

image = plt.imshow(digits.images[22], cmap=plt.cm.gray_r)

xticks = axes.set_xticks([])

yticks = axes.set_yticks([])

In [None]:
# placeholder due to merge of prior cells

In [None]:
# placeholder due to merge of prior cells

In [None]:
# placeholder due to merge of prior cells

## 15.2.4 Splitting the Data for Training and Testing 

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
     digits.data, digits.target, random_state=11)

### Training and Testing Set Sizes

In [None]:
X_train.shape

In [None]:
X_test.shape

![Self Check Exercises check mark image](files/art/check.png)
## 15.2.4 Self Check
**1. _(True/False)_** You should typically use all of a dataset’s data to train a model.

**Answer:** False. It’s important to set aside a portion of your data for testing, so you can evaluate a model’s performance using data that the model has not yet seen. 

**2. _(Discussion)_** For the Digits dataset, what numbers of samples would the following statement reserve for training and testing purposes? 

```python
X_train, X_test, y_train, y_test = train_test_split(
    digits.data, digits.target, test_size=0.40)
```
**Answer:** 1078 and 719.

## 15.2.5 Creating the Model 

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
knn = KNeighborsClassifier()

## 15.2.6 Training the Model 

In [None]:
knn.fit(X=X_train, y=y_train)

![Self Check Exercises check mark image](files/art/check.png)
## 15.2.6 Self Check
**1. _(Fill-In)_** The `KNeighborsClassifier` is said to be `________` because its work is performed only when you use it to make predictions.

**Answer:** lazy.

**2. _(True/False)_** Each scikit-learn estimator’s `fit` method simply loads a dataset.

**Answer:** False. For most, scikit-learn estimators, the `fit` method loads the data into the estimator then uses that data to perform complex calculations behind the scenes that learn from the data and train the model. 

## 15.2.7 Predicting Digit Classes 

In [None]:
predicted = knn.predict(X=X_test)

In [None]:
expected = y_test

In [None]:
predicted[:20]

In [None]:
expected[:20]

In [None]:
wrong = [(p, e) for (p, e) in zip(predicted, expected) if p != e]

In [None]:
wrong

![Self Check Exercises check mark image](files/art/check.png)
## 15.2.7 Self Check
**1. _(IPython Session)_** Using the `predicted` and `expected` arrays, calculate and display the prediction accuracy percentage.

**Answer:** 

In [None]:
print(f'{(len(expected) - len(wrong)) / len(expected):.2%}')

**2. _(IPython Session)_** Rewrite the list comprehension in snippet `[29]` using a for loop. Which coding style do you prefer?

**Answer:** 

In [None]:
wrong = []

In [None]:
for p, e in zip(predicted, expected):
      if p != e:
          wrong.append((p, e))

In [None]:
wrong

# 15.3 Case Study: Classification with k-Nearest Neighbors and the Digits Dataset, Part 2
## 15.3.1 Metrics for Model Accuracy 
### Estimator Method `score`

In [None]:
print(f'{knn.score(X_test, y_test):.2%}')

### Confusion Matrix

In [None]:
from sklearn.metrics import confusion_matrix

In [None]:
confusion = confusion_matrix(y_true=expected, y_pred=predicted)

In [None]:
confusion

### Classification Report

In [None]:
from sklearn.metrics import classification_report

In [None]:
names = [str(digit) for digit in digits.target_names]

In [None]:
print(classification_report(expected, predicted, 
       target_names=names))

### Visualizing the Confusion Matrix

In [None]:
import pandas as pd

In [None]:
confusion_df = pd.DataFrame(confusion, index=range(10),
     columns=range(10))

In [None]:
import seaborn as sns

In [None]:
axes = sns.heatmap(confusion_df, annot=True, 
                    cmap='nipy_spectral_r')

![Self Check Exercises check mark image](files/art/check.png)
## 15.3.1 Self Check
**1. _(Fill-In)_** A Seaborn `________` displays values as colors, often with values of higher magnitude displayed as more intense colors.

**Answer:** heat map.

**2. _(True/False)_** In a classification report, the precision specifies the total number of correct predictions for a class divided by the total number of samples for that class. 

**Answer:** True.

**3. _(Discussion)_** Explain row 3 of the confusion matrix presented in this section:

```
[ 0,  0,  0, 42,  0,  1,  0,  1,  0,  0]
```
**Answer:** The number `42` in column index 3 indicates that 42 `3`s were correctly predicted as 3s. The number `1` at column indices 5 and 7 indicates that one `3` was incorrectly classified as a `5` and one was incorrectly classified as a `7`. 

## 15.3.2 K-Fold Cross-Validation
### KFold Class

In [None]:
from sklearn.model_selection import KFold

In [None]:
kfold = KFold(n_splits=10, random_state=11, shuffle=True)

### Using the `KFold` Object with Function `cross_val_score` 

In [None]:
from sklearn.model_selection import cross_val_score

In [None]:
scores = cross_val_score(estimator=knn, X=digits.data, 
     y=digits.target, cv=kfold)

In [None]:
scores

In [None]:
print(f'Mean accuracy: {scores.mean():.2%}')

In [None]:
print(f'Accuracy standard deviation: {scores.std():.2%}')

![Self Check Exercises check mark image](files/art/check.png)
## 15.3.2 Self Check
**1.  _(True/False)_** Randomizing the data by shuffling it before splitting it into folds is particularly important if the samples might be ordered or grouped. 

**Answer:** True.

**2. _(True/False)_** When you call `cross_val_score` to peform k-fold cross-validation, the function returns the best score produced while testing the model with each fold.

**Answer:** False. The function returns an array containing the scores for each fold. The mean of those scores is the estimator’s overall score. 

## 15.3.3 Running Multiple Models to Find the Best One 

In [None]:
from sklearn.svm import SVC

In [None]:
from sklearn.naive_bayes import GaussianNB

In [None]:
estimators = {
     'KNeighborsClassifier': knn, 
     'SVC': SVC(gamma='scale'),
     'GaussianNB': GaussianNB()}

In [None]:
for estimator_name, estimator_object in estimators.items():
     kfold = KFold(n_splits=10, random_state=11, shuffle=True)
     scores = cross_val_score(estimator=estimator_object, 
         X=digits.data, y=digits.target, cv=kfold)
     print(f'{estimator_name:>20}: ' + 
           f'mean accuracy={scores.mean():.2%}; ' +
           f'standard deviation={scores.std():.2%}')

### Scikit-Learn Estimator Diagram

![Self Check Exercises check mark image](files/art/check.png)
## 15.3.3 Self Check
**1. _(True/False)_** You should choose the best estimator before performing your machine learning study.

**Answer:** False. It’s difficult to know in advance which machine learning model(s) will perform best for a given dataset, especially when they hide the details of how they operate from their users. For this reason, you should run multiple models to determine which is the best for your study. 

**2. _(Discussion)_** How would you modify the code in this section to so that it would also test a `LinearSVC` estimator?

**Answer:** You’d import the `LinearSVC` class, add a key–value pair to the `estimators` dictionary (`'LinearSVC': LinearSVC()`), then execute the `for` loop, which tests every estimator in the dictionary.

## 15.3.4 Hyperparameter Tuning 

In [None]:
for k in range(1, 20, 2):
     kfold = KFold(n_splits=10, random_state=11, shuffle=True)
     knn = KNeighborsClassifier(n_neighbors=k)
     scores = cross_val_score(estimator=knn, 
         X=digits.data, y=digits.target, cv=kfold)
     print(f'k={k:<2}; mean accuracy={scores.mean():.2%}; ' +
           f'standard deviation={scores.std():.2%}')

![Self Check Exercises check mark image](files/art/check.png)
## 15.3.4 Self Check
**1. _(True/False)_** When you create an estimator object, the default hyperparameter values that scikit-learn uses are generally the best ones for every machine learning study. 

**Answer:** False. The default hyperparameter values make it easy for you to test estimators quickly. In real-world machine learning studies, you’ll want to use hyperparameter tuning to choose hyperparameter values that produce the best possible predictions.

In [None]:
##########################################################################
# (C) Copyright 2019 by Deitel & Associates, Inc. and                    #
# Pearson Education, Inc. All Rights Reserved.                           #
#                                                                        #
# DISCLAIMER: The authors and publisher of this book have used their     #
# best efforts in preparing the book. These efforts include the          #
# development, research, and testing of the theories and programs        #
# to determine their effectiveness. The authors and publisher make       #
# no warranty of any kind, expressed or implied, with regard to these    #
# programs or to the documentation contained in these books. The authors #
# and publisher shall not be liable in any event for incidental or       #
# consequential damages in connection with, or arising out of, the       #
# furnishing, performance, or use of these programs.                     #
##########################################################################
