In [None]:
# Run cells by clicking on them and hitting CTRL + ENTER on your keyboard
from IPython.display import YouTubeVideo
from datascience import *
import numpy as np
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
%matplotlib inline

# Module 7.2: Classifiers

Now that you've gotten your feet wet with classifiers, we'll formalize the key terminology and concepts of classification. By the end
of this lecture guide, you'll be able to create your own classifiers, and apply them to a wide range of problems.

6 videos make up this notebook, for a total run time of 89:56.

1. [Terminology](#section1) *1 videos, total runtime 7:48*
2. [Breast Cancer Dataset](#section2) *1 video, total runtime 11:59*
3. [Nearest Neighbors](#section3) *2 videos, total runtime 36:34*
4. [Evaluation](#section4) *1 video, total runtime 13:34*
5. [Visualizing ](#section5) *1 video, total runtime 20:01*
6. [Check for Understanding](#section6)

Textbook readings:
- [Chapter 17.2: Training and Testing](https://www.inferentialthinking.com/chapters/17/2/Training_and_Testing.html)
- [Chapter 17.3: Rows of a Table](https://www.inferentialthinking.com/chapters/17/3/Rows_of_Tables.html)
- [Chapter 17.4: Implementing the Classifier](https://www.inferentialthinking.com/chapters/17/4/Implementing_the_Classifier.html)
- [Chapter 17.5: The Accuracy of the Classifier](https://www.inferentialthinking.com/chapters/17/5/Accuracy_of_the_Classifier.html)

<a id='section1'></a>

## 1. Terminology

Before we go any further, Professor Wagner will introduce the terminology commonly used in classification problems.
He'll also sketch out how classifiers learn the patterns and associations contained in the data.

In [None]:
YouTubeVideo('-7haErHby0U')

<a id='section2'></a>

## 2. Breast Cancer Dataset

The next lecture introduces a dataset which we will use throughout the rest of the notebook. It contains various
attributes describing benign and malignant tumor samples collected from breast tissue.

The data is loaded in the code cell below the video. Take a few minutes to explore it before moving on
to Section 3. 

In [None]:
YouTubeVideo('fpBy5ckZFCA')

In [None]:
# load the data
patients = Table.read_table('https://www.inferentialthinking.com/data/breast-cancer.csv').drop('ID')
patients.show(5)

In [None]:
# explore here
...

<a id='section3'></a>

## 3. Nearest Neighbors

You should have a strong qualitative understanding of the nearest neighbors classifier by now. The next two videos
will provide you with the mathematical concepts and programming tools needed to actually build one on your own.

In [None]:
YouTubeVideo('R-08BgOxtsI')

Compute the distance between the first and last patient in the `patients` table. Don't forget to remove the `Class` label!

In [None]:
def distance(x, y):
    ...

def row_distance(row1, row2):
    ...

...

<details>
    <summary>Solution</summary>
    
    def distance(x, y):
        return np.sqrt(np.sum((x - y)**2))

    def row_distance(row1, row2):
        return distance(np.array(list(row1)), np.array(list(row2)))

    classless_patients = patients.drop('Class')

    row_distance(classless_patients.row(0), classless_patients.row(classless_patients.num_rows - 1))
</details>

In [None]:
YouTubeVideo('l7m4stmyncA')

Practice makes perfect! Try to code the classifier from the above recording on your own.

*The code used in this video is a little outdated. Make sure to use `make_array` instead of `[...]`, and to use `np.append` instead of `append`*

In [None]:
def distances(training, example):
    ...
    
def closest(training, example, k):
    ...
    
def majority_class(neighbors):
    ...

def classify(training, example, k):
    ...

In [None]:
# test your classifier here
...

<details>
    <summary>Solution</summary>
    
    def distances(training, example):
        dists = make_array()
        attributes = training.drop('Class')
        for row in attributes.rows:
            dist = row_distance(row, example)
            dists = np.append(dists, dist)
        return training.with_columns('Distance', dists)

    def closest(training, example, k):
        return distances(training, example).sort('Distance').take(np.arange(k))

    def majority_class(neighbors):
        return neighbors.group('Class').sort('count', descending = True).column('Class').item(0)

    def classify(training, example, k):
        nearest_neighbors = closest(training, example, k)
        return majority_class(nearest_neighbors)
</details>

<a id='section4'></a>

## 4. Evaluation

Now that you know how to code a nearest neighbors classifier, you'll learn how to evaluate how accurate it is.
Professor Wagner also explains the importance of splitting your data into a training and test set.

In [None]:
YouTubeVideo('_Y3-ZItTs3M')

Fit 10 nearest neighbors classifiers to the `patient` data for $k = 1, \ldots, 10$.
Randomly assign 80% of the data to the training set, and the rest to the test set.
Which model would you prefer to use if you had to classify a new patient's tumor?

In [None]:
# split the data into training and test sets
shuffled = ...
trainset = ...
testset = ...

# define a function to evaluate classifier accuracy
def evaluate_accuracy(training, test, k):
    ...


# evaluate the classifiers' accuracies
classifier_accuracy = make_array()
for k in np.arange(1, 11):
    classifier_accuracy = ...

classifier_accuracy

<details>
    <summary>Solution</summary>
    
    Code:
    shuffled = patients.sample(with_replacement = False)
    trainset = shuffled.take(range(0, round(683*0.8)))
    testset = shuffled.take(range(round(683*0.8), 683))

    def evaluate_accuracy(training, test, k):
        test_attributes = test.drop('Class')
        num_correct = 0
        for i in np.arange(0, test.num_rows):
            test_patient = test_attributes.row(i)
            c = classify(training, test_patient, k)
            if c == test.column('Class').item(i):
                num_correct = num_correct + 1

        return num_correct / test.num_rows


    classifier_accuracy = make_array()

    for k in np.arange(1, 11):
        classifier_accuracy = np.append(classifier_accuracy, evaluate_accuracy(trainset, testset, k))

    classifier_accuracy
    
Interpretation: How do we select the appropriate number of neighbors to use when fitting a nearest neighbors classifier? By selecting the model which maximizes the accuracy over the test set.
</details>

<a id='section5'></a>

## 5. Decision Boundaries

In this last lecture recording, Professor Wagner will provide some more insight on how nearest
neighbor classifiers work. Don't worry about re-creating the fancy plots, serve only to build
your intuition.

In [None]:
YouTubeVideo('ctSif908tFI')

<a id='section6'></a>

## 6. Check for Understanding

**A. True or false? To evaluate how accurate a classifier is, it suffices to train it on a dataset, and then evaluate the
accuracy of its predictions using the same dataset.**

<details>
    <summary>Solution</summary>
    False! If we were to evaluate the performance of a classifier using the data on which it was trained, we would end up with
    overly optimistic results. Instead, we should randomly split our data into a training and a testing set, train the algorithm
    on the training set, and then evaluate it on the testing set. This will provide us with a more reasonable measure of the
    classifier's performance.
</details>

**B. True or false? If the attributes used in a nearest neighbors classifier have different variances or units, they
they should be standardized before training the algorithm.**

<details>
    <summary>Solution</summary>
    True! The nearest neighbor algorithm measures the similarity between observations based on distance. If the values
    of one variable are more variable or are larger than those of the other variables, then, even if this variable isn't
    that informative, it will be the the driving force of the similarity measure. Instead, we should compare like with like
    to get an accurate measure of similarity.
</details>

## You did it!

Congratulations, you just completed the last module of the semester! 