<div style="width: 38.5%;">
    <p><strong>City College of San Francisco</strong><p>
    <hr>
    <p>MATH 108 - Foundations of Data Science</p>
</div>

# Lecture 35: Classifiers

Associated Textbook Sections: [17.4 - 17.6](https://inferentialthinking.com/chapters/17/4/Implementing_the_Classifier.html)

## Outline

* [Classifiers](#Classifiers)
* [Rows](#Rows)
* [Distance](#Distance)
* [Nearest Neighbors](#Nearest-Neighbors)
* [Evaluation](#Evaluation)
* [Before Classifying](#Before-Classifying)

## Set Up the Notebook

In [None]:
from datascience import *
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

from mpl_toolkits.mplot3d import Axes3D

ckd = Table.read_table('data/ckd.csv')
ckd = ckd.relabeled('Blood Glucose Random', 'Glucose').select('Glucose', 'Hemoglobin', 'White Blood Cell Count', 'Class')

patients = Table.read_table('data/breast-cancer.csv').drop('ID')

def randomize_column(a):
    return a + np.random.normal(0.0, 0.09, size=len(a))

jittered = Table().with_columns([
        'Bland Chromatin (jittered)', 
        randomize_column(patients.column('Bland Chromatin')),
        'Single Epithelial Cell Size (jittered)', 
        randomize_column(patients.column('Single Epithelial Cell Size')),
        'Class',
        patients.column('Class')
    ])

## Classifiers

### Training a Classifier

<img src="img/lec35_training_classifier.png" width=80%>

### Nearest Neighbor Classifier

<img src="img/lec35_nn_classifier.png" width=80%>

### The Google Science Fair

<img src="img/lec35_google_fair.png" width=60%>

[Brittany Wenger](https://edu.google.com/case-studies/brittany-wenger/), a 17-year-old high school student in 2012 won by building a breast cancer classifier with 99% accuracy. 


### Demo: Google Science Fair

Load the `breast-cancer.csv` data set that contains various cell measurements and a Class labeling of benign `0` or malignant `1`. This data was manually produced by medical professionals that provided the values for each attribute based on visually inspecting images.

In [None]:
patients = Table.read_table('data/breast-cancer.csv').drop('ID')
patients.show(5)

In [None]:
patients.group('Class')

Visualize the relationship between `Bland Chromatin` and `Single Epithelial Cell Size` in order to try and classify the cell. Notice that it doesn't seem like there are hundreds of data points in the scatterplot.

In [None]:
patients.scatter('Bland Chromatin', 'Single Epithelial Cell Size', group='Class')

An issue with the data is there are not very many numerical values (they are more like categories/rankings), so there are a lot of repeated data points. The `jittered` table adds some random measurement noise to those data points values.

In [None]:
jittered.scatter(0, 1, group='Class')

## Rows

### Rows of Tables

Each row contains all the data for one individual
* `t.row(i)` evaluates to `i`th row of table `t`
* `t.row(i).item(j)` is the value of column `j` in row `i`
* If all values are numbers, then `np.array(t.row(i))` evaluates to an array of all the numbers in the row.  
* To consider each row individually, use
* `for row in t.rows:`
    `... row.item(j) ...`
* `t.exclude(i)` evaluates to the table `t` without its `i`th row


## Distance

### Pythagoras' Formula

<img src="img/lec35_pyth.png" width=20%>

For a right triangle with legs $a, b$ and hypotenuse $c$, the following relationship is always true: $$a^2 + b^2 = c^2.$$


### Distance Between Two Points

One way to calculate the distance between two points utilizes a right triangle and depends on the number of attributes each point has.
* If $D$ represents the distance between points $(x_0, y_0)$ and $(x_1, y_1)$, then $$D = \sqrt{(x_0 - x_1)^2 + (y_0 - y_1)^2}$$
* If $D$ represents the distance between points $(x_0, y_0, z_0)$ and $(x_1, y_1, z_1)$, then $$D = \sqrt{(x_0 - x_1)^2 + (y_0 - y_1)^2 + (z_0 - z_1)^2}$$
* etc.

### Demo: Distance

Create a function to calculate the distance between 2 points (represented as arrays). Use that function to calculate the distance between two rows of numeric data from a table.

In [None]:
def distance(pt1, pt2):
    """Return the distance between two points, represented as arrays"""
    return ...

In [None]:
def row_distance(row1, row2):
    """Return the distance between two numerical rows of a table"""
    return ...

Clean up the `patients` data to be able to apply the `row_distance` function to it.

In [None]:
attributes = patients.drop('Class')
attributes.show(3)

In [None]:
row_distance(attributes.row(0), attributes.row(1))

In [None]:
row_distance(attributes.row(0), attributes.row(2))

In [None]:
row_distance(attributes.row(2), attributes.row(2))

## Nearest Neighbors

### Finding the `k` Nearest Neighbors

To find the `k` nearest neighbors of an example:
* Find the distance between the example and each example in the training set
* Augment the training data table with a column containing all the distances
* Sort the augmented table in increasing order of the distances
* Take the top `k` rows of the sorted table

### The Classifier

To classify a point:
* Find its `k` nearest neighbors
* Take a majority vote of the `k` nearest neighbors to see which of the two classes appears more often
* Assign the point the class that wins the majority vote

### Demo: The Classifier

The following function calculates the distance between an example row (an individual patient) and every row in the a training set of `patient` data.

In [None]:
def distances(training, example):
    """
    Compute distance between example and every row in training.
    Return training augmented with Distance column
    """
    distances = make_array()
    attributes_only = training.drop('Class')
    
    for row in attributes_only.rows:
        distances = np.append(distances, row_distance(row, example))
    
#   ^ SAME AS DOING:
#
#   for i in np.arange(attributes_only.num_rows):
#       row = attributes_only.row(i)
#       distances = np.append(distances, row_distance(row, example))
        
    return training.with_column('Distance_to_ex', distances)

Measure the distance between row 21 and every other row in the data set.

In [None]:
example = attributes.row(21)
example

In [None]:
distances(patients.exclude(21), example).sort('Distance_to_ex')

Create a function that finds the `k` closest row to the example row. Apply that function to the situation above.

In [None]:
def closest(training, example, k):
    """
    Return a table of the k closest neighbors to example
    """
    return ...

In [None]:
closest(patients.exclude(21), example, 5)

Create a function or functions to report the majority class for the nearest `k` rows to the example row.

In [None]:
closest(patients.exclude(21), example, 5).group('Class').sort('count', descending=True)

In [None]:
def majority_class(topk):
    """
    Return the class with the highest count
    """
    return ...

In [None]:
def classify(training, example, k):
    """
    Return the majority class among the 
    k nearest neighbors of example
    """
    return ...

In [None]:
classify(patients.exclude(21), example, 5)

In [None]:
patients.take(21)

Apply this technique to other example rows from the data set.

In [None]:
new_example = attributes.row(10)
classify(patients.exclude(10), new_example, 5)

In [None]:
patients.take(10)

In [None]:
another_example = attributes.row(15)
classify(patients.exclude(15), another_example, 5)

In [None]:
patients.take(15)

### Review of the Steps

- `distance(pt1, pt2)`: Returns the distance between the arrays `pt1` and `pt2`
- `row_distance(row1, row2)`: Returns the distance between the rows `row1` and `row2`
- `distances(training, example)`: Returns a table that is `training` with an additional column `'Distance'` that contains the distance between `example` and each row of `training`
- `closest(training, example, k)`: Returns a table of the rows corresponding to the k smallest distances 
- `majority_class(topk)`: Returns the majority class in the `'Class'` column
- `classify(training, example, k)`: Returns the predicted class of `example` based on a `k` nearest neighbors classifier using the historical sample `training`

## Evaluation

### Accuracy of a Classifier

* The accuracy of a classifier on a labeled data set is the proportion of examples that are labeled correctly
* Need to compare classifier predictions to true labels
* If the labeled data set is sampled at random from a population, then we can infer accuracy on that population

<img src="img/lec35_train_test_split.png" width=50%>

### Demo: Accuracy of a Classifier

Split up the `patients` data into two tables where approximately 80% of the data is used for training and 20% is used for testing.

In [None]:
round(patients.num_rows * 0.8)

In [None]:
shuffled = patients.sample(with_replacement=False) # Randomly permute the rows
training_set = shuffled.take(np.arange(round(patients.num_rows * 0.8)))
test_set  = shuffled.take(np.arange(round(patients.num_rows * 0.8), patients.num_rows))

Create a function that evaluates the accuracy by returning the proportion of correctly classified examples in the test set.

In [None]:
def evaluate_accuracy(training, test, k):
    """Return the proportion of correctly classified examples 
    in the test set"""
    test_attributes = ...
    num_correct = 0
    for i in np.arange(...):
        c = ...
        num_correct = ...
    return ...

In [None]:
evaluate_accuracy(training_set, test_set, 5)

In [None]:
evaluate_accuracy(training_set, test_set, 3)

In [None]:
evaluate_accuracy(training_set, test_set, 11)

In [None]:
evaluate_accuracy(training_set, test_set, 1)

## Before Classifying

### Dog or Wolf?

<img src="img/lec35_dog_wolf.png" width=80%>

### Start with a Representative Sample

* Both the training and test sets must accurately represent the population on which you use your classifier
* Overfitting happens when a classifier does very well on the training set, but can't do as well on the test set

### Standardize if Necessary

In [None]:
Table().with_columns(
    'Glucose', make_array(117, 70, 380, 157),
    'Hemoglobin', make_array(11.2, 9.5, 10.8, 5.6),
    'White Blood Cell Count', make_array(6700, 12100, 4500, 11000),
    'Class', make_array(1, 1, 1, 1)
)

* If the attributes are on very different numerical scales, distance can be affected
* In such a situation, it is a good idea to convert all the variables to standard units 

### Demo: Standardize if Necessary

Explore what can happen to the classifier's accuracy when the units are not standardized.

In [None]:
def standard_units(x):
    return (x - np.average(x)) / np.std(x)

In [None]:
ckd_new = ckd.select('Class').with_columns(
    'Glucose_su', standard_units(ckd.column('Glucose')),
    'Hemoglobin_su', standard_units(ckd.column('Hemoglobin')),
    'WBC_su', standard_units(ckd.column('White Blood Cell Count'))
)

In [None]:
ckd_new

In [None]:
shuffled = ckd_new.sample(with_replacement=False) 
training_set = shuffled.take(np.arange(74))
test_set  = shuffled.take(np.arange(74, 148))

In [None]:
evaluate_accuracy(training_set, test_set, 3)

In [None]:
shuffled = ckd.sample(with_replacement=False) 
training_set = shuffled.take(np.arange(74))
test_set  = shuffled.take(np.arange(74, 148))

In [None]:
evaluate_accuracy(training_set, test_set, 3)

<footer>
    <hr>
    <p>Adopted from UC Berkeley DATA 8 course materials.</p>
    <p>This content is offered under a <a href="https://creativecommons.org/licenses/by-nc-sa/4.0/">CC Attribution Non-Commercial Share Alike</a> license.</p>
</footer>