In [None]:
# Run cells by clicking on them and hitting CTRL + ENTER on your keyboard
from IPython.display import YouTubeVideo
from datascience import *
import numpy as np
import matplotlib.pyplot as plots
from mpl_toolkits.mplot3d import Axes3D
plots.style.use('fivethirtyeight')
%matplotlib inline

# Module 7.1: Introduction to Classification

In this module, you'll learn how to construct a machine learning algorithm for classifying binary outcomes. 

3 videos make up this notebook, for a total run time of 43:03.

1. [Introduction](#section1) *1 video, total runtime 14:24*
2. [Classifiers](#section2) *2 videos, total runtime 28:39*
3. [Check for Understanding](#section3)

Textbook readings:
- [Chapter 17](https://www.inferentialthinking.com/chapters/17/Classification.html)
- [Chapter 17.1: Nearest Neighbours](https://www.inferentialthinking.com/chapters/17/1/Nearest_Neighbors.html)

<a id='section1'></a>

## 1. Introduction

In the first lecture recording of this module, Professor Wagner provides a high-level overview of machine learning algorithms with a 
particular focus on classifiers.

In [None]:
YouTubeVideo('D_E4b0QG_fY')

<a id='section2'></a>

## 2. Classifiers

Now that you've gained some familiarity with classifiers, you'll see how they can be used in a medical setting. This lecture will also
introduce you to a simple yet powerful classification technique called *nearest neighbors*.

In [None]:
YouTubeVideo('AqOPmNjjA6A')

The cell below loads the Chronic Kidney Disease dataset. Use it to follow along as Professor Wagner performs an exploratory analysis.

In [None]:
# load the data
ckd = Table.read_table('https://www.inferentialthinking.com/data/ckd.csv').relabeled('Blood Glucose Random', 'Glucose')
ckd.show(5)

In [None]:
...

The next video introduces another problem which can be solved using classifiers: fraud detection. In particular, you'll
see how machine learning algorithms can be used to identify counterfeit bank notes.

In [None]:
YouTubeVideo('eHdDpemodVc')

Use the `banknotes` data loaded in the table below to follow along.

In [None]:
# load the data
banknotes = Table.read_table('https://www.inferentialthinking.com/data/banknote.csv')
banknotes.show(5)

In [None]:
...

<a id='section3'></a>

## 3. Check for Understanding



**Consider the Chronic Kidney Disease data from lecture video 30.1. In addition to containing information on whether each
patient had chronic kidney disease, it possesses variables indicating if a patient has anemia, hypertension, diabetes, etc.**

**A. Based on the scatter plot below, do you think we could use the nearest neighbor algorithm to predict whether a patient is
anemic based on their blood glucose and white blood cell count?**

In [None]:
ckd.scatter('White Blood Cell Count', 'Glucose', colors = 'Anemia')

<details>
    <summary>Solution</summary>
    No, it doesn't look like this algorithm would do a good job of classifying patients. The observations' colors appear to
    be randomly assigned in this scatter plot. Assigning new observations to the class of their nearest neighbour seems unlikely
    to produce accurate results.
</details>

**B. What two variables could we use to create a nearest neighbor classifier to predict if a patient is hypertensive?**

In [None]:
ckd.scatter(..., ..., colors = 'Hypertension')

<details>
    <summary>Solution</summary>
    There are a few possibilities, though none of them are perfect:
    
    ckd.scatter('Hemoglobin', 'Red Blood Cell Count', colors = 'Hypertension')
    ckd.scatter('Potassium', 'Red Blood Cell Count', colors = 'Hypertension')
    ckd.scatter('Sodium', 'Packed Cell Volume', colors = 'Hypertension')
</details>

**C. Consider the `banknotes` dataset from lecture 30.3. Suppose we fit a nearest neighbor classifier using the variables
in the figures below, and use our algorithm to predict whether a new set of bills are counterfeit. Note that the counterfeit and
non-counterfeit bills in this plot are completely separated.**

**Will our algorithm identify the counterfeits in our new set bank notes with 100% accuracy?**

In [None]:
fig = plots.figure(figsize = (8, 8))
ax = Axes3D(fig)
ax.scatter(
    banknotes.column('WaveletSkew'),
    banknotes.column('WaveletVar'),
    banknotes.column('WaveletCurt'),
    c = banknotes.column('Class'),
    s = 20
);

<details>
    <summary>Solution</summary>
    We can't say whether our algorithm will be 100% accurate for a new set of bills. Any counterfeit bank notes in our new
    set of bills may have been created using different techniques than those used in the bills on which our data was trained.
    In other words, we don't know if our algorithm will generalize well to a new dataset. 
</details>