# Lesson 5: Classification

Today:
1. Classification
    + Classification vs. Regression
    + Examples of classification tasks
    + Building a Classifier
    + Hands-on example: cancer data
        + Explore and visualize the training data
        + Brainstorm a simple classification model
2. Decision Trees

## Class Starter

1. Which of the following statements is/are true about the purposes of training and test datasets?

    1. We use the training dataset to find patterns that help us build a model.  The test dataset is then used to assess the model.
    2. The training dataset must consist of the first half of all rows of a dataset; the test dataset consists of the remaining rows.
    3. In order to assess a model, we should not use the dataset that was already used to create the model.
    4. A model that fits the training dataset well will definitely fit the test dataset well also.
    5. None of the above

Respond on PollEV: https://pollev.com/fshum

Answer: 

## 1. Classification

<img src='images/classification.png' width=800>

### 1.1 Classification vs. Regression

So far we have looked at:

<img src='images/models-bigpicture_2.jpg' width=800>

a model that predicts the value of a **numerical variable** is called a **regression model**.

Now we will look at:

<img src='images/models-bigpicture_5b.jpg' width=800>

A model that predicts the value of a **categorical variable** is called a **classification model**.

### 1.2 Examples of classification tasks

- **Robot or Human?**
<table><tr>
    <td><img src='images/captcha1.png' width=400></td>
<td><img src='images/captcha2.gif' width=400></td>
</tr></table>
<sup><p>image source: https://webmasters.googleblog.com/2014/12/are-you-robot-introducing-no-captcha.html </p>
Credit: Alexey Bezrodny/iStock/Getty Images Plus

</sup>

- **0 or 1 or 2 or … or 9?**

<img src='images/digitrecognition.png' width=400>

<sup>image source: http://neuralnetworksanddeeplearning.com/chap1.html

</sup>

- **Image recognition**

<img src='images/image_rec.png' width=600>

- **Span or not spam?**

<img src='images/TextClassificationExample.png' width=600>

<sup>image source: https://developers.google.com/machine-learning/guides/text-classification/images/TextClassificationExample.png

</sup>

- <img src='images/MIT_covid19_coughclassifier.png' width=600>

<sup><p>Source: https://news.mit.edu/2020/covid-19-cough-cellphone-detection-1029</p>
Paper: https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=920879

</sup>

### 1.3 Building a Classifier: Big Picture

**Example: Spam or not spam?**

<img src='images/bigpicture-classifier1.png' width=800>

<img src='images/classifier-unlabelleddata.png' width=800>

**Example: 0 or 1 or 2 or … or 9?**

<table>
    <tr>
        <td>
            <p> In classification, we are trying to predict the category (known as a <b>class</b>) that a sample belongs to.</p>
            <p> In this example: </p>
<p>Classes: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9</p>
<p># of classes: 10 classes
</p>
        </td>
        <td><img src='images/digitrecognition.png' width=400>

<sup>image source: http://neuralnetworksanddeeplearning.com/chap1.html

</sup></td>
    </tr>
</table>

### 1.4 Hands-on Example

#### Example: Cancer Detection

Question: Is this tumor benign or malignant (cancerous) ?

We are given a labelled dataset:
+ Each **observation** corresponds to a **tumor** detected in a patient
+ Has several columns consisting of attributes/information about the tumor
+ Has a column called `Class`:
    + 0 indicates that the lump is not cancer
    + 1 indicates that the lump is cancer

<img src='images/bigpicture-classifier2.png' width=800>

Our goal today:
+ Split the dataset into training and test data
+ Explore and visualize the training data
+ Brainstorm a classification model

In [None]:
# import packages
import pandas as pd
import numpy as np
import seaborn as sns

In [None]:
# read in cancer dataset
cancerdata = pd.read_csv('../../../shared/datasets/cancer.csv')

Remarks:
+ Unlike with linear regression, we cannot fit a straight line to predict the y value based on the x value!
+ We need a new model / a new way of thinking about this prediction question

In [None]:
# split into training and test

# you could do it manually as well, here we do it using sklearn
from sklearn.model_selection import train_test_split

X = cancerdata.iloc[:, 1:10]
y = cancerdata['Class']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 1)

In [None]:
training_data = pd.concat([X_train, y_train], axis=1)

In [None]:
sns.scatterplot(data = training_data, x='Clump Thickness', y='Marginal Adhesion', style='Class', hue='Class')

In [None]:
sns.swarmplot(data=training_data, y='Clump Thickness', x='Class')

In [None]:
sns.displot(data=training_data, x='Clump Thickness', hue='Class')

Suppose we decided that the variables `Clump.Thickness` and `Marginal.Adhesion` are pretty good in distinguishing malignant vs. benign tumors:
- Small values of `Clump.Thickness` together with small values of `Marginal.Adhesion`: likely **not malignant**
- Otherwise: more likely to be **malignant**

#### Cancer Detection: Building Models and Making Predictions

- Fitting a straight line through data points no longer work!

<table>
    <tr>
        <td>
            <p>Suppose that we have four more data points (A, B, C, D in the figure to the right).  Based on this information alone, <b>what predictions would you make</b> for each of them (malignant or benign) ?</p>
<p>How might you <b>quantify</b> elements of your prediction process, so that you can tell a computer to do it for you?
(What is the ``rule'' or ``recipe'' that you use in making your prediction?)</p>
        </td>
        <td>
        <img src='images/lec20-knn-illustration2.png' width=600>
        </td>
    </tr>
</table>

**Idea 1: Draw lines or curves that separate the blue dots from the red dots**

<img src='images/first_classifier.png' width=800>

## 2. Decision Trees

<img src='images/decision_tree.png' width=600>

### What is a decision tree?

<table>
    <tr>
        <td> <p>A decision tree is a way to visualize a decision-making process when multiple possibilities are involved.</p>

They are called <b>trees</b> because the diagram involves branches. They’re essentially a particular type of “flowchart”.

- Each node is a True/False question.
- Each node branches into two other nodes based on the response to the question.
</td>
<td><img src='images/tree.png' width=400></td>
    </tr>
</table>