**Chapter 3 – Classification**

_This notebook contains all the sample code and solutions to the exercises in chapter 3._

<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/ageron/handson-ml2/blob/master/03_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
  </td>
  <td>
    <a target="_blank" href="https://kaggle.com/kernels/welcome?src=https://github.com/ageron/handson-ml2/blob/master/03_classification.ipynb"><img src="https://kaggle.com/static/images/open-in-kaggle.svg" /></a>
  </td>
</table>

# Setup

First, let's import a few common modules, ensure MatplotLib plots figures inline and prepare a function to save the figures. We also check that Python 3.5 or later is installed (although Python 2.x may work, it is deprecated so we strongly recommend you use Python 3 instead), as well as Scikit-Learn ≥0.20.

In [None]:
# Python ≥3.5 is required
import sys
assert sys.version_info >= (3, 5)

# Is this notebook running on Colab or Kaggle?
IS_COLAB = "google.colab" in sys.modules
IS_KAGGLE = "kaggle_secrets" in sys.modules

# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"

# Common imports
import numpy as np
import os

# to make this notebook's output stable across runs
np.random.seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# Where to save the figures
PROJECT_ROOT_DIR = "."
CHAPTER_ID = "classification"
IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID)
os.makedirs(IMAGES_PATH, exist_ok=True)

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = os.path.join(IMAGES_PATH, fig_id + "." + fig_extension)
    print("Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)

# MNIST

**Warning:** since Scikit-Learn 0.24, `fetch_openml()` returns a Pandas `DataFrame` by default. To avoid this and keep the same code as in the book, we use `as_frame=False`.

In [None]:
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1, as_frame=False)
mnist.keys()

In [None]:
X, y = mnist["data"], mnist["target"]
X.shape

In [None]:
y.shape

In [None]:
def plot_digit(data):
    image = data.reshape(28, 28)
    plt.imshow(image, cmap = mpl.cm.binary,
               interpolation="nearest")
    plt.axis("off")

In [None]:
y = y.astype(np.uint8)
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]

In [None]:
y_train_5 = (y_train == 5)
y_test_5 = (y_test == 5)

In [None]:
y_test_5

In [None]:
some_digit = X[0]

# Multiclass Classification

* Binary classifiers distinguish between two classes
* Multiclass classifiers distinguish between more than two classes

* Some algorithms are natively able to handle multiclass classification
    * e.g., Logistic regression, Random Forest, naive Bayes)
* Other algorithms are strictly binary classifiers
    * However, we can multiple binary classifiers to handle multiclass classificaiton problems

#### One-versus-the-rest (OvR)
* For example, if we want to classify images of digits into 10 classes (0 to 9)
* We could create 10 binary classifiers
    * a 0-detector, a 1-detector, a 2-detector, ... up to a 9-detector
    * Then, to classify an unseen image
        * We could run it through each of our 10 classifiers
        * And select the one that produced the highest decision score
        * This is called one-versus-the-rest (also called one-versus-all)

#### One-versus-one (OvO)
* Alternatively, we could create binary classifiers for every pair of digits
    * e.g., 0-1, 0-2, 0-3,... 0-9, 1-2, 1-3,... 1-9, 2-3...
    * This is called the one-versus-one strategy
    * For N classes, we would need N x (N-1)/2 classifiers
    * So for MNIST with 10 classes, we would need 10 x 9/2 = 45 binary classifiers!
    * To classify an image, we would have to run it through all 45 classifiers
    * And see which one won with most duels

* An advantage of OvO is that each classifier is only trained on part of the training set (the part for the two classes being considered by that classifier)

* Some algorithms do not scale well with the size of the training set, so in some cases OvO may be preferred
* However, for most binary classification problems, OvR is preferred

#### Scikit-learn, OvR, and OvO
* The good news is that scikit-learn will detect when you try to use a binary classification algorithm for a multiclass task
* And it will automatically run OvR or OvO, depending on the algorithm


For example, below we will create a Support Vector Machine classifier using sklearn.svm.SVC: 

In [None]:
from sklearn.svm import SVC

svm_clf = SVC(gamma="auto", random_state=42)
svm_clf.fit(X_train[:1000], y_train[:1000]) # y_train, not y_train_5
svm_clf.predict([some_digit])

In this case, scikit-learn chose to use a OvO strategy:  it trained 45 binary classifiers, compared the scores, and selected the one that won the most duels.

We can call the decision_function() to look at this.

It returns 10 scores per instance (instead of just 1):  one score per class (it's the number of won duels plus or minus a small amount to break ties based on the binary classifier scores).

In [None]:
some_digit_scores = svm_clf.decision_function([some_digit])
some_digit_scores

In [None]:
np.argmax(some_digit_scores)

In [None]:
svm_clf.classes_

When we train a classifier, it stores a list of the classes in its .classes_ attribute.

In this case, we are lucky that the class at index 5 is also the digit 5.

Things will not always work out so nicely.

In [None]:
svm_clf.classes_[5]

If we wanted, we could force scikit-learn to use a particular strategy using the `OneVsOneClassifier` or `OneVsRestClassifier` classes.

In [None]:
from sklearn.multiclass import OneVsRestClassifier
ovr_clf = OneVsRestClassifier(SVC(gamma="auto", random_state=42))
ovr_clf.fit(X_train[:1000], y_train[:1000])
ovr_clf.predict([some_digit])

### If not specificed, scikit-learn will select the strategy

Let's look what happens when we use a SGDClassifier.

In [None]:
from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier(max_iter=1000, tol=1e-3, random_state=42)
sgd_clf.fit(X_train, y_train)
sgd_clf.predict([some_digit])

Ooops... it thought the 5 was a 3.  Oh well... our classifier is not perfect.

In this example, scikit-learn trained 10 binary classifiers.

The decision_function() now returns one value per class:

In [None]:
sgd_clf.decision_function([some_digit])

In the output above, we can see that most of the scores are negative.

However, the score for class 3 is 1823.

Since this was the highest score, it was the predicted class.

To get a more complete evaluatation of the classifier, we could use cross-validation.

**Warning**: the following two cells may take close to 30 minutes to run, or more depending on your hardware.

In [None]:
from sklearn.model_selection import cross_val_score

cross_val_score(sgd_clf, X_train, y_train, cv=3, scoring="accuracy")

Not bad!  Over 85% on all test folds.  A random classifier would get 10% accuracy.

However, we can do better.

For example, scaling the inputs will help:

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train.astype(np.float64))
cross_val_score(sgd_clf, X_train_scaled, y_train, cv=3, scoring="accuracy")

# Error Analysis

Once we have found a promising model, we can look for ways to improve it.

One way to do this is to look at where the classifier is making errors.

Our first step will be to look at the confusion matrix.

Since we are now doing multiclass classification, each class could be classified as any of the classes, so we get an NxN matrix:

(recall that the rows are the actual and the columns are predicted)

In [None]:
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix

y_train_pred = cross_val_predict(sgd_clf, X_train_scaled, y_train, cv=3)
conf_mx = confusion_matrix(y_train, y_train_pred)
conf_mx

In [None]:
plt.matshow(conf_mx, cmap=plt.cm.gray)
save_fig("confusion_matrix_plot", tight_layout=False)
plt.show()

But the matrix and plots above are showing us raw numeric counts.

These could be influenced by the number of that class in the dataset.

For example, the 5s are a bit darker in the plot.  This could mean:
* the classifier does not perform as well on 5s
* there are not as many 5s in the dataset


#### Focus on the errors

Instead of looking at raw counts, we will divide each value by the number of images in the corresponding class.

This will give us error rates instead of absolute numbers.

In [None]:
row_sums = conf_mx.sum(axis=1, keepdims=True)
norm_conf_mx = conf_mx / row_sums

The items on the main diagonal are siutations where the classifier got the predication correct.

To help us focus on the errors, we will block out the main diagonal with zeros.

And we can plot our error rates.

In [None]:
np.fill_diagonal(norm_conf_mx, 0)
plt.matshow(norm_conf_mx, cmap=plt.cm.gray)
save_fig("confusion_matrix_errors_plot", tight_layout=False)
plt.show()

We can see several interesting things from the plot above:
* the 8s column is bright, meaning that many images get misclassified as 8s
* however, the 8s row is fairly dark, so most actual 8s get correctly classified as 8s
* 3s and 5s are often confused (in both directions)
    * lighter grey boxes at 3-5 and 5-3
    
    
These insights could help us think of ways to improve our classifier:
* We could add more training data for images that are confused to be 8s
* We could add new features to help distinguish 8s, 5s, and 3s

# Multilabel Classification

Some classification problems may involve recognizing multiple things from one input.

For example, we might want to know if Alice, Bob, and Charlie are in a picture.
* Outputting [0,1,1] would indicate that Bob and Charlie are in the picture, but not Alice
* Outputting [1,0,1] would indicate that Alice and Charlie are in the picture, but not Bob

In the example below, we will look at a simple multilabel classifier.

We will create `y_multilabel` which will contain two labels for each instance in our dataset.
* the first label will indicate if the image contains a digit greater than 7
* the second label will indicate if the image contains a digit that is an odd number

In [None]:
from sklearn.neighbors import KNeighborsClassifier

y_train_large = (y_train >= 7)
y_train_odd = (y_train % 2 == 1)
y_multilabel = np.c_[y_train_large, y_train_odd]
y_multilabel

The we can train a KNeighborsClassifier on the training data with our multilabels.

In [None]:
knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train, y_multilabel)

Precitions made with this classifier will now return two labels in the output:

In [None]:
knn_clf.predict([some_digit])

There are many ways to evaluate a multilabel classifier.

One approach is to compute the F<sub>1</sub> score for each individual label and then compute an average score across all labels.

This assumes that all labels are equally important.

Alternatively, we could use a weighted average.

**Warning**: the following cell may take a very long time (possibly hours depending on your hardware).

In [None]:
y_train_knn_pred = cross_val_predict(knn_clf, X_train, y_multilabel, cv=3)
f1_score(y_multilabel, y_train_knn_pred, average="macro")

## Count Vectorizer

In [None]:
# from: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
    
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]

In [None]:
corpus

In [None]:
corpus[0]

#### Create a feature vector where every word is a feature

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
words = vectorizer.get_feature_names_out()


In [None]:
X

In [None]:
X.toarray()

In [None]:
words

In [None]:
import pandas as pd

df = pd.DataFrame(X.toarray(), columns=words)
df

In [None]:
newdoc = vectorizer.transform(['one document is first and second'])

In [None]:
pd.DataFrame(newdoc.toarray(), columns=words)

## Exercise 3.1 -- Classifying movie reviews

In this exercise, you will build a binary classifier to determine if movie reviews from the Internet Movie Database (IMDB) are positive or negative.

We will use a dataset of 1000 movie reviews from IMDB.

The dataset is in the text file:  ../data/imdb_labelled.txt

It originally comes from: https://github.com/microsoft/ML-Server-Python-Samples/blob/master/microsoftml/202/data/sentiment_analysis/imdb_labelled.txt

Each line in the file consists of:
* the text of the review 
* a tab character (\t)
* a label of 0 (negative review) or 1 (positive review)
* a newline character (\n)

For example:

```A very, very, very slow-moving, aimless movie about a distressed, drifting young man.  	0
Saw the movie today and thought it was a good effort, good messages for kids.  	1
Not sure who was more lost - the flat characters or the audience, nearly half of whom walked out.  	0
Buy it, play it, enjoy it, love it.  	1
```

Below are suggested steps for creating and evaluating your classifier.

When you have finished your program, go to Canvas --> Assignments --> Exercise 3.1 and submit ONE file with your python code (e.g., copy and save the code you write in the cell below).


In [None]:
# Step 0: a few suggested things to import
import string
import re
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import cross_val_score


# Step 1: Read the file into a list where each list item is one instance (one line) from the file
with open("../data/imdb_labelled.txt") as infile:
    lines = infile.read().split('\n')

# Step 2: For each line do steps (2a through 2e)

    # Step 2a: split the line to get the review text and the label

    
    # Step 2b: clean up the review text by
    #   converting all the characters to lower case
    #   removing any characters that are not a-z or a space character
    #   note: you may wish to use regular expressions for this (import re)

    
    # Step 2c: convert the label from a string to a number

    
    # Step 2d: append the cleaned review text to a list

    
    # Step 2e: append the label to another list

    
    
# Step 3: Instantiate a CountVectorizer and use it to convert the cleaned review lines to vectors


# Step 4: Divide the data into a testing and training sets


# Step 5: Instantiate a SGDClassifier and train it on the training data


# Step 6: Make up a new review text and see what the classifier predicts for it


# Step 7: Use 10-fold cross-validation to evaluate the classifier accuracy
    


