# Classification

Most common supervised learning tasks are regression (predicting values) and classification (predicting classes)

## MNIST

set of 70,000 small images of handwritten digits.<br><br>
Scikit-Learn provides helper functions to download popular datasets

In [None]:
>>> from sklearn.datasets import fetch_openml
>>> mnist = fetch_openml('mnist_784', version=1)
>>> mnist.keys()

Datasets downloaded by Scikit-Learn have a similar dictionary structure. Convert to numpy array

In [None]:
>>> X, y = mnist["data"], mnist["target"]
>>> X.shape

In [None]:
>>> y.shape

70,000 images with 784 features. Each image is 28x28 pixels and each feature represents one pixel's intensity from 0 (white) to 255 (black). 

In [None]:
import matplotlib as mpl
import matplotlib.pyplot as plt

some_digit = X.to_numpy()[0]
some_digit_image = some_digit.reshape(28, 28)

plt.imshow(some_digit_image, cmap="binary")
plt.axis("off")
plt.show()

Looks like a 5. label says its a 5

In [None]:
>>> y.to_numpy()[0]

Cast label to integer

In [None]:
>>> import numpy as np
>>> y = y.astype(np.uint8)

MNIST is already split into training set (first 60,000) and test set (last 10,000). Training set is already shuffled

In [None]:
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]

## Training a Binary Classifier

Try to classify only 1 digit, the number 5. Create the target vectors for the task:

In [None]:
y_train_5 = (y_train == 5)
y_test_5 = (y_test == 5)

Start with _Stochastic Gradient Descent_ (SGD) classifier. Able to handle very large datasets efficiently because it deals with training instances independently, one at a time (good for online learning). 

In [None]:
from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier(random_state=42)
sgd_clf.fit(X_train, y_train_5)

# requires randomness hence stochastic
# set random state for reproducible results

In [None]:
# detect images of 5
>>> sgd_clf.predict([some_digit])

## Performance Measures

Evalutating a classifier is trickier than evaluating a regressor

### Measuring Accuracy Using Cross-Validation

#### Implementing Cross-Validation
Occasionally you will need more control over cross-validation process than Scikit-Learn's built in features can offer. The following does roughly the same as cross_val_score():

In [None]:
from sklearn.model_selection import StratifiedKFold
from sklearn.base import clone

skfolds = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)

for train_index, test_index in skfolds.split(X_train, y_train_5):
    clone_clf = clone(sgd_clf)
    X_train_folds = X_train[train_index]
    y_train_folds = y_train_5[train_index]
    X_test_fold = X_train[test_index]
    y_test_fold = y_train_5[test_index]
    
    clone_clf.fit(X_train_folds, y_train_folds)
    y_pred = clone_clf.predict(X_test_fold)
    n_correct = sum(y_pred == y_test_fold)
    print(n_correct / len(y_pred))

__StratifiedKFold__ class performs stratified sampling to produce folds that contain a representative ratio of each class. At each iteration the code creates a clone of the classifier, trains that clone on the training folds, and makes predictions on the test fold. Then it counts the number of correct predictions and outputs the ratio of correct predictions.

Use cross_val_score() to evalutate SGDClassifier model, using K-folds cross-validation with 3 folds. K-fold cross-validation means splitting the training set into K folds, then making predictions and evalutating them on each fold using a model trained on the remaining folds

In [None]:
>>> from sklearn.model_selection import cross_val_score
>>> cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring="accuracy")

Over 90% accuracy!! But wait, there's a catch. Let's look at the binary classifier that only checks for 1 number (5 or not 5)

In [None]:
from sklearn.base import BaseEstimator

class Never5Classifier(BaseEstimator):
    def fit(self, X, y=None):
        return self
    def predict(self, X):
        return np.zeroes((len(X), 1), dtype=bool)

In [None]:
>>> never_5_clf = Never5Classifier()
>>> cross_val_score(never_5_clf, X_train, y_train_5, cv=3, scoring="accuracy")

Over 90% accuracy. This is because 10% of the images are 5s, so if you always guess that an image is not a 5, you will be right 90% of the time. <br><br>
Demonstrates why accuracy is not the preferred performance measure for classifiers, especially with _skewed datasets_

### Confusion Matrix

Better way to evaluate the performance of a classifier. Count the number of times instances of class A are classified as class B. For example, to know the number of times the classifier confused images of 5s with 3s, you would look in the fifth row and third column of the confusion matrix



To compute, first need to have a set of predictions so that they can be compared to the actual targets

In [None]:
from sklearn.model_selection import cross_val_predict

y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3)

Like cross_val_score(), cross_val_predict() performs K-fold cross-validation but instead of returning the evaluation scores, it returns the predictions made on each test fold

In [None]:
>>> from sklearn.metrics import confusion_matrix
>>> confusion_matrix(y_train_5, y_train_pred)

Each row represents an _actual class_ while each column represents a _predicted class_. <br>
First row considers non-5 images (_negative class_): 0,0 correctly classified as non-5s (_true negatives_), 0,1 wrongly classified as 5s (_false positives_). <br>
Second row considers the images of 5s (_positive class_): 1,0 were wrongly classified as non-5s (_false negatives_), 1,1 were correctly classified as 5s (_true positives_). <br>
A perfect classifier would have only true positives and true negatives, so nonzero values on its main diagonal (top left, bottom right).

_Precision_ of the classifier: precision = TP / (TP + FP) <br>
TP = number of true positives, FP = number of false positives

Precision is typically used with _recall_, also called _sensitivity_ or the _true positive rate_ (TPR): ratio of positive instances that are correctly detected by the classifier <br>
recall = TP / (TP + FN) <br>
FN = number of false negatives

### Precision and Recall

Scikit-Learn provides functions to compute classifier metrics

In [None]:
>>> from sklearn.metrics import precision_score, recall_score
>>> precision_score(y_train_5, y_train_pred)


In [None]:
>>> recall_score(y_train_5, y_train_pred)

Not as accurate as it was before

Combine precision and recall into one metric called _F1 score_ if you need a simple way to compare two classifiers. F1 score is the _harmonic mean_ of precision and recall. Harmonic mean gives much more weight to low values. Classifier will only get a high F1 score if both recall and precision are high.

F1 = TP / (TP + ((FN + FP) / 2))

In [None]:
>>> from sklearn.metrics import f1_score
>>> f1_score(y_train_5, y_train_pred)

F1 favors classifiers that have similar precision and recall. this is not always what you want: some cases you care more about precision, others more about recall.

You can't have it both ways: increasing precision reduces recall, and vice versa. This is called _precision/recall trade-off_.

### Precision/Recall Trade-Off

SGDClassifier is based on a _decision function_. If the score is greater than a threshold, it assigns the instance to the positive class; otherwise it assigns it to the negative class. By increasing the threshold, it increases the precision but lowers recall because it increases the number of false negatives. By decreasing the threshold, it increases the recall but lowers the precision because it increases the number of false positives.

Use SGDClassifier's decision_function() method to return a score for each instance, and then use any threshold ou want to make predictions based on those scores:

In [None]:
>>> y_scores = sgd_clf.decision_function([some_digit])
>>> y_scores

In [None]:
>>> threshold = 0
>>> some_digit_pred = (y_scores > threshold)
>>> some_digit_pred

Let's raise the threshold

In [None]:
>>> threshold = 8000
>>> y_some_digit_pred = (y_scores > threshold)
>>> y_some_digit_pred

How to decide which threshold to use? First, use the cross_val_predict() function to get the scores of all instances in the training set, but this time specify that you want to return decision scores instead of predictions:

In [None]:
y_scores = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3, method="decision_function")

Use precision_recall_curve() to compute precision and recall for all possible thresholds

In [None]:
from sklearn.metrics import precision_recall_curve

precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)

def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):
    plt.plot(thresholds, precisions[:-1], "b--", label="Precision")
    plt.plot(thresholds, recalls[:-1], "g-", label="Recall")
    plt.xlabel('Threshold')
    plt.legend()

plot_precision_recall_vs_threshold(precisions, recalls, thresholds)
plt.show()

Another way to select a good precision/recall trade-off is to plot precision direcly against recall

Suppose you aim for 90% precision, you can search for the lowest threshold that gives you at least 90% precision

In [None]:
threshold_90_precision = thresholds[np.argmax(precisions >= 0.90)]

In [None]:
# make predicitons instead of calling classifier predict()
y_train_pred_90 = (y_scores >= threshold_90_precision)

In [None]:
>>> precision_score(y_train_5, y_train_pred_90)


In [None]:
>>> recall_score(y_train_5, y_train_pred_90)

Now we have a 90% precision classifier. However, a high precision classifier is not very useful when its recall is too low

### ROC Curve

_Receiver operating characteristic_ (ROC) curve is another common tool used with binary classifiers. Plots the _true positive rate_ (another name for recall) against the _false positive rate_ (FPR). It is equal to 1 - the _true negative rate_ (TNR). TNR is also called _specificity_. ROC plots _sensitivity_ (recall) versus 1 - _specificity_.

In [None]:
from sklearn.metrics import roc_curve

fpr, tpr, thresholds = roc_curve(y_train_5, y_scores)

def plot_roc_curve(fpr, tpr, label=None):
    plt.plot(fpr, tpr, linewidth=2, label=label)
    plt.plot([0,1], [0,1], 'k--')
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate (Recall)')

plot_roc_curve(fpr, tpr)
plt.show()

The higher the recall, the more false positives (FPR) the classifier produces. Dotted line represents the ROC curve of a purely random classifier; a good classifier stays as far away from that line as possible (toward the top-left corner). 

Compute _area under the curve_ (AUC). A perfect classifier will have a ROC AUC equal to 1, whereas a purely random classifier will have a ROC AUC of 0.5.

In [None]:
>>> from sklearn.metrics import roc_auc_score
>>> roc_auc_score(y_train_5, y_scores)

Prefer the precision/recall (PR) curve when you care more about the false positives than the false negatives. Otherwise, use the ROC curve.

Train a RandomForestClassifier and compare its ROC curve and ROC AUC score of those of the SGDClassifier

In [None]:
from sklearn.ensemble import RandomForestClassifier

forest_clf = RandomForestClassifier(random_state=42)
y_probas_forest = cross_val_predict(forest_clf, X_train, y_train_5, cv=3, method="predict_proba")

roc_curve() function expects labels and scores but instead of scores you can use class probabilites. Use the positive class's probability as the score

In [None]:
y_scores_forest = y_probas_forest[:, 1]
fpr_forest, tpr_forest, thresholds_forest = roc_curve(y_train_5, y_scores_forest)

In [None]:
plt.plot(fpr, tpr, "b:", label="SGD")
plot_roc_curve(fpr_forest, tpr_forest, "RandomForest")
plt.legend(loc="lower right")
plt.show()

RandomForestClassifier's ROC curve looks better than the SGDClassifier (closer to the top left), so the ROC AUC score should be higher

In [None]:
>>> roc_auc_score(y_train_5, y_scores_forest)

## Multiclass Classification

_Mulitclass classifiers_ (_multinomial classifiers_) distinguish between more than two classes
<br><br>
Some algorithms (such as Logistic Regression, Random Forest, and naive Bayes classifiers) are capable of handling multiple classes natively. Others (such as SGD or Support Vector Machine classifiers) are strictly binary classifiers.
<br><br>
There are various strategies that you can use to perform multiclass classification with multiple binary classifiers. 
<br><br>
_One-versus-the-rest_ (OvR) or _one-versus-all_ strategy: x binary classifiers that are trained, then when you want to classify an image, you get the decision score from each classifier for that image and select the class whose classifier outputs the highest score
<br><br>
_One-versus-one_ (OvO) strategy: create pairs for every class and compare them. If there are N classes, you must train N x (N-1) / 2 classifiers. Main advantage is that each classifier only needs to be trained on the part of the training set for the two classes it must distinguish.
<br><br>
Some algorithms (like SVM classifiers) scale poorly with the size of the training set. OvO is preferred because it is faster to train many classifiers on small training sets than to train few classifiers on large training sets. For most binary classifiers, however, OvR is preferred. Scikit-Learn automatically runs OvR or OvO depending on the algorithm.

In [None]:
>>> from sklearn.svm import SVC
>>> svm_clf = SVC()
>>> svm_clf.fit(X_train, y_train) # y_train, not y_train_5
>>> svm_clf.predict([some_digit])

Scikit-Learn used OvO: trained 45 binary classifiers, got their decision scores for the image, and selected the class that won the most duels
<br><br>
Calling decision_function(), it returns 10 scores per instance (instead of 1). 1 score per class (the number of won duels plus or minus a small tweak to break ties)

In [None]:
>>> some_digit_scores = svm_clf.decision_function([some_digit])
>>> some_digit_scores

In [None]:
>>> np.argmax(some_digit_scores)

In [None]:
>>> svm_clf.classes_

In [None]:
>>> svm_clf.classes_[5]

To force Scikit-Learn to use OvO or OvR, use built in classes OnveVsOneClassifier or OneVsRestClassifier

In [None]:
>>> from sklearn.multiclass import OneVsRestClassifier
>>> ovr_clf = OneVsRestClassifier(SVC())
>>> ovr_clf.fit(X_train, y_train)
>>> ovr_clf.predict([some_digit])

In [None]:
>>> len(ovr_clf.estimators_)

Train SGDClassifier

In [None]:
>>> sgd_clf.fit(X_train, y_train)
>>> sgd_clf.predict([some_digit])

Used OvR under the hood: it trained 10 binary classifiers

In [None]:
>>> sgd_clf.decision_function([some_digit])

Low score for class 3, so use cross-validation to evaluate classifier

In [None]:
>>> cross_val_score(sgd_clf, X_train, y_train, cv=3, scoring="accuracy")

Using a random classifier, it would get 10% accuracy. Scaling the inputs increases the accuracy

In [None]:
>>> from sklearn.preprocessing import StandardScaler
>>> scaler = StandardScaler()
>>> X_train_scaled = scaler.fit_transform(X_train.astype(np.float(64)))
>>> cross_val_score(sgd_clf, X_train_scaled, y_train, cv=3, scoring="accuracy")


## Error Analysis

In a real project, follow the Machine Learning project checklist (Appendix B). Explor data preparation options, try multiple models (shortlisting the best ones and fine-tuning their hyperparameters using GridSearchCV), and automate as much as possible. Here, assume we have a promising model and we want to improve it. Analyze the type of errors it makes.
<br><br>
First look at confusion matrix

In [None]:
>>> y_train_pred = cross_val_predict(sgd_clf, X_train_scaled, y_train, cv=3)
>>> conf_mx = confusion_matrix(y_train, y_train_pred)
>>> conf_mx

In [None]:
# plot confusion matrix
plt.matshow(conf_mx, cmap=plt.cm.gray)
plt.show()

In [None]:
row_sums = conf_mx.sum(axis=1, keepdims=True)
norm_conf_mx = conf_mx / row_sums

In [None]:
np.fill_diagonal(norm_conf_mx, 0)
plt.matshow(norm_conf_mx, cmap=plat.cm.gray)
plt.show()

Clearly see the errors the classifier makes. Rows represent actual classes, columsn represent predicted classes. Column for class 8 is bright, meaning many images get misclassified as 8s. However, row for class 8 is not that bad telling you that actual 8s in general get properly classified as 8s. 
<br><br>
Conf matrix is telling us to reduce flase 8s. Could try to gather more training data for digits that look like 8s (but are not) so the classifier can learn to distinguish them from real 8s. Or engineer new features to help the classifier - writing an algorithm to count the number of closed loops (e.g., 8 has two, 6 has 1, 5 has none). Couldpreprocess the images to make some patterns, such as closed loops, stand out more.
<br><br>
Analyzing individual errors can help gain insights. Plot examples of 3s and 5s

In [None]:
cl_a, cl_b = 3, 5
X_aa = X_train[(y_train == cl_a) & (y_train_pred == cl_a)]
X_ab = X_train[(y_train == cl_a) & (y_train_pred == cl_b)]
X_ba = X_train[(y_train == cl_b) & (y_train_pred == cl_a)]
X_bb = X_train[(y_train == cl_b) & (y_train_pred == cl_b)]
plt.figure(figsize=(8,8))
plt.subplot(221); plot_digits(X_aa[:25], images_per_row=5)
plt.subplot(222); plot_digits(X_aa[:25], images_per_row=5)
plt.subplot(223); plot_digits(X_aa[:25], images_per_row=5)
plt.subplot(224); plot_digits(X_aa[:25], images_per_row=5)
plt.show()

Some errors are understandable due to poor handwriting (even a human would have trouble classifying). Others are obvious but the model still gets them wrong. This is because we used a simple SGDClassifier, which is a linear model. It assigns a weight per class to each pixel, and when it sees a new image it just sums up the wieghted pixel instensities to get a score for each class. Since 3s and 5s differ only by a few pixels, this model will easily confuse them. 

Main difference between 3 and 5 is the small line that joins the top line to the bottom arc. The classifier is quite sensitive to image shifting and rotation. One way to reduce the confusion would be to preprocess the images to ensure that they are well centered and not too rotated. 

## Multilabel Classification

Until now, each instance has been asigned to one class. In some cases, you want your classifier to output multiple classes for each instance. E.g. facial recognition classifier. Train a facial recognition classifier to recognize 3 faces, Alice, Bob, and Charlie. Giving it a picture with Alice and Charlie should yield an output of [1, 0, 1] (Alice yes, Bob no, Charlie yes). A classification system that outputs multiple binary tags is called a _multilabel classification_ system.

In [None]:
from sklearn.neighbors import KNeighborsClassifier

y_train_large = (y_train >= 7)
y_train_odd = (y_train % 2 == 1)
y_multilabel = np.c_[y_train_large, y_train_odd]

knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train, y_multilabel)

Creates a y_multilabel array containing 2 target labels for each digit: first indicates whether it is large (7, 8, 9), second indicates whether it is odd. Then we train a KNeighborsClassifier using the multiple target array. Now we make a prediction:

In [None]:
>>> knn_clf.predict([some_digit])

Many ways to evaluate multilabel classifier, and selecting the right metric depends on the project. One approach is to measure F1 score for each individual label (or any other binary classifier metric) then compute the average score:

In [None]:
>>> y_train_knn_pred = cross_val_predict(knn_clf, X_train, y_multilabel, cv=3)
>>> f1_score(y_multilabel, y_train_knn_pred, average="macro")

This assumes all labels are equally important, which may not be the case. Give more weight to classes with more representation (e.g. there's more pictures of Alice than Bob or Charlie). Give each label a weight equal to its _support_ (the number of instances with that target label). Set average="weighted" in the preceding code

## Multioutput Classification

A generalization of multilabel classification where each label can be multiclasses (it can have more than two possible values). 

Build a system to remove noise from images. Input is a noisy digit image, and ouputs a clean digit image, represented as an array of pixel intensities. Classifier's output is multilabel (one label per pixel) and each label can have multiple values (pixel intensity ranges from 0 to 255).

Create training and test set and add noise to the pixel intensities with NumPy's randint()

In [None]:
noise = np.random.randint(0, 100, (len(X_train), 784))
X_train_mod = X_train + noise
noise = np.random.randint(0, 100, (len(X_test), 784))
X_test_mod = X_test + noise
y_train_mod = X_train
y_test_mod = X_test

Train the classifier and make it clean the image

In [None]:
knn_clf.fit(X_train_mod, y_train_mod)
clean_digit = knn_clf.predict([X_test_mod[some_index]])
plot_digit(clean_di)

# Exercises

1)

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {'n_neighbors': [3,4,5], 'weights': ['uniform', 'distance']}
grid_search = GridSearchCV(knn_clf, param_grid)
grid_search.fit(X_train, y_train)

In [None]:
>>> grid_search.best_params_

In [None]:
>>> grid_search.best_score_

In [None]:
from sklearn.metrics import accuracy_score

y_pred = grid_search.predict(X_test)
accuracy_score(y_test, y_pred)