# Chapter 3: Classification

In [1]:
# Pydata stack
import numpy as np
import pandas as pd

## MNIST <a name="mnist"></a> 

The MNIST dataset is a classic set of 70,000 images of handwritten digits often used for testing classification.

## Training a Binary Classifier <a name="binaryclassifier"></a>

**Binary Classifier:** A classifier that distinguishes between 2 classes.

**Stochastic Gradient Descent:** An efficient approach to fitting linear classifiers and regressors such as SVMs and Logistic Regressors
- Works well for large and sparse datasets (efficient)
- Technically an optimisation technique - the parameters determine the model
- Sensitive to feature scaling and the order of training data (so shuffle it!)
- `SGDClassifier` and `SGDRegressor` from `sklearn.linear_model`

## Performance Measures <a name="performancemeasures"></a>

**Accuracy:** The number of correctly classified samples as a proportion of the total.
- Can be unhelpful as a metric - particularly if dataset is skewed
- A dataset is *skewed* if some classes are much more frequent than others.

**Confusion Matrix:** Matrix with $(A, B)$ entry the number of times class $A$ was classified as $B$
- In a binary classification problem, row 0 gives how negative samples were classified, row 1 gives how positive samples were classified
    - (0, 0): True negatives
    - (0, 1): False positives
    - (1, 0): False negatives
    - (1, 1): True positives

**Precision:** True positives as a proportion of things you classified as positive
- A high value means that if you say something's positive it probably is - few false positives
- Quality: you might call very few things positive
- Ex: In information retrieval, precision is the fraction of retrieved documents that are relevant to the query
- You can see this as $P(Y=1\mid \hat{Y}=1)$

**Recall:** True positives as a proportion of things that are *actually* positive
- A high value means that you rarely call something negative when it isn't - few false negatives
- Also called *sensitivity*
- Quantity: you might call everything positive
- Ex: In information retrieval, recall is the fraction of the relevant documents that are successfully retrieved
- You can see this as $P(\hat{Y}=1\mid Y=1)$

**F1-Score:** The harmonic mean of precision and recall:

\begin{equation}
    F_1 
        = \frac{2}{\frac{1}{\text{precision}} + \frac{1}{\text{recall}}} 
        = 2 \cdot \frac{\text{precision}\cdot\text{recall}}{\text{precision} + \text{recall}} 
        = \frac{\text{TP}}{\text{TP}+\frac{\text{FN}+\text{FP}}{2}}
\end{equation}
- This favours classifiers that have similar precision and recall and penalises those where one is much higher
- More generally, $F_{\beta}$-score ($\beta > 0$) is used if recall is considered $\beta$ times more important than precision:

\begin{equation}
    F_{\beta}
        = (1+\beta^2) \cdot \frac{\text{precision}\cdot\text{recall}}{\beta^2\cdot\text{precision} + \text{recall}} 
        = \frac{\text{TP}}{\text{TP}+\frac{\beta^2\text{FN}+\text{FP}}{1+\beta^2}}
\end{equation}

**Decision Function:** A SGD classifier computes a score using a *decision function* and assigns a positive class if the score is greater than a *decision threshold*
- Can be valuable to plot precision against recall for different threshold values using `precision_recall_curve`

**Receiver Operation Characteristic (ROC) Curve:** Plots true positive rate against false positive rate for different threshold values
- Use `roc_curve` to plot
- False positive rate is 1 minus true negative rate (called *specificity*: $P(\hat{Y}=0\mid Y=0)$)
- Want high true positive rate and low false negative rate
- A random classifier would be diagonal from (0, 0) to (1, 1), we want curve to go as close to the top left as possible
- The *area under the curve* (AUC) (using `roc_auc_score`) is a single value used to compare classifiers (greater is higher), but it doesn't capture the full trade-off between true positives and false positives

**ROC vs Precision/Recall Curve:**
- The PR Curve is generally preferred when the positive class is rare or when the positive class is more interesting than the negative class
- [Here](https://stats.stackexchange.com/questions/7207/roc-vs-precision-and-recall-curves) is a rough explanation of why

## Multiclass Classification

**Multiclass Classifiers:** (or Multinomial Classifiers) can distinguish between more than 2 classes
- Some classifiers can handle multiclass natively, for others you need to build multiclass out of multiple binary classifiers
- *One-Versus-The-Rest (OvR):* (or one-versus-all) for each class you train a binary classifier on 'is the sample of this class'
- *One-Versus-One (OvO):* train binary classifiers for each pair of classes (e.g. is this digit a 4 or a 5) and then choose the class with the most pairwise 'wins'
- Draws in OvO are broken based on the total classification confidence among all pairs
- For OvO you have to train many more classifiers, but each on a much smaller training set since you only include samples for the two classes being compared
- Some algorithms (e.g. SVCs) scale badly with the training set so OvO is preferred, but for most OvR is preferred (Scikit-Learn automatically chooses the most appropriate, but you can override it).

## Error Analysis

**Confusion Matrix**
- Use confusion matrix to investigate numbers of errors by class and which classes are being misclassified as what
- Can plot using `plt.matshow`
- Valuable to normalise so you're looking at error rates, not absolute numbers, and remove the diagonal as rates there will be very high

**Individual Errors**
- Good (but time-consuming) to look at individaual errors
- E.g. look at 10 examples each of where class A was correctly classified, was misidentified as class B, class B was correctly classified, and was misclassified as class A

## Multilabel Classification

**Multilabel Classification:** Where a classifier has multiple labels for each input.
- E.g. Identifying faces in a photo
- This can be transformed into multiple binary classification problems, e.g. train a binary classifier for each potential label and output all those labels with a positive result
- Some models suppose multilabel natively, e.g. K-Nearest Neighbours classification
- To evaluate, you can calculate your favourite metric (e.g. F1-Score) for each label and then average. You may wish to do a weighted average according to how many samples have a particular label (the *support* of the label)

## Multioutput Classification

**Multioutput classification:** A generalisation of multilabel classification where the (multiple) labels can take more than 2 possible values.
- Ex: Removing noise from images - output is lots of pixel, which each can take many values

## Code Samples <a name="codesamples"></a>

### Manual Cross-Validation

The following code allows you to do cross-validation from scratch. We use the example of logistic regression.

In [2]:
from sklearn.base import clone
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold

# Sample data
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([0, 0, 0, 1, 1, 1])

# Initialise classifier
clf = LogisticRegression()

# Each split gives train and test indices for that fold that preserve the number of 0s and 1s in y
skfolds = StratifiedKFold(n_splits=3)  # n_splits is n_folds for cross-validation

for train_ix, test_ix in skfolds.split(X, y):  # Cycles through folds
    clone_clf = clone(clf)  # clone to avoiding change actual classifier

    print("Training indices: ", train_ix, "Test indices: ", test_ix)
    X_train_fold, y_train_fold = X[train_ix], y[train_ix]
    X_test_fold, y_test_fold = X[test_ix], y[test_ix]

    # Fit and print proportion correct
    clf.fit(X_train_fold, y_train_fold)
    y_pred = clf.predict(X_test_fold)
    n_correct = sum(y_pred == y_test_fold)
    print("Proportion correct: {:.1f}".format(n_correct / len(y_test_fold)))

Training indices:  [1 2 4 5] Test indices:  [0 3]
Proportion correct: 0.5
Training indices:  [0 2 3 5] Test indices:  [1 4]
Proportion correct: 0.0
Training indices:  [0 1 3 4] Test indices:  [2 5]
Proportion correct: 0.5


### Prediction Through Cross-Validation

**`cross_val_predict`:** Uses cross-validation to generate predictions
- E.g. If using K-Fold it will split the data into K disjoint folds (so each observation lies in precisely one) and for each fold, train the model on the data without that fold and make predictions for that fold
- You get predictions for the whole training set, but none of the observations are in their own training set
- Of course this is slow because you have to train many times
- You can also specific `method` to, e.g. use the decision function or get prediction probabilities, instead of just generating predictions

In [3]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_predict

# Sample data
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([0, 0, 0, 1, 1, 1])

# Initialise classifier
clf = LogisticRegression()

y_train_pred = cross_val_predict(clf, X, y, cv=3)
y_train_pred

array([0, 1, 0, 0, 0, 0])

### Execution Time Decorator

Decorators are functions that add additional functionality to functions.

In [12]:
import time

start_time = time.perf_counter()
time.sleep(1)
end_time = perf_counter()
print("Run time: {:0.1f} seconds".format(end_time - start_time))

Run time: 3.0 seconds


Here we are adding additional functionality to the `sleep` function. We can do this using a decorator.

In [13]:
def execution_time(func):
    
    def inner(*args, **kwargs):
        start = time.perf_counter()
        func(*args, **kwargs)
        end = time.perf_counter()
        
        print("Run time: {:0.1f} seconds".format(end - start))
    
    
    return inner


def sleep_copy(seconds):
    return sleep(seconds)


sleep_decorated = run_time(sleep_copy)
sleep_decorated(1)

Run time: 1.0 seconds


There is a shortcut for defining and decorating a function all in one step.

In [14]:
@execution_time
def sleep_decorated(seconds):
    return sleep(seconds)


sleep_decorated(1)

Run time: 1.0 seconds
