## Evaluating Classifiers

A key task when applying a classifier is to determine how effective our classifier will be at making predictions. One way to estimate this is to divide the full dataset into two sets using a "hold-out strategy":
1. *Training set*: A set of examples used to build the classification model.
2. *Test set*: A separate set of examples that is withheld from the classifier during training, and is used afterwards to evaluate the model.

### Evaluating Simple Train/Test Splits

To demonstrate how to evaluate classifiers in scikit-learn, we will randomly generate an artificial dataset with 400 examples described by 10 features, annotated with 2 classes: Positive (1) and Negative (-1)

In [None]:
from sklearn.datasets import make_hastie_10_2
data, target = make_hastie_10_2(400)

In [None]:
print(data.shape)
data[0,:]

Each item in this artificial dataset has a label:

In [None]:
print(target)

We can easily randomly split the complete dataset into a training test and a test set. We will specify that 20% (0.2) of the data will be used for the test set.

In [None]:
from sklearn.model_selection import train_test_split
data_train, data_test, target_train, target_test = train_test_split(data, target, test_size=0.2)

In [None]:
print("Training set has %d examples" % data_train.shape[0] )
print("Test set has %d examples" % data_test.shape[0] )

Now we will build a KNN classifier ($k=3$) as we have seen previously. Note that we only use the training data to build the model:

In [None]:
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=3)
model.fit(data_train, target_train)
print(model)

In [None]:
predicted = model.predict(data_test)
print("Target",target_test)
print("Predictions",predicted)

Manually comparing the target labels for the test data with our predictions can be misleading. Instead, we want to determine the extent to which the classifier made the following correct/incorrect predictions:
- *True Positives* (TP) are those which are labeled ``1`` which are actually ``1``
- *False Positives* (FP) are those which are labeled ``1`` which are actually ``-1``
- *True Negatives* (TN) are those which are labeled ``-1`` which are actually ``-1``
- *False Negatives* (FN) are those which are labeled ``-1`` which are actually ``1``

We can do this by creating a confusion matrix for the results. The result is a NumPy matrix, with predictions on the columns and actual labels on the rows. The values correspond to:

    [ [TP FN]
    [FP TN] ]
A perfect classifier with 100% accuracy would produce a pure diagonal matrix which would have all the test examples predicted in their correct class. In our case, we see that we have many false negatives (i.e. examples labelled -1 which are actually 1).

In [None]:
# import all of the scikit-learn evaluation functionality
from sklearn.metrics import *
# build the confusion matrix
cm = confusion_matrix(target_test, predicted,labels=[1,-1])
print(cm)

An overall *accuracy* score for the predictions, defined as the fraction of correct predictions, can be calculated using the below. This will return a value between 0 (completely wrong) and 1 (predictions are 100% accurate):

In [None]:
print("Accuracy = %.2f" % accuracy_score(target_test, predicted) )

Measures from information retrieval (search engines) can be used in ML evaluation. Note that these are calculated with respect to a particular class (e.g. the positive class labelled as "1").
- *Precision*: proportion of retrieved results that are relevant = TP/(TP+FP)
- *Recall*: proportion of relevant results that are retrieved = TP/(TP+FN)

In [None]:
# Note that we indicate that we are interested in the Positive class here, which is labelled as "1"
print("Precision (Positive) = %.2f" % precision_score(target_test, predicted, pos_label=1) )
print("Recall (Positive) = %.2f" % recall_score(target_test, predicted, pos_label=1) )

Note that there is often a trade-off between precision and recall. We can combine precision and recall into a single score using the *F1 Measure*, which is a weighted average of the precision and recall. The F1 Measure reaches its best value at 1 and worst at 0.

    F1 = 2 * (precision * recall) / (precision + recall)

In [None]:
print("F1 (Positive) = %.2f" % f1_score(target_test, predicted, pos_label=1) )

We can quickly compute a summary of these statistics using scikit-learn's provided convenience function:

In [None]:
print(classification_report(target_test, predicted, target_names=["negative","positive"]))

### Cross Validation

A problem with simply randomly splitting a dataset into two sets is that each random split might give different results. We are also ignoring a portion of your dataset. One way to address this is to use *k-fold cross-validation* to evaluate a classifier:
1. Divide the data into k disjoint subsets - “folds” (e.g. k=5).
2. For each of k experiments, use k-1 folds for training and the selected one fold for testing.
3. Repeat for all k folds, average the accuracy/error rates.

While this is a relatively complex process, scikit-learn allows us to achieve this using a single command. Let's do a 2-fold cross-validation of the KNN classifier

In [None]:
# create a single classifier
model = KNeighborsClassifier(n_neighbors=3)
# apply 2-fold cross-validation, measuring accuracy each time
from sklearn.model_selection import cross_val_score
acc_scores = cross_val_score(model, data, target, cv=2, scoring="accuracy")
print(acc_scores)

Similarly, for 10-fold cross validation we get an array with 10 accuracy scores, one for each fold:

In [None]:
acc_scores =  cross_val_score(model, data, target, cv=10, scoring="accuracy")
print(acc_scores)

Calculate the average accuracy across all folds:

In [None]:
print("KNN: Mean cross-validation accuracy = %.2f" % acc_scores.mean() )

We can use this approach to compare different classifiers on the same data, such as a logistic regression classifier or a Support Vector Machine (SVM) classifier.

In [None]:
from sklearn import linear_model
model = linear_model.LogisticRegression(solver='liblinear')
acc_scores =  cross_val_score(model, data, target, cv=10, scoring="accuracy")
print("Logistic Regression: Mean cross-validation accuracy = %.2f" % acc_scores.mean() )

In [None]:
from sklearn.svm import SVC
model = SVC(gamma='auto')
acc_scores = cross_val_score(model, data, target, cv=10, scoring="accuracy")
print("SVM: Mean cross-validation accuracy = %.2f" % acc_scores.mean() )