In [1]:
import numpy
import matplotlib.pyplot as plt 
from matplotlib import cm
import pandas
import mglearn
import os
import scipy

import sklearn
import sklearn.ensemble              # import seperatley otherwise sub module won't be imported
import sklearn.neural_network        # import seperatley otherwise sub module won't be imported
from sklearn.cluster import KMeans
import sklearn.feature_selection

import graphviz
import mpl_toolkits.mplot3d as plt3dd

import time

When selecting a metric, you should always have the end goal of the machine learning application in mind. In practice, we are usually interested not just in making accurate predictions, but in using these predictions as part of a larger decisionmaking process. Before picking a machine learning metric, you should think about the high-level goal of the application, often called the $business$ $metric$.

When choosing a model or adjusting parameters, you should pick the model or parameter values that have the most positive influence on the business metric. Often this is hard, as assessing the business impact of a particular model might require putting it in production in a real-life system.


In the early stages of development, and for adjusting parameters, it is often infeasible to put models into production just for testing purposes, because of the high business or personal risks that can be involved.

# Metrics for binary classification

Binary classification is arguably the most common and conceptually simple application of machine learning in practice. However, there are still a number of caveats in evaluating even this simple task.

Remember that for binary classification, we often speak of a positive class and a negative class, with the understanding that the positive class is the one we are looking for.

Often, accuracy is not a good measure of predictive performance, as the number of mistakes we make does not contain all the information we are interested in.

## Kinds of errors: Imbalanced datasets

In [10]:
digits = sklearn.datasets.load_digits();

y = digits.target == 9;
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(digits.data, y, random_state=0)

In [12]:
dummy_majority = sklearn.dummy.DummyClassifier(strategy="most_frequent").fit(X_train, y_train);

In [18]:
pred_most_frequent = dummy_majority.predict(X_test);

print("Unique predicted labels: {}".format(numpy.unique(pred_most_frequent)));
print("Test score: {:4.2f} %".format(100*dummy_majority.score(X_test, y_test)));

Unique predicted labels: [False]
Test score: 89.56 %


We obtained close to 90% accuracy without learning anything. This might seem striking, but think about it for a minute. Imagine someone telling you their model is 90% accurate. You might think they did a very good job. But depending on the problem, that might be possible by just predicting one class! Let’s compare this against using an actual classifier.

In [20]:
tree = sklearn.tree.DecisionTreeClassifier(max_depth=2).fit(X_train, y_train);
pred_tree = tree.predict(X_test); 
print("Test score: {:4.2f} %".format(100*tree.score(X_test, y_test)));

Test score: 91.78 %


According to accuracy, the DecisionTreeClassifier is only slightly better than the constant predictor. This could indicate either that something is wrong with how we used DecisionTreeClassifier, or that accuracy is in fact not a good measure here.

For comparison purposes, let’s evaluate two more classifiers, LogisticRegression and the default DummyClassifier, which makes random predictions but produces classes with the same proportions as in the training set:

In [26]:
dummy = sklearn.dummy.DummyClassifier().fit(X_train, y_train);
pred_dummy = dummy.predict(X_test);
print("dummy score: {:4.2f} %".format(100*dummy.score(X_test, y_test)));


logreg = sklearn.linear_model.LogisticRegression(C=0.1, max_iter=1000).fit(X_train, y_train);
pred_logreg = logreg.predict(X_test);
print("logreg score: {:4.2f} %".format(100*logreg.score(X_test, y_test)));

dummy score: 89.56 %
logreg score: 98.44 %


## Confusion matrices