Threshold Selection for Classifiers
Nina Zumel
WVPlots has a variety of visualizations that help modelers to design
classifiers best suited to their goals. In particular, ThresholdPlot
is a tool to select classifier thresholds that give the best tradeoffs
among relevant performance metrics. This note demonstrates an example of
using WVPlots to evaluate and design classifiers.
Example Data
Here we create an example synthetic data set, where score is the score
produced by a trained model, and y is the actual outcome in the
evaluation data set (TRUE or FALSE).
library(WVPlots)## Loading required package: wrapr
set.seed(1452225)
# data with two different regimes of behavior
d <- rbind(
data.frame(
score = rnorm(1000),
y = sample(c(TRUE, FALSE), prob = c(0.02, 0.98), size = 1000, replace = TRUE)),
data.frame(
score = rnorm(200) + 5,
y = sample(c(TRUE, FALSE), size = 200, replace = TRUE))
)One can create different classifiers from this single model, with
different performance metrics, by varying the decision threshold. Datums
that score higher than the threshold are predicted to be in the class of
interest (TRUE).
The ROC plot gives this model an AUC of 0.91, which seems pretty high.
ROCPlot(d, "score", "y", truthTarget=TRUE, title="Model ROC")The double density plot shows that in general, true instances score higher than most false instances. It further suggests that a threshold in the “valley” of the plot, say around a value of 2.5, would achieve good separation between true and false instances. But is that separation good enough to achieve project goals?
DoubleDensityPlot(d, "score", "y", title="Distribution of scores as a function of outcome")A big disadvantage of both the ROC and double density plots is that they
hide the fact that the classes are unbalanced; this is more obvious when
the TRUE and FALSE distributions are separated. We can show the actual
class prevalences with ShadowHist:
ShadowHist(d, "score", "y", title="Distribution of scores as a function of outcome")Notice that from the ShadowHist we can see that datums that score
above a threshold of 2.5 may not be majority true instances. So while
this model may look pretty good initially, we still aren’t sure if we
can pick a threshold that produces a classification rule that meets
project goals.
PRTPlot: Plotting Precision vs Recall
For a given model, PRTPlot plots the precision and recall for
different choices of threshold. As is expected, higher thresholds give
higher precision, at the cost of lower recall.
PRTPlot(d, "score", "y", truthTarget=TRUE, title="precision and recall as a function of threshold")The plot suggests that a threshold of 2.5 produces a classifier with about 87% recall, but only 50% precision. Depending on the project goals, this may or may not be good enough. Unfortunately this model can’t achieve higher precision without drastically impairing recall, so if higher simultaneous precision and recall are needed, it may be time to go back to the drawing board.
ThresholdPlot: Plotting Other Metrics by Threshold
If precision/recall aren’t the performance metrics for your application,
ThresholdPlot produces similar plots for a variety of classifier
metrics. See the
documentation
for all the metrics that can be plotted; here are a few examples.
# replicate PRTPlot. Looks a little different because ThresholdPlot does different smoothing
ThresholdPlot(d, "score", "y", title="Reproduce PRTPlot",
truth_target=TRUE, # default
metrics = c("precision", "recall"))## Warning: Removed 1 row(s) containing missing values (geom_path).
# default: sensitivity/specificity
ThresholdPlot(d, "score", "y", title="Sensitivity and Specificity as a Function of Threshold")“Unrolling” the ROC
One useful application of ThresholdPlot is to “unroll” an ROC plot: if
the ROC shows that your model can meet an acceptable trade-off of true
positive rate and false positive rate, then ThresholdPlot can tell you
which threshold achieves that goal.
ThresholdPlot(d, "score", "y", title="ROC 'unrolled'",
metrics = c("true_positive_rate", "false_positive_rate"))Our example model with a threshold of 2.5 achieves a true positive rate (or recall) of about 87%, with a false positive rate of about 12%.
Diagnostics on Data Distribution
ThresholdPlot can also be used to show some possibly useful
diagnostics on score distribution. fraction measures how much of the
data scores above a given threshold value. cdf is 1 - fraction, or
the CDF of the scores (how much of the data is below a given threshold
value).
ThresholdPlot(d, "score", "y", title="Score distribution",
metrics = c("fraction", "cdf"))MetricPairPlot
MetricPairPlot provides another way of visualizing the tradeoffs
between a pair of complementary performance metrics, by plotting them
against each other. For instance, plotting true_positive_rate vs
false_positive_rate gives you the equivalent of the ROC.
MetricPairPlot(d, 'score', 'y', title='ROC equivalent',
x_metric = "false_positive_rate", # default
y_metric = "true_positive_rate") # defaultThe above plot is just an ad-hoc version of ROCPlot.
# Plot ROCPlot for comparison
ROCPlot(d, 'score', 'y', truthTarget=TRUE, title='ROC example')You can plot other pairs as well, for instance precision vs. recall:
MetricPairPlot(d, 'score', 'y', title='recall/precision', x_metric = 'recall', y_metric = 'precision')## Warning: Removed 2 row(s) containing missing values (geom_path).
MetricPairPlot takes the same metrics as ThresholdPlot. See the
documentation
for details.










