Nina Zumel
WVPlots
has a variety of visualizations that help modelers to design
classifiers best suited to their goals. In particular, ThresholdPlot
is a tool to select classifier thresholds that give the best tradeoffs
among relevant performance metrics. This note demonstrates an example of
using WVPlots
to evaluate and design classifiers.
Here we create an example synthetic data set, where score
is the score
produced by a trained model, and y
is the actual outcome in the
evaluation data set (TRUE
or FALSE
).
library(WVPlots)
## Loading required package: wrapr
set.seed(1452225)
# data with two different regimes of behavior
d <- rbind(
data.frame(
score = rnorm(1000),
y = sample(c(TRUE, FALSE), prob = c(0.02, 0.98), size = 1000, replace = TRUE)),
data.frame(
score = rnorm(200) + 5,
y = sample(c(TRUE, FALSE), size = 200, replace = TRUE))
)
One can create different classifiers from this single model, with
different performance metrics, by varying the decision threshold. Datums
that score higher than the threshold are predicted to be in the class of
interest (TRUE
).
The ROC plot gives this model an AUC of 0.91, which seems pretty high.
ROCPlot(d, "score", "y", truthTarget=TRUE, title="Model ROC")
The double density plot shows that in general, true instances score higher than most false instances. It further suggests that a threshold in the “valley” of the plot, say around a value of 2.5, would achieve good separation between true and false instances. But is that separation good enough to achieve project goals?
DoubleDensityPlot(d, "score", "y", title="Distribution of scores as a function of outcome")
A big disadvantage of both the ROC and double density plots is that they
hide the fact that the classes are unbalanced; this is more obvious when
the TRUE and FALSE distributions are separated. We can show the actual
class prevalences with ShadowHist
:
ShadowHist(d, "score", "y", title="Distribution of scores as a function of outcome")
Notice that from the ShadowHist
we can see that datums that score
above a threshold of 2.5 may not be majority true instances. So while
this model may look pretty good initially, we still aren’t sure if we
can pick a threshold that produces a classification rule that meets
project goals.
For a given model, PRTPlot
plots the precision and recall for
different choices of threshold. As is expected, higher thresholds give
higher precision, at the cost of lower recall.
PRTPlot(d, "score", "y", truthTarget=TRUE, title="precision and recall as a function of threshold")
The plot suggests that a threshold of 2.5 produces a classifier with about 87% recall, but only 50% precision. Depending on the project goals, this may or may not be good enough. Unfortunately this model can’t achieve higher precision without drastically impairing recall, so if higher simultaneous precision and recall are needed, it may be time to go back to the drawing board.
If precision/recall aren’t the performance metrics for your application,
ThresholdPlot
produces similar plots for a variety of classifier
metrics. See the
documentation
for all the metrics that can be plotted; here are a few examples.
# replicate PRTPlot. Looks a little different because ThresholdPlot does different smoothing
ThresholdPlot(d, "score", "y", title="Reproduce PRTPlot",
truth_target=TRUE, # default
metrics = c("precision", "recall"))
## Warning: Removed 1 row(s) containing missing values (geom_path).
# default: sensitivity/specificity
ThresholdPlot(d, "score", "y", title="Sensitivity and Specificity as a Function of Threshold")
One useful application of ThresholdPlot
is to “unroll” an ROC plot: if
the ROC shows that your model can meet an acceptable trade-off of true
positive rate and false positive rate, then ThresholdPlot
can tell you
which threshold achieves that goal.
ThresholdPlot(d, "score", "y", title="ROC 'unrolled'",
metrics = c("true_positive_rate", "false_positive_rate"))
Our example model with a threshold of 2.5 achieves a true positive rate (or recall) of about 87%, with a false positive rate of about 12%.
ThresholdPlot
can also be used to show some possibly useful
diagnostics on score distribution. fraction
measures how much of the
data scores above a given threshold value. cdf
is 1 - fraction
, or
the CDF of the scores (how much of the data is below a given threshold
value).
ThresholdPlot(d, "score", "y", title="Score distribution",
metrics = c("fraction", "cdf"))
MetricPairPlot
provides another way of visualizing the tradeoffs
between a pair of complementary performance metrics, by plotting them
against each other. For instance, plotting true_positive_rate vs
false_positive_rate gives you the equivalent of the ROC.
MetricPairPlot(d, 'score', 'y', title='ROC equivalent',
x_metric = "false_positive_rate", # default
y_metric = "true_positive_rate") # default
The above plot is just an ad-hoc version of ROCPlot
.
# Plot ROCPlot for comparison
ROCPlot(d, 'score', 'y', truthTarget=TRUE, title='ROC example')
You can plot other pairs as well, for instance precision vs. recall:
MetricPairPlot(d, 'score', 'y', title='recall/precision', x_metric = 'recall', y_metric = 'precision')
## Warning: Removed 2 row(s) containing missing values (geom_path).
MetricPairPlot
takes the same metrics as ThresholdPlot
. See the
documentation
for details.