ROCket was primarily build for ROC curve estimation in the presence of aggregated data. Nevertheless, it can also handle raw samples. In general, aggregating data can be very beneficial when dealing with large datasets. Whenever a dataset consists of categorical or discretized continuous features, the data size can be effectively reduced by calculating sufficient statistics for each constellation of feature values. This saves memory and reduces the time needed to train classification models. Also model accuracy assessment can be done on aggregated data, which yields similar benefits. To this end, ROCket provides functions for ROC curve estimation and AUC calculation.
# From CRAN
install.packages("ROCket")
# From GitHub
# install.packages("devtools")
devtools::install_github("da-zar/ROCket")
The easiest way to get started is to prepare a dataset containing all distinct predicted score values together with their count and the number of positive cases. Your dataset could look like this:
nrow(data_agg)
#> [1] 11
head(data_agg)
#> score totals positives
#> 1: 1 62840 38499
#> 2: 0 62309 23985
#> 3: 2 30143 24092
#> 4: -1 30282 6072
#> 5: -2 6509 597
#> 6: 3 6642 6082
You can now pass this data to the rkt_prep
function in order to create
an object that will be later used for estimating ROC curves (possibly
with several different algorithms).
prep_data_agg <- rkt_prep(
scores = data_agg$score,
positives = data_agg$positives,
totals = data_agg$totals
)
It is not necessary to use an aggregated dataset. It’s also possible to
have each single observation in a separate row – in this case the
positives
argument is the regular indicator (a numeric vector is
required) for positive observations and the totals
argument is not
needed anymore (default is 1).
You can print the object, to get some information about the content, or plot it:
prep_data_agg
#> .:: ROCket Prep Object
#> Positives (pos_n): 100000
#> Negatives (neg_n): 100000
#> Pos ECDF (pos_ecdf): rkt_ecdf function
#> Neg ECDF (neg_ecdf): rkt_ecdf function
plot(prep_data_agg)
Estimates of the ROC curve can be calculated with the rkt_roc
function. It takes two arguments. The first one is the rkt_prep
object, which contains all the needed data, and the second one is an
integer saying which method of estimation should be used. A list of
implemented methods can be retrieved with the show_methods
function.
show_methods()
#> nr desc
#> 1: 1 ROC Curve (empirical)
#> 2: 2 ROC Function (empirical)
#> 3: 3 ROC Function (placement values)
#> 4: 4 ROC Function (binormal)
In ROCket, we distinguish two types of ROC curve representations:
- parametric curves – TPR and FPR are functions of a parameter (the score),
- functions – TPR is a function of FPR.
In the first case we estimate two functions, the x and y coordinates of the ROC curve (FPR, TPR). These two functions are returned as a list. In the second case the output is a regular function.
Let’s now calculate estimates of the ROC curve using all available methods.
roc_list <- list()
for (i in 1:4){
roc_list[[i]] <- rkt_roc(prep_data_agg, method = i)
}
The output of rkt_roc
can be used to plot the ROC curve estimates and
calculate the AUC.
par(mfrow = c(2, 2))
for (i in 1:4){
plot(
roc_list[[i]],
main = show_methods()[i, desc],
sub = sprintf('AUC: %f', auc(roc_list[[i]]))
)
}