You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
library(evabic)
library(magrittr)
set.seed(1)
n <- 10000
fake_data <- data.frame(name = paste("species", 1:n, sep = "_"),
value = (1:n)/n,
true_status = sample(c(T, F), size = n, replace = T))
## Appel à ebc_tidy_by_threshold pour calculer l'AUC (rapide jusqu'à n = 1000)
tictoc::tic()
run_1 <- ebc_tidy_by_threshold(detection_values = with(fake_data, setNames(value, name)),
true = with(fake_data, name[true_status]),
all = with(fake_data, name),
measures = c('TPR', "FPR", "FDR"),
direction = "<=")
tictoc::toc() ## 49.75 s
## méthode directe sans passer par ebc_tidy
## Prototypée uniquement pour direction = "<" et measures = c('TPR', "FPR", "FDR") mais s'adapte facilement aux autres cas
## Par flemme, j'utilise quelques fonctions de dplyr mais ce n'est pas strictement nécessaire et je ne reprend pas le préprocessing
my_ebc_tidy_by_threshold <- function(detection_values, true, all) {
N <- length(detection_values)
N_true <- length(true)
d <- data.frame(
ID = names(detection_values),
threshold = detection_values,
status = names(detection_values) %in% true
) %>%
## Sort the data to compute TP / FP / FN / TN iteratively
dplyr::arrange(threshold) %>% ## desc(threshold) si direction = ">" ou ">!"
dplyr::mutate(TP = cumsum(status), ## Number of TP when using the current threshold, ajouter -1 si direction = "<" au lieu de "<="
FP = 1:N - TP, ## Number of FP when using the current threshold
FN = N_true - TP, ## Number of FN when using the current threshold
TN = N - TP - FP - FN, ## Number of TN when using the current threshold
FDR = FP / pmax((FP + TP), 1),
TPR = TP / (TP + FN), ## Recall / sensitivity
FPR = FP / (TN + FP) ## 1 - specificity
)
## Rajouter du code si on demande d'autres mesures
## Remove rows corresponding to duplicate scores by keeping only the last one (use rev twice to keep last one instead of first one)
rows_to_exclude <- d$threshold %>% rev() %>% duplicated() %>% rev()
d <- d[!rows_to_exclude, ]
d
}
tictoc::tic()
run_2 <- my_ebc_tidy_by_threshold(detection_values = with(fake_data, setNames(value, name)),
true = with(fake_data, name[true_status]),
all = with(fake_data, name))
tictoc::toc() # 0.043 s
Probably because there's a lot a useless computation going on when using lapply and computing many related quantities. The prototype my_ebc_tidy_by_threshold() is a proof of concept for a faster implementation that computes basic quantities (TP, TN, FP, FN) efficiently for each threshold by sorting the data and then computes derived metrics from TP, TN, FP, FN. It does not handle border cases yet, i.e. when computing AUC, one should add c(0, 0) and c(1, 1) to c(TPR, FPR), but it's much faster.
The text was updated successfully, but these errors were encountered:
Probably because there's a lot a useless computation going on when using lapply and computing many related quantities. The prototype
my_ebc_tidy_by_threshold()
is a proof of concept for a faster implementation that computes basic quantities (TP, TN, FP, FN) efficiently for each threshold by sorting the data and then computes derived metrics from TP, TN, FP, FN. It does not handle border cases yet, i.e. when computing AUC, one should add c(0, 0) and c(1, 1) to c(TPR, FPR), but it's much faster.The text was updated successfully, but these errors were encountered: