`ebc_tidy_by_threshold()` is very slow for large vectors #6

mahendra-mariadassou · 2021-01-31T21:00:55Z

library(evabic)
library(magrittr)

set.seed(1)
n <- 10000
fake_data <- data.frame(name        = paste("species", 1:n, sep = "_"), 
                        value       = (1:n)/n, 
                        true_status = sample(c(T, F), size = n, replace = T))

## Appel à ebc_tidy_by_threshold pour calculer l'AUC (rapide jusqu'à n = 1000)
tictoc::tic()
run_1 <- ebc_tidy_by_threshold(detection_values = with(fake_data, setNames(value, name)), 
                              true              = with(fake_data, name[true_status]), 
                              all               = with(fake_data, name),
                              measures          = c('TPR', "FPR", "FDR"), 
                              direction         = "<=")
tictoc::toc() ## 49.75 s

## méthode directe sans passer par ebc_tidy
## Prototypée uniquement pour direction = "<" et measures = c('TPR', "FPR", "FDR") mais s'adapte facilement aux autres cas
## Par flemme, j'utilise quelques fonctions de dplyr mais ce n'est pas strictement nécessaire et je ne reprend pas le préprocessing
my_ebc_tidy_by_threshold <- function(detection_values, true, all) {
  N <- length(detection_values)
  N_true <- length(true)
  d <- data.frame(
    ID        = names(detection_values), 
    threshold = detection_values, 
    status    = names(detection_values) %in% true
  ) %>% 
    ## Sort the data to compute TP / FP / FN / TN iteratively
    dplyr::arrange(threshold)  %>% ## desc(threshold) si direction = ">" ou ">!"
    dplyr::mutate(TP  = cumsum(status),          ## Number of TP when using the current threshold, ajouter -1 si direction = "<" au lieu de "<="
                  FP  = 1:N - TP,                ## Number of FP when using the current threshold
                  FN  = N_true - TP,             ## Number of FN when using the current threshold
                  TN  = N - TP - FP - FN,        ## Number of TN when using the current threshold
                  FDR = FP / pmax((FP + TP), 1),
                  TPR = TP / (TP + FN),          ## Recall / sensitivity
                  FPR = FP / (TN + FP)           ## 1 - specificity
    )
  ## Rajouter du code si on demande d'autres mesures
  ## Remove rows corresponding to duplicate scores by keeping only the last one (use rev twice to keep last one instead of first one)
  rows_to_exclude <- d$threshold %>% rev() %>% duplicated() %>% rev()
  d <- d[!rows_to_exclude, ]
  d
}

tictoc::tic()
run_2 <- my_ebc_tidy_by_threshold(detection_values = with(fake_data, setNames(value, name)), 
                                  true              = with(fake_data, name[true_status]), 
                                  all               = with(fake_data, name))
tictoc::toc() # 0.043 s

Probably because there's a lot a useless computation going on when using lapply and computing many related quantities. The prototype my_ebc_tidy_by_threshold() is a proof of concept for a faster implementation that computes basic quantities (TP, TN, FP, FN) efficiently for each threshold by sorting the data and then computes derived metrics from TP, TN, FP, FN. It does not handle border cases yet, i.e. when computing AUC, one should add c(0, 0) and c(1, 1) to c(TPR, FPR), but it's much faster.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`ebc_tidy_by_threshold()` is very slow for large vectors #6

`ebc_tidy_by_threshold()` is very slow for large vectors #6

mahendra-mariadassou commented Jan 31, 2021

ebc_tidy_by_threshold() is very slow for large vectors #6

ebc_tidy_by_threshold() is very slow for large vectors #6

Comments

mahendra-mariadassou commented Jan 31, 2021

`ebc_tidy_by_threshold()` is very slow for large vectors #6

`ebc_tidy_by_threshold()` is very slow for large vectors #6