Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ebc_tidy_by_threshold() is very slow for large vectors #6

Open
mahendra-mariadassou opened this issue Jan 31, 2021 · 0 comments
Open

Comments

@mahendra-mariadassou
Copy link

library(evabic)
library(magrittr)

set.seed(1)
n <- 10000
fake_data <- data.frame(name        = paste("species", 1:n, sep = "_"), 
                        value       = (1:n)/n, 
                        true_status = sample(c(T, F), size = n, replace = T))

## Appel à ebc_tidy_by_threshold pour calculer l'AUC (rapide jusqu'à n = 1000)
tictoc::tic()
run_1 <- ebc_tidy_by_threshold(detection_values = with(fake_data, setNames(value, name)), 
                              true              = with(fake_data, name[true_status]), 
                              all               = with(fake_data, name),
                              measures          = c('TPR', "FPR", "FDR"), 
                              direction         = "<=")
tictoc::toc() ## 49.75 s

## méthode directe sans passer par ebc_tidy
## Prototypée uniquement pour direction = "<" et measures = c('TPR', "FPR", "FDR") mais s'adapte facilement aux autres cas
## Par flemme, j'utilise quelques fonctions de dplyr mais ce n'est pas strictement nécessaire et je ne reprend pas le préprocessing
my_ebc_tidy_by_threshold <- function(detection_values, true, all) {
  N <- length(detection_values)
  N_true <- length(true)
  d <- data.frame(
    ID        = names(detection_values), 
    threshold = detection_values, 
    status    = names(detection_values) %in% true
  ) %>% 
    ## Sort the data to compute TP / FP / FN / TN iteratively
    dplyr::arrange(threshold)  %>% ## desc(threshold) si direction = ">" ou ">!"
    dplyr::mutate(TP  = cumsum(status),          ## Number of TP when using the current threshold, ajouter -1 si direction = "<" au lieu de "<="
                  FP  = 1:N - TP,                ## Number of FP when using the current threshold
                  FN  = N_true - TP,             ## Number of FN when using the current threshold
                  TN  = N - TP - FP - FN,        ## Number of TN when using the current threshold
                  FDR = FP / pmax((FP + TP), 1),
                  TPR = TP / (TP + FN),          ## Recall / sensitivity
                  FPR = FP / (TN + FP)           ## 1 - specificity
    )
  ## Rajouter du code si on demande d'autres mesures
  ## Remove rows corresponding to duplicate scores by keeping only the last one (use rev twice to keep last one instead of first one)
  rows_to_exclude <- d$threshold %>% rev() %>% duplicated() %>% rev()
  d <- d[!rows_to_exclude, ]
  d
}

tictoc::tic()
run_2 <- my_ebc_tidy_by_threshold(detection_values = with(fake_data, setNames(value, name)), 
                                  true              = with(fake_data, name[true_status]), 
                                  all               = with(fake_data, name))
tictoc::toc() # 0.043 s

Probably because there's a lot a useless computation going on when using lapply and computing many related quantities. The prototype my_ebc_tidy_by_threshold() is a proof of concept for a faster implementation that computes basic quantities (TP, TN, FP, FN) efficiently for each threshold by sorting the data and then computes derived metrics from TP, TN, FP, FN. It does not handle border cases yet, i.e. when computing AUC, one should add c(0, 0) and c(1, 1) to c(TPR, FPR), but it's much faster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant