Question on the performance of `anydate()` #132

etiennebacher · 2024-02-24T15:28:11Z

Hello Dirk, I saw the announcement of chronos on Mastodon, and when I tried to reproduce the benchmarks I was a bit surprised by the performance of anytime (without comparing it to chronos). It seems to me that applying anydate() / anytime() on the unique values and then match them on the original vector could lead to time gains (at the expense of memory usage) in some contexts, mostly for dates as it is more likely to have duplicated values than for datetime.

Here's a small benchmark to compare the current anydate() with the alternative of applying on unique values only. I ran this with 10k-70k values (with steps of 10k):

library(anytime)
library(ggplot2)
library(patchwork)

res <- bench::press(
  n = c(10000, 20000, 30000, 40000, 50000, 60000, 70000),
  {
    dat <- paste(
      sample(1:31, n, TRUE), sample(month.abb, n, TRUE),
      sample(2000:2024, n, TRUE)
    )
    print(paste("Number of unique obs:", length(unique(dat))))
    bench::mark(
      orig = anytime::anydate(dat),
      new = {
        uniques <- unique(dat)
        converted <- anytime::anydate(uniques)
        out <- converted[match(dat, uniques)] |> 
          as.Date()
        attr(out, "tzone") <- "Europe/Paris"
        out
      }
    )
  }
)
#> Running with:
#>       n
#> 1 10000
#> [1] "Number of unique obs: 6178"
#> 2 20000
#> [1] "Number of unique obs: 8234"
#> 3 30000
#> [1] "Number of unique obs: 8929"
#> 4 40000
#> [1] "Number of unique obs: 9178"
#> 5 50000
#> [1] "Number of unique obs: 9253"
#> 6 60000
#> [1] "Number of unique obs: 9284"
#> 7 70000
#> [1] "Number of unique obs: 9295"

res$expr <- as.character(res$expression)

time <- res |> 
  ggplot(aes(n, median, color = expr)) +
  geom_point() +
  geom_line() +
  labs(title = "Time")

mem <- res |> 
  ggplot(aes(n, mem_alloc, color = expr)) +
  geom_point() +
  geom_line() +
  labs(title = "Memory")

time + mem

Of course I'm aware of the difficulties of producing benchmarks that accurately reproduce real-world situations and I'm not arguing that my alternative is better. I'm sure I also let aside many important details, such as time zones, missing values, etc. I was just curious and since I found those results I'm wondering if this is something that you already considered.

The text was updated successfully, but these errors were encountered:

eddelbuettel · 2024-02-24T15:45:56Z

(Aside: That's gowdawfully formatted code. But that's just me and 25+ years of ESS use.)

I have the feeling that has come up before. Did you check old issues?

Could you also please measure the overhead of computing unique values at those size for vectors that are in fact unique without replicates?

eddelbuettel · 2024-02-24T15:49:31Z

Maybe add a third column using this value:

calcUnique: A logical value with a default value of ‘FALSE’ that tells
          the function to perform the ‘anytime()’ or ‘anydate()’
          calculation only once for each unique value in the ‘x’
          vector. It results in no difference in inputs or outputs, but
          can result in a significant speed increases for long vectors
          where each timestamp appears more than once. However, it will
          result in a slight slow down for input vectors where each
          timestamp appears only once.

etiennebacher · 2024-02-24T15:55:24Z

Only saw #109 and calcUnique now... Sorry for the duplicated (and already fixed) issue.

Results with calcUnique = TRUE for future visitors:

eddelbuettel · 2024-02-24T16:00:50Z

@schochastics Please see above -- @etiennebacher did some digging and touches upon an issue that may matter for your benchmarks too. I have the default for unique on 'off' because where I came from (in my former field of high-ish frequency finance) our timestamps tend to indeed be unique (and by now the field is of course more occoupied with nanoseconds resolution so POSIXct is of limited usefulness, that was different when I wrote anytime). And for dates it is definitely an issue as it is so much easier to clash values.

@etiennebacher We could think about some data.table alike heuristics here. Maybe if N > someValue, say 10k, we sample 100 and see if we have replication. Or maybe blockwise sample ten blocks of ten? This would require some thinking but you do document that the gain could be substantial. Worth doing as a heuristic?

etiennebacher · 2024-02-24T16:06:13Z

Worth doing as a heuristic?

Could be, but I'm not an active user of anytime as I rarely have a usecase for it, so I don't think my opinion matters much here. I was simply intrigued by the benchmarks of @schochastics and explored a bit to see if there were some low-hanging fruits.

schochastics · 2024-02-24T16:36:19Z

Upfront I need to clarify that my benchmark is not that fair yet, because I just return a character vector so far, not a POSIXct formatted object. calcUnique doesnt make a difference in my current benchmark since I used a vector of unique dates.

I try to remember to report back here once i did some more rigorous testing with chronos and a better interface

eddelbuettel · 2024-02-24T16:38:27Z

No need to report back here then if you also use unique values.

etiennebacher closed this as completed Feb 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question on the performance of `anydate()` #132

Question on the performance of `anydate()` #132

etiennebacher commented Feb 24, 2024

eddelbuettel commented Feb 24, 2024 •

edited

Loading

eddelbuettel commented Feb 24, 2024

etiennebacher commented Feb 24, 2024

eddelbuettel commented Feb 24, 2024 •

edited

Loading

etiennebacher commented Feb 24, 2024

schochastics commented Feb 24, 2024

eddelbuettel commented Feb 24, 2024

Question on the performance of anydate() #132

Question on the performance of anydate() #132

Comments

etiennebacher commented Feb 24, 2024

eddelbuettel commented Feb 24, 2024 • edited Loading

eddelbuettel commented Feb 24, 2024

etiennebacher commented Feb 24, 2024

eddelbuettel commented Feb 24, 2024 • edited Loading

etiennebacher commented Feb 24, 2024

schochastics commented Feb 24, 2024

eddelbuettel commented Feb 24, 2024

Question on the performance of `anydate()` #132

Question on the performance of `anydate()` #132

eddelbuettel commented Feb 24, 2024 •

edited

Loading

eddelbuettel commented Feb 24, 2024 •

edited

Loading