Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question on the performance of anydate() #132

Closed
etiennebacher opened this issue Feb 24, 2024 · 7 comments
Closed

Question on the performance of anydate() #132

etiennebacher opened this issue Feb 24, 2024 · 7 comments

Comments

@etiennebacher
Copy link

Hello Dirk, I saw the announcement of chronos on Mastodon, and when I tried to reproduce the benchmarks I was a bit surprised by the performance of anytime (without comparing it to chronos). It seems to me that applying anydate() / anytime() on the unique values and then match them on the original vector could lead to time gains (at the expense of memory usage) in some contexts, mostly for dates as it is more likely to have duplicated values than for datetime.

Here's a small benchmark to compare the current anydate() with the alternative of applying on unique values only. I ran this with 10k-70k values (with steps of 10k):

library(anytime)
library(ggplot2)
library(patchwork)

res <- bench::press(
  n = c(10000, 20000, 30000, 40000, 50000, 60000, 70000),
  {
    dat <- paste(
      sample(1:31, n, TRUE), sample(month.abb, n, TRUE),
      sample(2000:2024, n, TRUE)
    )
    print(paste("Number of unique obs:", length(unique(dat))))
    bench::mark(
      orig = anytime::anydate(dat),
      new = {
        uniques <- unique(dat)
        converted <- anytime::anydate(uniques)
        out <- converted[match(dat, uniques)] |> 
          as.Date()
        attr(out, "tzone") <- "Europe/Paris"
        out
      }
    )
  }
)
#> Running with:
#>       n
#> 1 10000
#> [1] "Number of unique obs: 6178"
#> 2 20000
#> [1] "Number of unique obs: 8234"
#> 3 30000
#> [1] "Number of unique obs: 8929"
#> 4 40000
#> [1] "Number of unique obs: 9178"
#> 5 50000
#> [1] "Number of unique obs: 9253"
#> 6 60000
#> [1] "Number of unique obs: 9284"
#> 7 70000
#> [1] "Number of unique obs: 9295"

res$expr <- as.character(res$expression)

time <- res |> 
  ggplot(aes(n, median, color = expr)) +
  geom_point() +
  geom_line() +
  labs(title = "Time")

mem <- res |> 
  ggplot(aes(n, mem_alloc, color = expr)) +
  geom_point() +
  geom_line() +
  labs(title = "Memory")

time + mem

Of course I'm aware of the difficulties of producing benchmarks that accurately reproduce real-world situations and I'm not arguing that my alternative is better. I'm sure I also let aside many important details, such as time zones, missing values, etc. I was just curious and since I found those results I'm wondering if this is something that you already considered.

@eddelbuettel
Copy link
Owner

eddelbuettel commented Feb 24, 2024

(Aside: That's gowdawfully formatted code. But that's just me and 25+ years of ESS use.)

I have the feeling that has come up before. Did you check old issues?

Could you also please measure the overhead of computing unique values at those size for vectors that are in fact unique without replicates?

@eddelbuettel
Copy link
Owner

Maybe add a third column using this value:

calcUnique: A logical value with a default value of ‘FALSE’ that tells
          the function to perform the ‘anytime()’ or ‘anydate()’
          calculation only once for each unique value in the ‘x’
          vector. It results in no difference in inputs or outputs, but
          can result in a significant speed increases for long vectors
          where each timestamp appears more than once. However, it will
          result in a slight slow down for input vectors where each
          timestamp appears only once.

@etiennebacher
Copy link
Author

Only saw #109 and calcUnique now... Sorry for the duplicated (and already fixed) issue.


Results with calcUnique = TRUE for future visitors:

@eddelbuettel
Copy link
Owner

eddelbuettel commented Feb 24, 2024

@schochastics Please see above -- @etiennebacher did some digging and touches upon an issue that may matter for your benchmarks too. I have the default for unique on 'off' because where I came from (in my former field of high-ish frequency finance) our timestamps tend to indeed be unique (and by now the field is of course more occoupied with nanoseconds resolution so POSIXct is of limited usefulness, that was different when I wrote anytime). And for dates it is definitely an issue as it is so much easier to clash values.

@etiennebacher We could think about some data.table alike heuristics here. Maybe if N > someValue, say 10k, we sample 100 and see if we have replication. Or maybe blockwise sample ten blocks of ten? This would require some thinking but you do document that the gain could be substantial. Worth doing as a heuristic?

@etiennebacher
Copy link
Author

Worth doing as a heuristic?

Could be, but I'm not an active user of anytime as I rarely have a usecase for it, so I don't think my opinion matters much here. I was simply intrigued by the benchmarks of @schochastics and explored a bit to see if there were some low-hanging fruits.

@schochastics
Copy link

Upfront I need to clarify that my benchmark is not that fair yet, because I just return a character vector so far, not a POSIXct formatted object. calcUnique doesnt make a difference in my current benchmark since I used a vector of unique dates.

I try to remember to report back here once i did some more rigorous testing with chronos and a better interface

@eddelbuettel
Copy link
Owner

No need to report back here then if you also use unique values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants