Process only unique values for speed, please :) #109

stephenbfroehlich · 2020-01-10T21:35:39Z

While Anytime will regularly help me out with non-standard timestamps, it is of course a bit slow ... as such I have taken to wrapping it in a function to only have it calculate unique values and then merge/join the unique values back in to the input data.table:

Would it be too much to ask that this be done within the package?

anytime_at_scale <- 
    function(
        dat, time_stamp_column, tz = Sys.timezone()
    ){
        
        dat <- as.data.table(dat)
        
        time_stamp_column_chr <- enquo(time_stamp_column) %>% as.character() %>% .[2]
        
        time_stamps <-
            dat %>%
            lazy_dt() %>%
            group_by({{time_stamp_column}}) %>%
            dplyr::summarise() %>%
            mutate(time_stamp_posix = anytime({{time_stamp_column}}, tz = tz)) %>% 
            as.data.table() 
        
        # print(time_stamps %>% head())
        # print(dat %>% head())
        
        # print(enquo(time_stamp_column) %>% as.character() %>% .[2])
        
        merge(
            x = dat,
            y = time_stamps,
            # by = time_stamp_column_chr,
            all.x = TRUE
        ) %>% 
            lazy_dt(immutable = FALSE) %>%
            mutate({{time_stamp_column}} := time_stamp_posix) %>%
            select(-!!"time_stamp_posix") %>% 
            as.data.table() %>% 
            return()
        
    }

The text was updated successfully, but these errors were encountered:

eddelbuettel · 2020-01-10T21:43:30Z

Hadn't thought of that as a speedup. Could especially maybe as a simple wrapper.

But ideally without extra depends so dplyr and magritr are a bit of a no-no there. If you wanted to implement this using Rcpp and BH at the C++ level, or in base R without new dependencies we might have something here.

stephenbfroehlich · 2020-01-11T00:15:21Z

For me it would be base R ... the one thing I don't know how to do is to do a fast n*log(n) merge in base R ... I learned to do that in data.table. Let me research it and see what I can come up with.

stephenbfroehlich · 2020-01-11T00:25:46Z

Apparently match() is the way to go ... here is a function in base R ...

anytime_base_unique <- 
    function(x, ...){
        
        u_x <- unique(x)
       
        u_y <- anytime(u_x, ...)
        
        return(
            u_y[match(x, u_x)]
        )
        
    }

eddelbuettel · 2020-01-11T00:25:54Z

Same -- I so love data.table. Maybe we can get by with a Suggests: if we make the function an option or functionality an option and then test for data.table being present ... but while I wrote this you came up with match(). Nice!

stephenbfroehlich · 2020-01-11T00:55:01Z

The result is quite quick ...

> nrow(per_modem_1_min)
[1] 110980140
> length(unique(per_modem_1_min$time_stamp))
[1] 43201
> system.time(per_modem_1_min[, time_stamp := anytime_base_unique(time_stamp)])
   user  system elapsed 
  2.777   0.008   2.785 
>

eddelbuettel · 2020-01-11T01:09:02Z

What percentage of values are non-unique in that case?

(Reason I am asking is that for most of my cases data was chronologically ordered and hence with unique timestamps -- but your suggestion has clear merit ...)

stephenbfroehlich · 2020-01-11T01:23:49Z

99.96% are non-unique. (1- 43201 / 110980140)

This is a sample of data from a ~10,000 modem panel (that was the number we requested), so we would expect 1 in 10k or 99.99%. In other words, it is exceedingly common to get several interwoven time series in one input file when you're asking for a time series for a statistically significant number of different cases.

For a real-world case from an operating company I would expect 5,6, or 7 9's. (100k, 1M, 10M)

That these time stamps came in in text form that wasn't always perfect (i.e. as.POSIXct() failed and I couldn't easily figure out why), isn't exceedingly common, but its not at all unheard of.

So I'm definitely thankful for anytime in these contexts, but this simple step should allow it to scale to a wide range of common situations.

My intuition as to what percentage non-unique you get to before this technique is slower is that its below 50% as fast as match() and unique() are.

--Stephen

eddelbuettel · 2020-01-11T01:46:56Z

That's what they call Yuge. We should support that.

(Only other optimization I had been thinking about but did not need myself was 'memorizing' what last parsing expression was used to not search again. Then again I have the feeling I made that 'index; variable static and enabled it many moons ago ...)

stephenbfroehlich · 2020-01-11T02:25:51Z

I only just glanced through your C code to see if I could find you boiling it down to unique instances, so I can't tell, but I did notice the large number of potential formats you hand-coded ... thanks again for creating this package and your work on behalf of the community.

eddelbuettel · 2020-01-11T12:09:43Z

Yes. See how it 'holds' a vector of format candidates? It then loops til it finds one that does not generate NA. I meant to double-check that the index j of that position is the starting value for the next attempt in a possible outer loop (of vectorized input).

As for the 'cache for unique', what are you thinking? Make it an option for (any|utc)(time|date), so add it somewhere to the process code? Or make it a wrapper (potentially four times :-/)?

stephenbfroehlich · 2020-01-11T14:15:04Z

Honestly its a just a super simple refactoring (more speed) with no other changes, while I'm not nearly as wise in creating user experiences as you are, I don't see any reason to do anything but put it "under the hood" when the time is right and in the right place ... which I think would be in the R (S3?) methods just before the Rcpp call.

I will do a binary search to see where the tradeoff is for anytime() and anydate() ... but of course it must include calculating unique(x) in the first place, so that is half of the overhead. Then one can add a test to see if its worth doing.

I don't even think its worth adding a 'calc_uniuqe = TRUE` flag to the function call ... That's the kind of complication that one only care's about when one is parallelizing something.

eddelbuettel · 2020-01-11T14:20:26Z

I am mostly worried about overall behaviour and consistency.

This is a functional change, so my preference would be opt-in, that is to have an option defaulting to FALSE or off which one needs to turn on to get the behaviour you seek.

We could also keep it simple and just make it a demo and/or a new vignette for now.

stephenbfroehlich · 2020-01-11T14:36:13Z

In that context, then I think calc_unique = FALSE is likely the way to go ... and then a short vigniette and/or blog post to publicize the speedup.

I'm getting the hint that you'd like me to clone or fork the code and put in the changes and also at least draft the blog post in blogdown?

eddelbuettel · 2020-01-11T14:42:44Z

:-)

I am still trying to think through how to best approach this, and it is not obvious. If you look at anytime.R many/most/all of the path end by eventually calling anytime_cpp. So we could do it there and hash etc in C++. But then we'd reinvent match() and that is silly.

Or one just wraps an outer function around that takes a vector, finds the unique ones and maps map to the original ones. Basically three or so data.table statements. Maybe we just do that?

stephenbfroehlich · 2020-01-11T14:50:42Z

It feels like you're thinking yourself in circles at this point ... give it a few days ... something will decide it for you.

A little later today, I'll present what I'm thinking in terms of code. Given the speed of unique() and match() (the base R team has been working hard on making these fast, methinks), there is little reason to recreate them in C ... plus I am incapable of doing that within reasonable time.

If you want to do it in C, I would take a look at the underlying source for those commands and see what C (hopefully not fortran) libraries they call upon in the first place.

eddelbuettel · 2020-01-11T14:52:26Z

Well I don't want to add code to each s3 methods. Maybe it will work at the respective .default methods.

But yes, your itch to scratch so please do make a proposal in code and then we see where we are at.

stephenbfroehlich · 2020-01-11T16:20:37Z

Here is the diff I propose for anytime.R ... I will add that my knowledge of S3 is just what I see here as an example, so take that with a grain of salt.

However, the only modification to each class is of adding the calcUnique varialbe to the function definition. As we discussed, only the calls to anytime_cpp (or the internal call to anytime) are changed.

I haven't tested it just yet, and I can't commit to any testing until well into next week, but this is the outline of my proposal.

anytime_r_diff_calcUnique.txt

eddelbuettel · 2020-01-11T16:38:11Z

That's pretty good and concise. I was musing for a moment if one could get by with just one call to anytime_cpp() but obviously one cannot working on a reduced set is the whole point of this exercise :)

eddelbuettel · 2020-01-12T15:34:30Z

Do you want to fork, commit your change and propose it as a pull-request?

stephenbfroehlich · 2020-01-13T01:48:02Z

Its my first rodeo and your repository ... so whichever you want.

eddelbuettel · 2020-01-13T01:59:25Z

I am easy. If you want to "earn your first stripes" of an actual commit and PR I am happy to walk you through. If that is more 'bah humbug' to you, I can also take your (clear enough diff) and apply it, and still give you credit in the ChangeLog etc. Happy to help either way.

stephenbfroehlich · 2020-01-13T15:45:16Z

I would indeed like to learn to do it correctly ... but I want to test the code first ... and that will likely be tomorrow or Wednesday as I need to concentrate on actually getting some conclusions out of the data today.

eddelbuettel · 2020-01-13T16:03:45Z

That's part of a PR too :) The existing unit tests helps, it is considered very best style to add new tests for new functionality. I.e. we could mock something by having a vector with c(rep(x, 4), rep(y, 3)) and ensure that the first four results are indentical, and as are the last three, and that they are different from each other etc pp.

No rush. anytime is in a good spot, but your suggested change will make it clearly better for a subset of users such as yourself who encounter lots of 'dupes' in the input.

stephenbfroehlich · 2020-01-20T16:26:58Z

Fixed with 0.3.7 Release

eddelbuettel · 2020-01-20T16:33:57Z

Thanks for catching that. I usually remember to add a tag (fixes #xyz) to the commit message but forgot today.

And thanks again for the PR.

stephenbfroehlich · 2020-01-20T19:23:47Z

I'm happy to ... it was a great learning experience and demystified a lot for me.

be-green · 2020-01-26T15:46:34Z

I'm working with a dataset that has ~90million observations all measured over the same daily window and this was a massive speedup so thank you both!

stephenbfroehlich closed this as completed Jan 20, 2020

etiennebacher mentioned this issue Feb 24, 2024

Question on the performance of anydate() #132

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Process only unique values for speed, please :) #109

Process only unique values for speed, please :) #109

stephenbfroehlich commented Jan 10, 2020

eddelbuettel commented Jan 10, 2020

stephenbfroehlich commented Jan 11, 2020

stephenbfroehlich commented Jan 11, 2020 •

edited

Loading

eddelbuettel commented Jan 11, 2020 •

edited

Loading

stephenbfroehlich commented Jan 11, 2020

eddelbuettel commented Jan 11, 2020

stephenbfroehlich commented Jan 11, 2020

eddelbuettel commented Jan 11, 2020

stephenbfroehlich commented Jan 11, 2020

eddelbuettel commented Jan 11, 2020

stephenbfroehlich commented Jan 11, 2020

eddelbuettel commented Jan 11, 2020

stephenbfroehlich commented Jan 11, 2020

eddelbuettel commented Jan 11, 2020

stephenbfroehlich commented Jan 11, 2020

eddelbuettel commented Jan 11, 2020 •

edited

Loading

stephenbfroehlich commented Jan 11, 2020 •

edited

Loading

eddelbuettel commented Jan 11, 2020 •

edited

Loading

eddelbuettel commented Jan 12, 2020 •

edited

Loading

stephenbfroehlich commented Jan 13, 2020

eddelbuettel commented Jan 13, 2020 •

edited

Loading

stephenbfroehlich commented Jan 13, 2020

eddelbuettel commented Jan 13, 2020

stephenbfroehlich commented Jan 20, 2020

eddelbuettel commented Jan 20, 2020

stephenbfroehlich commented Jan 20, 2020

be-green commented Jan 26, 2020

Process only unique values for speed, please :) #109

Process only unique values for speed, please :) #109

Comments

stephenbfroehlich commented Jan 10, 2020

eddelbuettel commented Jan 10, 2020

stephenbfroehlich commented Jan 11, 2020

stephenbfroehlich commented Jan 11, 2020 • edited Loading

eddelbuettel commented Jan 11, 2020 • edited Loading

stephenbfroehlich commented Jan 11, 2020

eddelbuettel commented Jan 11, 2020

stephenbfroehlich commented Jan 11, 2020

eddelbuettel commented Jan 11, 2020

stephenbfroehlich commented Jan 11, 2020

eddelbuettel commented Jan 11, 2020

stephenbfroehlich commented Jan 11, 2020

eddelbuettel commented Jan 11, 2020

stephenbfroehlich commented Jan 11, 2020

eddelbuettel commented Jan 11, 2020

stephenbfroehlich commented Jan 11, 2020

eddelbuettel commented Jan 11, 2020 • edited Loading

stephenbfroehlich commented Jan 11, 2020 • edited Loading

eddelbuettel commented Jan 11, 2020 • edited Loading

eddelbuettel commented Jan 12, 2020 • edited Loading

stephenbfroehlich commented Jan 13, 2020

eddelbuettel commented Jan 13, 2020 • edited Loading

stephenbfroehlich commented Jan 13, 2020

eddelbuettel commented Jan 13, 2020

stephenbfroehlich commented Jan 20, 2020

eddelbuettel commented Jan 20, 2020

stephenbfroehlich commented Jan 20, 2020

be-green commented Jan 26, 2020

stephenbfroehlich commented Jan 11, 2020 •

edited

Loading

eddelbuettel commented Jan 11, 2020 •

edited

Loading

eddelbuettel commented Jan 11, 2020 •

edited

Loading

stephenbfroehlich commented Jan 11, 2020 •

edited

Loading

eddelbuettel commented Jan 11, 2020 •

edited

Loading

eddelbuettel commented Jan 12, 2020 •

edited

Loading

eddelbuettel commented Jan 13, 2020 •

edited

Loading