-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Process only unique values for speed, please :) #109
Comments
Hadn't thought of that as a speedup. Could especially maybe as a simple wrapper. But ideally without extra depends so dplyr and magritr are a bit of a no-no there. If you wanted to implement this using Rcpp and BH at the C++ level, or in base R without new dependencies we might have something here. |
For me it would be base R ... the one thing I don't know how to do is to do a fast n*log(n) merge in base R ... I learned to do that in data.table. Let me research it and see what I can come up with. |
Apparently
|
Same -- I so love data.table. Maybe we can get by with a Suggests: if we make the function an option or functionality an option and then test for |
The result is quite quick ...
|
What percentage of values are non-unique in that case? (Reason I am asking is that for most of my cases data was chronologically ordered and hence with unique timestamps -- but your suggestion has clear merit ...) |
99.96% are non-unique. This is a sample of data from a ~10,000 modem panel (that was the number we requested), so we would expect 1 in 10k or 99.99%. In other words, it is exceedingly common to get several interwoven time series in one input file when you're asking for a time series for a statistically significant number of different cases. For a real-world case from an operating company I would expect 5,6, or 7 9's. (100k, 1M, 10M) That these time stamps came in in text form that wasn't always perfect (i.e. So I'm definitely thankful for My intuition as to what percentage non-unique you get to before this technique is slower is that its below 50% as fast as --Stephen |
That's what they call Yuge. We should support that. (Only other optimization I had been thinking about but did not need myself was 'memorizing' what last parsing expression was used to not search again. Then again I have the feeling I made that 'index; variable static and enabled it many moons ago ...) |
I only just glanced through your C code to see if I could find you boiling it down to unique instances, so I can't tell, but I did notice the large number of potential formats you hand-coded ... thanks again for creating this package and your work on behalf of the community. |
Yes. See how it 'holds' a vector of format candidates? It then loops til it finds one that does not generate NA. I meant to double-check that the index As for the 'cache for unique', what are you thinking? Make it an option for (any|utc)(time|date), so add it somewhere to the process code? Or make it a wrapper (potentially four times :-/)? |
Honestly its a just a super simple refactoring (more speed) with no other changes, while I'm not nearly as wise in creating user experiences as you are, I don't see any reason to do anything but put it "under the hood" when the time is right and in the right place ... which I think would be in the R (S3?) methods just before the Rcpp call. I will do a binary search to see where the tradeoff is for I don't even think its worth adding a 'calc_uniuqe = TRUE` flag to the function call ... That's the kind of complication that one only care's about when one is parallelizing something. |
I am mostly worried about overall behaviour and consistency. This is a functional change, so my preference would be opt-in, that is to have an option defaulting to FALSE or off which one needs to turn on to get the behaviour you seek. We could also keep it simple and just make it a demo and/or a new vignette for now. |
In that context, then I think I'm getting the hint that you'd like me to clone or fork the code and put in the changes and also at least draft the blog post in blogdown? |
:-) I am still trying to think through how to best approach this, and it is not obvious. If you look at anytime.R many/most/all of the path end by eventually calling Or one just wraps an outer function around that takes a vector, finds the unique ones and maps map to the original ones. Basically three or so |
It feels like you're thinking yourself in circles at this point ... give it a few days ... something will decide it for you. A little later today, I'll present what I'm thinking in terms of code. Given the speed of If you want to do it in C, I would take a look at the underlying source for those commands and see what C (hopefully not fortran) libraries they call upon in the first place. |
Well I don't want to add code to each s3 methods. Maybe it will work at the respective But yes, your itch to scratch so please do make a proposal in code and then we see where we are at. |
Here is the diff I propose for However, the only modification to each class is of adding the calcUnique varialbe to the function definition. As we discussed, only the calls to anytime_cpp (or the internal call to anytime) are changed. I haven't tested it just yet, and I can't commit to any testing until well into next week, but this is the outline of my proposal. |
That's pretty good and concise. I was musing for a moment if one could get by with just one call to |
Do you want to fork, commit your change and propose it as a pull-request? |
Its my first rodeo and your repository ... so whichever you want. |
I am easy. If you want to "earn your first stripes" of an actual commit and PR I am happy to walk you through. If that is more 'bah humbug' to you, I can also take your (clear enough diff) and apply it, and still give you credit in the ChangeLog etc. Happy to help either way. |
I would indeed like to learn to do it correctly ... but I want to test the code first ... and that will likely be tomorrow or Wednesday as I need to concentrate on actually getting some conclusions out of the data today. |
That's part of a PR too :) The existing unit tests helps, it is considered very best style to add new tests for new functionality. I.e. we could mock something by having a vector with No rush. |
Fixed with 0.3.7 Release |
Thanks for catching that. I usually remember to add a tag And thanks again for the PR. |
I'm happy to ... it was a great learning experience and demystified a lot for me. |
I'm working with a dataset that has ~90million observations all measured over the same daily window and this was a massive speedup so thank you both! |
While Anytime will regularly help me out with non-standard timestamps, it is of course a bit slow ... as such I have taken to wrapping it in a function to only have it calculate unique values and then merge/join the unique values back in to the input data.table:
Would it be too much to ask that this be done within the package?
The text was updated successfully, but these errors were encountered: