-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide aggregation functionality for use with data.table #64
Comments
The functionality described in Rdatatable/data.table#1390 allows for aggregation with a neat and concise syntax within the DT[, .(hourly_mean = mean(x)), .(iperiod = periodize(idate, itime, "hours"))] But there is an issue if the "period" used for aggregation spans an interval that has no observation in @jangorecki: is there a way to provide The aggregation functions we are developing are also much more general as they can handle arbitrary "periods" that are not necessarily a ceiling or a floor, that are not necessarily contiguous, that can be overlapping, etc. |
Yes, I am aware that gaps in time series are not being materialized by use of IPeriod class. IMO that should be the default, because we want to be memory conservative. It makese perfect sense to have timeseries expanded to be contiguous, but in many cases it won't be feasible as observations can be sparsely located. In such cases there is a risk of freezing users machine due to swapping and OOM. library(data.table)
id = c(0L,1L,2L,5L,6L,8L)
x = data.table(date=as.IDate(id), value=c(1,2,3,4,5,6))
x[.(min(x$date):max(x$date)), on="date", .(date=as.IDate(date), value)]
# date value
# <IDat> <num>
#1: 1970-01-01 1
#2: 1970-01-02 2
#3: 1970-01-03 3
#4: 1970-01-04 NA
#5: 1970-01-05 NA
#6: 1970-01-06 4
#7: 1970-01-07 5
#8: 1970-01-08 NA
#9: 1970-01-09 6 Another reason why I think the default should not materialize gaps, is that nanotime offers high resolution. The higher the resolution, the smaller the periods, the more likely is to have a 0 count period, the more memory is being wasted. |
It's a very important topic. "So far" I think we all mostly used dense chunks (for efficiency) even with irregular time deltas. Using true sparse representation may be an alternative but I fear we all may end up rewriting a lot of code. Or am I missing a few magic bullets here? |
Seems reasonable. We'll implement here then the same functionality for these functions as |
implement floor and ceiling; closes #64
Replicate functionality described in Rdatatable/data.table#1390
The text was updated successfully, but these errors were encountered: