Provide aggregation functionality for use with data.table #64

lsilvest · 2020-04-04T04:43:26Z

Replicate functionality described in Rdatatable/data.table#1390

lsilvest · 2020-04-08T06:02:03Z

The functionality described in Rdatatable/data.table#1390 allows for aggregation with a neat and concise syntax within the data.table framework:

DT[, .(hourly_mean = mean(x)), .(iperiod = periodize(idate, itime, "hours"))]

But there is an issue if the "period" used for aggregation spans an interval that has no observation in DT, because such an interval will be dropped. In some situations that could be what is wanted, but if the operation is not a mean but a count for example this could really be incorrect. And it seems that it is not desirable that this would be the default behavior.

@jangorecki: is there a way to provide data.table with a grouping such that times with 0 observation are taken into account?

The aggregation functions we are developing are also much more general as they can handle arbitrary "periods" that are not necessarily a ceiling or a floor, that are not necessarily contiguous, that can be overlapping, etc.

jangorecki · 2020-04-08T09:56:05Z

Yes, I am aware that gaps in time series are not being materialized by use of IPeriod class. IMO that should be the default, because we want to be memory conservative. It makese perfect sense to have timeseries expanded to be contiguous, but in many cases it won't be feasible as observations can be sparsely located. In such cases there is a risk of freezing users machine due to swapping and OOM.
Moreover irregular periods can be handled by user gracefully. For example, having a function like described in Rdatatable/data.table#3241 it would allow to compute rolling statistics on a sparse periods without having to materialize the gaps, potentially saving a lot of memory.
The current solution in data.table for that is basically explode the dataset by joining to complete sequence, and then use it as an input. This is memory inefficient, but at the same time can be used with rolling join, giving easy way for filling missing values in gaps with LOCF (just by addng roll=TRUE!) or NOCB.

library(data.table)
id = c(0L,1L,2L,5L,6L,8L)
x = data.table(date=as.IDate(id), value=c(1,2,3,4,5,6))
x[.(min(x$date):max(x$date)), on="date", .(date=as.IDate(date), value)]
#         date value
#       <IDat> <num>
#1: 1970-01-01     1
#2: 1970-01-02     2
#3: 1970-01-03     3
#4: 1970-01-04    NA
#5: 1970-01-05    NA
#6: 1970-01-06     4
#7: 1970-01-07     5
#8: 1970-01-08    NA
#9: 1970-01-09     6

Another reason why I think the default should not materialize gaps, is that nanotime offers high resolution. The higher the resolution, the smaller the periods, the more likely is to have a 0 count period, the more memory is being wasted.

eddelbuettel · 2020-04-08T12:58:06Z

It's a very important topic. "So far" I think we all mostly used dense chunks (for efficiency) even with irregular time deltas. Using true sparse representation may be an alternative but I fear we all may end up rewriting a lot of code. Or am I missing a few magic bullets here?

lsilvest · 2020-04-10T05:12:22Z

Seems reasonable. We'll implement here then the same functionality for these functions as IPeriod and xts provide. We'll also get the current alignment functions finalised in their most general expression. Then we can revisit how we can best take advantage of some of the existing functionality of data.table.

implement floor and ceiling; closes #64

lsilvest mentioned this issue Apr 4, 2020

Time intervals integer data type IPeriod Rdatatable/data.table#1390

Closed

lsilvest added a commit that referenced this issue Apr 26, 2020

implement floor and ceiling; closes #64

75d0a44

eddelbuettel closed this as completed in 5d18004 Apr 27, 2020

eddelbuettel added a commit that referenced this issue Apr 27, 2020

Merge pull request #65 from eddelbuettel/feature/rounding

fe1d20d

implement floor and ceiling; closes #64

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide aggregation functionality for use with data.table #64

Provide aggregation functionality for use with data.table #64

lsilvest commented Apr 4, 2020

lsilvest commented Apr 8, 2020

jangorecki commented Apr 8, 2020 •

edited

Loading

eddelbuettel commented Apr 8, 2020

lsilvest commented Apr 10, 2020

Provide aggregation functionality for use with data.table #64

Provide aggregation functionality for use with data.table #64

Comments

lsilvest commented Apr 4, 2020

lsilvest commented Apr 8, 2020

jangorecki commented Apr 8, 2020 • edited Loading

eddelbuettel commented Apr 8, 2020

lsilvest commented Apr 10, 2020

jangorecki commented Apr 8, 2020 •

edited

Loading