Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide aggregation functionality for use with data.table #64

Closed
lsilvest opened this issue Apr 4, 2020 · 4 comments
Closed

Provide aggregation functionality for use with data.table #64

lsilvest opened this issue Apr 4, 2020 · 4 comments

Comments

@lsilvest
Copy link
Collaborator

lsilvest commented Apr 4, 2020

Replicate functionality described in Rdatatable/data.table#1390

@lsilvest
Copy link
Collaborator Author

lsilvest commented Apr 8, 2020

The functionality described in Rdatatable/data.table#1390 allows for aggregation with a neat and concise syntax within the data.table framework:

DT[, .(hourly_mean = mean(x)), .(iperiod = periodize(idate, itime, "hours"))]

But there is an issue if the "period" used for aggregation spans an interval that has no observation in DT, because such an interval will be dropped. In some situations that could be what is wanted, but if the operation is not a mean but a count for example this could really be incorrect. And it seems that it is not desirable that this would be the default behavior.

@jangorecki: is there a way to provide data.table with a grouping such that times with 0 observation are taken into account?

The aggregation functions we are developing are also much more general as they can handle arbitrary "periods" that are not necessarily a ceiling or a floor, that are not necessarily contiguous, that can be overlapping, etc.

@jangorecki
Copy link

jangorecki commented Apr 8, 2020

Yes, I am aware that gaps in time series are not being materialized by use of IPeriod class. IMO that should be the default, because we want to be memory conservative. It makese perfect sense to have timeseries expanded to be contiguous, but in many cases it won't be feasible as observations can be sparsely located. In such cases there is a risk of freezing users machine due to swapping and OOM.
Moreover irregular periods can be handled by user gracefully. For example, having a function like described in Rdatatable/data.table#3241 it would allow to compute rolling statistics on a sparse periods without having to materialize the gaps, potentially saving a lot of memory.
The current solution in data.table for that is basically explode the dataset by joining to complete sequence, and then use it as an input. This is memory inefficient, but at the same time can be used with rolling join, giving easy way for filling missing values in gaps with LOCF (just by addng roll=TRUE!) or NOCB.

library(data.table)
id = c(0L,1L,2L,5L,6L,8L)
x = data.table(date=as.IDate(id), value=c(1,2,3,4,5,6))
x[.(min(x$date):max(x$date)), on="date", .(date=as.IDate(date), value)]
#         date value
#       <IDat> <num>
#1: 1970-01-01     1
#2: 1970-01-02     2
#3: 1970-01-03     3
#4: 1970-01-04    NA
#5: 1970-01-05    NA
#6: 1970-01-06     4
#7: 1970-01-07     5
#8: 1970-01-08    NA
#9: 1970-01-09     6

Another reason why I think the default should not materialize gaps, is that nanotime offers high resolution. The higher the resolution, the smaller the periods, the more likely is to have a 0 count period, the more memory is being wasted.

@eddelbuettel
Copy link
Owner

It's a very important topic. "So far" I think we all mostly used dense chunks (for efficiency) even with irregular time deltas. Using true sparse representation may be an alternative but I fear we all may end up rewriting a lot of code. Or am I missing a few magic bullets here?

@lsilvest
Copy link
Collaborator Author

Seems reasonable. We'll implement here then the same functionality for these functions as IPeriod and xts provide. We'll also get the current alignment functions finalised in their most general expression. Then we can revisit how we can best take advantage of some of the existing functionality of data.table.

lsilvest added a commit that referenced this issue Apr 26, 2020
eddelbuettel added a commit that referenced this issue Apr 27, 2020
implement floor and ceiling; closes #64
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants