Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serialization of list columns #20

Open
derekholmes opened this issue Feb 5, 2017 · 3 comments
Open

Serialization of list columns #20

derekholmes opened this issue Feb 5, 2017 · 3 comments
Assignees
Milestone

Comments

@derekholmes
Copy link

This is a great package -- halved my data loading time by half, but with some effort. I frequently group data into lists (e.g. a time series "dataset" with data in a data.table, inventory in a small data.frame and xts dates/representations) of the form mydata=list("x"=data.table(..), "y"=data.table, "z" = chr) etc.

I was able to write a wrapper around these to parse component datatables to separate .fst files, but it would be great if you generalized the read and write to more general data structures. Eventually, I think this can really be a replacement for save and load.

@MarcusKlik
Copy link
Collaborator

Thanks, it's good to see that many people benefit from the fst package in their work! Your request is somewhat similar to requests made in issue #12, I believe a first step could be to use R's internal serialization mechanism for serializing 'complex types' but use the LZ4 and ZSTD compressors instead of the default compressors for speed. In that case, you would still have random row access to elements in list-type columns. Later, I could also optimize further by using fst serialization for list elements of known types inside the list-type columns (recursively), increasing speed further.
Thanks for the request, it's definitely on the list for one of the next versions.

@MarcusKlik MarcusKlik self-assigned this Feb 13, 2017
@MarcusKlik MarcusKlik added this to the Format complete milestone Apr 16, 2017
@MarcusKlik MarcusKlik modified the milestones: Fst package v0.8.0, Fst package v0.9.0 Oct 12, 2017
@derekholmes
Copy link
Author

derekholmes commented Jan 20, 2018

FWIW, here is the code I wrote to do this. I have a function called cAssign() which takes a possible list of dataframes and either assigns them to a different environment and/or saves them to disk. (This is self-rolled persistence.) The name is passed in as a string, and if one of the data frames is large enough, it is split into a separate file using fst.

cAssign<-function(x,dbg=TRUE,silent=FALSE,copysilent=FALSE,trace=FALSE,dpath=datapath,nbig=10000,title="",usefst=TRUE) {
   ppp=lapply(x,function(y){
      fname=paste0(dpath,paste0(y,".RD"))
      if(usefst) {
        cadtmp=get(y,pos=parent.frame(n=3))
        if("list" %in% class(cadtmp)) {
          listonames = c(names(cadtmp),paste0("A",1:length(cadtmp)))[1:length(cadtmp)]
          for(i in 1:length(listonames)) {
            if("data.frame" %in% class(cadtmp[[i]]) && nrow(cadtmp[[i]])>=nbig) {
               message(" Splitting ",listonames[[i]], " from ",y)
               newfilename=paste0(y,"_",listonames[[i]],".fst")
               write.fst(cadtmp[[i]], paste0(dpath,newfilename),compress=20)
               cadtmp[[i]]=newfilename
            }
           }
        }
        e1<-new.env()
        assign(y,cadtmp,env=e1)
        save(list=c(y),envir=e1,file=fname) }
      else {
        save(list=c(y),file=fname)  }
      if(!copysilent) { message("GlobalAssign and Saving ", y, " to disk as ",y,".RD (filesize:",file.size(fname),")") }
      })
   }

@MarcusKlik
Copy link
Collaborator

Hi @derekholmes, thanks for sharing! So basically you need to store a list with several components, the largest of which are data.table's. You would like fst to be able to store a list and if a list element is a data.table (or vector), still have random access to that structure?

Supporting lists would certainly be possible. For storing a table with random access inside that list the fst format would need to support nested structures. That would be a very interesting and useful feature I think. The current format could be maintained as is, but when you need a list, you can use a single column data.table containing 1 column of the list type. The same holds for vectors.

The speed of a nested list structure would probably be lower due to additional file-pointer jumps, but when the data.table elements are comparatively large, the effect would be small.

Thanks for your feature request, when the list type is implemented, I'll make sure that the format is prepared for recursive structures as well!

@MarcusKlik MarcusKlik changed the title Great package, would love to see a generalization Allow serialization of list columns Nov 16, 2022
@MarcusKlik MarcusKlik changed the title Allow serialization of list columns Serialization of list columns Nov 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants