Serialization of list columns #20

derekholmes · 2017-02-05T16:12:59Z

This is a great package -- halved my data loading time by half, but with some effort. I frequently group data into lists (e.g. a time series "dataset" with data in a data.table, inventory in a small data.frame and xts dates/representations) of the form mydata=list("x"=data.table(..), "y"=data.table, "z" = chr) etc.

I was able to write a wrapper around these to parse component datatables to separate .fst files, but it would be great if you generalized the read and write to more general data structures. Eventually, I think this can really be a replacement for save and load.

MarcusKlik · 2017-02-05T17:46:32Z

Thanks, it's good to see that many people benefit from the fst package in their work! Your request is somewhat similar to requests made in issue #12, I believe a first step could be to use R's internal serialization mechanism for serializing 'complex types' but use the LZ4 and ZSTD compressors instead of the default compressors for speed. In that case, you would still have random row access to elements in list-type columns. Later, I could also optimize further by using fst serialization for list elements of known types inside the list-type columns (recursively), increasing speed further.
Thanks for the request, it's definitely on the list for one of the next versions.

derekholmes · 2018-01-20T14:44:05Z

FWIW, here is the code I wrote to do this. I have a function called cAssign() which takes a possible list of dataframes and either assigns them to a different environment and/or saves them to disk. (This is self-rolled persistence.) The name is passed in as a string, and if one of the data frames is large enough, it is split into a separate file using fst.

cAssign<-function(x,dbg=TRUE,silent=FALSE,copysilent=FALSE,trace=FALSE,dpath=datapath,nbig=10000,title="",usefst=TRUE) {
   ppp=lapply(x,function(y){
      fname=paste0(dpath,paste0(y,".RD"))
      if(usefst) {
        cadtmp=get(y,pos=parent.frame(n=3))
        if("list" %in% class(cadtmp)) {
          listonames = c(names(cadtmp),paste0("A",1:length(cadtmp)))[1:length(cadtmp)]
          for(i in 1:length(listonames)) {
            if("data.frame" %in% class(cadtmp[[i]]) && nrow(cadtmp[[i]])>=nbig) {
               message(" Splitting ",listonames[[i]], " from ",y)
               newfilename=paste0(y,"_",listonames[[i]],".fst")
               write.fst(cadtmp[[i]], paste0(dpath,newfilename),compress=20)
               cadtmp[[i]]=newfilename
            }
           }
        }
        e1<-new.env()
        assign(y,cadtmp,env=e1)
        save(list=c(y),envir=e1,file=fname) }
      else {
        save(list=c(y),file=fname)  }
      if(!copysilent) { message("GlobalAssign and Saving ", y, " to disk as ",y,".RD (filesize:",file.size(fname),")") }
      })
   }

MarcusKlik · 2018-01-20T22:46:17Z

Hi @derekholmes, thanks for sharing! So basically you need to store a list with several components, the largest of which are data.table's. You would like fst to be able to store a list and if a list element is a data.table (or vector), still have random access to that structure?

Supporting lists would certainly be possible. For storing a table with random access inside that list the fst format would need to support nested structures. That would be a very interesting and useful feature I think. The current format could be maintained as is, but when you need a list, you can use a single column data.table containing 1 column of the list type. The same holds for vectors.

The speed of a nested list structure would probably be lower due to additional file-pointer jumps, but when the data.table elements are comparatively large, the effect would be small.

Thanks for your feature request, when the list type is implemented, I'll make sure that the format is prepared for recursive structures as well!

MarcusKlik self-assigned this Feb 13, 2017

MarcusKlik added the feature request label Feb 22, 2017

MarcusKlik added this to the Format complete milestone Apr 16, 2017

MarcusKlik mentioned this issue Jul 11, 2017

Cannot save tibble where one of the columns is a list of some kind #71

Open

MarcusKlik modified the milestones: Fst package v0.8.0, Fst package v0.9.0 Oct 12, 2017

MarcusKlik mentioned this issue Aug 24, 2019

write_fstrds and read_fstrds functions #210

Closed

MarcusKlik changed the title ~~Great package, would love to see a generalization~~ Allow serialization of list columns Nov 16, 2022

MarcusKlik changed the title ~~Allow serialization of list columns~~ Serialization of list columns Nov 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Serialization of list columns #20

Serialization of list columns #20

derekholmes commented Feb 5, 2017

MarcusKlik commented Feb 5, 2017

derekholmes commented Jan 20, 2018 •

edited

Loading

MarcusKlik commented Jan 20, 2018

Serialization of list columns #20

Serialization of list columns #20

Comments

derekholmes commented Feb 5, 2017

MarcusKlik commented Feb 5, 2017

derekholmes commented Jan 20, 2018 • edited Loading

MarcusKlik commented Jan 20, 2018

derekholmes commented Jan 20, 2018 •

edited

Loading