Feature request: Support Tibbles #12

jeroenjanssens · 2017-01-27T17:09:04Z

Excellent package. I read that it supports data.tables. Would it be possible to also add support for reading FST files as tibbles ?

The text was updated successfully, but these errors were encountered:

MarcusKlik · 2017-01-27T18:05:17Z

Thanks for the feature request. Would you like to be able to write as well as read tibble from a fst file? Or just get the data returned as a tibble?

jeroenjanssens · 2017-01-31T19:22:52Z

I think that both writing and reading would be great. Would there be a way to store these additional classes (or data.table when appropriate) in the fst file so that, when reading, the correct class is returned? Perhaps when checking whether the tibble or data.table is available?

MarcusKlik · 2017-01-31T22:25:46Z

Returning the data as a tibble would very well be possible. The challenge with tibbles however are the non-basic data types that are allowed in the list-type columns (such as S3 classes). But we could use R's native serialization for non-basic types. So the C++ equivalent of:

some_list <- as.list(1:10000)

# Serialize
some_file <- file("test.bin", "wb")
lapply(some_list, serialize, some_file)
close(some_file)

# Unserialize
some_file <- file("test.bin", "rb")
res <- lapply(1:10000, function(x) { unserialize(some_file)})
close(some_file)

In this case each list element is serialized by R's native serialization mechanism. In that way, data could still be accessed randomly and could even be compressed by the LZ4 or ZSTD compressors that fst uses (much faster than the default compressors). Is random (row-) access something that you are interested in for your use cases?

dselivanov · 2017-02-01T16:41:30Z

Data.frames and data.tables also allows complex columns, so there is nothing specific to "tibbles". Tibble is just attribute wich allows pretty printing and few other minor things.
@MarcusKlik please don't introduce dependency on it :-)

MarcusKlik · 2017-02-01T18:55:24Z

Indeed @dselivanov, there is no intrinsic difference between a data.table and a tibble, except for the class attribute. And the fst package currently also doesn't support list-type columns for a data.table, you are absolutely right! But would support for these columns be interesting (for a tibble or data.table) ? The read and write speed for list-type columns would be comparatively low but it would allow for random access of 'custom types'. I could also just serialize simple attributes (such a the class). So a tibble would be returned as a tibble without creating a dependency on the tibble package.

dselivanov · 2017-02-01T19:00:35Z

Support for complex columns will be definitely very nice feature (but I realize that speed will suffer a lot). Actually I asked for similar functionality in feather here.
I think it is very good idea to use build-in serializer, I thought it will be much harder to support complex types.

MarcusKlik · 2017-02-01T19:34:46Z

Yes, nice. I can use R's internal serialize method to serialize each list element to a raw vector and compress with LZ4 and ZSTD from there (and then write to a fst file). Thanks!

jeroenjanssens · 2017-02-03T14:56:26Z

I must confess that I'm not in a position to properly comment on this. Perhaps that @hadley, the author of tibble, has any thoughts?

MarcusKlik · 2017-11-11T23:45:45Z

Hi @jeroenjanssens, these last months, the core C++ code of the fst package has moved to a separate and completely independent library (fstlib). The fstlib library has no dependencies on the R API and therefore specific R classes like data.table, tibble or data.frame have no special meaning the fst format.

For that reason, I can't honor your request for storing the specific table type inside the fst format at this point. However, you won't loose any speed over this, because converting a data.frame to a tibble is done very effectively with the as.tibble method, both in terms of speed and memory:

library(pryr)
library(tibble)

mem_used()
#> 34.2 MB

df <- data.frame(x = 1:100000000)  # 400 MB vector
mem_used()
#> 434 MB

df_tibble <- tibble::as.tibble(df)
mem_used()
#> 435 MB

you can see that the cast to tibble is just a shallow copy of the actual data.frame. While the tibble get's a new memory address, the actual data doesn't:

address(df)
#> [1] "0x17ad2850"
address(dt_tibble)
#> [1] "0x105706e0"

x_vec <- df$x
x_vec2 <- df_tibble$x

address(x_vec)
#> [1] "0x7ff5e7cb0010"
address(x_vec2)
#> [1] "0x7ff5e7cb0010"

also, in terms of speed, the cast is very effective:

library(microbenchmark)

median(microbenchmark(
  df <- as.tibble(df)
)$time)
#> [1] 3285

that's just 3 microseconds for that cast, very fast. To make a long story short, you can effectively get a tibble from fst by using:

library(fst)

write_fst(df, "df.fst")
df_tibble <- as.tibble(read_fst("df.fst"))

Hope that will be sufficient for your purposes, thanks a lot for filing your feature request!

jeroenjanssens · 2017-11-13T09:47:35Z

Thanks for getting back to this. This is a very reasonable solution. Thanks!

MarcusKlik self-assigned this Feb 1, 2017

fstpackage added this to the Binary format milestone Feb 1, 2017

MarcusKlik mentioned this issue Feb 5, 2017

Serialization of list columns #20

Open

MarcusKlik mentioned this issue Feb 21, 2017

data.table Object Class not Saved #27

Closed

MarcusKlik added the feature request label Mar 10, 2017

MarcusKlik modified the milestones: Interface, Binary format Apr 16, 2017

MarcusKlik mentioned this issue May 1, 2017

Restore data.table class on read #57

Closed

MarcusKlik mentioned this issue Jul 11, 2017

Cannot save tibble where one of the columns is a list of some kind #71

Open

MarcusKlik modified the milestones: Fst package v0.8.0, Interface Jul 11, 2017

jeroenjanssens closed this as completed Nov 13, 2017

rgayler mentioned this issue Jan 8, 2018

Documentation: add explicit comment to write_fst to say what it won't do? #120

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: Support Tibbles #12

Feature request: Support Tibbles #12

jeroenjanssens commented Jan 27, 2017

MarcusKlik commented Jan 27, 2017

jeroenjanssens commented Jan 31, 2017

MarcusKlik commented Jan 31, 2017

dselivanov commented Feb 1, 2017 •

edited

Loading

MarcusKlik commented Feb 1, 2017 •

edited

Loading

dselivanov commented Feb 1, 2017 •

edited

Loading

MarcusKlik commented Feb 1, 2017 •

edited

Loading

jeroenjanssens commented Feb 3, 2017

MarcusKlik commented Nov 11, 2017

jeroenjanssens commented Nov 13, 2017

Feature request: Support Tibbles #12

Feature request: Support Tibbles #12

Comments

jeroenjanssens commented Jan 27, 2017

MarcusKlik commented Jan 27, 2017

jeroenjanssens commented Jan 31, 2017

MarcusKlik commented Jan 31, 2017

dselivanov commented Feb 1, 2017 • edited Loading

MarcusKlik commented Feb 1, 2017 • edited Loading

dselivanov commented Feb 1, 2017 • edited Loading

MarcusKlik commented Feb 1, 2017 • edited Loading

jeroenjanssens commented Feb 3, 2017

MarcusKlik commented Nov 11, 2017

jeroenjanssens commented Nov 13, 2017

dselivanov commented Feb 1, 2017 •

edited

Loading

MarcusKlik commented Feb 1, 2017 •

edited

Loading

dselivanov commented Feb 1, 2017 •

edited

Loading

MarcusKlik commented Feb 1, 2017 •

edited

Loading