Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Support Tibbles #12

Closed
jeroenjanssens opened this issue Jan 27, 2017 · 10 comments
Closed

Feature request: Support Tibbles #12

jeroenjanssens opened this issue Jan 27, 2017 · 10 comments

Comments

@jeroenjanssens
Copy link

Excellent package. I read that it supports data.tables. Would it be possible to also add support for reading FST files as tibbles ?

@MarcusKlik
Copy link
Collaborator

Thanks for the feature request. Would you like to be able to write as well as read tibble from a fst file? Or just get the data returned as a tibble?

@jeroenjanssens
Copy link
Author

I think that both writing and reading would be great. Would there be a way to store these additional classes (or data.table when appropriate) in the fst file so that, when reading, the correct class is returned? Perhaps when checking whether the tibble or data.table is available?

@MarcusKlik
Copy link
Collaborator

Returning the data as a tibble would very well be possible. The challenge with tibbles however are the non-basic data types that are allowed in the list-type columns (such as S3 classes). But we could use R's native serialization for non-basic types. So the C++ equivalent of:

some_list <- as.list(1:10000)

# Serialize
some_file <- file("test.bin", "wb")
lapply(some_list, serialize, some_file)
close(some_file)

# Unserialize
some_file <- file("test.bin", "rb")
res <- lapply(1:10000, function(x) { unserialize(some_file)})
close(some_file)

In this case each list element is serialized by R's native serialization mechanism. In that way, data could still be accessed randomly and could even be compressed by the LZ4 or ZSTD compressors that fst uses (much faster than the default compressors). Is random (row-) access something that you are interested in for your use cases?

@dselivanov
Copy link
Contributor

dselivanov commented Feb 1, 2017

Data.frames and data.tables also allows complex columns, so there is nothing specific to "tibbles". Tibble is just attribute wich allows pretty printing and few other minor things.
@MarcusKlik please don't introduce dependency on it :-)

@MarcusKlik
Copy link
Collaborator

MarcusKlik commented Feb 1, 2017

Indeed @dselivanov, there is no intrinsic difference between a data.table and a tibble, except for the class attribute. And the fst package currently also doesn't support list-type columns for a data.table, you are absolutely right! But would support for these columns be interesting (for a tibble or data.table) ? The read and write speed for list-type columns would be comparatively low but it would allow for random access of 'custom types'. I could also just serialize simple attributes (such a the class). So a tibble would be returned as a tibble without creating a dependency on the tibble package.

@dselivanov
Copy link
Contributor

dselivanov commented Feb 1, 2017

Support for complex columns will be definitely very nice feature (but I realize that speed will suffer a lot). Actually I asked for similar functionality in feather here.
I think it is very good idea to use build-in serializer, I thought it will be much harder to support complex types.

@MarcusKlik
Copy link
Collaborator

MarcusKlik commented Feb 1, 2017

Yes, nice. I can use R's internal serialize method to serialize each list element to a raw vector and compress with LZ4 and ZSTD from there (and then write to a fst file). Thanks!

@MarcusKlik MarcusKlik self-assigned this Feb 1, 2017
@fstpackage fstpackage added this to the Binary format milestone Feb 1, 2017
@jeroenjanssens
Copy link
Author

I must confess that I'm not in a position to properly comment on this. Perhaps that @hadley, the author of tibble, has any thoughts?

@MarcusKlik
Copy link
Collaborator

Hi @jeroenjanssens, these last months, the core C++ code of the fst package has moved to a separate and completely independent library (fstlib). The fstlib library has no dependencies on the R API and therefore specific R classes like data.table, tibble or data.frame have no special meaning the fst format.

For that reason, I can't honor your request for storing the specific table type inside the fst format at this point. However, you won't loose any speed over this, because converting a data.frame to a tibble is done very effectively with the as.tibble method, both in terms of speed and memory:

library(pryr)
library(tibble)

mem_used()
#> 34.2 MB

df <- data.frame(x = 1:100000000)  # 400 MB vector
mem_used()
#> 434 MB

df_tibble <- tibble::as.tibble(df)
mem_used()
#> 435 MB

you can see that the cast to tibble is just a shallow copy of the actual data.frame. While the tibble get's a new memory address, the actual data doesn't:

address(df)
#> [1] "0x17ad2850"
address(dt_tibble)
#> [1] "0x105706e0"

x_vec <- df$x
x_vec2 <- df_tibble$x

address(x_vec)
#> [1] "0x7ff5e7cb0010"
address(x_vec2)
#> [1] "0x7ff5e7cb0010"

also, in terms of speed, the cast is very effective:

library(microbenchmark)

median(microbenchmark(
  df <- as.tibble(df)
)$time)
#> [1] 3285

that's just 3 microseconds for that cast, very fast. To make a long story short, you can effectively get a tibble from fst by using:

library(fst)

write_fst(df, "df.fst")
df_tibble <- as.tibble(read_fst("df.fst"))

Hope that will be sufficient for your purposes, thanks a lot for filing your feature request!

@jeroenjanssens
Copy link
Author

Thanks for getting back to this. This is a very reasonable solution. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants