Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Append (rows) = T #9

Open
cawthm opened this issue Jan 24, 2017 · 3 comments
Open

Feature request: Append (rows) = T #9

cawthm opened this issue Jan 24, 2017 · 3 comments

Comments

@cawthm
Copy link

cawthm commented Jan 24, 2017

this is standard with utils::write.csv and readr::write_csv.

great FAST package here- thank you

@MarcusKlik
Copy link
Collaborator

MarcusKlik commented Jan 24, 2017

Thanks a lot! The csv file format has a row-oriented file format on disk, while the fst format is a column-oriented binary file. Therefore it's somewhat easier to append data to csv than to a format like fst.
I could implement an append method, but that would mean that there would be multiple blocks of column-based data stored one after another, which would have a performance hit if append is called many times (say every 10 rows). I can see three options:

  • You can append and the user should understand that there is a performance hit
  • You can specify the maximum number of appends when first writing the file. In this case there is only a small performance hit. (default would be 0 appends)
  • The user just writes the data to separate fst files and during a read.fst you can specify a vector of filenames which would then be read as a single table.

Which option would best suit your use case?

@fstpackage fstpackage added this to the Multiple chunks in binary format milestone Feb 1, 2017
@edwindj
Copy link

edwindj commented Feb 3, 2017

Thanks for your wonderful package!
I have the same request: appending to the fst file to support the following workflow/processing:

  • to process large csv/text files that do not fit in RAM (I'm author of R package chunked) and store the result in a fst file.
  • because fst allows for partial reading this would be a nice option.

What about a fourth option?:

  • writing to multiple fst files and add a function that combines multiple fst files with identical metadata to be merged in one fst file?

@MarcusKlik
Copy link
Collaborator

Thanks for your feature request @edwindj. I can definitely see the added value of append functionality for your use-case. Perhaps a streaming connection object (issue #15) would also be suitable for your scenario. Or, alternatively, an apply-like method (issue #18). The latter would require multiple csv-files or a single csv file with random access (like the functionality provided in your package and the LaF package from @djvanderlaan.). Your multiple fst file option is also very interesting (see issue #14). For your scenario with a very large csv file, that means that you can process and write data with multiple threads. I have some ideas on building a multi-threaded random-access csv reader for a new package that I'm building, perhaps you and @djvanderlaan would like to exchange some thoughts with me on the subject given your experience with your own packages?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants