Feature request: Append (rows) = T #9

cawthm · 2017-01-24T14:56:34Z

this is standard with utils::write.csv and readr::write_csv.

great FAST package here- thank you

MarcusKlik · 2017-01-24T20:34:38Z

Thanks a lot! The csv file format has a row-oriented file format on disk, while the fst format is a column-oriented binary file. Therefore it's somewhat easier to append data to csv than to a format like fst.
I could implement an append method, but that would mean that there would be multiple blocks of column-based data stored one after another, which would have a performance hit if append is called many times (say every 10 rows). I can see three options:

You can append and the user should understand that there is a performance hit
You can specify the maximum number of appends when first writing the file. In this case there is only a small performance hit. (default would be 0 appends)
The user just writes the data to separate fst files and during a read.fst you can specify a vector of filenames which would then be read as a single table.

Which option would best suit your use case?

edwindj · 2017-02-03T11:28:31Z

Thanks for your wonderful package!
I have the same request: appending to the fst file to support the following workflow/processing:

to process large csv/text files that do not fit in RAM (I'm author of R package chunked) and store the result in a fst file.
because fst allows for partial reading this would be a nice option.

What about a fourth option?:

writing to multiple fst files and add a function that combines multiple fst files with identical metadata to be merged in one fst file?

MarcusKlik · 2017-02-03T20:44:21Z

Thanks for your feature request @edwindj. I can definitely see the added value of append functionality for your use-case. Perhaps a streaming connection object (issue #15) would also be suitable for your scenario. Or, alternatively, an apply-like method (issue #18). The latter would require multiple csv-files or a single csv file with random access (like the functionality provided in your package and the LaF package from @djvanderlaan.). Your multiple fst file option is also very interesting (see issue #14). For your scenario with a very large csv file, that means that you can process and write data with multiple threads. I have some ideas on building a multi-threaded random-access csv reader for a new package that I'm building, perhaps you and @djvanderlaan would like to exchange some thoughts with me on the subject given your experience with your own packages?

fstpackage added this to the Multiple chunks in binary format milestone Feb 1, 2017

MarcusKlik added the feature request label Feb 22, 2017

MarcusKlik modified the milestones: Fst package v0.9.0, Fst package v0.8.0 Jul 13, 2017

MarcusKlik self-assigned this Sep 22, 2017

MarcusKlik added format informat labels Sep 22, 2017

MarcusKlik modified the milestones: fst v0.8.4, fst v0.8.6 Dec 22, 2017

MarcusKlik modified the milestones: row and column binds, Candidate May 10, 2018

richfitz mentioned this issue Jul 18, 2018

Support more additional formats (i.e. fst, feather, etc?) ropensci/arkdb#3

Open

1 task

algoquant mentioned this issue May 29, 2019

Jump tests? algoquant/HighFreq#2

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: Append (rows) = T #9

Feature request: Append (rows) = T #9

cawthm commented Jan 24, 2017

MarcusKlik commented Jan 24, 2017 •

edited

Loading

edwindj commented Feb 3, 2017

MarcusKlik commented Feb 3, 2017

Feature request: Append (rows) = T #9

Feature request: Append (rows) = T #9

Comments

cawthm commented Jan 24, 2017

MarcusKlik commented Jan 24, 2017 • edited Loading

edwindj commented Feb 3, 2017

MarcusKlik commented Feb 3, 2017

MarcusKlik commented Jan 24, 2017 •

edited

Loading