Append column to fst file #39

MarcusKlik · 2017-03-15T15:14:13Z

Methods fst.rbind and fst.cbind will be added to the package in the next version.

The text was updated successfully, but these errors were encountered:

MarcusKlik · 2017-03-15T15:17:31Z

This requires a complete redesign of the file format that fst uses. Version v0.7.2 will still be supported but a warning will be given to rewrite the data with a newer version of the package. The new format will allow writing chunks of data recursively, so fst.rbind and fst.cbind can be used in arbitrary order and more than once.

MarcusKlik · 2017-03-15T15:20:09Z

MarcusKlik · 2017-03-15T15:27:23Z

This image represent a data set that has been serialized to the fst format:

A 3-column data set with u rows was serialized
A 3-column data set with v rows was appended with fst.rbind
A 3 column data set with w rows was appended with fst.rbind
A 1-column data set with (u + v + w) rows was appended with fst.cbind
A 4-column data set with x rows was appended with fst.rbind
A 4-column data set with y rows was appended with fst.rbind
A 2-column data set with (u + v + w + x + y) rows was appended with fst.cbind
A 1-column data set with (u + v + w + x + y) rows was appended with fst.cbind
A 7-column data set with z rows was appended with fst.rbind

After all these operations, the fst file will appear to the user as a single table with (u + v + w + x + y) rows and 7 columns. All operations will be still work on this file (user will be oblivious to previous operations)

MarcusKlik · 2017-03-15T15:30:58Z

Each append operation will create a redirection in the format, requiring extra seek operations. With > 1e5 seeks per second for most SSD disks, it is expected that performance will not suffer for a modest amount of operations. The file will be larger in size when many appends have been performed due to storage of indexing pointers. For appending small chunks this effect can be significant.

phillc73 · 2017-04-09T19:49:54Z

This might not make sense, but after reading this enhancement request, and then also your thoughts about, for example, fst.sum in #44, is there any advantage/desire to perform what I can only describe as "in fst" calculations?

What I mean is, using fst.sum to add two values from different columns and than also create a new column using fst.cbind, all within the same operation?

Maybe it's opening a can of worms with regards to other "in fst" calculations such as subtraction, division, median etc

I have some legacy code which looks something like this:

# Filter for wins only
trainer_sr <- dplyr::filter(fbHistoricResultsSQL, POS == 1)

# Make all NAs = 0
trainer_sr[is.na(trainer_sr)] <- 0

# Calculate Strike Rate
trainer_sr$sr <- (trainer_sr$wins / trainer_sr$runs) * 100

The slow part really is restoring fbHistoricResultsSQL from a MySQL database. However, if the data was already an fst file and calculations could be conducted "in fst" so to speak, it could make things much quicker.

MarcusKlik · 2017-04-09T21:37:32Z

Hi @phillc73, with a fst.cbind method, you could add your new column to a fst file like:

# Data required for calculating strike rate
strike_data <- read.fst("trainer_sr", c("wins", "runs"))

# Add column 'sr' to fst file
fst.cbind("trainer_sr", data.frame(sr = 100 * strike_data$wins / strike_data$runs))

So you would add a single column data.frame to an existing fst file. That would require memory for reading columns wins and runs and also for in-memory calculation of column sr. But for large data sets, this might still require too much memory and you would like to perform the operation from within the fst C++ code, right? The operation could look like:

# Create reference object to fst file
fst_table <- fst("trainer_sr")

# Add a column on-the-fly
fst_table[, sr := 100 * wins / runs]

# Or using dplyr interface
mutate(fst_table, sr = 100 * wins / runs)

For operations with user defined methods, that would still require a full read of columns wins and runs because I guess there is no way of knowing whether a user defined function is an aggregate function of just a vector operation like '/'. I can see two ways around this:

fst defines methods like the fst.sum that you mentioned and these methods can be used in statements to create new columns 'from within fst'. These methods operate on chunks of data and would only require memory for a few chunks at a time.
The user can select a key column as the grouping parameter and calculations are performed per group in sequential order. That would also require much less memory as only a small amount of groups is actually in-memory. The data needs to be sorted though for this to work.

A third option would be to allow the user to set a parameter chunkwise = TRUE (or similar) that tells fst that the operation can be performed on a chunk of arbitrary size (like your division). Setting that parameter would reduce the memory footprint significantly as fst can run the operation on many smaller chunks.

I think you are right in saying that it would be opening a can of worms if fst would pack all kinds of special functions like fst.sum, fst.mean, etc., as that would always be too limited for many user's needs! The 'grouping' option in combination with the 'chunkwise = TRUE' parameter would probably be a better place to start. But when we get to the advanced operations milestone (#48), it would be nice to have the user define custom 'map-reduce' like methods that can be used on a fst file.

Thanks @phillc73 for some interesting ideas!

MarcusKlik added the feature request label Mar 15, 2017

MarcusKlik self-assigned this Mar 15, 2017

MarcusKlik added this to the Multiple chunks in binary format milestone Mar 15, 2017

MarcusKlik modified the milestones: Fst package v0.9.0, Fst package v0.8.0 Jul 13, 2017

MarcusKlik added format informat labels Sep 22, 2017

MarcusKlik mentioned this issue Oct 10, 2017

Append directly on/to disk #91

Open

phillc73 mentioned this issue Nov 24, 2017

Consider a dplyr interface to an on disk fst? #108

Open

MarcusKlik modified the milestones: fst v0.8.4, fst v0.8.6 Dec 22, 2017

MarcusKlik modified the milestones: row and column binds, Candidate May 10, 2018

akrun1 mentioned this issue Aug 14, 2020

rbind/cbind two fsttable object or a fsttable with a data.frame or data.table fstpackage/fsttable#43

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Append column to fst file #39

Append column to fst file #39

MarcusKlik commented Mar 15, 2017 •

edited

Loading

MarcusKlik commented Mar 15, 2017 •

edited

Loading

MarcusKlik commented Mar 15, 2017

MarcusKlik commented Mar 15, 2017 •

edited

Loading

MarcusKlik commented Mar 15, 2017 •

edited

Loading

phillc73 commented Apr 9, 2017

MarcusKlik commented Apr 9, 2017

Append column to fst file #39

Append column to fst file #39

Comments

MarcusKlik commented Mar 15, 2017 • edited Loading

MarcusKlik commented Mar 15, 2017 • edited Loading

MarcusKlik commented Mar 15, 2017

MarcusKlik commented Mar 15, 2017 • edited Loading

MarcusKlik commented Mar 15, 2017 • edited Loading

phillc73 commented Apr 9, 2017

MarcusKlik commented Apr 9, 2017

MarcusKlik commented Mar 15, 2017 •

edited

Loading

MarcusKlik commented Mar 15, 2017 •

edited

Loading

MarcusKlik commented Mar 15, 2017 •

edited

Loading

MarcusKlik commented Mar 15, 2017 •

edited

Loading