Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Append column to fst file #39

Open
MarcusKlik opened this issue Mar 15, 2017 · 6 comments
Open

Append column to fst file #39

MarcusKlik opened this issue Mar 15, 2017 · 6 comments

Comments

@MarcusKlik
Copy link
Collaborator

MarcusKlik commented Mar 15, 2017

Methods fst.rbind and fst.cbind will be added to the package in the next version.

@MarcusKlik
Copy link
Collaborator Author

MarcusKlik commented Mar 15, 2017

This requires a complete redesign of the file format that fst uses. Version v0.7.2 will still be supported but a warning will be given to rewrite the data with a newer version of the package. The new format will allow writing chunks of data recursively, so fst.rbind and fst.cbind can be used in arbitrary order and more than once.

@MarcusKlik
Copy link
Collaborator Author

fstformat

@MarcusKlik
Copy link
Collaborator Author

MarcusKlik commented Mar 15, 2017

This image represent a data set that has been serialized to the fst format:

  1. A 3-column data set with u rows was serialized
  2. A 3-column data set with v rows was appended with fst.rbind
  3. A 3 column data set with w rows was appended with fst.rbind
  4. A 1-column data set with (u + v + w) rows was appended with fst.cbind
  5. A 4-column data set with x rows was appended with fst.rbind
  6. A 4-column data set with y rows was appended with fst.rbind
  7. A 2-column data set with (u + v + w + x + y) rows was appended with fst.cbind
  8. A 1-column data set with (u + v + w + x + y) rows was appended with fst.cbind
  9. A 7-column data set with z rows was appended with fst.rbind

After all these operations, the fst file will appear to the user as a single table with (u + v + w + x + y) rows and 7 columns. All operations will be still work on this file (user will be oblivious to previous operations)

@MarcusKlik
Copy link
Collaborator Author

MarcusKlik commented Mar 15, 2017

Each append operation will create a redirection in the format, requiring extra seek operations. With > 1e5 seeks per second for most SSD disks, it is expected that performance will not suffer for a modest amount of operations. The file will be larger in size when many appends have been performed due to storage of indexing pointers. For appending small chunks this effect can be significant.

@MarcusKlik MarcusKlik added this to the Multiple chunks in binary format milestone Mar 15, 2017
@phillc73
Copy link

phillc73 commented Apr 9, 2017

This might not make sense, but after reading this enhancement request, and then also your thoughts about, for example, fst.sum in #44, is there any advantage/desire to perform what I can only describe as "in fst" calculations?

What I mean is, using fst.sum to add two values from different columns and than also create a new column using fst.cbind, all within the same operation?

Maybe it's opening a can of worms with regards to other "in fst" calculations such as subtraction, division, median etc

I have some legacy code which looks something like this:

# Filter for wins only
trainer_sr <- dplyr::filter(fbHistoricResultsSQL, POS == 1)

# Make all NAs = 0
trainer_sr[is.na(trainer_sr)] <- 0

# Calculate Strike Rate
trainer_sr$sr <- (trainer_sr$wins / trainer_sr$runs) * 100

The slow part really is restoring fbHistoricResultsSQL from a MySQL database. However, if the data was already an fst file and calculations could be conducted "in fst" so to speak, it could make things much quicker.

@MarcusKlik
Copy link
Collaborator Author

Hi @phillc73, with a fst.cbind method, you could add your new column to a fst file like:

# Data required for calculating strike rate
strike_data <- read.fst("trainer_sr", c("wins", "runs"))

# Add column 'sr' to fst file
fst.cbind("trainer_sr", data.frame(sr = 100 * strike_data$wins / strike_data$runs))

So you would add a single column data.frame to an existing fst file. That would require memory for reading columns wins and runs and also for in-memory calculation of column sr. But for large data sets, this might still require too much memory and you would like to perform the operation from within the fst C++ code, right? The operation could look like:

# Create reference object to fst file
fst_table <- fst("trainer_sr")

# Add a column on-the-fly
fst_table[, sr := 100 * wins / runs]

# Or using dplyr interface
mutate(fst_table, sr = 100 * wins / runs)

For operations with user defined methods, that would still require a full read of columns wins and runs because I guess there is no way of knowing whether a user defined function is an aggregate function of just a vector operation like '/'. I can see two ways around this:

  • fst defines methods like the fst.sum that you mentioned and these methods can be used in statements to create new columns 'from within fst'. These methods operate on chunks of data and would only require memory for a few chunks at a time.
  • The user can select a key column as the grouping parameter and calculations are performed per group in sequential order. That would also require much less memory as only a small amount of groups is actually in-memory. The data needs to be sorted though for this to work.

A third option would be to allow the user to set a parameter chunkwise = TRUE (or similar) that tells fst that the operation can be performed on a chunk of arbitrary size (like your division). Setting that parameter would reduce the memory footprint significantly as fst can run the operation on many smaller chunks.

I think you are right in saying that it would be opening a can of worms if fst would pack all kinds of special functions like fst.sum, fst.mean, etc., as that would always be too limited for many user's needs! The 'grouping' option in combination with the 'chunkwise = TRUE' parameter would probably be a better place to start. But when we get to the advanced operations milestone (#48), it would be nice to have the user define custom 'map-reduce' like methods that can be used on a fst file.

Thanks @phillc73 for some interesting ideas!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants