-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Append column to fst file #39
Comments
This requires a complete redesign of the file format that |
This image represent a data set that has been serialized to the
After all these operations, the |
Each append operation will create a redirection in the format, requiring extra seek operations. With > 1e5 seeks per second for most SSD disks, it is expected that performance will not suffer for a modest amount of operations. The file will be larger in size when many appends have been performed due to storage of indexing pointers. For appending small chunks this effect can be significant. |
This might not make sense, but after reading this enhancement request, and then also your thoughts about, for example, What I mean is, using Maybe it's opening a can of worms with regards to other "in fst" calculations such as subtraction, division, median etc I have some legacy code which looks something like this: # Filter for wins only
trainer_sr <- dplyr::filter(fbHistoricResultsSQL, POS == 1)
# Make all NAs = 0
trainer_sr[is.na(trainer_sr)] <- 0
# Calculate Strike Rate
trainer_sr$sr <- (trainer_sr$wins / trainer_sr$runs) * 100 The slow part really is restoring |
Hi @phillc73, with a # Data required for calculating strike rate
strike_data <- read.fst("trainer_sr", c("wins", "runs"))
# Add column 'sr' to fst file
fst.cbind("trainer_sr", data.frame(sr = 100 * strike_data$wins / strike_data$runs)) So you would add a single column # Create reference object to fst file
fst_table <- fst("trainer_sr")
# Add a column on-the-fly
fst_table[, sr := 100 * wins / runs]
# Or using dplyr interface
mutate(fst_table, sr = 100 * wins / runs) For operations with user defined methods, that would still require a full read of columns wins and runs because I guess there is no way of knowing whether a user defined function is an aggregate function of just a vector operation like '/'. I can see two ways around this:
A third option would be to allow the user to set a parameter I think you are right in saying that it would be opening a can of worms if Thanks @phillc73 for some interesting ideas! |
Methods
fst.rbind
andfst.cbind
will be added to the package in the next version.The text was updated successfully, but these errors were encountered: