-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add append argument to write.fst #153
Comments
Hi @mikejiang, thanks for your question! For loading a matrix that's already stored using the method given as an example in #154, random access retrieval could be added (full code given for completeness): m <- matrix(sample(1:100, 1000000, replace = TRUE), nrow = 1000)
# matrix is just a vector with a 'dim' attribute
attributes(m)
#> $dim
#> [1] 1000 1000
# equivalent method to support writing full matrices
write_fst_matrix <- function(m, file_name) {
# store and remove dims attribute
dim <- attr(m, "dim")
attr(m, "dim") <- NULL
fst::write_fst(data.frame(Data = m), file_name)
saveRDS(dim, paste0(file_name, ".dim"))
}
# equivalent method to support reading (sub-) matrices
read_fst_matrix <- function(file_name, x1, x2, y1, y2) {
# get dimensions from file
dim <- readRDS(paste0(file_name, ".dim"))
nr_of_rows <- dim[1]
res <- list()
for (x in x1:x2) {
column_data <- fst::read_fst(file_name,
from = (x - 1) * nr_of_rows + y1,
to = (x - 1) * nr_of_rows + y2)[[1]]
res[[1 + length(res)]] <- column_data
}
m <- unlist(res)
attr(m, "dim") <- c(1 + y2 - y1, 1 + x2 - x1)
m
}
# write matrix efficiently
write_fst_matrix(m, "1.fst")
# read matrix efficiently
m <- read_fst_matrix("1.fst", 1, 10, 1, 1000)
#> Loading required namespace: data.table This would give you a way of reading matrices random access without too much memory overhead. But to really get maximum speed and avoid extra memory-copies, the random access Is it possible to read your thanks |
Actually to subset h5 by columns and write them to the same fst file is what I was asking for. e.g. > dim(h5array) #on-disk h5 too big to be loaded into R
[1] 1000000 27998
sub <- as.matrix(h5array[, 1:1e3])
write.fst(sub, fstfile)
sub <- as.matrix(h5array[, 1e3:2e3])
write.fst(sub, fstfile, append = T)
... or some kind of |
ah, I see! Yes, you will need As a workaround until this feature is available, could you store the chunks in separate files and write a ( |
duplicate of #91 ? |
We have 1M x 30k big matrix and would like to use fst to load submatrix on demand. Since the data is too big to fit into memory in the first place (stored as h5 file), how do I manage to write to fst format without loading entire matrix to data.frame?
The text was updated successfully, but these errors were encountered: