Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add append argument to write.fst #153

Open
mikejiang opened this issue May 21, 2018 · 4 comments
Open

add append argument to write.fst #153

mikejiang opened this issue May 21, 2018 · 4 comments

Comments

@mikejiang
Copy link

We have 1M x 30k big matrix and would like to use fst to load submatrix on demand. Since the data is too big to fit into memory in the first place (stored as h5 file), how do I manage to write to fst format without loading entire matrix to data.frame?

@MarcusKlik
Copy link
Collaborator

Hi @mikejiang, thanks for your question! For loading a matrix that's already stored using the method given as an example in #154, random access retrieval could be added (full code given for completeness):

m <- matrix(sample(1:100, 1000000, replace = TRUE), nrow = 1000)

# matrix is just a vector with a 'dim' attribute
attributes(m)
#> $dim
#> [1] 1000 1000

# equivalent method to support writing full matrices
write_fst_matrix <- function(m, file_name) {
  
  # store and remove dims attribute
  dim <- attr(m, "dim")
  attr(m, "dim") <- NULL
  
  fst::write_fst(data.frame(Data = m), file_name)
  saveRDS(dim, paste0(file_name, ".dim"))
}

# equivalent method to support reading (sub-) matrices
read_fst_matrix <- function(file_name, x1, x2, y1, y2) {

  # get dimensions from file
  dim <- readRDS(paste0(file_name, ".dim"))

  nr_of_rows <- dim[1]
  res <- list()

  for (x in x1:x2) {
    column_data <- fst::read_fst(file_name,
      from = (x - 1) * nr_of_rows + y1,
      to = (x - 1) * nr_of_rows + y2)[[1]]
    
    res[[1 + length(res)]] <- column_data
  }

  m <- unlist(res)
  attr(m, "dim") <- c(1 + y2 - y1, 1 + x2 - x1)
  
  m
}

# write matrix efficiently
write_fst_matrix(m, "1.fst")

# read matrix efficiently
m <- read_fst_matrix("1.fst", 1, 10, 1, 1000)
#> Loading required namespace: data.table

This would give you a way of reading matrices random access without too much memory overhead. But to really get maximum speed and avoid extra memory-copies, the random access matrix would have to be added to the fstlib library so that it can process matrices in blocks like discussed in #154.

Is it possible to read your h5 file in small chunks into an R matrix?

thanks

@mikejiang
Copy link
Author

mikejiang commented May 21, 2018

Actually to subset h5 by columns and write them to the same fst file is what I was asking for. e.g.

> dim(h5array) #on-disk h5 too big to be loaded into R
[1] 1000000 27998
sub <-  as.matrix(h5array[, 1:1e3])
write.fst(sub, fstfile)
sub <-  as.matrix(h5array[, 1e3:2e3])
write.fst(sub, fstfile, append = T)
...

or some kind of cbind method to merge multiple fst_tables

@MarcusKlik
Copy link
Collaborator

ah, I see! Yes, you will need cbind() or rbind() functionality for that, which is not implemented yet. The format is prepared to support multiple chunks of data, so there is no need for a format change to support appending, but currently only single chunks are implemented.

As a workaround until this feature is available, could you store the chunks in separate files and write a (R) wrapper to load subsets of the data? Then you would have the benefit of the speed and compression from fst, and could still do random subsets.

@xiaodaigh
Copy link
Contributor

duplicate of #91 ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants