add append argument to write.fst #153

mikejiang · 2018-05-21T04:39:36Z

We have 1M x 30k big matrix and would like to use fst to load submatrix on demand. Since the data is too big to fit into memory in the first place (stored as h5 file), how do I manage to write to fst format without loading entire matrix to data.frame?

MarcusKlik · 2018-05-21T22:00:12Z

Hi @mikejiang, thanks for your question! For loading a matrix that's already stored using the method given as an example in #154, random access retrieval could be added (full code given for completeness):

m <- matrix(sample(1:100, 1000000, replace = TRUE), nrow = 1000)

# matrix is just a vector with a 'dim' attribute
attributes(m)
#> $dim
#> [1] 1000 1000

# equivalent method to support writing full matrices
write_fst_matrix <- function(m, file_name) {
  
  # store and remove dims attribute
  dim <- attr(m, "dim")
  attr(m, "dim") <- NULL
  
  fst::write_fst(data.frame(Data = m), file_name)
  saveRDS(dim, paste0(file_name, ".dim"))
}

# equivalent method to support reading (sub-) matrices
read_fst_matrix <- function(file_name, x1, x2, y1, y2) {

  # get dimensions from file
  dim <- readRDS(paste0(file_name, ".dim"))

  nr_of_rows <- dim[1]
  res <- list()

  for (x in x1:x2) {
    column_data <- fst::read_fst(file_name,
      from = (x - 1) * nr_of_rows + y1,
      to = (x - 1) * nr_of_rows + y2)[[1]]
    
    res[[1 + length(res)]] <- column_data
  }

  m <- unlist(res)
  attr(m, "dim") <- c(1 + y2 - y1, 1 + x2 - x1)
  
  m
}

# write matrix efficiently
write_fst_matrix(m, "1.fst")

# read matrix efficiently
m <- read_fst_matrix("1.fst", 1, 10, 1, 1000)
#> Loading required namespace: data.table

This would give you a way of reading matrices random access without too much memory overhead. But to really get maximum speed and avoid extra memory-copies, the random access matrix would have to be added to the fstlib library so that it can process matrices in blocks like discussed in #154.

Is it possible to read your h5 file in small chunks into an R matrix?

thanks

mikejiang · 2018-05-21T22:16:23Z

Actually to subset h5 by columns and write them to the same fst file is what I was asking for. e.g.

> dim(h5array) #on-disk h5 too big to be loaded into R
[1] 1000000 27998
sub <-  as.matrix(h5array[, 1:1e3])
write.fst(sub, fstfile)
sub <-  as.matrix(h5array[, 1e3:2e3])
write.fst(sub, fstfile, append = T)
...

or some kind of cbind method to merge multiple fst_tables

MarcusKlik · 2018-05-22T09:30:38Z

ah, I see! Yes, you will need cbind() or rbind() functionality for that, which is not implemented yet. The format is prepared to support multiple chunks of data, so there is no need for a format change to support appending, but currently only single chunks are implemented.

As a workaround until this feature is available, could you store the chunks in separate files and write a (R) wrapper to load subsets of the data? Then you would have the benefit of the speed and compression from fst, and could still do random subsets.

xiaodaigh · 2018-10-31T11:19:25Z

duplicate of #91 ?

richfitz mentioned this issue Jul 18, 2018

Support more additional formats (i.e. fst, feather, etc?) ropensci/arkdb#3

Open

1 task

xiaodaigh mentioned this issue May 25, 2019

cbind table #196

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add append argument to write.fst #153

add append argument to write.fst #153

mikejiang commented May 21, 2018

MarcusKlik commented May 21, 2018

mikejiang commented May 21, 2018 •

edited

Loading

MarcusKlik commented May 22, 2018

xiaodaigh commented Oct 31, 2018

add append argument to write.fst #153

add append argument to write.fst #153

Comments

mikejiang commented May 21, 2018

MarcusKlik commented May 21, 2018

mikejiang commented May 21, 2018 • edited Loading

MarcusKlik commented May 22, 2018

xiaodaigh commented Oct 31, 2018

mikejiang commented May 21, 2018 •

edited

Loading