Subset filearray into filearray proxy #5

dipterix · 2023-04-06T13:52:36Z

The original issue: dipterix/lazyarray#3

Is it possible that subsetting a lazyarray again yields a lazyarray?

I am a bit puzzled whether I use your package correctly, e.g.

# `arr` from readme.md
inds <- arr > 0.5 # error
inds <- arr[] > 0.5

During this call, arr[] fully populates the memory, i.e. the whole lazy-aspect is gone?

Original reply:

Hi @chrisdane , the development for this package has been paused in favor of https://github.com/dipterix/filearray , a very similar package that offers better performance and more functions. This package (lazyarray) is still on CRAN because some of my old projects are still depending on it, but soon the migration will complete. I'm sorry for the inconvenience.

Back to your question. It's not straightforward to subset lazyarray/filearray in that way for now because I'm dealing with arrays with sizes of 10GB+. Your proposed operations might need to create a new array on disk. This could very easily fill up the hard disks if not carefully treated.

It's true that once you call [, the data will be loaded into memory, hence the "lazy" aspect goes away.

What I could do, however, is I might be able to set some lazy-evaluated proxies. The proxies does not evaluate the arrays immediately. Instead, they only evaluate when you subset the arrays:

# No evaluation, inds is just a proxy array
inds <- arr > 0.5

# evaluates `arr>0.5` on the fly
inds[,,1]

# or 
arr[inds]

Does that resolve your problems?

The text was updated successfully, but these errors were encountered:

chrisdane · 2023-04-06T18:25:14Z

Yes I think so =)
Generally I am looking for data manipulation/arithmetic/subsetting as lazy as possible, i.e. applying as many methods as possible without using the memory. Ideally, this would include subsetting, but maybe this is technically nonsense ^_^:

methods(class="FileArray")
 [1] $          $<-        [          [<-        [[         apply     
 [7] as.array   coerce     dim        dimnames   dimnames<- fwhich    
[13] initialize length     mapreduce  max        min        range     
[19] show       subset     sum        typeof

Btw, would it be possible to lazy-load a netcdf file as FileArray? This seems possible with the stars package (called "proxy" there):

library(stars)
proxy <- stars::read_stars(system.file("nc/reduced.nc", package="stars"), proxy=T) # lazy-load nc file
message("proxy obj needs ", format(utils::object.size(proxy), units="auto"))
#proxy obj needs 12.3 Kb
stars <- stars::st_as_stars(proxy) # convert to accessible data = use memory
message("stars obj needs ", format(utils::object.size(stars), units="auto"))
#stars obj needs 519.3 Kb
methods(class="stars_proxy")
 [1] Math            Ops             [               [<-            
 [5] [[<-            adrop           aggregate       aperm          
 [9] as.data.frame   c               coerce          dim            
[13] droplevels      filter          hist            initialize     
[17] is.na           merge           plot            predict        
[21] print           show            slotsFromS3     split          
[25] st_apply        st_as_sf        st_as_stars     st_crop        
[29] st_dimensions<- st_downsample   st_mosaic       st_redimension 
[33] st_sample       st_set_bbox     write_stars

Thanks a lot for your great work!

dipterix · 2023-04-06T20:34:30Z

Generally I am looking for data manipulation/arithmetic/subsetting as lazy as possible, i.e. applying as many methods as possible without using the memory. Ideally, this would include subsetting

That sounds like a good idea. There will be some limitations to the types of methods available. point-wise methods such as +-*/>< should be easiest. Indexing could be a little bit tricky (arr[arr>0.5]) but doable. Other methods (such as tensor decomposition or matrix multiplication) will not be implemented at this time.

but maybe this is technically nonsense ^_^:

No you are good. Glad you brought up this feature request.

Btw, would it be possible to lazy-load a netcdf file as FileArray?

Not natively. I think you can convert the arrays though. I'm not very familiar with the low-level implementation of netCDF... From what I have read, it seems netCDF or hdf5 are often chunked for random access.

FileArray has its own format. The file IO of filearray is written from scratch to make sure sequential IO is as fast as possible on NVMe SSD (2-4GB/s on Mac M1/M2, or 1GB/s on average windows).

The performance comes with costs. For example, random access is relatively slow. filearray does not use universal file formats that can be read from other programs. The data array is only expandable along the last margin... If you are OK with these disadvantages, or have alternative methods to get around, filearray should be a great tool for out-of-memory analyses (my project often needs to analyze 200x200x300x300+ data arrays within seconds.)

dipterix · 2023-04-13T20:48:09Z

Hi @chrisdane I have added this experimental feature to branch https://github.com/dipterix/filearray/tree/lazyeval

Would you mind helping me check this branch to see if there is method that you want to support? Also please let me know if you find any bugs :)

You can install and compile this dev branch via

remotes::install_github("dipterix/filearray@lazyeval")

If you run on Windows, rtools is needed to compile. For osx, please run xcode-select --install in terminal to install building tools.

Here's a sanity test:

> x <- as_filearray(1:24, dimension = c(4,6))

> y <- (2^(x - 1) + log(x)) > 10000 | x <= 2

> print(y)
Reference class object of class "FileArrayProxy"
Mode: readwrite 
UUID: 0005-640eaaf8-c6e7-4f55-aa6e-2956a872155c (depth=5)
Dimension: 4x6 
Partition count: 6 
Partition size: 1 
Data type: logical 
Internal type: integer 
Location: $TEMPDIR/tmpfilearray11ef51b065fe9.farr 

> x[y]
 [1]  1  2 15 16 17 18 19 20 21 22 23 24

> # Sanity check
> x[][(2^(x[] - 1) + log(x[])) > 10000 | x[] <= 2]
 [1]  1  2 15 16 17 18 19 20 21 22 23 24

dipterix · 2023-06-23T12:38:00Z

Added as of 0.1.6

dipterix added the enhancement New feature or request label Apr 6, 2023

dipterix mentioned this issue Apr 6, 2023

subset lazyarray into lazyarray? dipterix/lazyarray#3

Open

dipterix closed this as completed Jun 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Subset filearray into filearray proxy #5

Subset filearray into filearray proxy #5

dipterix commented Apr 6, 2023 •

edited

Loading

chrisdane commented Apr 6, 2023

dipterix commented Apr 6, 2023

dipterix commented Apr 13, 2023

dipterix commented Jun 23, 2023

Subset filearray into filearray proxy #5

Subset filearray into filearray proxy #5

Comments

dipterix commented Apr 6, 2023 • edited Loading

Original reply:

chrisdane commented Apr 6, 2023

dipterix commented Apr 6, 2023

dipterix commented Apr 13, 2023

dipterix commented Jun 23, 2023

dipterix commented Apr 6, 2023 •

edited

Loading