Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subset filearray into filearray proxy #5

Closed
dipterix opened this issue Apr 6, 2023 · 4 comments
Closed

Subset filearray into filearray proxy #5

dipterix opened this issue Apr 6, 2023 · 4 comments
Labels
enhancement New feature or request

Comments

@dipterix
Copy link
Owner

dipterix commented Apr 6, 2023

The original issue: dipterix/lazyarray#3

Is it possible that subsetting a lazyarray again yields a lazyarray?

I am a bit puzzled whether I use your package correctly, e.g.

# `arr` from readme.md
inds <- arr > 0.5 # error
inds <- arr[] > 0.5

During this call, arr[] fully populates the memory, i.e. the whole lazy-aspect is gone?

Original reply:

Hi @chrisdane , the development for this package has been paused in favor of https://github.com/dipterix/filearray , a very similar package that offers better performance and more functions. This package (lazyarray) is still on CRAN because some of my old projects are still depending on it, but soon the migration will complete. I'm sorry for the inconvenience.

Back to your question. It's not straightforward to subset lazyarray/filearray in that way for now because I'm dealing with arrays with sizes of 10GB+. Your proposed operations might need to create a new array on disk. This could very easily fill up the hard disks if not carefully treated.

It's true that once you call [, the data will be loaded into memory, hence the "lazy" aspect goes away.

What I could do, however, is I might be able to set some lazy-evaluated proxies. The proxies does not evaluate the arrays immediately. Instead, they only evaluate when you subset the arrays:

# No evaluation, inds is just a proxy array
inds <- arr > 0.5

# evaluates `arr>0.5` on the fly
inds[,,1]

# or 
arr[inds]

Does that resolve your problems?

@chrisdane
Copy link

Yes I think so =)
Generally I am looking for data manipulation/arithmetic/subsetting as lazy as possible, i.e. applying as many methods as possible without using the memory. Ideally, this would include subsetting, but maybe this is technically nonsense ^_^:

methods(class="FileArray")
 [1] $          $<-        [          [<-        [[         apply     
 [7] as.array   coerce     dim        dimnames   dimnames<- fwhich    
[13] initialize length     mapreduce  max        min        range     
[19] show       subset     sum        typeof

Btw, would it be possible to lazy-load a netcdf file as FileArray? This seems possible with the stars package (called "proxy" there):

library(stars)
proxy <- stars::read_stars(system.file("nc/reduced.nc", package="stars"), proxy=T) # lazy-load nc file
message("proxy obj needs ", format(utils::object.size(proxy), units="auto"))
#proxy obj needs 12.3 Kb
stars <- stars::st_as_stars(proxy) # convert to accessible data = use memory
message("stars obj needs ", format(utils::object.size(stars), units="auto"))
#stars obj needs 519.3 Kb
methods(class="stars_proxy")
 [1] Math            Ops             [               [<-            
 [5] [[<-            adrop           aggregate       aperm          
 [9] as.data.frame   c               coerce          dim            
[13] droplevels      filter          hist            initialize     
[17] is.na           merge           plot            predict        
[21] print           show            slotsFromS3     split          
[25] st_apply        st_as_sf        st_as_stars     st_crop        
[29] st_dimensions<- st_downsample   st_mosaic       st_redimension 
[33] st_sample       st_set_bbox     write_stars

Thanks a lot for your great work!

@dipterix
Copy link
Owner Author

dipterix commented Apr 6, 2023

Generally I am looking for data manipulation/arithmetic/subsetting as lazy as possible, i.e. applying as many methods as possible without using the memory. Ideally, this would include subsetting

That sounds like a good idea. There will be some limitations to the types of methods available. point-wise methods such as +-*/>< should be easiest. Indexing could be a little bit tricky (arr[arr>0.5]) but doable. Other methods (such as tensor decomposition or matrix multiplication) will not be implemented at this time.

but maybe this is technically nonsense ^_^:

No you are good. Glad you brought up this feature request.

Btw, would it be possible to lazy-load a netcdf file as FileArray?

Not natively. I think you can convert the arrays though. I'm not very familiar with the low-level implementation of netCDF... From what I have read, it seems netCDF or hdf5 are often chunked for random access.

FileArray has its own format. The file IO of filearray is written from scratch to make sure sequential IO is as fast as possible on NVMe SSD (2-4GB/s on Mac M1/M2, or 1GB/s on average windows).

The performance comes with costs. For example, random access is relatively slow. filearray does not use universal file formats that can be read from other programs. The data array is only expandable along the last margin... If you are OK with these disadvantages, or have alternative methods to get around, filearray should be a great tool for out-of-memory analyses (my project often needs to analyze 200x200x300x300+ data arrays within seconds.)

@dipterix
Copy link
Owner Author

Hi @chrisdane I have added this experimental feature to branch https://github.com/dipterix/filearray/tree/lazyeval

Would you mind helping me check this branch to see if there is method that you want to support? Also please let me know if you find any bugs :)

You can install and compile this dev branch via

remotes::install_github("dipterix/filearray@lazyeval")

If you run on Windows, rtools is needed to compile. For osx, please run xcode-select --install in terminal to install building tools.

Here's a sanity test:

> x <- as_filearray(1:24, dimension = c(4,6))

> y <- (2^(x - 1) + log(x)) > 10000 | x <= 2

> print(y)
Reference class object of class "FileArrayProxy"
Mode: readwrite 
UUID: 0005-640eaaf8-c6e7-4f55-aa6e-2956a872155c (depth=5)
Dimension: 4x6 
Partition count: 6 
Partition size: 1 
Data type: logical 
Internal type: integer 
Location: $TEMPDIR/tmpfilearray11ef51b065fe9.farr 

> x[y]
 [1]  1  2 15 16 17 18 19 20 21 22 23 24

> # Sanity check
> x[][(2^(x[] - 1) + log(x[])) > 10000 | x[] <= 2]
 [1]  1  2 15 16 17 18 19 20 21 22 23 24

@dipterix
Copy link
Owner Author

Added as of 0.1.6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants