Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What about using a seed with "dual layout" i.e. row- and col- oriented? #19

Closed
hpages opened this issue May 15, 2018 · 2 comments
Closed

Comments

@hpages
Copy link
Contributor

hpages commented May 15, 2018

I'm opening an issue to facilitate the discussion about this. The discussion started with Mike's following email (May 12, 2018):

Herve,

I am in the process of implementing a delayedArray backend and have some design questions for you. In vignette In theory, it should be possible to implement a DelayedArray backend for any file format that has the capability to store array data with fast random access, which I understand as the minimum requirement for extract_array method (used for random indexing [i,j]) and dims,dimnames slots. (edited)
However, I am not quite sure what is the purpose of seed class which contains the filepath of the actual data store. I thought the DelayedArray should be agnostic about the physical storage information of the backend.
If it is mandated, then my seed object will contain at least two file paths, which represent different data layout (both row and column oriented storage) of the same matrix. is that going to play well with the DelayedArray?
Also, I am envisioning our extended array type will be a generic disk-based class, which allows arbitrary file format (h5, tiledb,etc..) to be used in this type of hybrid data layout framework.
Hopefully you can give me some clarifications on these. Thanks a lot!
Mike

@hpages
Copy link
Contributor Author

hpages commented May 15, 2018

Hey @mikejiang, @gfinak, @raphg,

Yes the seed class is mandated. This is how you actually implement the backend, by implementing a seed class. When you implement an extract_array method for your seed objects, you're providing random access to the array data stored in your backend. So code in DelayedArray can use extract_array() to extract array data from a seed object without knowing anything about the backend.

About implementing a backend with dual layouts: For this to play well with DelayedArray, at least 2 things are needed:

  1. Your extract_array method will need to take advantage of the dual layout e.g. by using the row-oriented layout when the supplied index is of the form list(i, NULL) and by using the column-oriented layout when index is of the form list(NULL, j).

  2. When doing block-processing, DelayedArray will need to choose a block geometry that leads to optimal calls to extract_array(). For example, if DelayedArray object x has a seed with dual layout, rowSums() should ideally use blocks made of full rows and colSums() should use blocks made of full columns. That's if x doesn't carry a delayed transposition on it. If it does, then it's the other way around. Note that improving the block-processing strategy used by DelayedArray is still a work in progress. My priority at the moment is to have the block-processing strategy play well with the physical chunk geometry of the seed. Seeds will have a way to tell DelayedArray about the chunk geometry via a chunkdim method or something like that. If a seed provides no chunkdim method, a default block-processing strategy should be used. It would actually make sense that this default strategy does the above i.e. use blocks made of full rows or cols when calling row/col summarization functions like rowSums()/colSums(). Then it would play well with your dual layout backend. I'm putting this on the TODO list.

If you're going to implement a seed class for HDF5 dual layout, you should probably avoid starting from scratch. It's going to be easier to define the new class on top of the HDF5ArraySeed class e.g. with something like this:

    setClass("DualHDF5ArraySeed",
        slots=c(row_oriented="HDF5ArraySeed",
                col_oriented="HDF5ArraySeed"))

It feels to me that the approach would be the same if you were going to implement a seed class for tiledb dual layout (except that AFAIK there is no TileDbSeed class yet so you would need to start by implementing that). Implementing a dual layout seed might actually be done in a more generic way e.g. with something like:

    setClass("DualSeed",
        slots=c(row_oriented="ANY",
                col_oriented="ANY"))

with a validity method that checks that the seeds stored in the row_oriented and col_oriented slots are "compatible". What "compatible" means exactly (and how strict it needs to be) still needs to be
decided. For example, there is no reason a priori why the 2 sub-seeds would need to use the same backend.

@hpages
Copy link
Contributor Author

hpages commented May 20, 2020

@mikejiang Is it ok to close this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant