Streaming volume support #12

csparker247 · 2023-04-26T00:22:29Z

csparker247
Apr 26, 2023
Maintainer

There has been some active work in the Scroll Prize Discord recently that's working to make chunked versions of volumes available for streaming to end users. For this library's purposes, it would be great to extend the current volpkg format to support hosted volumes, chunked or otherwise. I'm starting this discussion so the various stakeholders can talk about how this can be implemented in practice.

To get things started, I largely see this as a book keeping effort. I see this working like this:

Client loads the meta.json. Inside is a URI to a hosted volume.
The client downloads the file at the URI, which is some metadata payload.
The metadata describes the layout and format of the volume (tiff slices, chunked files, etc.) and how to download them (https? ftp? torrent?!?! file naming patterns?)
The client then queries the host using the specified protocol configuration, caching files locally as needed.

As I see it, this allows immediate access to remote volumes (slices or chunks served over http, sftp, however we currently do it), with the possibility of expanding the protocol later by defining new protocols in the metadata payload and by updating clients. But what does everyone else think?

spelufo · 2023-04-26T14:21:58Z

spelufo
Apr 26, 2023

Sounds reasonable. I'll share some thoughts, in case it helps. Disregard otherwise :).

Personally I value simplicity over extensibility. Loose chunk files over HTTP are good enough for me. Naming the file with the chunk coordinates makes sense. It is a natural ID for the chunk. Also, you know where to make the request to without indirection. If there are multiple resolution grids, then that's at a different path with a different schema for how you query. The chunk files would have been better with metadata inside them too, but you need to know where to find them anyway.

Documentation is important for formats. In particular, in this case the thing that I think needs to be put down in writing are the coordinate systems and indexing of the data. Numbers without units and coordinates without reference frames can be a problem. I have written down what I've used here. It would be nice to at least agree on the main coordinate system of the scan. I think I've followed the conventions the data had, please let me know if I didn't.

Is there a document somewhere that describes volpkg? Or source code... I haven't looked much at VC at all, honestly, just at the data on the server.

1 reply

csparker247 Apr 26, 2023
Maintainer Author

On the volume coordinates, it's more typical to index z, y, x.

janpaul123 · 2023-04-26T17:12:42Z

janpaul123
Apr 26, 2023

Yeah, agreed with both. Right now we're serving over http and that seems to work pretty well. We can standardize on @spelufo's chunking format for now, because that's all we have, and it seems pretty good :)

Another consideration would be different variants of files. E.g. @JamesDarby345 has been working on blacking out non-scroll parts of the tif files, which make them considerably smaller. We're keeping them in the same resolution as the original files so that the coordinate systems match, but it would be nice if meta.json could point to such different variants, and treat them as logically the same as the original volume data.

0 replies

caethan · 2023-04-26T19:12:43Z

caethan
Apr 26, 2023

I think the simplest approach is to continue to just host tiff files, but have an additional set that are:

3D chunks rather than 2D slices to make real-time manipulation less laggy as fewer files get loaded in and out of memory.
make the chunk size appropriate so that you don't need more than 8 chunks loaded at any one time (at the center of a chunk, that one chunk is fine, at the corner you at most need to load the 7 other chunks nearby that corner). A 500x500x500 array of uint16s is 250MB, meaning you could load in 8 chunks with 2GB of memory. That suggests spefulo's approach should be just fine.
masked to only the actual scroll data. I've been talking with with James, and he's seeing pretty good results from masking. In particular, with 500 pixel sides, you'll see a lot of empty chunks that will take no space at all. Masking would need to happen on the original slices, though.
compressed with a standard (probably LZW) compression algorithm that's generally accessible across most tiff file readers (PIL, tifffile, cv2, etc.)

Then you just serve them up from a different folder (assuming you have hosting space). It's not clear that streaming data is going to be a huge win, just a small usability win over just downloading the chunks covering the region you're interested in.

0 replies

caethan · 2023-04-26T19:18:20Z

caethan
Apr 26, 2023

If you do want some sort of packaging of the tiffs to make it easier for users to grab the regions they're interested in, I'd strongly suggest standardizing on zarr. It will handle a lot of the work around making sure you get the right chunk for the data you need.

2 replies

csparker247 Apr 26, 2023
Maintainer Author

We have a lot of internal positivity towards zarr, but largely in the Python space. The C++ APIs seem to have mixed support though. Not particularly relevant to the discussion here, but you wouldn't happen to know of a good one, would you?

caethan Apr 26, 2023

Afraid not, I've been mostly in the python scientific stack.

csparker247 · 2023-04-26T20:31:12Z

csparker247
Apr 26, 2023
Maintainer Author

I think what I'm trying to enable here is easier access to all of these versions that are slowly coming online, but in the tools and without the user having to do so much file management. I should be able to see a list of volumes in the tool, regardless of whether I've downloaded any of the data from that volume. If I try to access some slice that I haven't downloaded yet, the tool should be able to tell me that and give me the option to download what I need.

@spelufo There isn't a formal document on the volpkg, but you're right that there should. The source code is all in the VC core library. It is mostly considered a project directory which subdivides organization into segmentation, render, and volume directories. Currently a volume is a directory containing a TIFF stack and a json metadata file describing the volume's properties. The C++ code only supports this format of volume, but I'm not worried about that compatibility at the moment. Here's an example of the existing metadata:

{
  "height": 21,
  "max": 56151.0,
  "min": 0.0,
  "name": "Lorem Ipsum Parallel (40um)",
  "slices": 183,
  "type": "vol",
  "uuid": "20170511164226",
  "voxelsize": 40.0,
  "width": 300
}

At a minimum, I think we need to determine how to modify this metadata for indexing into the various new volume versions, particularly the chunks. So an updated TIFF stack might look like this:

{
  "type": "vol",
  "name": "Some sample",
  "shape": [183, 21, 300],
  "voxelsize": 40.0,
  "format": "zstack",
  "pattern": "file_{Z:03d}.tif"
  "host": "https://some.web/endpoint/tif/"     # excluded if using the local directory?
}

And the chunks version might be modified from that like so:

{
  "format": "chunks",
  "chunk_shape": [500, 500, 500],
  "pattern": "file_{Z:03d}_{Y:02d}_{X:03d}.dat",
  "host": "https://some.web/endpoint/chunks/"
}

For the streaming/downloading improvements, my proposal would be to host these json files alongside the data on the server (or anywhere really) and enable the clients to import them into local volpkgs (similar to how you currently add slices to a volpkg). From then on, the client code would be responsible for pulling data as needed. You could have a Volpkg Manager that lets you list and download a subset of the whole dataset into the volpkg, an online client that tries to load and cache on-the-fly, or something in-between.

As far as @janpaul123's suggestion for having alternative volumes, I think this is a great idea. In terms of the volpkg, I think you'd probably want these to somehow be stored independently but linked in the metadata.

0 replies

spelufo · 2023-04-26T21:24:51Z

spelufo
Apr 26, 2023

Right, about 60% of the cells are not relevant data. Something like 4540/7888 for scroll 1. Don't know how the server is configured, but even having it 404 for those could be worth consideration. To restrictive perhaps. Anyway, the mask is a no brainer.

0 replies

spelufo · 2023-04-26T21:33:15Z

spelufo
Apr 26, 2023

May I suggest the filename encoding the name of the coordinates right then and there? I did cell_yxz_{Y:03d}_{X:02d}_{Z:03d}.tif, I think even better was cell_y{Y:03d}_x{X:02d}_z{Z:03d}.tif. Say it is going to be file_{Z:03d}_{Y:02d}_{X:03d}.dat, then I'd rather use file_z{Z:03d}_y{Y:02d}_x{X:03d}.dat. I know, it is stupid, but it carries the information about the coords in-band. See a loose file somewhere, no doubt how it was encoded.

1 reply

csparker247 Apr 26, 2023
Maintainer Author

Yeah, that's fine. I think the specifics of that are open to debate. I also think by setting the filename pattern in the metadata, you can flexibly support all sorts of dataset-specific naming schemes.

janpaul123 · 2023-04-26T21:45:38Z

janpaul123
Apr 26, 2023

I put @spelufo's grids at paths like http://dl2.*****.org/full-scrolls/Scroll1.volpkg/volume_grids/20230205180739/ It's only on the dl2 server for now, because dl doesn't have enough space. We might deprecate dl completely.

3 replies

csparker247 Apr 26, 2023
Maintainer Author

I need to go poke around the data servers a bit and see what's there now.

janpaul123 Apr 26, 2023

Scroll2 also has some segments (only on dl2)

caethan Apr 26, 2023

The gridded data is extremely useful. I put together a python zarr/tifffile approach that gives you very fast access to an arbitrary region of ~8 chunks worth of data (1000x1000x1000):

import tifffile

# Local path to gridded tif files
grid_path = "D:dl.*******.org/full-scrolls/Scroll1.volpkg/volume_grids/20230205180739/*.tif"
image_sequence = tifffile.TiffSequence(
    grid_path, pattern=r'cell_yxz_(\d+)_(\d+)_(\d+)'
)

# We have to tell the zarr creation that the images are tiled, and what filename axes map
# to what chunk axes.  The individual tiff axes are [Z, Y, X] and the filename axes are [Y, X, Z].
scroll1_store = image_sequence.aszarr(axestiled={0: 1, 1: 2, 2: 0})
scroll1_data = zarr.open(scroll1_store, mode="r")

scroll1_data is then a zarr array of size (1000, 1000, 1000) when I downloaded the first 8 chunk files, and you can load arbitrary slices in less than half a second.

Thanks!

spelufo · 2023-04-27T16:31:00Z

spelufo
Apr 27, 2023

Here's a mask for scroll_1_54. https://github.com/spelufo/vesuvius-build/blob/main/masks/scroll_1_54_mask.csv

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streaming volume support #12

{{title}}

Replies: 9 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Streaming volume support #12

csparker247 Apr 26, 2023 Maintainer

Replies: 9 comments · 7 replies

csparker247 Apr 26, 2023 Maintainer Author

csparker247 Apr 26, 2023 Maintainer Author

csparker247 Apr 26, 2023 Maintainer Author

csparker247 Apr 26, 2023 Maintainer Author

csparker247 Apr 26, 2023 Maintainer Author

csparker247
Apr 26, 2023
Maintainer

Replies: 9 comments 7 replies

csparker247 Apr 26, 2023
Maintainer Author

csparker247 Apr 26, 2023
Maintainer Author

csparker247
Apr 26, 2023
Maintainer Author

csparker247 Apr 26, 2023
Maintainer Author

csparker247 Apr 26, 2023
Maintainer Author