Streaming volume support #12
Replies: 9 comments 7 replies
-
Sounds reasonable. I'll share some thoughts, in case it helps. Disregard otherwise :). Personally I value simplicity over extensibility. Loose chunk files over HTTP are good enough for me. Naming the file with the chunk coordinates makes sense. It is a natural ID for the chunk. Also, you know where to make the request to without indirection. If there are multiple resolution grids, then that's at a different path with a different schema for how you query. The chunk files would have been better with metadata inside them too, but you need to know where to find them anyway. Documentation is important for formats. In particular, in this case the thing that I think needs to be put down in writing are the coordinate systems and indexing of the data. Numbers without units and coordinates without reference frames can be a problem. I have written down what I've used here. It would be nice to at least agree on the main coordinate system of the scan. I think I've followed the conventions the data had, please let me know if I didn't. Is there a document somewhere that describes volpkg? Or source code... I haven't looked much at VC at all, honestly, just at the data on the server. |
Beta Was this translation helpful? Give feedback.
-
Yeah, agreed with both. Right now we're serving over http and that seems to work pretty well. We can standardize on @spelufo's chunking format for now, because that's all we have, and it seems pretty good :) Another consideration would be different variants of files. E.g. @JamesDarby345 has been working on blacking out non-scroll parts of the tif files, which make them considerably smaller. We're keeping them in the same resolution as the original files so that the coordinate systems match, but it would be nice if meta.json could point to such different variants, and treat them as logically the same as the original volume data. |
Beta Was this translation helpful? Give feedback.
-
I think the simplest approach is to continue to just host tiff files, but have an additional set that are:
Then you just serve them up from a different folder (assuming you have hosting space). It's not clear that streaming data is going to be a huge win, just a small usability win over just downloading the chunks covering the region you're interested in. |
Beta Was this translation helpful? Give feedback.
-
If you do want some sort of packaging of the tiffs to make it easier for users to grab the regions they're interested in, I'd strongly suggest standardizing on zarr. It will handle a lot of the work around making sure you get the right chunk for the data you need. |
Beta Was this translation helpful? Give feedback.
-
I think what I'm trying to enable here is easier access to all of these versions that are slowly coming online, but in the tools and without the user having to do so much file management. I should be able to see a list of volumes in the tool, regardless of whether I've downloaded any of the data from that volume. If I try to access some slice that I haven't downloaded yet, the tool should be able to tell me that and give me the option to download what I need. @spelufo There isn't a formal document on the volpkg, but you're right that there should. The source code is all in the VC core library. It is mostly considered a project directory which subdivides organization into segmentation, render, and volume directories. Currently a {
"height": 21,
"max": 56151.0,
"min": 0.0,
"name": "Lorem Ipsum Parallel (40um)",
"slices": 183,
"type": "vol",
"uuid": "20170511164226",
"voxelsize": 40.0,
"width": 300
} At a minimum, I think we need to determine how to modify this metadata for indexing into the various new volume versions, particularly the chunks. So an updated TIFF stack might look like this: {
"type": "vol",
"name": "Some sample",
"shape": [183, 21, 300],
"voxelsize": 40.0,
"format": "zstack",
"pattern": "file_{Z:03d}.tif"
"host": "https://some.web/endpoint/tif/" # excluded if using the local directory?
} And the chunks version might be modified from that like so: {
"format": "chunks",
"chunk_shape": [500, 500, 500],
"pattern": "file_{Z:03d}_{Y:02d}_{X:03d}.dat",
"host": "https://some.web/endpoint/chunks/"
} For the streaming/downloading improvements, my proposal would be to host these json files alongside the data on the server (or anywhere really) and enable the clients to import them into local volpkgs (similar to how you currently add slices to a volpkg). From then on, the client code would be responsible for pulling data as needed. You could have a Volpkg Manager that lets you list and download a subset of the whole dataset into the volpkg, an online client that tries to load and cache on-the-fly, or something in-between. As far as @janpaul123's suggestion for having alternative volumes, I think this is a great idea. In terms of the volpkg, I think you'd probably want these to somehow be stored independently but linked in the metadata. |
Beta Was this translation helpful? Give feedback.
-
Right, about 60% of the cells are not relevant data. Something like 4540/7888 for scroll 1. Don't know how the server is configured, but even having it 404 for those could be worth consideration. To restrictive perhaps. Anyway, the mask is a no brainer. |
Beta Was this translation helpful? Give feedback.
-
May I suggest the filename encoding the name of the coordinates right then and there? I did |
Beta Was this translation helpful? Give feedback.
-
I put @spelufo's grids at paths like http://dl2.*****.org/full-scrolls/Scroll1.volpkg/volume_grids/20230205180739/ It's only on the dl2 server for now, because dl doesn't have enough space. We might deprecate dl completely. |
Beta Was this translation helpful? Give feedback.
-
Here's a mask for scroll_1_54. https://github.com/spelufo/vesuvius-build/blob/main/masks/scroll_1_54_mask.csv |
Beta Was this translation helpful? Give feedback.
-
There has been some active work in the Scroll Prize Discord recently that's working to make chunked versions of volumes available for streaming to end users. For this library's purposes, it would be great to extend the current volpkg format to support hosted volumes, chunked or otherwise. I'm starting this discussion so the various stakeholders can talk about how this can be implemented in practice.
To get things started, I largely see this as a book keeping effort. I see this working like this:
meta.json
. Inside is a URI to a hosted volume.As I see it, this allows immediate access to remote volumes (slices or chunks served over http, sftp, however we currently do it), with the possibility of expanding the protocol later by defining new protocols in the metadata payload and by updating clients. But what does everyone else think?
Beta Was this translation helpful? Give feedback.
All reactions