identify optimal chunk size for computational imaging, DL, and visualization workflows #33

mattersoflight · 2023-01-31T22:49:00Z

When data is stored in locations that have significant i/o latency (storage server on premise or on cloud), a large number of small files create excessive overhead in data transfer during nightly backups, transferring data across nodes, and transferring data over the internet. We need to decide a default chunk size for the data that will be read and written by DL workflows such as virtual staining. We need to test how chunk size affects the i/o speed on HPC.

@Christianfoley can you write a test for i/o performance (e.g., reading a batch of 2.5D stacks and writing the batch to a zarr store) using iohub reader and writer, and run the tests on HPC?

@ziw-liu can then evaluate if the chunk size that is optimal for DL workflow works fine for recOrder/waveOrder.

Choices to evaluate (assuming that camera that acquires 2048x2048 image)

Chunk by image (chunk size = 1x1x2048x2048).
Chunk by channels (chunk size = Cx1x2048x2048).
Chunk by stack (chunk size = 1XZX2048x2048).

@JoOkuma how sensitive is performance of dexp processing steps to chunk size?

JoOkuma · 2023-02-01T00:47:28Z

@mattersoflight, we've encountered this problem in the past, where we had a huge performance drop when using a lot of (1, 1, 512, 512) tiles, especially when writing. So we defined our default to compute the largest number of z-slices that zarr supports given the data type and y, x size.

However, I heard this limitation is not due to having a lot of files but because they're in a single directory. So having subfolders should improve that, I never tested this though, but that explains why ome-zarr spec uses a NestedDirectoryStorage. 1 reference, more can be found online.

I would benchmark the performance of DirectoryStorage vs NestedDirectoryStorage.
In our use case, having a (1, 1, max Y, max X) would be ideal.

JoOkuma · 2023-02-01T01:02:21Z

Regarding DL, IO is much slower than forward/backward pass in our models, so we implemented a cache where we store in memory a few 3D volumes and sample patches from it; after a few iterations, we clear the cache and sample a new set of 3D volumes. Since the patches are random, we rather sample from the memory cache than from zarr directly because we have no guarantee they will match the chunks. What happens is that if the chunks are small, we end up sampling multiple chunks (slow), and if the chunks are large, we will sample a single chunk per patch, but they're also slow because they're large.

Recently, I went back and started sampling patches from the dataset, saving them in an auxiliary directory, and using a standard directory dataset to process them.
It's much faster, simpler, and, more importantly, I'm sure I'm always using the same training set when benchmarking multiple models, so it's a fair comparison.

All of this is regarding training. For inference, I think the fastest approach is just to use multiple SLURM jobs, and you don't need to worry that much about IO.

mattersoflight · 2023-02-01T02:42:06Z

I went back and started sampling patches from the dataset, saving them in an auxiliary directory, and using a standard directory dataset to process them. It's much faster, simpler, and, more importantly, I'm sure I'm always using the same training set when benchmarking multiple models, so it's a fair comparison.

Good point! Considering this, setting the chunk size to image size turns out to be a good default with data stored in a nested hierarchy.

@ziw-liu Is re-chunking a given zarr store (that may be chunked by z) easy and efficient?

ziw-liu · 2023-02-01T05:04:18Z

Is re-chunking a given zarr store (that may be chunked by z) easy and efficient?

The implementation can be as easy as a few lines, or there's also existing tools for (computational) scalability.

Performance, however, cannot be guaranteed. Zarr arrays are compressed, so there is no simple magic to play when it comes to having to read every byte of data into RAM and then write back to disk. I think the limiting factor will likely be either network bandwidth or storage IOPS.

We can take a specific example to test if this will work for us.

ziw-liu · 2023-02-01T05:18:01Z

For context, the current implementation in #31 allows for any chunk size to be specified and will default to the ZYX stack shape if None is provided:

https://github.com/czbiohub/iohub/blob/278467209e9da20929ade781de5bbe9c19e3940a/iohub/ngff.py#L408-L444

So downstream applications can conveniently modify chunking behavior regardless of the 'default' we impose. And I think it is also better to be explicit about chunking when coding for IO-sensitive tasks.

mattersoflight · 2023-02-15T23:21:58Z

The question of chunk size came up again at the Donuts & Development, and that we should measure it. The chunk-size vs read/write benchmarks will guide both the users of the library and developers. For example, we can test the speed gain with tensorstore #26.

I'd bump this up in priority - right after we have the feature-complete release 0.1.0.

@ziw-liu and @AhmetCanSolak what benchmarks make sense for our HPC environment? Are there some existing benchmarks that should be refactored into iohub?

ziw-liu · 2023-02-15T23:29:26Z

what benchmarks make sense for our HPC environment?

Our HPC environment is relatively stable. So I think that we should design the test based on typical workloads, parameterizing dataset types and analysis pipelines. A good starting point may be gathering representative datasets and converting them. A point @AhmetCanSolak brought up is that we also need to decide on benchmarking infrastructure (e.g. asv) so that they are automated and reproducible.

AhmetCanSolak · 2023-02-16T16:59:25Z

I believe default chunk_size selection is not a dealbreaker for many use cases. It seems fair to expect API consumers to set their chunk size per their needs. Different pipelines can be benchmarked against varying chunk sizes, as @ziw-liu mentioned, but I don't think those need to live on iohub repository. What @JoOkuma suggests is interesting (benchmarking the performance of DirectoryStorage vs NestedDirectoryStorage), can be part of our benchmark suite. Also, I created an issue on benchmarking, please give your input here( #57 ) regarding what you like to see in our benchmark suite.

mattersoflight assigned Christianfoley and ziw-liu Jan 31, 2023

ziw-liu added performance Speed and memory usage of the code NGFF OME-NGFF (OME-Zarr format) labels Feb 1, 2023

mattersoflight added this to the 0.2.0 milestone Feb 15, 2023

mattersoflight changed the title ~~decide on the largest chunk size compatible with DL workflows~~ identify optimal chunk size for computational imaging, DL, and visualization workflows Feb 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

identify optimal chunk size for computational imaging, DL, and visualization workflows #33

identify optimal chunk size for computational imaging, DL, and visualization workflows #33

mattersoflight commented Jan 31, 2023 •

edited

Loading

JoOkuma commented Feb 1, 2023

JoOkuma commented Feb 1, 2023

mattersoflight commented Feb 1, 2023 •

edited

Loading

ziw-liu commented Feb 1, 2023

ziw-liu commented Feb 1, 2023

mattersoflight commented Feb 15, 2023

ziw-liu commented Feb 15, 2023

AhmetCanSolak commented Feb 16, 2023

identify optimal chunk size for computational imaging, DL, and visualization workflows #33

identify optimal chunk size for computational imaging, DL, and visualization workflows #33

Comments

mattersoflight commented Jan 31, 2023 • edited Loading

JoOkuma commented Feb 1, 2023

JoOkuma commented Feb 1, 2023

mattersoflight commented Feb 1, 2023 • edited Loading

ziw-liu commented Feb 1, 2023

ziw-liu commented Feb 1, 2023

mattersoflight commented Feb 15, 2023

ziw-liu commented Feb 15, 2023

AhmetCanSolak commented Feb 16, 2023

mattersoflight commented Jan 31, 2023 •

edited

Loading

mattersoflight commented Feb 1, 2023 •

edited

Loading