Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

identify optimal chunk size for computational imaging, DL, and visualization workflows #33

Open
mattersoflight opened this issue Jan 31, 2023 · 8 comments
Assignees
Labels
NGFF OME-NGFF (OME-Zarr format) performance Speed and memory usage of the code
Milestone

Comments

@mattersoflight
Copy link
Collaborator

mattersoflight commented Jan 31, 2023

When data is stored in locations that have significant i/o latency (storage server on premise or on cloud), a large number of small files create excessive overhead in data transfer during nightly backups, transferring data across nodes, and transferring data over the internet. We need to decide a default chunk size for the data that will be read and written by DL workflows such as virtual staining. We need to test how chunk size affects the i/o speed on HPC.

@Christianfoley can you write a test for i/o performance (e.g., reading a batch of 2.5D stacks and writing the batch to a zarr store) using iohub reader and writer, and run the tests on HPC?

@ziw-liu can then evaluate if the chunk size that is optimal for DL workflow works fine for recOrder/waveOrder.

Choices to evaluate (assuming that camera that acquires 2048x2048 image)

  • Chunk by image (chunk size = 1x1x2048x2048).
  • Chunk by channels (chunk size = Cx1x2048x2048).
  • Chunk by stack (chunk size = 1XZX2048x2048).

@JoOkuma how sensitive is performance of dexp processing steps to chunk size?

@ziw-liu ziw-liu added performance Speed and memory usage of the code NGFF OME-NGFF (OME-Zarr format) labels Feb 1, 2023
@JoOkuma
Copy link
Member

JoOkuma commented Feb 1, 2023

@mattersoflight, we've encountered this problem in the past, where we had a huge performance drop when using a lot of (1, 1, 512, 512) tiles, especially when writing. So we defined our default to compute the largest number of z-slices that zarr supports given the data type and y, x size.

However, I heard this limitation is not due to having a lot of files but because they're in a single directory. So having subfolders should improve that, I never tested this though, but that explains why ome-zarr spec uses a NestedDirectoryStorage. 1 reference, more can be found online.

I would benchmark the performance of DirectoryStorage vs NestedDirectoryStorage.
In our use case, having a (1, 1, max Y, max X) would be ideal.

@JoOkuma
Copy link
Member

JoOkuma commented Feb 1, 2023

Regarding DL, IO is much slower than forward/backward pass in our models, so we implemented a cache where we store in memory a few 3D volumes and sample patches from it; after a few iterations, we clear the cache and sample a new set of 3D volumes. Since the patches are random, we rather sample from the memory cache than from zarr directly because we have no guarantee they will match the chunks. What happens is that if the chunks are small, we end up sampling multiple chunks (slow), and if the chunks are large, we will sample a single chunk per patch, but they're also slow because they're large.

Recently, I went back and started sampling patches from the dataset, saving them in an auxiliary directory, and using a standard directory dataset to process them.
It's much faster, simpler, and, more importantly, I'm sure I'm always using the same training set when benchmarking multiple models, so it's a fair comparison.

All of this is regarding training. For inference, I think the fastest approach is just to use multiple SLURM jobs, and you don't need to worry that much about IO.

@mattersoflight
Copy link
Collaborator Author

mattersoflight commented Feb 1, 2023

I went back and started sampling patches from the dataset, saving them in an auxiliary directory, and using a standard directory dataset to process them. It's much faster, simpler, and, more importantly, I'm sure I'm always using the same training set when benchmarking multiple models, so it's a fair comparison.

Good point! Considering this, setting the chunk size to image size turns out to be a good default with data stored in a nested hierarchy.

@ziw-liu Is re-chunking a given zarr store (that may be chunked by z) easy and efficient?

@ziw-liu
Copy link
Collaborator

ziw-liu commented Feb 1, 2023

Is re-chunking a given zarr store (that may be chunked by z) easy and efficient?

The implementation can be as easy as a few lines, or there's also existing tools for (computational) scalability.

Performance, however, cannot be guaranteed. Zarr arrays are compressed, so there is no simple magic to play when it comes to having to read every byte of data into RAM and then write back to disk. I think the limiting factor will likely be either network bandwidth or storage IOPS.

We can take a specific example to test if this will work for us.

@ziw-liu
Copy link
Collaborator

ziw-liu commented Feb 1, 2023

For context, the current implementation in #31 allows for any chunk size to be specified and will default to the ZYX stack shape if None is provided:

https://github.com/czbiohub/iohub/blob/278467209e9da20929ade781de5bbe9c19e3940a/iohub/ngff.py#L408-L444

So downstream applications can conveniently modify chunking behavior regardless of the 'default' we impose. And I think it is also better to be explicit about chunking when coding for IO-sensitive tasks.

@mattersoflight
Copy link
Collaborator Author

The question of chunk size came up again at the Donuts & Development, and that we should measure it. The chunk-size vs read/write benchmarks will guide both the users of the library and developers. For example, we can test the speed gain with tensorstore #26.

I'd bump this up in priority - right after we have the feature-complete release 0.1.0.

@ziw-liu and @AhmetCanSolak what benchmarks make sense for our HPC environment? Are there some existing benchmarks that should be refactored into iohub?

@mattersoflight mattersoflight added this to the 0.2.0 milestone Feb 15, 2023
@ziw-liu
Copy link
Collaborator

ziw-liu commented Feb 15, 2023

what benchmarks make sense for our HPC environment?

Our HPC environment is relatively stable. So I think that we should design the test based on typical workloads, parameterizing dataset types and analysis pipelines. A good starting point may be gathering representative datasets and converting them. A point @AhmetCanSolak brought up is that we also need to decide on benchmarking infrastructure (e.g. asv) so that they are automated and reproducible.

@mattersoflight mattersoflight changed the title decide on the largest chunk size compatible with DL workflows identify optimal chunk size for computational imaging, DL, and visualization workflows Feb 16, 2023
@AhmetCanSolak
Copy link
Contributor

I believe default chunk_size selection is not a dealbreaker for many use cases. It seems fair to expect API consumers to set their chunk size per their needs. Different pipelines can be benchmarked against varying chunk sizes, as @ziw-liu mentioned, but I don't think those need to live on iohub repository. What @JoOkuma suggests is interesting (benchmarking the performance of DirectoryStorage vs NestedDirectoryStorage), can be part of our benchmark suite. Also, I created an issue on benchmarking, please give your input here( #57 ) regarding what you like to see in our benchmark suite.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NGFF OME-NGFF (OME-Zarr format) performance Speed and memory usage of the code
Projects
None yet
Development

No branches or pull requests

5 participants