Performance comparison between tifffile and iohub's custom OME-TIFF implementation #66

ziw-liu · 2023-02-22T19:41:14Z

A custom OME-TIFF reader (iohub.multipagetiff.MicromanagerOmeTiffReader) was implemented because historically tifffile and AICSImageIO was slow when reading large OME-TIFF series generated by Micro-Manager acquisitions.

While debugging #65 I found that this implementation does not guarantee data integrity during reading. Before investing more time in fixing it, I think it is worth revisiting the topic of whether it is worth maintaining a custom OME-TIFF reader, given that the more widely adopted solutions have evolved since waveorder.io's designation.

Here is a simple read speed benchmark of tifffile and iohub's custom reader:

The test was done on a 123 GB dataset with TCZYX=(8, 9, 3, 81, 2048, 2048) dimensions. Voxels from 2 non-sequential positions was read into RAM in each iteration (N=5).

Test script (click to expand):

Environment: Python 3.10.8, Linux 4.18 (x86_64, AMD EPYC 7H12@2.6GHz)

# %%
import os
from timeit import timeit
import zarr
import pandas as pd

# readers tested
from tifffile import TiffSequence  # 2023.2.3
from iohub.multipagetiff import MicromanagerOmeTiffReader  # 0.1.dev368+g3d62e6f


# %%
# 123 GB total
DATASET = (
  "/hpc/projects/comp_micro/rawdata/hummingbird/Soorya/"
  "2022_06_27_A549cellMembraneStained/"
  "A549_CellMaskDye_Well1_deltz0.25_63X_30s_2framemin/"
  "A549_CellMaskdye_Well1_30s_2framemin_1"
)

POSITIONS = (2, 0)


# %%
def read_tifffile():
  sequence = TiffSequence(os.scandir(DATASET))
  data = zarr.open(sequence.aszarr(), mode="r")
  for p in POSITIONS:
      _ = data[p]
  sequence.close()


# %%
def read_custom():
  reader = MicromanagerOmeTiffReader(DATASET)
  for p in POSITIONS:
      _ = reader.get_array(p)


# %%
def repeat(n=5):
  tf_times = []
  wo_times = []
  for _ in range(n):
      tf_times.append(
          timeit(
              "read_tifffile()", number=1, setup="from __main__ import read_tifffile"
          )
      )
      wo_times.append(
          timeit("read_custom()", number=1, setup="from __main__ import read_custom")
      )
  return pd.DataFrame({"tifffile": tf_times, "waveorder": wo_times})


# %%
timings = repeat()

At least in this test, the latest tifffile consistently out-performs the iohub implementation. While a comprehensive benchmark will take more time (#57), I think as long as a widely used library is not significantly slower, the reduction of maintenance overhead and increased user testing can make a strong case for us to reconsider maintaining the custom code in iohub.

The text was updated successfully, but these errors were encountered:

ziw-liu · 2023-02-23T17:17:48Z

A note on interpreting the results (why no error bars) from timeit docs:

Note It’s tempting to calculate mean and standard deviation from the result vector and report these. However, this is not very useful. In a typical case, the lowest value gives a lower bound for how fast your machine can run the given code snippet; higher values in the result vector are typically not caused by variability in Python’s speed, but by other processes interfering with your timing accuracy. So the min() of the result is probably the only number you should be interested in. After that, you should look at the entire vector and apply common sense rather than statistics.

In our case if the file is small FS caching might decrease that lower bound, but the general idea holds true.

ziw-liu · 2023-02-23T19:15:46Z

Repeated the test with a much larger (1036 GB) dataset, and this time iohub's custom reader is faster by a large margin:

Test script (click to expand):

Environment: Python 3.10.8, Linux 4.18 (x86_64, AMD EPYC 7H12@2.6GHz)

# %%
import os
from timeit import timeit
from tqdm import tqdm
import zarr
import pandas as pd

# readers tested
from tifffile import TiffSequence  # 2023.2.3
from iohub.multipagetiff import MicromanagerOmeTiffReader  # 0.1.dev368+g3d62e6f


# %%
# 1036 GB total
DATASET = (
    "/hpc/projects/comp_micro/rawdata/hummingbird/Janie/"
    "2022_03_15_orgs_nuc_mem_63x_04NA/all_21_3"
)
POSITIONS = (50, 100, 150, 200, 250)


# %%
def read_tifffile():
    sequence = TiffSequence(os.scandir(DATASET))
    data = zarr.open(sequence.aszarr(), mode="r")
    for p in POSITIONS:
        _ = data[p]
    sequence.close()


# %%
def read_custom():
    reader = MicromanagerOmeTiffReader(DATASET)
    for p in POSITIONS:
        _ = reader.get_array(p)


# %%
def repeat(n=5):
    tf_times = []
    wo_times = []
    for _ in tqdm(range(n)):
        tf_times.append(
            timeit(
                "read_tifffile()", number=1, setup="from __main__ import read_tifffile"
            )
        )
        wo_times.append(
            timeit("read_custom()", number=1, setup="from __main__ import read_custom")
        )
    return pd.DataFrame({"tifffile": tf_times, "waveorder": wo_times})


# %%
def main():
    timings = repeat()
    print(timings)
    timings.to_csv("large_tiff_time.csv")


# %%
if __name__ == "__main__":
    main()

ziw-liu · 2023-02-23T19:24:32Z

Given these tests it seems that the inflection point of tifffile/iohub performance is in the vicinity of 500 GB dataset size. Since our users do expect ~TB OME-TIFF datasets frequently, it makes sense to keep investing in a custom solution.

~~One thing of concern, however, is that 7a6675a makes the custom reader a lot slower too.~~ Edit: fixed.

cgohlke · 2023-02-25T21:09:34Z

Allow me to chime in. 350 vs 50 seconds to read 25 files seems excessive. What is the number of files, size of files, and number of pages in each file? If this is a single, multi-file MicroManager dataset, the use of TiffSequence is counterproductive. Tifffile uses OME-XML metadata in the first file, which describes the whole multi-file dataset and triggers indexing all TIFF pages in all files of the dataset...

ziw-liu · 2023-02-27T17:44:10Z

@cgohlke Thanks for letting us know of the improvement we can make in benchmarking! We are very much looking forward to switching to a more widely used and tested implementation with comparable performance.

What is the number of files, size of files, and number of pages in each file? If this is a single, multi-file MicroManager dataset, the use of TiffSequence is counterproductive.

There's 260 .ome.tif files, each 4 GB (except for those written at the end of the acquisition). It is a single MM acquisition of shape PTCZYX=(336, 1, 4, 97, 2048, 2048) and type uint16.

Can you point us to the recommended entry point for such a dataset with tifffile? I tried using tifs = imread(FIRST_FILE, aszarr=True) to get the Zarr store but it only seems to take even longer time.

cgohlke · 2023-02-27T19:24:24Z

Can you point us to the recommended entry point for such a dataset with tifffile? I tried using tifs = imread(FIRST_FILE, aszarr=True) to get the Zarr store but it only seems to take even longer time.

Unfortunately it is currently not possible to get a single Zarr store of a multi-position MicroManager dataset. Tifffile uses the OME-XML metadata, which describes positions as distinct OME series. imread(FIRST_FILE, ...) returns the zarr store for the first position only. Instead, use the lower level interface, e.g.:

with TiffFile(FIRST_FILE) as tif:
    series = tif.series  # this parses OME-XML and indexes all pages in all files
    for position in POSITIONS:
        # read series as numpy array ...
        im = series[position].asarray()
        # ... or via zarr store
        store = series[position].aszarr()
        z = zarr.open(store, mode='r')
        im = z[:]
        store.close()

A major bottleneck when only accessing a few positions is that tifffile needs to index all TIFF pages/IFDs in the dataset when creating the series. That requires a lot of seeks and little reads over many files. I have a patch that uses the MicroManager IndexMap, which speeds up indexing about 10x for a smaller 8 GB dataset.

It would require changes in tifffile to return a single series for the whole MicroManager datastore. Either special casing the OME-XML parser, or adding a dedicated MicroManager parser similar to NDTiff. A special parser would be preferable, but (1) the format seems poorly documented compared to OME-TIFF, (2) there are variations between MicroManager versions, (3) there is no public set of test files, and (4) MicroManager frequently produces corrupted files (at least on my system).

ziw-liu · 2023-02-27T22:32:01Z

Tifffile uses the OME-XML metadata, which describes positions as distinct OME series. imread(FIRST_FILE, ...) returns the zarr store for the first position only. Instead, use the lower level interface, e.g.:

Thanks for showing this! This way is much faster in most runs. Changing the tifffile reading function to:

def read_tifffile():
    with TiffFile(os.path.join(DATASET, FIRST_FILE)) as tif:
        series = tif.series
        for p in POSITIONS:
            _ = series[p].asarray()

while the rest of the script is kept the same gives:

The 1000+ second outlier happened in the first run:

      tifffile  iohub
0  1456.934305  28.823575
1    47.703576  28.290186
2    47.394221  27.825434
3    47.895452  27.775385
4    46.929777  27.824112

I'll have to run it a couple more times to see if that's a consistent pattern or just a network glitch.

ziw-liu · 2023-02-27T23:19:39Z

I'll have to run it a couple more times to see if that's a consistent pattern or just a network glitch.

Tested again with iohub running first in each loop:

    tifffile  iohub
0  47.075930  27.999947
1  47.212810  27.868444
2  47.222090  27.899368
3  47.039293  27.784547
4  47.265484  27.865451

cgohlke · 2023-02-28T00:02:37Z

Interesting. I guess that "glitch" is due to caching and abysmal performance of parsing the TIFF IFD structures on a network drive.

I have just released tifffile 2023.2.27, which uses the Micro-Manager indexmap instead of parsing the IFD chain. That should improve timing, especially when files are on a network drive.

Another thought: are the image data returned by iohub and tifffile the same, i.e. (4, 97, 2048, 2048) uint16 arrays?

mattersoflight · 2023-02-28T06:45:38Z

A special parser would be preferable, but (1) the format seems poorly documented compared to OME-TIFF, (2) there are variations between MicroManager versions, (3) there is no public set of test files, and (4) MicroManager frequently produces corrupted files (at least on my system).

The special parser in iohub was developed primarily with MicroManager gamma datasets. We will need to ask @nicost how data formats have evolved. We can provide example datasets and collaborate on the parser for MM2 gamma maintained with tifffile package.

cc: @edyoshikun, @talonchandler.

ziw-liu · 2023-02-28T18:16:22Z

Another thought: are the image data returned by iohub and tifffile the same, i.e. (4, 97, 2048, 2048) uint16 arrays?

Yes, they are the same.

I have just released tifffile 2023.2.27, which uses the Micro-Manager indexmap instead of parsing the IFD chain. That should improve timing, especially when files are on a network drive.

Great work! The timing did improve a lot. I just tested the 1 TB dataset on other compute nodes linked to the same storage server, and the initial delay seems to be a pattern of the network storage caching.

Tifffile running first:

     tifffile  iohub
0  133.257984  31.177287
1   17.239525  29.169270
2   17.044940  29.175739
3   17.088925  29.221248
4   17.046381  29.282445

iohub running first (2 different nodes):

    tifffile  iohub
0  15.708963  62.636006
1  13.058120  20.977946
2  12.750226  20.705275
3  12.794518  20.277593
4  12.784348  20.557426

    tifffile   iohub
0  29.763436  188.030651
1  18.895259   32.523990
2  18.771151   31.426238
3  18.842250   31.213109
4  19.288997   31.242613

In later iterations tifffile is now consistently faster, and its performance on 'fresh' nodes is now usable. A remaining question is the first shot performance between the two.

cgohlke · 2023-03-01T00:33:27Z

Great. Thanks for re-running the benchmark!

Are there any warnings from tifffile in the log output?

A remaining question is the first shot performance between the two.

I think that is the overhead of reading all the Micro-Manager metadata from all files initially. The metadata are distributed across the files: the indexmap is at the beginning, while the OME-XML (in the first file only), the display settings, and comments are towards the end. It might be worth only reading the indexmap and OME-XML required for parsing the series to minimize cache misses.

ziw-liu · 2023-03-01T00:46:42Z

Are there any warnings from tifffile in the log output?

There were warnings both before and after the update. iohub has the same issue since it uses tifffile to get the headers. I think this has always been the case and iohub even had ('doesn't work') code to suppress tifffile warnings:

https://github.com/czbiohub/iohub/blob/a54aec97d14a56b69425476a05c9662b5b512fa6/iohub/multipagetiff.py#L35-L37

<tifffile.read_micromanager_metadata> failed to read display settings: invalid header 2049807406
<tifffile.read_micromanager_metadata> failed to read comments: invalid header 440544581
<tifffile.read_micromanager_metadata> failed to read display settings: invalid header 2049807406
<tifffile.read_micromanager_metadata> failed to read comments: invalid header 440544581
<tifffile.read_micromanager_metadata> failed to read display settings: invalid header 1426148353
<tifffile.read_micromanager_metadata> failed to read comments: invalid header 1476488961
<tifffile.TiffTag 270 @253745> coercing invalid ASCII to bytes
<tifffile.read_micromanager_metadata> failed to read display settings: invalid header 3955965643
<tifffile.read_micromanager_metadata> failed to read comments: invalid header 4257259969
<tifffile.read_micromanager_metadata> failed to read display settings: invalid header 3590444033
<tifffile.read_micromanager_metadata> failed to read comments: invalid header 3358267179
<tifffile.read_micromanager_metadata> failed to read display settings: invalid header 11993262
<tifffile.read_micromanager_metadata> failed to read comments: invalid header 9306267
<tifffile.read_micromanager_metadata> failed to read display settings: invalid header 35258912
<tifffile.read_micromanager_metadata> failed to read comments: invalid header 30671302
<tifffile.read_micromanager_metadata> failed to read display settings: invalid header 1291929857
<tifffile.read_micromanager_metadata> failed to read comments: invalid header 1610705409
<tifffile.TiffTag 270 @253745> coercing invalid ASCII to bytes
<tifffile.read_micromanager_metadata> failed to read display settings: invalid header 3453931290
<tifffile.read_micromanager_metadata> failed to read comments: invalid header 3578255049
<tifffile.TiffTag 270 @253745> coercing invalid ASCII to bytes
<tifffile.read_micromanager_metadata> failed to read display settings: invalid header 37683756
<tifffile.read_micromanager_metadata> failed to read comments: invalid header 267128726
<tifffile.read_micromanager_metadata> failed to read display settings: invalid header 3489797890
<tifffile.read_micromanager_metadata> failed to read comments: invalid header 4194435329
<tifffile.TiffTag 270 @253745> coercing invalid ASCII to bytes

cgohlke · 2023-03-01T00:53:11Z

Thanks. Those are the same kind of warnings I get on files produced by MicroManager 2.0.0. I'll double-check my code against the file format spec. The "coercing invalid ASCII to byte" warning comes from a second ImageDescription tag, which usually contains ImageJ metadata, but is clearly corrupted for these files.

ziw-liu · 2023-03-01T00:56:13Z

The "coercing invalid ASCII to byte" warning comes from a second ImageDescription tag, which usually contains ImageJ metadata, but is clearly corrupted for these files.

Good to know. Calling TiffFile() on the first file gives <tifffile.TiffTag 270 @253745> coercing invalid ASCII to bytes ande the .imagej_metadata attribute of the return object is empty.

cgohlke · 2023-03-01T06:02:04Z

Turns out that some files written by Micro-Manager are > 4GB while the offsets in classic TIFF and the Micro-Manager header are 32-bit (< 4 GB). Hence the offsets to ImageJ metadata, comments, and display settings stored at the end of MicroManager files are frequently wrong (32-bit overflow). OME-XML could be affected too, but I don't have such a sample.

cgohlke · 2023-03-01T06:22:48Z

Moreover, the display settings, which should be UTF-8 JSON strings, are always (?) truncated and invalid JSON.

cgohlke · 2023-03-01T07:47:47Z

ImageJ metadata are also wrong for multi-file or combined multi-position datasets...

Tifffile 2023.2.28 contains potential speed improvements and fixes for reading some corrupted metadata.

ziw-liu · 2023-03-01T17:42:52Z

Turns out that some files written by Micro-Manager are > 4GB while the offsets in classic TIFF and the Micro-Manager header are 32-bit (< 4 GB). Hence the offsets to ImageJ metadata, comments, and display settings stored at the end of MicroManager files are frequently wrong (32-bit overflow). OME-XML could be affected too, but I don't have such a sample.

This is probably because MM determines the max number of images to write in a TIFF file only from the size of pixels:

https://github.com/micro-manager/micro-manager/blob/56726a5fe8298420ea100850dfd59b8c40445267/mmstudio/src/main/java/org/micromanager/data/internal/multipagetiff/MultipageTiffWriter.java#L380-L381

And since the MM metadata can be large (> 100 MB, it dumps the state of the entire system in a JSON string for every frame micro-manager/micro-manager#1563), this might break stuff at times. For example the first file (and many others) in the 1 TB dataset I'm using in the benchmark is 4.3 GB (4,339,252,149 bytes).

Even if they fix it in the future we will still have to be 'compatible' with existing broken data.

Edit: It does seem check if there's space to write OME metadata. Couldn't find the check for image plane metadata (the larger one) though.

https://github.com/micro-manager/micro-manager/blob/56726a5fe8298420ea100850dfd59b8c40445267/mmstudio/src/main/java/org/micromanager/data/internal/multipagetiff/FileSet.java#L149-L165

ziw-liu · 2023-03-01T18:16:04Z

Tifffile 2023.2.28 contains potential speed improvements and fixes for reading some corrupted metadata.

Just tried it out. Same 1 TB benchmark on the same node:

#2023.2.27
    tifffile  iohub
0  17.227365  30.499268
1  17.311548  29.198703
2  16.966164  29.202692
3  16.967853  29.073899
4  16.956701  29.024661
# 2023.2.28
    tifffile  iohub
0  15.639089  48.219170
1  15.678544  34.795313
2  15.628425  34.761851
3  15.679976  34.806345
4  15.602959  34.863524

Now the header warnings are silenced. Only the tag 270 encoding warning persists. I tried loading the tag with PIL but that gives me a corrupted UTF-8 string (no luck with chardet):

'\x02°\x02Ñ\x02\x8a\x02?\x02¢\x02\x7f\x02¡\x02n\x02\x88\x02µ\x02Ç\x02\x90\x02r\x02~\x02®\x02{\x02¼\x02¡\x02\\\x02ç\x02m\x02¥\x02\x88\x02\x93\x02«\x02Ï\x02u\x02{\x02\x9a\x02`\x02t\x02~\x02i\x02\x87\x02\x98\x02\x81\x02\x82\x02a\x02¢\x02y\x02u\x02\x8a\x02U\x02\x87\x02e\x02\x9f\x02'

cgohlke · 2023-03-01T18:46:40Z

Those timings make sense. The tifffile series interface is a little faster because it now only reads the indexmaps from the beginning of the files and the OME-XML from the end of the first file. Iohub is a little slower because the read_micromanager_metadata function now actually reads and (tries to) decode the Micro-Manager metadata from the end of the files instead of failing early.

I did not special-case the TIFF parser for the wrong second ImageDescription tag. If you really need to recover the (wrong) ImageJ metadata from files > 4 GB, try:

with tifffile.TiffFile(FILENAME) as tif:
    tag = tif.pages[0].tags.get('ImageDescription', index=1)
    tif.filehandle.seek(tag.valueoffset + 2**32)
    data = tif.filehandle.read(tag.count)
    imagej_description = tifffile.stripnull(data).decode('cp1252')
    print(imagej_description)

cgohlke · 2023-03-16T18:12:28Z

Tifffile v2023.3.15 includes a new parser for MMStack series and reverts the MicroManager specific optimizations for OME series. Positions are now returned as a dimension in the series rather than as separate series. The dimension order is parsed from the MM metadata and might differ from OME. I only have a limited number of test files, many of which are corrupted in one way or another. Hope it works for you.

ziw-liu · 2023-03-17T00:00:34Z

Tifffile v2023.3.15 includes a new parser for MMStack series and reverts the MicroManager specific optimizations for OME series. Positions are now returned as a dimension in the series rather than as separate series. The dimension order is parsed from the MM metadata and might differ from OME. I only have a limited number of test files, many of which are corrupted in one way or another. Hope it works for you.

@cgohlke Thanks for letting us know! I have yet to change the related code here, so I have to pin the tifffile version on the main branch for now to avoid this breaking change.

nicost · 2023-04-11T04:38:44Z

Just found this thread. It would be great to open issues on the micro-manager repository (https://github.com/micro-manager/micro-manager). I see at least two different issues here (Size sometimes > 4GB, and "the display settings, which should be UTF-8 JSON strings, are always (?) truncated and invalid JSON", but there may be more.

ziw-liu added μManager Micro-Manager files and metadata performance Speed and memory usage of the code labels Feb 22, 2023

ziw-liu changed the title ~~iohub's custom OME-TIFF implementation can be slower than tiffile~~ iohub's custom OME-TIFF implementation can be slower than tiffile on smaller datasets Feb 24, 2023

ziw-liu changed the title ~~iohub's custom OME-TIFF implementation can be slower than tiffile on smaller datasets~~ Performance comparison between tifffile and iohub's custom OME-TIFF implementation Feb 28, 2023

ziw-liu mentioned this issue Mar 1, 2023

Improve TIFF I/O stack #53

Closed

mattersoflight mentioned this issue Mar 9, 2023

Implement reader czbiohub-sf/napari-iohub#1

Merged

ziw-liu mentioned this issue Mar 16, 2023

Improve CLI and API reading entry points #92

Merged

ziw-liu mentioned this issue Mar 17, 2023

New tifffile version breaks the OME-TIFF reader #94

Closed

cgohlke mentioned this issue Mar 17, 2023

Is mmstack parsing reliable? cgohlke/tifffile#186

Closed

ziw-liu mentioned this issue Apr 21, 2023

Convert scale metadata #118

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance comparison between tifffile and iohub's custom OME-TIFF implementation #66

Performance comparison between tifffile and iohub's custom OME-TIFF implementation #66

ziw-liu commented Feb 22, 2023

ziw-liu commented Feb 23, 2023

ziw-liu commented Feb 23, 2023

ziw-liu commented Feb 23, 2023 •

edited

Loading

cgohlke commented Feb 25, 2023

ziw-liu commented Feb 27, 2023

cgohlke commented Feb 27, 2023

ziw-liu commented Feb 27, 2023

ziw-liu commented Feb 27, 2023

cgohlke commented Feb 28, 2023

mattersoflight commented Feb 28, 2023

ziw-liu commented Feb 28, 2023 •

edited

Loading

cgohlke commented Mar 1, 2023

ziw-liu commented Mar 1, 2023

cgohlke commented Mar 1, 2023

ziw-liu commented Mar 1, 2023

cgohlke commented Mar 1, 2023

cgohlke commented Mar 1, 2023 •

edited

Loading

cgohlke commented Mar 1, 2023

ziw-liu commented Mar 1, 2023 •

edited

Loading

ziw-liu commented Mar 1, 2023

cgohlke commented Mar 1, 2023

cgohlke commented Mar 16, 2023

ziw-liu commented Mar 17, 2023

nicost commented Apr 11, 2023

Performance comparison between tifffile and iohub's custom OME-TIFF implementation #66

Performance comparison between tifffile and iohub's custom OME-TIFF implementation #66

Comments

ziw-liu commented Feb 22, 2023

ziw-liu commented Feb 23, 2023

ziw-liu commented Feb 23, 2023

ziw-liu commented Feb 23, 2023 • edited Loading

cgohlke commented Feb 25, 2023

ziw-liu commented Feb 27, 2023

cgohlke commented Feb 27, 2023

ziw-liu commented Feb 27, 2023

ziw-liu commented Feb 27, 2023

cgohlke commented Feb 28, 2023

mattersoflight commented Feb 28, 2023

ziw-liu commented Feb 28, 2023 • edited Loading

cgohlke commented Mar 1, 2023

ziw-liu commented Mar 1, 2023

cgohlke commented Mar 1, 2023

ziw-liu commented Mar 1, 2023

cgohlke commented Mar 1, 2023

cgohlke commented Mar 1, 2023 • edited Loading

cgohlke commented Mar 1, 2023

ziw-liu commented Mar 1, 2023 • edited Loading

ziw-liu commented Mar 1, 2023

cgohlke commented Mar 1, 2023

cgohlke commented Mar 16, 2023

ziw-liu commented Mar 17, 2023

nicost commented Apr 11, 2023

ziw-liu commented Feb 23, 2023 •

edited

Loading

ziw-liu commented Feb 28, 2023 •

edited

Loading

cgohlke commented Mar 1, 2023 •

edited

Loading

ziw-liu commented Mar 1, 2023 •

edited

Loading