Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance comparison between tifffile and iohub's custom OME-TIFF implementation #66

Open
ziw-liu opened this issue Feb 22, 2023 · 24 comments
Labels
μManager Micro-Manager files and metadata performance Speed and memory usage of the code

Comments

@ziw-liu
Copy link
Collaborator

ziw-liu commented Feb 22, 2023

A custom OME-TIFF reader (iohub.multipagetiff.MicromanagerOmeTiffReader) was implemented because historically tifffile and AICSImageIO was slow when reading large OME-TIFF series generated by Micro-Manager acquisitions.

While debugging #65 I found that this implementation does not guarantee data integrity during reading. Before investing more time in fixing it, I think it is worth revisiting the topic of whether it is worth maintaining a custom OME-TIFF reader, given that the more widely adopted solutions have evolved since waveorder.io's designation.

Here is a simple read speed benchmark of tifffile and iohub's custom reader:

benchmark_plot

The test was done on a 123 GB dataset with TCZYX=(8, 9, 3, 81, 2048, 2048) dimensions. Voxels from 2 non-sequential positions was read into RAM in each iteration (N=5).

Test script (click to expand):

Environment: Python 3.10.8, Linux 4.18 (x86_64, AMD EPYC 7H12@2.6GHz)

# %%
import os
from timeit import timeit
import zarr
import pandas as pd

# readers tested
from tifffile import TiffSequence  # 2023.2.3
from iohub.multipagetiff import MicromanagerOmeTiffReader  # 0.1.dev368+g3d62e6f


# %%
# 123 GB total
DATASET = (
  "/hpc/projects/comp_micro/rawdata/hummingbird/Soorya/"
  "2022_06_27_A549cellMembraneStained/"
  "A549_CellMaskDye_Well1_deltz0.25_63X_30s_2framemin/"
  "A549_CellMaskdye_Well1_30s_2framemin_1"
)

POSITIONS = (2, 0)


# %%
def read_tifffile():
  sequence = TiffSequence(os.scandir(DATASET))
  data = zarr.open(sequence.aszarr(), mode="r")
  for p in POSITIONS:
      _ = data[p]
  sequence.close()


# %%
def read_custom():
  reader = MicromanagerOmeTiffReader(DATASET)
  for p in POSITIONS:
      _ = reader.get_array(p)


# %%
def repeat(n=5):
  tf_times = []
  wo_times = []
  for _ in range(n):
      tf_times.append(
          timeit(
              "read_tifffile()", number=1, setup="from __main__ import read_tifffile"
          )
      )
      wo_times.append(
          timeit("read_custom()", number=1, setup="from __main__ import read_custom")
      )
  return pd.DataFrame({"tifffile": tf_times, "waveorder": wo_times})


# %%
timings = repeat()

At least in this test, the latest tifffile consistently out-performs the iohub implementation. While a comprehensive benchmark will take more time (#57), I think as long as a widely used library is not significantly slower, the reduction of maintenance overhead and increased user testing can make a strong case for us to reconsider maintaining the custom code in iohub.

@ziw-liu ziw-liu added μManager Micro-Manager files and metadata performance Speed and memory usage of the code labels Feb 22, 2023
@ziw-liu
Copy link
Collaborator Author

ziw-liu commented Feb 23, 2023

A note on interpreting the results (why no error bars) from timeit docs:

Note It’s tempting to calculate mean and standard deviation from the result vector and report these. However, this is not very useful. In a typical case, the lowest value gives a lower bound for how fast your machine can run the given code snippet; higher values in the result vector are typically not caused by variability in Python’s speed, but by other processes interfering with your timing accuracy. So the min() of the result is probably the only number you should be interested in. After that, you should look at the entire vector and apply common sense rather than statistics.

In our case if the file is small FS caching might decrease that lower bound, but the general idea holds true.

@ziw-liu
Copy link
Collaborator Author

ziw-liu commented Feb 23, 2023

Repeated the test with a much larger (1036 GB) dataset, and this time iohub's custom reader is faster by a large margin:

large_ome_tiff_benchmark

Test script (click to expand):

Environment: Python 3.10.8, Linux 4.18 (x86_64, AMD EPYC 7H12@2.6GHz)

# %%
import os
from timeit import timeit
from tqdm import tqdm
import zarr
import pandas as pd

# readers tested
from tifffile import TiffSequence  # 2023.2.3
from iohub.multipagetiff import MicromanagerOmeTiffReader  # 0.1.dev368+g3d62e6f


# %%
# 1036 GB total
DATASET = (
    "/hpc/projects/comp_micro/rawdata/hummingbird/Janie/"
    "2022_03_15_orgs_nuc_mem_63x_04NA/all_21_3"
)
POSITIONS = (50, 100, 150, 200, 250)


# %%
def read_tifffile():
    sequence = TiffSequence(os.scandir(DATASET))
    data = zarr.open(sequence.aszarr(), mode="r")
    for p in POSITIONS:
        _ = data[p]
    sequence.close()


# %%
def read_custom():
    reader = MicromanagerOmeTiffReader(DATASET)
    for p in POSITIONS:
        _ = reader.get_array(p)


# %%
def repeat(n=5):
    tf_times = []
    wo_times = []
    for _ in tqdm(range(n)):
        tf_times.append(
            timeit(
                "read_tifffile()", number=1, setup="from __main__ import read_tifffile"
            )
        )
        wo_times.append(
            timeit("read_custom()", number=1, setup="from __main__ import read_custom")
        )
    return pd.DataFrame({"tifffile": tf_times, "waveorder": wo_times})


# %%
def main():
    timings = repeat()
    print(timings)
    timings.to_csv("large_tiff_time.csv")


# %%
if __name__ == "__main__":
    main()

@ziw-liu
Copy link
Collaborator Author

ziw-liu commented Feb 23, 2023

Given these tests it seems that the inflection point of tifffile/iohub performance is in the vicinity of 500 GB dataset size. Since our users do expect ~TB OME-TIFF datasets frequently, it makes sense to keep investing in a custom solution.

One thing of concern, however, is that 7a6675a makes the custom reader a lot slower too. Edit: fixed.

@ziw-liu ziw-liu changed the title iohub's custom OME-TIFF implementation can be slower than tiffile iohub's custom OME-TIFF implementation can be slower than tiffile on smaller datasets Feb 24, 2023
@cgohlke
Copy link

cgohlke commented Feb 25, 2023

Allow me to chime in. 350 vs 50 seconds to read 25 files seems excessive. What is the number of files, size of files, and number of pages in each file? If this is a single, multi-file MicroManager dataset, the use of TiffSequence is counterproductive. Tifffile uses OME-XML metadata in the first file, which describes the whole multi-file dataset and triggers indexing all TIFF pages in all files of the dataset...

@ziw-liu
Copy link
Collaborator Author

ziw-liu commented Feb 27, 2023

@cgohlke Thanks for letting us know of the improvement we can make in benchmarking! We are very much looking forward to switching to a more widely used and tested implementation with comparable performance.

What is the number of files, size of files, and number of pages in each file? If this is a single, multi-file MicroManager dataset, the use of TiffSequence is counterproductive.

There's 260 .ome.tif files, each 4 GB (except for those written at the end of the acquisition). It is a single MM acquisition of shape PTCZYX=(336, 1, 4, 97, 2048, 2048) and type uint16.

Can you point us to the recommended entry point for such a dataset with tifffile? I tried using tifs = imread(FIRST_FILE, aszarr=True) to get the Zarr store but it only seems to take even longer time.

@cgohlke
Copy link

cgohlke commented Feb 27, 2023

Can you point us to the recommended entry point for such a dataset with tifffile? I tried using tifs = imread(FIRST_FILE, aszarr=True) to get the Zarr store but it only seems to take even longer time.

Unfortunately it is currently not possible to get a single Zarr store of a multi-position MicroManager dataset. Tifffile uses the OME-XML metadata, which describes positions as distinct OME series. imread(FIRST_FILE, ...) returns the zarr store for the first position only. Instead, use the lower level interface, e.g.:

with TiffFile(FIRST_FILE) as tif:
    series = tif.series  # this parses OME-XML and indexes all pages in all files
    for position in POSITIONS:
        # read series as numpy array ...
        im = series[position].asarray()
        # ... or via zarr store
        store = series[position].aszarr()
        z = zarr.open(store, mode='r')
        im = z[:]
        store.close()

A major bottleneck when only accessing a few positions is that tifffile needs to index all TIFF pages/IFDs in the dataset when creating the series. That requires a lot of seeks and little reads over many files. I have a patch that uses the MicroManager IndexMap, which speeds up indexing about 10x for a smaller 8 GB dataset.

It would require changes in tifffile to return a single series for the whole MicroManager datastore. Either special casing the OME-XML parser, or adding a dedicated MicroManager parser similar to NDTiff. A special parser would be preferable, but (1) the format seems poorly documented compared to OME-TIFF, (2) there are variations between MicroManager versions, (3) there is no public set of test files, and (4) MicroManager frequently produces corrupted files (at least on my system).

@ziw-liu
Copy link
Collaborator Author

ziw-liu commented Feb 27, 2023

Tifffile uses the OME-XML metadata, which describes positions as distinct OME series. imread(FIRST_FILE, ...) returns the zarr store for the first position only. Instead, use the lower level interface, e.g.:

Thanks for showing this! This way is much faster in most runs. Changing the tifffile reading function to:

def read_tifffile():
    with TiffFile(os.path.join(DATASET, FIRST_FILE)) as tif:
        series = tif.series
        for p in POSITIONS:
            _ = series[p].asarray()

while the rest of the script is kept the same gives:

image

The 1000+ second outlier happened in the first run:

      tifffile  iohub
0  1456.934305  28.823575
1    47.703576  28.290186
2    47.394221  27.825434
3    47.895452  27.775385
4    46.929777  27.824112

I'll have to run it a couple more times to see if that's a consistent pattern or just a network glitch.

@ziw-liu
Copy link
Collaborator Author

ziw-liu commented Feb 27, 2023

I'll have to run it a couple more times to see if that's a consistent pattern or just a network glitch.

Tested again with iohub running first in each loop:

    tifffile  iohub
0  47.075930  27.999947
1  47.212810  27.868444
2  47.222090  27.899368
3  47.039293  27.784547
4  47.265484  27.865451

@cgohlke
Copy link

cgohlke commented Feb 28, 2023

Interesting. I guess that "glitch" is due to caching and abysmal performance of parsing the TIFF IFD structures on a network drive.

I have just released tifffile 2023.2.27, which uses the Micro-Manager indexmap instead of parsing the IFD chain. That should improve timing, especially when files are on a network drive.

Another thought: are the image data returned by iohub and tifffile the same, i.e. (4, 97, 2048, 2048) uint16 arrays?

@mattersoflight
Copy link
Collaborator

A special parser would be preferable, but (1) the format seems poorly documented compared to OME-TIFF, (2) there are variations between MicroManager versions, (3) there is no public set of test files, and (4) MicroManager frequently produces corrupted files (at least on my system).

The special parser in iohub was developed primarily with MicroManager gamma datasets. We will need to ask @nicost how data formats have evolved. We can provide example datasets and collaborate on the parser for MM2 gamma maintained with tifffile package.

cc: @edyoshikun, @talonchandler.

@ziw-liu
Copy link
Collaborator Author

ziw-liu commented Feb 28, 2023

Another thought: are the image data returned by iohub and tifffile the same, i.e. (4, 97, 2048, 2048) uint16 arrays?

Yes, they are the same.

I have just released tifffile 2023.2.27, which uses the Micro-Manager indexmap instead of parsing the IFD chain. That should improve timing, especially when files are on a network drive.

Great work! The timing did improve a lot. I just tested the 1 TB dataset on other compute nodes linked to the same storage server, and the initial delay seems to be a pattern of the network storage caching.

Tifffile running first:

     tifffile  iohub
0  133.257984  31.177287
1   17.239525  29.169270
2   17.044940  29.175739
3   17.088925  29.221248
4   17.046381  29.282445

iohub running first (2 different nodes):

    tifffile  iohub
0  15.708963  62.636006
1  13.058120  20.977946
2  12.750226  20.705275
3  12.794518  20.277593
4  12.784348  20.557426

    tifffile   iohub
0  29.763436  188.030651
1  18.895259   32.523990
2  18.771151   31.426238
3  18.842250   31.213109
4  19.288997   31.242613

In later iterations tifffile is now consistently faster, and its performance on 'fresh' nodes is now usable. A remaining question is the first shot performance between the two.

@ziw-liu ziw-liu changed the title iohub's custom OME-TIFF implementation can be slower than tiffile on smaller datasets Performance comparison between tifffile and iohub's custom OME-TIFF implementation Feb 28, 2023
@cgohlke
Copy link

cgohlke commented Mar 1, 2023

Great. Thanks for re-running the benchmark!

Are there any warnings from tifffile in the log output?

A remaining question is the first shot performance between the two.

I think that is the overhead of reading all the Micro-Manager metadata from all files initially. The metadata are distributed across the files: the indexmap is at the beginning, while the OME-XML (in the first file only), the display settings, and comments are towards the end. It might be worth only reading the indexmap and OME-XML required for parsing the series to minimize cache misses.

@ziw-liu
Copy link
Collaborator Author

ziw-liu commented Mar 1, 2023

Are there any warnings from tifffile in the log output?

There were warnings both before and after the update. iohub has the same issue since it uses tifffile to get the headers. I think this has always been the case and iohub even had ('doesn't work') code to suppress tifffile warnings:

https://github.com/czbiohub/iohub/blob/a54aec97d14a56b69425476a05c9662b5b512fa6/iohub/multipagetiff.py#L35-L37

<tifffile.read_micromanager_metadata> failed to read display settings: invalid header 2049807406
<tifffile.read_micromanager_metadata> failed to read comments: invalid header 440544581
<tifffile.read_micromanager_metadata> failed to read display settings: invalid header 2049807406
<tifffile.read_micromanager_metadata> failed to read comments: invalid header 440544581
<tifffile.read_micromanager_metadata> failed to read display settings: invalid header 1426148353
<tifffile.read_micromanager_metadata> failed to read comments: invalid header 1476488961
<tifffile.TiffTag 270 @253745> coercing invalid ASCII to bytes
<tifffile.read_micromanager_metadata> failed to read display settings: invalid header 3955965643
<tifffile.read_micromanager_metadata> failed to read comments: invalid header 4257259969
<tifffile.read_micromanager_metadata> failed to read display settings: invalid header 3590444033
<tifffile.read_micromanager_metadata> failed to read comments: invalid header 3358267179
<tifffile.read_micromanager_metadata> failed to read display settings: invalid header 11993262
<tifffile.read_micromanager_metadata> failed to read comments: invalid header 9306267
<tifffile.read_micromanager_metadata> failed to read display settings: invalid header 35258912
<tifffile.read_micromanager_metadata> failed to read comments: invalid header 30671302
<tifffile.read_micromanager_metadata> failed to read display settings: invalid header 1291929857
<tifffile.read_micromanager_metadata> failed to read comments: invalid header 1610705409
<tifffile.TiffTag 270 @253745> coercing invalid ASCII to bytes
<tifffile.read_micromanager_metadata> failed to read display settings: invalid header 3453931290
<tifffile.read_micromanager_metadata> failed to read comments: invalid header 3578255049
<tifffile.TiffTag 270 @253745> coercing invalid ASCII to bytes
<tifffile.read_micromanager_metadata> failed to read display settings: invalid header 37683756
<tifffile.read_micromanager_metadata> failed to read comments: invalid header 267128726
<tifffile.read_micromanager_metadata> failed to read display settings: invalid header 3489797890
<tifffile.read_micromanager_metadata> failed to read comments: invalid header 4194435329
<tifffile.TiffTag 270 @253745> coercing invalid ASCII to bytes

@cgohlke
Copy link

cgohlke commented Mar 1, 2023

Thanks. Those are the same kind of warnings I get on files produced by MicroManager 2.0.0. I'll double-check my code against the file format spec. The "coercing invalid ASCII to byte" warning comes from a second ImageDescription tag, which usually contains ImageJ metadata, but is clearly corrupted for these files.

@ziw-liu
Copy link
Collaborator Author

ziw-liu commented Mar 1, 2023

The "coercing invalid ASCII to byte" warning comes from a second ImageDescription tag, which usually contains ImageJ metadata, but is clearly corrupted for these files.

Good to know. Calling TiffFile() on the first file gives <tifffile.TiffTag 270 @253745> coercing invalid ASCII to bytes ande the .imagej_metadata attribute of the return object is empty.

@cgohlke
Copy link

cgohlke commented Mar 1, 2023

Turns out that some files written by Micro-Manager are > 4GB while the offsets in classic TIFF and the Micro-Manager header are 32-bit (< 4 GB). Hence the offsets to ImageJ metadata, comments, and display settings stored at the end of MicroManager files are frequently wrong (32-bit overflow). OME-XML could be affected too, but I don't have such a sample.

@cgohlke
Copy link

cgohlke commented Mar 1, 2023

Moreover, the display settings, which should be UTF-8 JSON strings, are always (?) truncated and invalid JSON.

@cgohlke
Copy link

cgohlke commented Mar 1, 2023

ImageJ metadata are also wrong for multi-file or combined multi-position datasets...

Tifffile 2023.2.28 contains potential speed improvements and fixes for reading some corrupted metadata.

@ziw-liu
Copy link
Collaborator Author

ziw-liu commented Mar 1, 2023

Turns out that some files written by Micro-Manager are > 4GB while the offsets in classic TIFF and the Micro-Manager header are 32-bit (< 4 GB). Hence the offsets to ImageJ metadata, comments, and display settings stored at the end of MicroManager files are frequently wrong (32-bit overflow). OME-XML could be affected too, but I don't have such a sample.

This is probably because MM determines the max number of images to write in a TIFF file only from the size of pixels:

https://github.com/micro-manager/micro-manager/blob/56726a5fe8298420ea100850dfd59b8c40445267/mmstudio/src/main/java/org/micromanager/data/internal/multipagetiff/MultipageTiffWriter.java#L380-L381

And since the MM metadata can be large (> 100 MB, it dumps the state of the entire system in a JSON string for every frame micro-manager/micro-manager#1563), this might break stuff at times. For example the first file (and many others) in the 1 TB dataset I'm using in the benchmark is 4.3 GB (4,339,252,149 bytes).

Even if they fix it in the future we will still have to be 'compatible' with existing broken data.

Edit: It does seem check if there's space to write OME metadata. Couldn't find the check for image plane metadata (the larger one) though.

https://github.com/micro-manager/micro-manager/blob/56726a5fe8298420ea100850dfd59b8c40445267/mmstudio/src/main/java/org/micromanager/data/internal/multipagetiff/FileSet.java#L149-L165

@ziw-liu
Copy link
Collaborator Author

ziw-liu commented Mar 1, 2023

Tifffile 2023.2.28 contains potential speed improvements and fixes for reading some corrupted metadata.

Just tried it out. Same 1 TB benchmark on the same node:

#2023.2.27
    tifffile  iohub
0  17.227365  30.499268
1  17.311548  29.198703
2  16.966164  29.202692
3  16.967853  29.073899
4  16.956701  29.024661
# 2023.2.28
    tifffile  iohub
0  15.639089  48.219170
1  15.678544  34.795313
2  15.628425  34.761851
3  15.679976  34.806345
4  15.602959  34.863524

Now the header warnings are silenced. Only the tag 270 encoding warning persists. I tried loading the tag with PIL but that gives me a corrupted UTF-8 string (no luck with chardet):

'\x02°\x02Ñ\x02\x8a\x02?\x02¢\x02\x7f\x02¡\x02n\x02\x88\x02µ\x02Ç\x02\x90\x02r\x02~\x02®\x02{\x02¼\x02¡\x02\\\x02ç\x02m\x02¥\x02\x88\x02\x93\x02«\x02Ï\x02u\x02{\x02\x9a\x02`\x02t\x02~\x02i\x02\x87\x02\x98\x02\x81\x02\x82\x02a\x02¢\x02y\x02u\x02\x8a\x02U\x02\x87\x02e\x02\x9f\x02'

@cgohlke
Copy link

cgohlke commented Mar 1, 2023

Those timings make sense. The tifffile series interface is a little faster because it now only reads the indexmaps from the beginning of the files and the OME-XML from the end of the first file. Iohub is a little slower because the read_micromanager_metadata function now actually reads and (tries to) decode the Micro-Manager metadata from the end of the files instead of failing early.

I did not special-case the TIFF parser for the wrong second ImageDescription tag. If you really need to recover the (wrong) ImageJ metadata from files > 4 GB, try:

with tifffile.TiffFile(FILENAME) as tif:
    tag = tif.pages[0].tags.get('ImageDescription', index=1)
    tif.filehandle.seek(tag.valueoffset + 2**32)
    data = tif.filehandle.read(tag.count)
    imagej_description = tifffile.stripnull(data).decode('cp1252')
    print(imagej_description)

@cgohlke
Copy link

cgohlke commented Mar 16, 2023

Tifffile v2023.3.15 includes a new parser for MMStack series and reverts the MicroManager specific optimizations for OME series. Positions are now returned as a dimension in the series rather than as separate series. The dimension order is parsed from the MM metadata and might differ from OME. I only have a limited number of test files, many of which are corrupted in one way or another. Hope it works for you.

@ziw-liu
Copy link
Collaborator Author

ziw-liu commented Mar 17, 2023

Tifffile v2023.3.15 includes a new parser for MMStack series and reverts the MicroManager specific optimizations for OME series. Positions are now returned as a dimension in the series rather than as separate series. The dimension order is parsed from the MM metadata and might differ from OME. I only have a limited number of test files, many of which are corrupted in one way or another. Hope it works for you.

@cgohlke Thanks for letting us know! I have yet to change the related code here, so I have to pin the tifffile version on the main branch for now to avoid this breaking change.

@nicost
Copy link

nicost commented Apr 11, 2023

Just found this thread. It would be great to open issues on the micro-manager repository (https://github.com/micro-manager/micro-manager). I see at least two different issues here (Size sometimes > 4GB, and "the display settings, which should be UTF-8 JSON strings, are always (?) truncated and invalid JSON", but there may be more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
μManager Micro-Manager files and metadata performance Speed and memory usage of the code
Projects
None yet
Development

No branches or pull requests

4 participants