Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve speed when opening tiff files over the network #5

Open
ecobost opened this issue Apr 6, 2019 · 2 comments
Open

Improve speed when opening tiff files over the network #5

ecobost opened this issue Apr 6, 2019 · 2 comments

Comments

@ecobost
Copy link
Collaborator

ecobost commented Apr 6, 2019

After opening a file, if a user tries to access the num_frames of a scan tifffile will iterate over each page to find their offsets (see step 2 in the Details of data loading section in the readme). This turns out to be very slow when done over the network (almost 200x slower than when the file is local):

In [13]: f2 = tifffile.TiffFile('/mnt/scratch06/Two-Photon/taliah/2019-04-03_12-41-44/21067_10_00003_00001.tif')   # over the network                                                                                                 

In [14]: cProfile.run('n2 = len(f2.pages)')                                                                                                                                                                        
         240111 function calls (240109 primitive calls) in 28.641 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   28.641   28.641 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 tifffile.py:2035(filehandle)
        1    0.287    0.287   28.641   28.641 tifffile.py:3375(_seek)
        1    0.000    0.000   28.641   28.641 tifffile.py:3567(__len__)
    40000    0.053    0.000   28.080    0.001 tifffile.py:5570(read)
    40001    0.065    0.000    0.209    0.000 tifffile.py:5662(seek)
    19999    0.010    0.000    0.010    0.000 tifffile.py:5704(size)
        1    0.000    0.000    0.000    0.000 tifffile.py:5708(closed)
    40000    0.049    0.000    0.049    0.000 {built-in method _struct.unpack}
        1    0.000    0.000   28.641   28.641 {built-in method builtins.exec}
      101    0.000    0.000    0.000    0.000 {built-in method builtins.isinstance}
      3/1    0.000    0.000   28.641   28.641 {built-in method builtins.len}
    19999    0.006    0.000    0.006    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
    40000   28.027    0.001   28.027    0.001 {method 'read' of '_io.BufferedReader' objects}
    40001    0.144    0.000    0.144    0.000 {method 'seek' of '_io.BufferedReader' objects}

In [18]: f3 = tifffile.TiffFile('/data/pipeline/21067_10_00003_00001.tif')   # local                                                                                                                                      

In [19]: cProfile.run('n2 = len(f3.pages)')                                                                                                                                                                        
         240111 function calls (240109 primitive calls) in 0.154 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.154    0.154 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 tifffile.py:2035(filehandle)
        1    0.046    0.046    0.154    0.154 tifffile.py:3375(_seek)
        1    0.000    0.000    0.154    0.154 tifffile.py:3567(__len__)
    40000    0.011    0.000    0.062    0.000 tifffile.py:5570(read)
    40001    0.014    0.000    0.036    0.000 tifffile.py:5662(seek)
    19999    0.003    0.000    0.003    0.000 tifffile.py:5704(size)
        1    0.000    0.000    0.000    0.000 tifffile.py:5708(closed)
    40000    0.006    0.000    0.006    0.000 {built-in method _struct.unpack}
        1    0.000    0.000    0.154    0.154 {built-in method builtins.exec}
      101    0.000    0.000    0.000    0.000 {built-in method builtins.isinstance}
      3/1    0.000    0.000    0.154    0.154 {built-in method builtins.len}
    19999    0.002    0.000    0.002    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
    40000    0.051    0.000    0.051    0.000 {method 'read' of '_io.BufferedReader' objects}
    40001    0.023    0.000    0.023    0.000 {method 'seek' of '_io.BufferedReader' objects}

The chain of operations goes scan.num_frames -> len(TiffFile.pages) -> TiffFile.TiffPages.seek(-1). What seek(-1) does is starting on the first page which has already been read, move page by page accessing their offset value and saving it in an index. Per page, it performs two seeks and two reads on the tiff file handle (which is an io.BufferedReader object); these reads take most of the time.

However, they only read 8 bytes each (fh.read(tagnosize) reads the number of tags and fh.read(offsetsize) reads the actual offset) which doesn't account to enough info for it to be a bottleneck (even assuming each 8 byte is packeted as a 96 byte TCP packet, that is only around 4 Mb which would not take 28 seconds). My guess is that it is the sheer number of packets that is causing the problem.

In any way, because all of ScanImage's tiff files' pages are the same size on file, the offset from page to page will be exactly the same so we only need to compute one offset overall (or maybe one per file to be safe and avoid read errors if two files come from diff scans). This will require changing the seek function in tifffile.TiffPages to only compute the offset once and fill out the rest of page offsets with it.

@ecobost ecobost changed the title Improve speed when opening tiff files Improve speed when opening tiff files over the network May 25, 2019
@cgohlke
Copy link

cgohlke commented Sep 29, 2020

because all of ScanImage's tiff files' pages are the same size on file, the offset from page to page will be exactly the same so we only need to compute one offset overall

FWIW, this is not true for ScanImage > 2015 BigTIFF files, where the ImageDescription tag value varies. See also cgohlke/tifffile#29.

@ecobost
Copy link
Collaborator Author

ecobost commented Oct 1, 2020

Hi @cgohlke
Tags changing size will be annoying, I would have to check when that is the case. I remember checking offsets for some test cases and they were the same but maybe it changes for some configs. At least, it will be patently obvious if the offsets are wrong (all kinds of stuff should break).
Thanks for letting us know 👍

PS: Not sure why that would have been a problem in the referenced issue, I thought tifffile explicitly reads the offsets page by page (even if is_scanimage is True), that's what this issue was supposed to be about.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants