Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save conversion time and disk space by not copying images into HDF5 file #15

Open
din14970 opened this issue Nov 20, 2020 · 20 comments
Open
Labels
enhancement New feature or request

Comments

@din14970
Copy link
Owner

The workflow read TVIPS -> process data -> save to hdf5 -> reconstruct scan -> save to blo or hspy is somewhat problematic because the intermediate hdf5 file can be huge. Additionally, once the data processing is done, this is what you are stuck with for conversion to blo or hspy. If you want additional processing you have to create a new hdf5 file. Parameter studies on the image preprocessing is prohibitive.

Ideally the data copying to hdf5 in the intermediate step is skipped so that the workflow becomes read TVIPS -> process data -> save to blo or hspy. The problem is getting a handle on which image frames to use and how to reconstruct the scan. To do this, the first step of the process should be just a "prepare dataset" step that:

  • extracts rotator indices if relevant (EM Conos)
  • extracts vbf intensities
  • extracts x,y positions of the direct beam (if selected)

These items can then be stored in an hdf5 file that the user should not care about - they could even just be stored in memory. necessary for these steps is a preview of the diffraction pattern, the VBF aperture, and maybe the direct beam center. No other pre-processing should be done on the data at this point. I would guess this also cuts down on the time necessary to loop over the data.

In the second step, this information could be used to reconstruct the scan and VBF. The user should then have the option to modify the data with filters/binning/cropping etc. The data should be extracted directly from the TVIPS files, processed, and then written to the correct file. The beam center can be used to shift the diffraction patterns so that they are centered.

There should still be an option to export a subset of the data to tiff and maybe to HDF5, but it shouldn't be a requirement.

@din14970 din14970 added the enhancement New feature or request label Nov 20, 2020
@harripj
Copy link
Collaborator

harripj commented Feb 16, 2021

I wonder whether it would be possible to load the tvips file into hyperspy directly, without converting to .hspy for example. Given that the data is huge already we would want to load this lazily.

@din14970
Copy link
Owner Author

That would be great and probably I could write such a plug-in, but the main issues in my view are:

  • data spread over multiple input files. Dask (layer underneath hyperspy lazy) works well with multiple hdf5 files, but I have not yet tried this with other kinds of files.
  • lack of metadata. In many files there is no information about the scan coordinates. So data could only be loaded into hspy as a <frames|kx, ky> dataset. Reshaping this thing into the 4D dataset that it should be seems quite a hassle that most people don't want to go through every time.

The real solution would be if TVIPS could store their data in a decent format including relevant metadata.

@harripj
Copy link
Collaborator

harripj commented Feb 17, 2021

Yes the user would need to set the scan dimensions, which is a shame. I don't know whether we could reshape the hyperspy signal to 2D if we loaded it as 1D.

I had a brief go at creating a LazySignal2D from a list of lists of np.memmapped arrays, but plotting the data with hyperspy appeared to load it all to memory without deleting it (too much for my computer). I'm not familiar enough with either dask or hyperspy to take it much further, but we would need the data to be read, processed and then only loaded again when needed (I suppose this may be possible).

@din14970
Copy link
Owner Author

with quite some delay but I've started work on this here: hyperspy/hyperspy#2781 I'm indeed using np.memmap over the files (entire file is a memmapped array) and combining these arrays with dask.array.concatenate. With a smart rechunking I can indexing into this array to retrieve images, and then it's a breeze to do the data reshaping. snake-scan and hysteresis are a bit of an issue since dask arrays don't support in-place reassignment but I think I implemented an ok workaround. I think interfacing directly with the tvips files will be much cleaner in this way.

@ericpre
Copy link

ericpre commented Jun 26, 2021

snake-scan and hysteresis are a bit of an issue since dask arrays don't support in-place reassignment but I think I implemented an ok workaround.

Just in case this is useful: dask 2021.4.1 added support of item assignment in dask/dask#7393 and this is documented in https://docs.dask.org/en/latest/array-assignment.html.

@harripj
Copy link
Collaborator

harripj commented Sep 6, 2021

Just a quick question on Dask as I'm not that familiar. Once a memmap array is loaded to memory is it retained there or is there a way to read the array, process and store the result, and then dump from memory (as I guess would be wanted here with such large data)? My experience with memmap arrays is that they eat away at my system memory even when looping through each frame individually, however I am hoping that there is a way around this that I am currently unfamiliar with.

As an alternative, although I expect not compatible with Hyperspy, I have found that generator processing each frame works well on SSD with the benefit of not using much more than an individual frame of system memory. The drawback being that the entire data is not accessible at the same time for indexing (as would be the case for an array).

@din14970
Copy link
Owner Author

din14970 commented Sep 6, 2021

Phoo I actually have no idea how it works under the hood. My expectation for memmap is that it should give me a handle on a file as if it is an array without loading the entire thing into memory, as per the documentation https://numpy.org/doc/stable/reference/generated/numpy.memmap.html. So it seems strange to me that it would fill up your memory, it should just be some fancy file pointer basically. Since memmap behaves like an array it seems possible to create another layer on top of it - the dask array, which is what I try to use in hyperspy/hyperspy#2781. Dask divides the array into manageable chunks that are loaded into memory only when required. First one builds a task graph by calling functions on this array, and dask.compute should then take care of executing these in a good way. I don't know what would happen if you loop and how you are constructing your loop, but in general you want to avoid loops over python arrays in general. When I was playing with this to create the TVIPS file reader I didn't have any issues with memory on my macbook but perhaps I should take a closer look at profiling. Do you have some problematic code snippets?

@harripj
Copy link
Collaborator

harripj commented Sep 6, 2021

In fact looking at it a bit closer now I think that what I was experiencing was a reference counting issue and perhaps Dask handles this well. An example is if you load a big(-gish) array from disk it will stay in memory once computed until it has no more references at which point it is garbage collected (memory usage observed is from a fresh kernel after having saved array to disk in 1):

1:

import numpy as np

# save array of numbers
n = int(3e8)
nums = np.arange(n).reshape(15000, 20000)

# will save array to disk, NB ~1.2 GB
np.save('nums_memmap_test.npy', nums)

2 (memory usage +0 GB):

# load array as memmap
a = np.memmap('nums_memmap_test.npy')

3 (memory usage +1.2 GB):

a.max()
# array is now stored in memory and persists until all references removed

4 (memory usage -1.2 GB):

del a

@din14970
Copy link
Owner Author

din14970 commented Sep 6, 2021

I can't seem to reproduce your example :/

I tried:

import numpy as np

# save array of numbers
n = int(3e8)
nums = np.arange(n).reshape(15000, 20000)

# creating the file
fp = np.memmap("nums_memmap_test.dat", dtype='int64', mode='w+', shape=(15000,20000))
# dumping the data to the buffer
fp[:] = nums[:]
# saving the buffer to disk
fp.flush()

# after the following line we are back to original memory
del nums

As far as I understand one shouldn't use the .npy format for this since it's not a raw format.

Then memmapping the file:

# no additional memory used
a = np.memmap('nums_memmap_test.dat', dtype="int64", mode="r", shape=(15000, 20000))

# no additional memory used
memmap_max = a.max()

I basically just looked at what happens to system memory but didn't do extensive profiling. The last line SHOULD load the entire array into memory for a short bit, so there should be a peak in memory consumption, but the data should afterwards be immediately garbage collected.

Are you on Windows? Maybe it's an OS thing, I seem to vaguely recall that gc on windows works a bit differently than Linux/OSX. If so, could you try after the a.max line import gc and gc.collect()?

Anyway, to avoid any memory peak, I thought of using dask in this way:

import dask.array as da

a_lazy = da.array(a)
lazy_max = a_lazy.max()
lazy_max.compute()

It would be interesting to see if this decreases the memory peak.

@harripj
Copy link
Collaborator

harripj commented Sep 7, 2021

Yes, I tried this on Windows- it could well be OS dependent. For me both the .dat and the dask implementation show the same behaviour. In the dask case, the memory persists even if I run del a_lazy; del lazy_max and is only freed when I run del a. gc.collect() doesn't appear to free the memory either. Perhaps there is a better way to do this on Windows, but I'm glad to hear it works on OSX!

@din14970
Copy link
Owner Author

din14970 commented Sep 7, 2021

I didn't try on OSX, my test was on Linux, but I suppose they should behave similarly. I found this thread which does mention OS specific differences https://stackoverflow.com/questions/45132940/numpy-memmap-memory-usage-want-to-iterate-once. In principle dask should automatically do the loop they show in that thread, if not I would guess it's a dask bug for windows?

@din14970
Copy link
Owner Author

din14970 commented Sep 7, 2021

it's good that you discover this issue because my TVIPS reader implementation in Hyperspy was going to rely entirely on this memmap + dask trick and I use it in reverse in the BLO writer https://github.com/hyperspy/hyperspy/blob/8c74de87855a7e6cf53acfa85ca6b1e77367d619/hyperspy/io_plugins/blockfile.py#L423

@harripj
Copy link
Collaborator

harripj commented Sep 7, 2021

I tried to implement something like this (memmap) for the TVIPS in the past too but it blew up in my face memory wise as we have discussed here, so I gave it up. I assumed it was that I was implementing memmap incorrectly as I don't have huge experience with it, but it could well be that I was just using Windows... ¯_(ツ)_/¯ I didn't try on my Mac so settled on a generator version which runs quickly in fairness and keeps memory to an absolute minimum, but I agree a dask implementation would be best.

I could try to run your dask TVIPS implementation on my Windows PC to see if it blows up too if that would help?

@din14970
Copy link
Owner Author

din14970 commented Sep 7, 2021

Yes actually if you could test out the TVIPS branch would be great, you might want to merge locally with their main branch RELEASE_next_patch. I would be interested to see what happens when you:

  • slice into it and for example make a plot of a single image
  • try to export the file to another file format e.g. BLO with the save method

I seem to recall it worked ok on Mac OS when I made it, but I don't have any windows systems to test it on. Will try again on Linux. If it behaves differently on Windows then it must be a dask issue.

@harripj
Copy link
Collaborator

harripj commented Sep 7, 2021

So I installed the PR as suggested above. If I load a tvips file that I have (the file size is ~7.3 GB) and plot it:

data = hsio.load(r'dataset.tvips', lazy=True, scan_shape=(100, 100), offset=1)
data.transpose(signal_axes=(0, 1)).plot()

My system memory peaks and drops, but drops to a level 5.1 GB higher than before plotting. Subsequent plotting calls peak the memory and return to this elevated level. In fact if I change the scan shape to (50, 100) then the peak is smaller (as expected) and the elevated memory level is 2.6 GB higher than before, ie. 1/2 as much memory, as expected.

If I try to load close to the whole file scan_shape=(20, 719), then the full 7 GB remains in memory. The peak appears to be twice as large as the file itself. which I don't understand. I suppose you don't see this on your Linux system?

image

@ericpre
Copy link

ericpre commented Sep 9, 2021

I compared some of the code mentioned above on windows and linux and I can reproduce the issue on windows - linux is fine. I briefly tried to use different approach to read the array and pass it to dask but I was always getting the memory issue (whole array loaded into memory).

data = hsio.load(r'dataset.tvips', lazy=True, scan_shape=(100, 100), offset=1)
data.transpose(signal_axes=(0, 1)).plot()

@harripj, are you aware of a dask regression in memory usage with the threaded scheduler. The version affected are from 2021.4.0 to 2021.8.1, if I remember correctly.
To narrow down the last issue, can you try with data.plot(navigator='slider') first and then add more complexity with the transpose and the calculation of the navigator, which you can optimise for your needs.
The transpose will need to go through the whole dataset and I wouldn't be surprised if the memory usage peak was due to the recent memory usage regression in dask, which is now fixed. I don't have a tvips dataset, so I can't try to reproduce.

@harripj
Copy link
Collaborator

harripj commented Sep 9, 2021

@ericpre thanks for following up with this! No I wasn't aware but I can confirm that I was testing this using dask 2021.7.0 and python 3.8.10.

The reason for the transpose was that hyperspy threw the error: ValueError: Plotting is not supported for this view. Try e.g. 's.transpose(signal_axes=1).plot()' for plotting as a 1D signal, or 's.transpose(signal_axes=(1,2)).plot()' for plotting as a 2D signal. when calling plot, which I then followed. The same occurs when calling data.plot(navigator='slider'). If you know a way around this or how to fix this then I would be happy to try.

I updated dask to 2021.9.0 and can confirm that the peak is now much smaller, but the data remains in memory after plotting.

@ericpre
Copy link

ericpre commented Sep 9, 2021

The reason for the transpose was that hyperspy threw the error: ValueError: Plotting is not supported for this view. Try e.g. 's.transpose(signal_axes=1).plot()' for plotting as a 1D signal, or 's.transpose(signal_axes=(1,2)).plot()' for plotting as a 2D signal. when calling plot, which I then followed. The same occurs when calling data.plot(navigator='slider'). If you know a way around this or how to fix this then I would be happy to try.

This is because the navigation dimensions are not set - I guess you have a BaseSignal with signal dimension of more than 2. s.transpose(signal_axes=(1, 2)) should return a Signal2D without rearranging the data ordering for non-lazy signal. For lazy signal, it would go rechunk the data and sometimes, it is good, sometimes, not...

I updated dask to 2021.9.0 and can confirm that the peak is now much smaller, but the data remains in memory after plotting.

There are severals things going on here and there are already several issues discussed in this thread and it diverged significantly from the original one! It would be to disentangle the different issues:

@ericpre
Copy link

ericpre commented Sep 9, 2021

  • lack of metadata. In many files there is no information about the scan coordinates. So data could only be loaded into hspy as a <frames|kx, ky> dataset. Reshaping this thing into the 4D dataset that it should be seems quite a hassle that most people don't want to go through every time.

With a smart rechunking I can indexing into this array to retrieve images, and then it's a breeze to do the data reshaping. snake-scan and hysteresis are a bit of an issue since dask arrays don't support in-place reassignment but I think I implemented an ok workaround. I think interfacing directly with the tvips files will be much cleaner in this way.

There is a very similar issue with the mib file (some other reader too) and the current implementation works but is slow. It would be great to figure out a generic approach to reshape snake or flyback scanning which can then be used for different reader!

@harripj
Copy link
Collaborator

harripj commented Oct 15, 2021

Separate to the hyperspy plugin that @din14970 is working on, I have been working on an implementation that avoids the HDF5 intermediate and converts straight to .blo. It based on the current code and uses generators to keep memory usage down. For this a TVIPS class has been created, and this can be indexed directly, eg. TVIPS(fname)[20:, 30:45]. I have made a simple GUI to do some simple processing (clipping and binning) before converting to .blo, and the VBF can be viewed directly from the tvips file as opposed to from the hdf.

Behind the scenes a .tvipsinfo file is created on first load in the same directory and with the same base name as the original tvips file. This is just a small hdf file which contains the VBF intensities once computed, and stores the scan information. The TVIPS class tries to find this file and information on instantiation so the scan parameters can be reloaded automatically. The interface can still be loaded with the tvipsconverter command.

You can try it out if interested here: https://github.com/harripj/TVIPSconverter.git on the no-hdf-intermediate branch. Bear in mind that some of the features such as snakescan and hysteresis have not been implemented. At the moment I don't think it's right to make a PR for this as it's a large change and not all of the features are implemented, but I will be happy for any comments and for people to use it if they find it works for them. For a quick conversion between tvips and .blo, this has worked well for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants