-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Save conversion time and disk space by not copying images into HDF5 file #15
Comments
I wonder whether it would be possible to load the tvips file into hyperspy directly, without converting to |
That would be great and probably I could write such a plug-in, but the main issues in my view are:
The real solution would be if TVIPS could store their data in a decent format including relevant metadata. |
Yes the user would need to set the scan dimensions, which is a shame. I don't know whether we could reshape the hyperspy signal to 2D if we loaded it as 1D. I had a brief go at creating a |
with quite some delay but I've started work on this here: hyperspy/hyperspy#2781 I'm indeed using |
Just in case this is useful: dask 2021.4.1 added support of item assignment in dask/dask#7393 and this is documented in https://docs.dask.org/en/latest/array-assignment.html. |
Just a quick question on Dask as I'm not that familiar. Once a As an alternative, although I expect not compatible with Hyperspy, I have found that generator processing each frame works well on SSD with the benefit of not using much more than an individual frame of system memory. The drawback being that the entire data is not accessible at the same time for indexing (as would be the case for an array). |
Phoo I actually have no idea how it works under the hood. My expectation for memmap is that it should give me a handle on a file as if it is an array without loading the entire thing into memory, as per the documentation |
In fact looking at it a bit closer now I think that what I was experiencing was a reference counting issue and perhaps Dask handles this well. An example is if you load a big(-gish) array from disk it will stay in memory once computed until it has no more references at which point it is garbage collected (memory usage observed is from a fresh kernel after having saved array to disk in 1): 1: import numpy as np
# save array of numbers
n = int(3e8)
nums = np.arange(n).reshape(15000, 20000)
# will save array to disk, NB ~1.2 GB
np.save('nums_memmap_test.npy', nums) 2 (memory usage +0 GB): # load array as memmap
a = np.memmap('nums_memmap_test.npy') 3 (memory usage +1.2 GB): a.max()
# array is now stored in memory and persists until all references removed 4 (memory usage -1.2 GB): del a |
I can't seem to reproduce your example :/ I tried: import numpy as np
# save array of numbers
n = int(3e8)
nums = np.arange(n).reshape(15000, 20000)
# creating the file
fp = np.memmap("nums_memmap_test.dat", dtype='int64', mode='w+', shape=(15000,20000))
# dumping the data to the buffer
fp[:] = nums[:]
# saving the buffer to disk
fp.flush()
# after the following line we are back to original memory
del nums As far as I understand one shouldn't use the .npy format for this since it's not a raw format. Then memmapping the file: # no additional memory used
a = np.memmap('nums_memmap_test.dat', dtype="int64", mode="r", shape=(15000, 20000))
# no additional memory used
memmap_max = a.max() I basically just looked at what happens to system memory but didn't do extensive profiling. The last line SHOULD load the entire array into memory for a short bit, so there should be a peak in memory consumption, but the data should afterwards be immediately garbage collected. Are you on Windows? Maybe it's an OS thing, I seem to vaguely recall that gc on windows works a bit differently than Linux/OSX. If so, could you try after the Anyway, to avoid any memory peak, I thought of using dask in this way: import dask.array as da
a_lazy = da.array(a)
lazy_max = a_lazy.max()
lazy_max.compute() It would be interesting to see if this decreases the memory peak. |
Yes, I tried this on Windows- it could well be OS dependent. For me both the |
I didn't try on OSX, my test was on Linux, but I suppose they should behave similarly. I found this thread which does mention OS specific differences https://stackoverflow.com/questions/45132940/numpy-memmap-memory-usage-want-to-iterate-once. In principle dask should automatically do the loop they show in that thread, if not I would guess it's a dask bug for windows? |
it's good that you discover this issue because my TVIPS reader implementation in Hyperspy was going to rely entirely on this memmap + dask trick and I use it in reverse in the BLO writer https://github.com/hyperspy/hyperspy/blob/8c74de87855a7e6cf53acfa85ca6b1e77367d619/hyperspy/io_plugins/blockfile.py#L423 |
I tried to implement something like this (memmap) for the TVIPS in the past too but it blew up in my face memory wise as we have discussed here, so I gave it up. I assumed it was that I was implementing memmap incorrectly as I don't have huge experience with it, but it could well be that I was just using Windows... ¯_(ツ)_/¯ I didn't try on my Mac so settled on a generator version which runs quickly in fairness and keeps memory to an absolute minimum, but I agree a dask implementation would be best. I could try to run your dask TVIPS implementation on my Windows PC to see if it blows up too if that would help? |
Yes actually if you could test out the TVIPS branch would be great, you might want to merge locally with their main branch
I seem to recall it worked ok on Mac OS when I made it, but I don't have any windows systems to test it on. Will try again on Linux. If it behaves differently on Windows then it must be a dask issue. |
So I installed the PR as suggested above. If I load a tvips file that I have (the file size is ~7.3 GB) and plot it: data = hsio.load(r'dataset.tvips', lazy=True, scan_shape=(100, 100), offset=1)
data.transpose(signal_axes=(0, 1)).plot() My system memory peaks and drops, but drops to a level 5.1 GB higher than before plotting. Subsequent plotting calls peak the memory and return to this elevated level. In fact if I change the scan shape to If I try to load close to the whole file |
I compared some of the code mentioned above on windows and linux and I can reproduce the issue on windows - linux is fine. I briefly tried to use different approach to read the array and pass it to dask but I was always getting the memory issue (whole array loaded into memory).
@harripj, are you aware of a dask regression in memory usage with the threaded scheduler. The version affected are from 2021.4.0 to 2021.8.1, if I remember correctly. |
@ericpre thanks for following up with this! No I wasn't aware but I can confirm that I was testing this using dask 2021.7.0 and python 3.8.10. The reason for the transpose was that hyperspy threw the error: I updated dask to 2021.9.0 and can confirm that the peak is now much smaller, but the data remains in memory after plotting. |
This is because the navigation dimensions are not set - I guess you have a
There are severals things going on here and there are already several issues discussed in this thread and it diverged significantly from the original one! It would be to disentangle the different issues:
|
There is a very similar issue with the |
Separate to the hyperspy plugin that @din14970 is working on, I have been working on an implementation that avoids the HDF5 intermediate and converts straight to .blo. It based on the current code and uses generators to keep memory usage down. For this a TVIPS class has been created, and this can be indexed directly, eg. Behind the scenes a You can try it out if interested here: https://github.com/harripj/TVIPSconverter.git on the |
The workflow
read TVIPS -> process data -> save to hdf5 -> reconstruct scan -> save to blo or hspy
is somewhat problematic because the intermediate hdf5 file can be huge. Additionally, once the data processing is done, this is what you are stuck with for conversion to blo or hspy. If you want additional processing you have to create a new hdf5 file. Parameter studies on the image preprocessing is prohibitive.Ideally the data copying to hdf5 in the intermediate step is skipped so that the workflow becomes
read TVIPS -> process data -> save to blo or hspy
. The problem is getting a handle on which image frames to use and how to reconstruct the scan. To do this, the first step of the process should be just a "prepare dataset" step that:These items can then be stored in an hdf5 file that the user should not care about - they could even just be stored in memory. necessary for these steps is a preview of the diffraction pattern, the VBF aperture, and maybe the direct beam center. No other pre-processing should be done on the data at this point. I would guess this also cuts down on the time necessary to loop over the data.
In the second step, this information could be used to reconstruct the scan and VBF. The user should then have the option to modify the data with filters/binning/cropping etc. The data should be extracted directly from the TVIPS files, processed, and then written to the correct file. The beam center can be used to shift the diffraction patterns so that they are centered.
There should still be an option to export a subset of the data to tiff and maybe to HDF5, but it shouldn't be a requirement.
The text was updated successfully, but these errors were encountered: