Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding in Hyperspy/Pyxem Support #27

Open
CSSFrancis opened this issue Jun 15, 2021 · 18 comments
Open

Adding in Hyperspy/Pyxem Support #27

CSSFrancis opened this issue Jun 15, 2021 · 18 comments

Comments

@CSSFrancis
Copy link

Some of the developers over at pyxem were talking about adding in support for dynamic type simulations/ 4-D STEM pyxem/diffsims#170. abTEM came up as a good alternative which we could reference and work alongside rather than trying to support dynamic type simulations.

I was wondering if you would be interested in adding support for hyperspy/pyxem to the package?

There are a couple of ways that this could be accomplished (non-exclusively):

I think that this would be an easy way to increase the visualization/ analysis tools available some simulation without too much added overhead. I would be willing to create a small pull request to make these changes if you are interested in adding in this support.

@din14970 Based on the conversation we had earlier you might have some additional thought.

@jacobjma
Copy link
Member

Yes, we would be happy to add support for hyperspy.

I think adding a to_hyperspy method is the better option, however, hyperspy should only be imported inside that method, so it is an optional dependency. Adding additional options for export would also be welcome.

Please do make a pull request.

@jacobjma
Copy link
Member

One gotcha is that calibrations in abTEM are allowed to be None (I should maybe change that). This might help:

calibrations = measurement.calibrations 
array = measurement.array

hyperspy_calibrations = []
for calibration, n in zip(calibrations, array.shape):
    if calibration is None:
        hyperspy_calibrations.append({'size':n, 'scale': calibration.sampling, 'units':calibration.units, 'name':calibration.name, 'offset':calibration.offset})
    else:
        hyperspy_calibrations.append({'size':n})

@din14970
Copy link

Anything that can be considered a "measurement" would translate nicely to Hyperspy: projected potentials, simulated STEM images, simulated 4D-STEM data, simulated DP's. In this case you could also consider shaving off parts of code that are not core business of image simulation and leave that up to the analysis packages: making line profiles, integrating within a disk, ... just an idea.

I would also be in favor of having an in-memory to_hspy, but maybe it would also be good to be able to get the hyperspy format out directly for simulated 4D-STEM datasets. Another alternative is adding a reader to hyperspy to parse the abTEM output file. If there is a specific reason to save the data out in the way it is currently done, then this may be the best approach.

With regards to None for calibrations, I would say one can always default to pixels as unit?

@jacobjma
Copy link
Member

I will think about that. Although, some of the functions do have some minor features relevant to simulations, e.g. the Fourier space disc integration exactly matches the output of the AnnularDetector, which might not be the case for the analysis package.

abTEM is currently undergoing a fairly big transformation to make all the arrays dask. It might be possible for hyperspy to grab and extend the task graph from abTEM to write results in supported (parallel) formats during a simulation. It might require some work on the hyperspy side and I am not sure how hyperspy handles parallel write.
Delayed simulated data might also be useful for large 4D-STEM datasets, where writing to disk, in certain cases, is the most expensive step of a simulation.

The calibration is typically None when the measurement consists of a batch of different measurements, e.g. diffraction patterns for three random positions. In this case, I am not sure even pixels make sense as a unit. I should still change it though.

@din14970
Copy link

It might be possible for hyperspy to grab and extend the task graph from abTEM to write results in supported (parallel) formats during a simulation. It might require some work on the hyperspy side and I am not sure how hyperspy handles parallel write.

This is actually great. If your output can be cast to a dask array Hyperspy will happily work with it as the data of a "Lazy" signal. One just needs a light metadata + axis information wrapper around it and that's it. All hyperspy signals have a save method, although I've never looked under the hood how this is working, so not sure whether it is the most efficient/parallelized. I do know that Hyperspy saves the data in hdf5 with lossless compression, and this step can take quite a bit of time. However in a rather sparse 4D-STEM dataset it can save a lot of disk space.

The calibration is typically None when the measurement consists of a batch of different measurements, e.g. diffraction patterns for three random positions.

Aha so basically one could think of it like a non-linear axis, with just e.g. (x, y) scan coordinates given. Maybe the following PR could be relevant hyperspy/hyperspy#2399

@CSSFrancis
Copy link
Author

It might be possible for hyperspy to grab and extend the task graph from abTEM to write results in supported (parallel) formats during a simulation. It might require some work on the hyperspy side and I am not sure how hyperspy handles parallel write.

This is actually great. If your output can be cast to a dask array Hyperspy will happily work with it as the data of a "Lazy" signal. One just needs a light metadata + axis information wrapper around it and that's it. All hyperspy signals have a save method, although I've never looked under the hood how this is working, so not sure whether it is the most efficient/parallelized. I do know that Hyperspy saves the data in hdf5 with lossless compression, and this step can take quite a bit of time. However in a rather sparse 4D-STEM dataset it can save a lot of disk space.

Any of this would slot into hyperspy quite easily! For writing data to the disk, hyperspy currently uses the store function and h5py to write the data. Dask has a specific to 'to_hdf5` function as well which might have marginally better performance as well. Depending on the scope of the task graph, however, your real bottle neck might be in the write speed for your hard drive which will only really write using one thread unless you have some special hardware.

From my experience I tend to not use the compression feature when saving my data (it adds a fair bit of overhead in saving and loading for large data sets >10 GB). There is probably a way to speed that up but that is near the bottom of my list of things to do. The one thing I've been meaning to fix in hyperspy is how they save and load chunks of data to optimize performance, which is probably the lowest hanging fruit for speeding up read/write.

My experience with lazy operations are that they are easy to implement but sometimes difficult to implement well. I do think that lazy 4-D STEM simulations would be incredibly useful so if you need any help I would be willing to offer some.

I will think about that. Although, some of the functions do have some minor features relevant to simulations, e.g. the Fourier space disc integration exactly matches the output of the AnnularDetector, which might not be the case for the analysis package.

In reference to allowing hyperspy/pyxem to handle some of the analysis there are benefits to both sides. I wouldn't worry too much about specific implementations as pyxem can always be fairly flexible to accommodate. I'll start with adding in the ability to export to hyperspy.

@jacobjma
Copy link
Member

I merged the pull request. I will keep the issue around, as we should revisit it in the new version with dask. Given a lazy abTEM measurement object I hope it will be possible to do:

measurement.to_hyperspy().save('my_data.hspy')

This should then trigger the simulation to start and save the result directly in the hyperspy format.

Aha so basically one could think of it like a non-linear axis, with just e.g. (x, y) scan coordinates given. Maybe the following PR could be relevant hyperspy/hyperspy#2399

Thanks, that is very useful.

My experience with lazy operations are that they are easy to implement but sometimes difficult to implement well. I do think that lazy 4-D STEM simulations would be incredibly useful so if you need any help I would be willing to offer some.

Indeed, my initial naive implementation of dask was 10 times slower than the thread parallelized version (on my 8-core test system). I have been working hard on it, so now the dask version, using processes or threads, should be faster, even on small systems.

I will certainly take that help. There is already an abtem-dask branch, which is functional, except for PRISM. You can check it out, but I think it makes sense to wait until a few more major changes are in. I should have time to finish those this week, and will alert you then.

After some testing, I am thinking of switching to zarr as the main format. It seems to integrate better with dask. Do you have any experience?

I agree that lazy 4d-stem simulations are useful, particularly if the diffraction patterns are calculated from a precalculated PRISM scattering matrix, the reduction can be done much faster than required for real-time visualization. In this context, the scattering matrix could be thought of as a lossless compression of the data.

@CSSFrancis
Copy link
Author

I merged the pull request. I will keep the issue around, as we should revisit it in the new version with dask. Given a lazy abTEM measurement object I hope it will be possible to do:

measurement.to_hyperspy().save('my_data.hspy')

This should then trigger the simulation to start and save the result directly in the hyperspy format.

Yep! I don't see any reason that wouldn't work!

My experience with lazy operations are that they are easy to implement but sometimes difficult to implement well. I do think that lazy 4-D STEM simulations would be incredibly useful so if you need any help I would be willing to offer some.

Indeed, my initial naive implementation of dask was 10 times slower than the thread parallelized version (on my 8-core test system). I have been working hard on it, so now the dask version, using processes or threads, should be faster, even on small systems.

I will certainly take that help. There is already an abtem-dask branch, which is functional, except for PRISM. You can check it out, but I think it makes sense to wait until a few more major changes are in. I should have time to finish those this week, and will alert you then.

I'll take a look at it if you ping me! Thanks!

After some testing, I am thinking of switching to zarr as the main format. It seems to integrate better with dask. Do you have any experience?

I haven't ever really used it. I'm more familiar with xarray, but it seems like it might work well for compressed array creation.

I agree that lazy 4d-stem simulations are useful, particularly if the diffraction patterns are calculated from a precalculated PRISM scattering matrix, the reduction can be done much faster than required for real-time visualization. In this context, the scattering matrix could be thought of as a lossless compression of the data.

That is a cool idea! Curious to see how it works out.

@din14970
Copy link

This is pretty exciting, finally a great simulation code that will integrate well and is built on the same software stack as the most popular analysis code! I've been using abTEM for high-throughput calculations on a small GPU cluster - I simulated 700 HAADF STEM images using full multi-slice of different 5x5x20 nm crystals and it only took a week and a half, so thanks a lot for this effort!

Indeed, my initial naive implementation of dask was 10 times slower than the thread parallelized version (on my 8-core test system). I have been working hard on it, so now the dask version, using processes or threads, should be faster, even on small systems.

I'm not going to claim I'm a super expert at dask but I'd also be happy to have a look at it. Have you looked into dask-cuda? Might allow you to scale to multiple GPU's on a single node, which in my opinion is currently the only selling point of PRISM.

After some testing, I am thinking of switching to zarr as the main format. It seems to integrate better with dask. Do you have any experience?

Is that the case? I think in the end the final file format doesn't matter so much, but it would be convenient if metadata is saved alongside, including simulation "microscope" conditions. I've really only used hdf5 but I can also see why some people don't like it.

@jacobjma
Copy link
Member

Still working on the dask implementation. In the meanwhile, you may have some input on this new issue(#30) I created, as it was somewhat prompted by the present issue.

I haven't ever really used it. I'm more familiar with xarray, but it seems like it might work well for compressed array creation.

I have used xarray, I believe zarr is quite different. xarray is more similar to hyperspy and zarr is quite similar to hdf5.

I simulated 700 HAADF STEM images using full multi-slice of different 5x5x20 nm crystals and it only took a week and a half, so thanks a lot for this effort!

Really cool, very happy to hear it being used.

I'm not going to claim I'm a super expert at dask but I'd also be happy to have a look at it.

Yes, please 👍. I will ping you.

Have you looked into dask-cuda?

I have, although I have only used it with a single GPU for now, but I think parts of the code in the dask branch should scale to multiple GPU's already.

Is that the case? I think in the end the final file format doesn't matter so much, but it would be convenient if metadata is saved alongside, including simulation "microscope" conditions. I've really only used hdf5 but I can also see why some people don't like it.

Yeah, I will try make metadata a feature. It is slightly more difficult due the modular nature of abTEM, there is not necessarily one simulation, so we have to be smart about propagating metadata through the objects.

@CSSFrancis
Copy link
Author

@jacobjma

I have used xarray, I believe zarr is quite different. xarray is more similar to hyperspy and zarr is quite similar to hdf5.

You prompted me to look into the issue a little bit more. I've recently be running into some problems saving dask datasets into hdf5. Hyperspy is rather slow when loading large datasets/ compressed datasets and it is largely a problem with the hdf5 data format being mostly restricted to single threaded read/writes. HDF5 doesn't seem to play nicely with dask.distributed which I think is a large issue as people move towards larger and larger datasets and parallel operations become more important.

There is apparently support using mpi with parallel-hdf5, but I think that zarr has a much better framework for those kind of operations. I'm probably not telling you anything new... just catching up.

I will open an issue in hyperspy about potentially loading/saving using zarr rather than hdf5. Overall it should be fairly easy to add in support for zarr loading and saving as the file format is very similar to hdf5. From there I think for larger datasets using the zarr format should be preferable to hdf5. I don't know if a complete switch from hdf5 to zarr would ever occur, but something different from the current implementation is clearly needed.

@CSSFrancis
Copy link
Author

Just so that things stay up to date, I am in the process of adding in the ability to save hyperspy signals in the zarr format.

hyperspy/hyperspy#2798

Right now this seems to work excellently with dask-distributed and it offers another route for saving from abtem where you can just simulate a lazy dask array and then save in hyperspy. I need to load up the lazy branch to test how this might work but I don't really see any problems in the future.

@TomaSusi
Copy link
Member

Shall we close this, @jacobjma ?

@CSSFrancis
Copy link
Author

You can probably close this issue currently but I've been meaning to create some examples converting abtem --> hyperspy.

Particularly there are a couple of cool things like the lazy plotting/ caching in hyperspy which mean that you should be able to create a lazy simulation and then plot it lazily using hyperspy which would be quite cool. That process would probably go something like.

  • Create a 2d navigator image by simulating a BrightField image.
  • Create a lazy 4-D STEM image
  • Create a lazy hyperspy signal and plot it

Then you can visualize your simulation, check certain parts of the image all without significant computations occurring.

These kind of end to end lazy operations are super cool but it would be good to have good examples of how to do some of these things.

@jacobjma
Copy link
Member

This should work in the main branch, the code for converting abTEM to HyperSpy now exist here. I just had to fix a bug, the testing of the to_hyperspy method is very rudimentary.

You can create an abTEM focal series ensemble like below, the ensemble axes will become navigation axes when it is converted to Hyperspy.

probe = abtem.Probe(
    sampling=0.05,
    extent=20,
    energy=80e3,
    semiangle_cutoff=20,
    defocus=[0, 100],
)

hs_signal = probe.build().to_hyperspy()

will produce this lazy signal:
billede

You can also do a GridScan like this to produce a focal series of 4D-STEM data:

probe = abtem.Probe(energy=80e3, semiangle_cutoff=20, defocus=[0, 100], sampling=0.05)

atoms = ase.build.bulk("Au", cubic=True) * (2, 2, 1)

detector = abtem.PixelatedDetector()

measurement = probe.scan(atoms, detectors=detector)

measurement.to_hyperspy()

which will give you this signal.

billede

Any ArrayObject in abTEM can be converted to a Hyperspy signal. Currently objects with 3D base axes in abTEM, such as PotentialArray and SMatrixArray are converted to Signal2D with the first axis made into a navigation axis. There are still some issues with nicely converting NonLinearAxis to a HyperSpy axis.

@CSSFrancis
Copy link
Author

@jacobjma yea I would imagine that converting a NonLinearAxis would cause some problems. There is support in hyperspy so it should be fairly easy to add.

This should work in the main branch, the code for converting _ab_TEM to HyperSpy now exist here. I just had to fix a bug, the testing of the to_hyperspy method is very rudimentary.

You can create an _ab_TEM focal series ensemble like below, the ensemble axes will become navigation axes when it is converted to Hyperspy.

probe = abtem.Probe(
    sampling=0.05,
    extent=20,
    energy=80e3,
    semiangle_cutoff=20,
    defocus=[0, 100],
)

hs_signal = probe.build().to_hyperspy()

will produce this lazy signal: billede

You can also do a GridScan like this to produce a focal series of 4D-STEM data:

probe = abtem.Probe(energy=80e3, semiangle_cutoff=20, defocus=[0, 100], sampling=0.05)

atoms = ase.build.bulk("Au", cubic=True) * (2, 2, 1)

detector = abtem.PixelatedDetector()

measurement = probe.scan(atoms, detectors=detector)

measurement.to_hyperspy()

which will give you this signal.

billede

Any ArrayObject in _ab_TEM can be converted to a Hyperspy signal. Currently objects with 3D base axes in _ab_TEM, such as PotentialArray and SMatrixArray are converted to Signal2D with the first axis made into a navigation axis. There are still some issues with nicely converting NonLinearAxis to a HyperSpy axis.

@jacobjma This is fantastic! I have to admit when I originally added the to_hyperspy function I was assuming that I would be doing more simulations. That being said the lazy plotting is really cool hyperspy.

One thing is I was wondering if there is a way to make a fast HRTEM simulation using a Plane wave that matches the real space sampling for a 4D STEM simulation? I've been trying with the new version of abtem but can't seem to get it to work.

Just to give you an idea of what I am doing. In hyperspy you can set the navigator for some signal and that means that when you plot the signal it will only compute the current chunk.

For example:

import ase
import abtem
probe = abtem.Probe(energy=80e3, semiangle_cutoff=20, defocus=100, sampling=0.01) # just one defocus 

atoms = ase.build.bulk("Au", cubic=True) * (2, 2, 1)
atoms.center(vacuum=2)
detector = abtem.PixelatedDetector()
# setting the max_batch smaller is better for responsive plotting (not so much for task generation)
measurement = probe.scan(atoms, detectors=detector,max_batch=8) 
hs_signal = measurement.to_hyperspy()

#hs_signal.compute_navigator # computes the entire signal (not very helpful)

hs_signal.navigator = bf_image # Much faster to compute for a basic representation (This can be anything as long as it is fast)

hs_signal.axes_manager.indices = (10,10) # jump to the middle of the plot
hs_signal.plot(vmax="99th") # This is fast(er) now as the navigator is already set!

Not sure how useful this actually is but that probably depends on how fast your computer is. You still have to compute the current chunk before displaying it which can be slightly laggy and any time you move from one chunk to the next things are slow as it has to compute the next chunk. I assume if you use a GPU you are probably going to be a little bit happier about this and maybe if you use a Prism algorithm it might be even better.

Anyways I should take another look at the to_hyperspy algorithm and see if I can't make it work a little bit better

@jacobjma
Copy link
Member

Are you coming to M&M? We could discuss this in person.

@CSSFrancis
Copy link
Author

Are you coming to M&M? We could discuss this in person.

I just realized that I never responded to this :) Yes I'll try to find you to talk to you after your talk today!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants