# 4-D STEM Analysis Using Pyxem 


## Data Inspection- Preprocessing - Unsupervised ML - Lazy Processing - Orientation Analysis 


### Carter Francis | University of Wisconsin Madison | May 24th 2023

Pyxem Introduction:
-------------------

`Pyxem` was first started in 2016 by Duncan Johnston.  Since then it has been continually developed alongside hyperspy. Below is a very simplified dependancy tree for `pyxem`. We inherit quite a bit of functionality from upstream packages and ascribe strongly to the ideal that if we can upstream code to make it available to a wider audience we should!

We are always looking for more people to join our team [here](https://github.com/pyxem/pyxem)!  

<img style="left" src="DependancyTree.svg">

### Our Focus/ Goals:

1. Provide Scalable analysis for pixelated(mostly 4-D STEM) diffraction 
    - Pyxem (and hyperspy) scales from single core --> Multi-core single machiene --> Multicore Distributed Computing!
    - Fast I-O allows streaming and processing of TB sized datasets in under a ___minute___!
    - Hyperspy is ___Fast___! I mean like really fast. If you don't believe me try running using the dask distributed backend. 
2. Provide End to End workflows without limiting functionality.
    - A focus on documentation and example notebooks keeping the internal `pyxem` code simple and easy to maintain and grow.
3. Testing Testing Testing!
    - Pyxem is focused on test driven development which limits the number of bugs and helps us to understand why bugs arise when they do
    - While not perfect this helps us to know that updates won't cause functionality to fail.
4. Learning and Teaching!
    - Drop by and say hi on github.  Make an [issue](https://github.com/pyxem/pyxem/issues) for a feature you would like, add some code you find helpful. 
    - Even if you are just trying something out or need help we are always happy to help!

Data Introduction
-----------------
This data is a set of MgO nanocrystals on a lacy Carbon sample.  It is a pretty (very small) 4-D STEM dataset that I could load into RAM. But I personally still like to run everything lazily (and using the distributed backend) for a couple of reasons:

1. I'm Lazy (why shouldn't my data be):
    - I like things to load imediately and don't like waiting around
    - Lazy makes things like running in parallel EASY
2. Lazy means better parallelization (and it's Fast!):
    - Lazy data is already set up to run in parallel so you get better control
    - I love the dask-dashboard (and you should too)
3. One workflow, Any Size of Data
    - You can very easily take the same code. Move it to a cluster or a HPC cluster :)
    - Lazy workflows mean faster iteration, faster discovery which means more experiments. 


## Contents

1. <a href='#loa'> Loading & Inspection</a>
2. <a href='#cal'> Alignment & Calibration</a>
3. <a href='#vdf'> Virtual Diffraction Imaging</a>
4. <a href='#ml'> Machine Learning SPED Data</a>
5. <a href='#vec'> Peak Finding and Segmentation</a>

## 0. Import pyxem and other required libraries

In [1]:
# I like to print out the current version for hyperspy that I am using just in case I come back to a
# notbook a many years later and things change slightly
import hyperspy
print(hyperspy.__version__)
import pyxem
print(pyxem.__version__)

1.7.5




0.15.1


In [2]:
# Changing the matplotlib background will give you interactive 
%matplotlib qt5
#%matplotlib inline
#%matplotlib widget # for plotting when running remotely on a cluster etc.
import hyperspy.api as hs
import pyxem as pxm
import numpy as np

In [3]:
# Starting up a distributed Cluster locally 
# You don't have to do this but it helps to visualize what is happening
from dask.distributed import Client
client = Client()  # set up local cluster on your laptop
client

INFO:distributed.scheduler:State start
INFO:distributed.scheduler:  Scheduler at:     tcp://127.0.0.1:39491
INFO:distributed.scheduler:  dashboard at:  http://127.0.0.1:8787/status
INFO:distributed.nanny:        Start Nanny at: 'tcp://127.0.0.1:34341'
INFO:distributed.nanny:        Start Nanny at: 'tcp://127.0.0.1:42453'
INFO:distributed.nanny:        Start Nanny at: 'tcp://127.0.0.1:43749'
INFO:distributed.nanny:        Start Nanny at: 'tcp://127.0.0.1:39001'
INFO:distributed.scheduler:Register worker <WorkerState 'tcp://127.0.0.1:35847', name: 1, status: init, memory: 0, processing: 0>
INFO:distributed.scheduler:Starting worker compute stream, tcp://127.0.0.1:35847
INFO:distributed.core:Starting established connection to tcp://127.0.0.1:50116
INFO:distributed.scheduler:Register worker <WorkerState 'tcp://127.0.0.1:44797', name: 0, status: init, memory: 0, processing: 0>
INFO:distributed.scheduler:Starting worker compute stream, tcp://127.0.0.1:44797
INFO:distributed.core:Starting est

0,1
Connection method: Cluster object,Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:8787/status,

0,1
Dashboard: http://127.0.0.1:8787/status,Workers: 4
Total threads: 12,Total memory: 15.43 GiB
Status: running,Using processes: True

0,1
Comm: tcp://127.0.0.1:39491,Workers: 4
Dashboard: http://127.0.0.1:8787/status,Total threads: 12
Started: Just now,Total memory: 15.43 GiB

0,1
Comm: tcp://127.0.0.1:44797,Total threads: 3
Dashboard: http://127.0.0.1:42085/status,Memory: 3.86 GiB
Nanny: tcp://127.0.0.1:34341,
Local directory: /tmp/dask-scratch-space/worker-d8gqd_g7,Local directory: /tmp/dask-scratch-space/worker-d8gqd_g7

0,1
Comm: tcp://127.0.0.1:35847,Total threads: 3
Dashboard: http://127.0.0.1:37693/status,Memory: 3.86 GiB
Nanny: tcp://127.0.0.1:42453,
Local directory: /tmp/dask-scratch-space/worker-s5fes254,Local directory: /tmp/dask-scratch-space/worker-s5fes254

0,1
Comm: tcp://127.0.0.1:32793,Total threads: 3
Dashboard: http://127.0.0.1:44171/status,Memory: 3.86 GiB
Nanny: tcp://127.0.0.1:43749,
Local directory: /tmp/dask-scratch-space/worker-etfp8hfc,Local directory: /tmp/dask-scratch-space/worker-etfp8hfc

0,1
Comm: tcp://127.0.0.1:39097,Total threads: 3
Dashboard: http://127.0.0.1:43665/status,Memory: 3.86 GiB
Nanny: tcp://127.0.0.1:39001,
Local directory: /tmp/dask-scratch-space/worker-ox67aq5z,Local directory: /tmp/dask-scratch-space/worker-ox67aq5z


<a id='loa'></a>

##  1. Loading and Inspection

Load the SPED data acquired from the nano-crystals using hyperspy.

`Note: Because pyxem extends hyperspy this happens automatically!`

In [4]:
import hyperspy.api as hs
dp = hs.load("data/mgo_nanoparticles.zspy", lazy=True)

In [5]:
# lets just look at the data 
dp
# display(dp) also works

Title:,Unnamed: 1_level_0,Unnamed: 2_level_0
SignalType:,electron_diffraction,Unnamed: 2_level_1
Unnamed: 0_level_2,Array,Chunk
Navigation Axes,Signal Axes,Unnamed: 2_level_3
Bytes,0.96 GiB,49.44 MiB
Shape,"(109, 114|144, 144)","(25,25|144,144)"
Count,26 Tasks,25 Chunks
Type,float32,numpy.ndarray
109  114,144  144,
"Title: SignalType: electron_diffraction Array Chunk Bytes 0.96 GiB 49.44 MiB Shape (109, 114|144, 144) (25,25|144,144) Count 26 Tasks 25 Chunks Type float32 numpy.ndarray",Navigation Axes Signal Axes 109  114  144  144,

Title:,Unnamed: 1_level_0,Unnamed: 2_level_0
SignalType:,electron_diffraction,Unnamed: 2_level_1
Unnamed: 0_level_2,Array,Chunk
Bytes,0.96 GiB,49.44 MiB
Shape,"(109, 114|144, 144)","(25,25|144,144)"
Count,26 Tasks,25 Chunks
Type,float32,numpy.ndarray

Navigation Axes,Signal Axes
109  114,144  144


In [6]:
# Lets change the title here so it shows up when we load the dataset
dp.metadata.General.title = "MgO Nano-Crystals"

In [7]:
# Then we can display the data strucuture again. 
dp

Title:,MgO Nano-Crystals,Unnamed: 2_level_0
SignalType:,electron_diffraction,Unnamed: 2_level_1
Unnamed: 0_level_2,Array,Chunk
Navigation Axes,Signal Axes,Unnamed: 2_level_3
Bytes,0.96 GiB,49.44 MiB
Shape,"(109, 114|144, 144)","(25,25|144,144)"
Count,26 Tasks,25 Chunks
Type,float32,numpy.ndarray
109  114,144  144,
"Title: MgO Nano-Crystals SignalType: electron_diffraction Array Chunk Bytes 0.96 GiB 49.44 MiB Shape (109, 114|144, 144) (25,25|144,144) Count 26 Tasks 25 Chunks Type float32 numpy.ndarray",Navigation Axes Signal Axes 109  114  144  144,

Title:,MgO Nano-Crystals,Unnamed: 2_level_0
SignalType:,electron_diffraction,Unnamed: 2_level_1
Unnamed: 0_level_2,Array,Chunk
Bytes,0.96 GiB,49.44 MiB
Shape,"(109, 114|144, 144)","(25,25|144,144)"
Count,26 Tasks,25 Chunks
Type,float32,numpy.ndarray

Navigation Axes,Signal Axes
109  114,144  144


Inspect the dp object

In [8]:
dp.navigator

In [9]:
# currently the navigator isn't set so in order to plot this we have to "create" one by
# summing the entire dataset. Look at the distributed task-stream to see all of the chunks are
# loaded and then a Summed navigator is created.  This isn't very efficient (or lazy!)
dp.plot()
# if you rerun this cell the navigator is saved (Yay) and it takes much less time to plot the data!

In [10]:
dp.navigator

<ElectronDiffraction2D, title: MgO Nano-Crystals, dimensions: (|109, 114)>

In [11]:
dp.metadata

Inspect the data type of the object

In [12]:
dp.data.dtype

dtype('float32')

Inspect the metadata associated with the object 'dp'

In [13]:
dp.metadata

Set important experimental parameters using the built in function

In [14]:
dp.set_experimental_parameters(beam_energy=300.0,
                               camera_length=21.0,
                               scan_rotation=277.0,
                               convergence_angle=0.7,
                               exposure_time=10.0)

See how this changed the metadata

In [15]:
dp.metadata

Plot the data to inspect it

## 2. Alignment & Calibration

Let's center the direct beam for the dataset

In [16]:
# get the direct beam position using `get_direct_beam_position`
shifts = dp.get_direct_beam_position(method='blur',sigma=3,
                                     half_square_width=10)

In [17]:
# compute the shifts
shifts.compute()



In [18]:
# plot the orginal shifts
hs.plot.plot_images(shifts.T, label=["x-shift", "y-shift"])



[<Axes: title={'center': 'x-shift'}>, <Axes: title={'center': 'y-shift'}>]

In [19]:
# make the shifts into a linear plane
shifts.make_linear_plane()

In [20]:
# plot the shifts again!
hs.plot.plot_images(shifts.T, label=["x-shift", "y-shift"])



[<Axes: title={'center': 'x-shift'}>, <Axes: title={'center': 'y-shift'}>]

Align the dataset based on the direct beam position

In [21]:
dp.center_direct_beam(shifts=shifts)

In [22]:
dp.plot(cmap='inferno')

#### Setting the Calibration
Set the calibration. This is usually known for some detector from a standard sample
or you can get this from the dataset if the scale is known.

In [23]:
scale = 0.03246
scale_real = 3.03
dp.set_diffraction_calibration(scale)
dp.set_scan_calibration(scale_real)

Plot the calibrated data

In [24]:
dp.plot(cmap='inferno')

<a id='vdf'></a>

##  3. Virtual Diffraction Imaging & Selecting Regions

### 3.1 Interactive VDF Imaging

Plot an interactive virtual image integrating intensity within a circular subset of pixels in the diffraction pattern

In [25]:
# create the rois
bf_roi = hs.roi.CircleROI(cx=0.,cy=0, r_inner=0.0, r=0.07)
df_roi = hs.roi.CircleROI(cx=0.,cy=0, r_inner=0.4, r=1)

In [26]:
mean_dp = dp.mean()
mean_dp.compute()

In [27]:
# lets just add a custom VDF to the image to get both VDF and VBF images
mean_dp.plot()
bf_roi.add_widget(mean_dp)

<hyperspy.drawing._widgets.circle.CircleWidget at 0x7f2f1aba0940>

In [28]:
# inspect the bf roi I like to save these values above for reproduceability.
bf_roi

CircleROI(cx=0, cy=0, r=0.07, r_inner=0)

In [29]:
mean_dp.plot()
df_roi.add_widget(mean_dp)

<hyperspy.drawing._widgets.circle.CircleWidget at 0x7f2f1aba0e20>

In [30]:
# inspect the Df roi I like to save these values above for reproduceability.
bf_roi

CircleROI(cx=0, cy=0, r=0.07, r_inner=0)

In [31]:
# get the integrated intensity for the bf and df
bf_image = dp.get_integrated_intensity(bf_roi)
df_image = dp.get_integrated_intensity(df_roi)

Get the virtual diffraction image associated with the last integration window used interactively

In [32]:
# compute the two images
bf_image.compute()
df_image.compute()

Lets plot the two virtual images

In [33]:
hs.plot.plot_images([bf_image, df_image], label=["Virtual Bright Field", "Virtual Dark Field"])

[<Axes: title={'center': 'Virtual Bright Field'}, xlabel='x axis (nm)', ylabel='y axis (nm)'>,
 <Axes: title={'center': 'Virtual Dark Field'}, xlabel='x axis (nm)', ylabel='y axis (nm)'>]

Inspect the metadata

In [34]:
bf_image.metadata

Save the virtual dark-field image as a 32bit tif

In [35]:
df_image.save('df_image.tif')

Overwrite '/home/cssfrancis/workshop/df_image.tif' (y/n)?
 n


### 3.2 Azimuthal Integration

Pyxem uses the [`pyfai`](https://github.com/silx-kit/pyFAI/tree/v2023.1) library to handle azimuthal integrations including the effects of the Ewald Sphere.  Because of this you should set the calibration and the beam energy before integration.

For speed a AzimuthalIntegrator object is precomputed which reduces redundant calculations.

In [36]:
# set the unit and beam_energy
dp.unit = "k_A^-1"
dp.beam_energy=300

In [37]:
# set the Azimuthal Integrator
dp.set_ai()

In [38]:
# get the 1D Azimuthal Integration
azm1d = dp.get_azimuthal_integral1d(npt=10)



In [39]:
# compute the dataset
azm1d.compute()

In [40]:
# plot the inverse to get the VDF as a function of radius 
azm1d.T.plot()

### 3.3 Select a region in the scan

Plot the data with an adjustable marker indicating where to crop the scan region

In [41]:
reg = hs.roi.RectangularROI(left=50.,
                            top=100.,
                            right=100.,
                            bottom=300.)
dp.plot(cmap='inferno')
reg.add_widget(dp)

<hyperspy.drawing._widgets.rectangles.RectangleWidget at 0x7f2f1afa5cc0>

Crop the dataset based on the region defined above

In [42]:
dpc = reg(dp)

Calculate the mean diffraction pattern from the selected region

In [43]:
dpcm = dpc.mean(axis = dpc.axes_manager.navigation_axes)

Plot the mean diffraction pattern from the selected region

In [44]:
dpcm.plot(cmap='inferno')

## 4. Unsupervised learning

Perform singular value decomposition (SVD) of the data

Obtain a "Scree plot" by plotting the fraction of variance described by each principal component

In [45]:
# SVD won't converage with zeros
dpc.data = dpc.data+0.01

In [46]:
# Lazy decompositions only work with the processes and threaded schedulers (not the distributed scheduler)
dpc.compute() # compute this to load it into memory

In [47]:
# Perform a SVD Decomposition
dpc.decomposition(algorithm='SVD',
                  normalize_poissonian_noise=True,
                  centre=None)

Decomposition info:
  normalize_poissonian_noise=True
  algorithm=SVD
  output_dimension=None
  centre=None


In [48]:
dpc.plot_explained_variance_ratio()

<Axes: title={'center': 'MgO Nano-Crystals\nPCA Scree Plot'}, xlabel='Principal component index', ylabel='Proportion of variance'>

Perform non-negative matrix factorisation (NMF)

In [49]:
dpc.decomposition(True,
                  algorithm='NMF',
                  output_dimension=5,
                  max_iter=600)

Decomposition info:
  normalize_poissonian_noise=True
  algorithm=NMF
  output_dimension=5
  centre=None
scikit-learn estimator:
NMF(max_iter=600, n_components=5)




In [50]:
dpc.plot_decomposition_results()

VBox(children=(HBox(children=(Label(value='Decomposition component index', layout=Layout(width='15%')), IntSli…

<a id='vec'></a>

##  5. Peak Finding

Perform peak finding on all diffraction patterns in data

In [51]:
# This will immediately compute and return a BaseSignal
peaks = dp.find_peaks(method='difference_of_gaussian',
                       min_sigma=1.,
                       max_sigma=6.,
                       sigma_ratio=1.6,
                       threshold=0.04,
                       overlap=0.99,
                       interactive=False)

Check the peaks object type

In [52]:
from pyxem.signals import DiffractionVectors

In [53]:
dp

Title:,MgO Nano-Crystals,Unnamed: 2_level_0
SignalType:,electron_diffraction,Unnamed: 2_level_1
Unnamed: 0_level_2,Array,Chunk
Navigation Axes,Signal Axes,Unnamed: 2_level_3
Bytes,0.96 GiB,49.44 MiB
Shape,"(109, 114|144, 144)","(25,25|144,144)"
Count,102 Tasks,25 Chunks
Type,float32,numpy.ndarray
109  114,144  144,
"Title: MgO Nano-Crystals SignalType: electron_diffraction Array Chunk Bytes 0.96 GiB 49.44 MiB Shape (109, 114|144, 144) (25,25|144,144) Count 102 Tasks 25 Chunks Type float32 numpy.ndarray",Navigation Axes Signal Axes 109  114  144  144,

Title:,MgO Nano-Crystals,Unnamed: 2_level_0
SignalType:,electron_diffraction,Unnamed: 2_level_1
Unnamed: 0_level_2,Array,Chunk
Bytes,0.96 GiB,49.44 MiB
Shape,"(109, 114|144, 144)","(25,25|144,144)"
Count,102 Tasks,25 Chunks
Type,float32,numpy.ndarray

Navigation Axes,Signal Axes
109  114,144  144


In [54]:
# Convert the peaks found to a Diffraction Vectors Object
dv = DiffractionVectors.from_peaks(peaks, center=(72, 72), calibration=dp.axes_manager.signal_axes[0].scale)



Look at what's in the peaks object

In [55]:
# Plot the number of peaks found at each point
dv.get_diffracting_pixels_map().plot()



In [56]:
#Cluster the vectors
distance_threshold = 0.1
min_samples = 7
unique_peaks = dv.get_unique_vectors(method='DBSCAN',
                                     distance_threshold=distance_threshold,
                                     min_samples=min_samples)
print(np.shape(unique_peaks.data)[0], ' unique vectors were found.')

68  unique vectors were found.


In [57]:
# remove the zero beam
unique_peaks = unique_peaks.filter_magnitude(min_magnitude=.4,
                                   max_magnitude=np.inf)
print(np.shape(unique_peaks)[0], ' unique vectors.')


67  unique vectors.


In [58]:
# plot the transpose and the unique vectors
dpt = dp.T
dpt.plot()
for x,y in zip(unique_peaks.data[:,1],unique_peaks.data[:,0]):
    markers = hs.markers.point(x= x,y=y, color="r")
    dpt.add_marker(markers, plot_on_signal=False)

In [59]:
# create Virtual Images from the unique vectors
from pyxem.generators import VirtualDarkFieldGenerator
radius=0.1

vdfgen = VirtualDarkFieldGenerator(dp, unique_peaks)
VDFs = vdfgen.get_virtual_dark_field_images(radius=radius)
VDFs

Title:,Stack of Integrated intensity,Unnamed: 2_level_0
SignalType:,virtual_dark_field,Unnamed: 2_level_1
Unnamed: 0_level_2,Array,Chunk
Navigation Axes,Signal Axes,Unnamed: 2_level_3
Bytes,3.18 MiB,48.54 kiB
Shape,"(67|109, 114)","(1|109,114)"
Count,4968 Tasks,67 Chunks
Type,float32,numpy.ndarray
67  1,109  114,
"Title: Stack of Integrated intensity SignalType: virtual_dark_field Array Chunk Bytes 3.18 MiB 48.54 kiB Shape (67|109, 114) (1|109,114) Count 4968 Tasks 67 Chunks Type float32 numpy.ndarray",Navigation Axes Signal Axes 67  1  109  114,

Title:,Stack of Integrated intensity,Unnamed: 2_level_0
SignalType:,virtual_dark_field,Unnamed: 2_level_1
Unnamed: 0_level_2,Array,Chunk
Bytes,3.18 MiB,48.54 kiB
Shape,"(67|109, 114)","(1|109,114)"
Count,4968 Tasks,67 Chunks
Type,float32,numpy.ndarray

Navigation Axes,Signal Axes
67  1,109  114


In [60]:
#compute the Virtual darkfield images from the peaks
VDFs.compute()

In [61]:
VDFs.plot()

In [62]:
min_distance = 10.5
min_size = 40
max_size = 1000
max_number_of_grains = 3000
marker_radius = 2
exclude_border = 2
threshold= 0.65

In [63]:
# test for the right watershed parameters
from pyxem.utils.segment_utils import separate_watershed
i = 21
sep_i = separate_watershed(
    VDFs.inav[i].data, min_distance=min_distance, min_size=min_size,
    max_size=max_size, max_number_of_grains=max_number_of_grains,
    exclude_border=exclude_border, marker_radius=marker_radius,
    threshold=threshold, plot_on=True)

In [64]:
segs = VDFs.get_vdf_segments(min_distance=min_distance,
                             min_size=min_size,
                             max_size = max_size,
                             max_number_of_grains = max_number_of_grains,
                             exclude_border=exclude_border,
                             marker_radius=marker_radius,
                             threshold=threshold)




In [65]:
segs.segments.plot(cmap='magma_r')

Calculate normalised cross-correlations between all VDF image segments to identify those that are related to the same crystal.

In [66]:
ncc_vdf = segs.get_ncc_matrix()
ncc_vdf.plot(scalebar=False, cmap='RdBu')

  fig = plt.figure(**kwargs)


If the correlation value exceeds corr_threshold for certain segments, those segments are summed. These segments are discarded if the number of these segments are below vector_threshold, as this number corresponds to the number of detected diffraction peaks associated with the single crystal. The vector_threshold criteria is included to avoid including segment images resulting from noise or incorrect segmentation.

In [67]:
corr_threshold=0.3
vector_threshold=4
segment_threshold=2

In [68]:
corrsegs = segs.correlate_vdf_segments(
    corr_threshold=corr_threshold, vector_threshold=vector_threshold,
    segment_threshold=segment_threshold)
print(np.shape(corrsegs.segments)[0],' correlated segments were found.')

  0%|                                                    | 0/87 [00:00<?, ?it/s]


7  correlated segments were found.


In [69]:
hs.plot.plot_images(corrsegs.segments, cmap='magma_r', axes_decor='off',
                    per_row=int(np.shape(corrsegs.segments)[0]/2),
                    suptitle='', scalebar=False, scalebar_color='white',
                    colorbar=False,
                    padding={'top': 0.95, 'bottom': 0.05,
                             'left': 0.05, 'right':0.78})

[<Axes: title={'center': ' (0,)'}, xlabel='x axis (nm)', ylabel='y axis (nm)'>,
 <Axes: title={'center': ' (1,)'}, xlabel='x axis (nm)', ylabel='y axis (nm)'>,
 <Axes: title={'center': ' (2,)'}, xlabel='x axis (nm)', ylabel='y axis (nm)'>,
 <Axes: title={'center': ' (3,)'}, xlabel='x axis (nm)', ylabel='y axis (nm)'>,
 <Axes: title={'center': ' (4,)'}, xlabel='x axis (nm)', ylabel='y axis (nm)'>,
 <Axes: title={'center': ' (5,)'}, xlabel='x axis (nm)', ylabel='y axis (nm)'>,
 <Axes: title={'center': ' (6,)'}, xlabel='x axis (nm)', ylabel='y axis (nm)'>]