<p style="font-size:32px; font-weight: bolder; text-align: center"> Structural representations for materials and molecules </p>
<p style="text-align: center"><i> authored by: <a href="mailto:michele.ceriotti@gmail.com"> Michele Ceriotti </a></i></p>

This notebook provides an overview of the ideas underlying the construction of (symmetric) descriptors of atomic structures, based on an atom-centered expansion of the neighbor density. 

We are going to work with two datasets of atomic structures (a snapshot of a simulation of undercooled iron, and a collection of isomers and crystalline polymorphs of azaphenacene) that allow us to introduce the concept of descriptors and to show how to apply some simple regression and dimensionality reduction techniques to gather insights into the nature of the problem and into structure-property relations. 

_References:_
- [Chem. Rev. 121, 9759 (2021)](https://pubs.acs.org/doi/10.1021/acs.chemrev.1c00021)
- [Phys. Rev. B 87, 184115 (2013)](http://link.aps.org/doi/10.1103/PhysRevB.87.184115)
- [J. Chem. Phys. 156, 204115 (2022)](https://aip.scitation.org/doi/10.1063/5.0087042)


In [None]:
%matplotlib widget
# scwidgets import
import matplotlib as mpl
import matplotlib.pyplot as plt
import chemiscope

import ipywidgets
from ipywidgets import FloatSlider, IntSlider, Checkbox, Dropdown, HBox, Layout, HTML

from markdown import markdown as mdwn

import scwidgets
from scwidgets.check import (
    Check,
    CheckRegistry,
    assert_numpy_allclose,
    assert_numpy_floating_sub_dtype,
    assert_shape,
    assert_type,
)
from scwidgets.code import ParameterPanel, CodeInput
from scwidgets.cue import CueObject, CueFigure
from scwidgets.exercise import CodeExercise, TextExercise, ExerciseRegistry

In [None]:
import numpy as np
import ase, ase.io
import itertools

import rascaline
from metatensor import mean_over_samples, Labels, slice_block

from sklearn.decomposition import PCA
from sklearn.linear_model import RidgeCV

In [None]:
# set CSS style for code-hide
scwidgets.get_css_style()

In [None]:
exercise_registry = ExerciseRegistry(filename_prefix="module_02")
exercise_registry

In [None]:
check_registry = CheckRegistry()
check_registry

In [None]:
module_summary = TextExercise(
    exercise_description="""You can use this box to make general considerations, 
    or keep track of your doubts and questions about this notebook.""",
    exercise_registry=exercise_registry,
    exercise_title="Module comments",
    exercise_key="00"
)
display(module_summary)

<a id="data-driven"> </a>

# Descriptors of atomic environments in supercooled iron

As a first example we consider a structure which is cut ouf of a simulation of freezing iron ([Shibuta et al., Acta Mater. (2016)](https://www.sciencedirect.com/science/article/abs/pii/S1359645415301397)).
The snapshot contains a few solid nuclei embedded in a supercoled liquid.

We will use this structure to define atom-centered descriptors, and perform principal component analysis to color atoms based on whether they are in liquid or solid regions. 

Let's start by taking a look at the structure. Note that, to make the notebook fast enough, this is carved out of a larger structure, and so it is not periodic in the $x,y$ directions. 

In [None]:
frame_iron = ase.io.read("data/iron-snapshot.xyz", 0)

# requires running in a jupyter notebook, and takes a while to load - it's > 100k atoms.
cs = chemiscope.show(frames=[frame_iron], mode="structure", 
                     settings={"structure": [ {"bonds": False, "unitCell": True, 
                             } ] },)
display(cs)

## Atom-centered environments

A first important consideration is that we are looking at an individual configuration, and that we want to identify atomic structures _within_ this configuration - distinguishing liquid regions, crystalline nuclei, and ideally the interfacial regions.

<center><img src="figures/environments.png" width="500"/></center>

One way to do this is to look at atomic _environments_ i.e. spherical atom-centered regions that we can describe in terms of the collection of interatomic distance vectors around each atom. You can look at the environments for the frame 

In [None]:
# requires running in a jupyter notebook, and takes a while to load - it's > 100k atoms.
sel_env_idx = np.array([29030, 55650, 99980, 97370, 19570, 125940])
sel_env_idx.sort()
cs = chemiscope.show(frames=[frame_iron], mode="structure", 
                     settings={"structure": [ 
                         {"bonds": False, "unitCell": False,
                          "keepOrientation": True,
                     'environments': {'activated': False, 'center': False}}] 
                              },  
                     environments=[[0,s,5.0] for s in sel_env_idx ] ,                     
                    )

In [None]:
def update_co(code_exercise):
    cutoff = code_exercise.parameters["cutoff"]
    showenv = code_exercise.parameters["showenv"]
    cs.settings={"structure": [{"environments": {'activated': showenv, 
                                                 'center':showenv,
                                                 "cutoff":cutoff}}]}
cs_wp = ParameterPanel(
    showenv=Checkbox(value=False, description="show environments"),
    cutoff=FloatSlider(value=5.,min=2,max=8,step=0.25, description=r"cutoff / Å"),    
)
cue_cs = CueObject()
with cue_cs:
    display(cs)
    
cs_demo = CodeExercise(
            parameters=cs_wp,
            cue_outputs = cue_cs,
            update_func = update_co,
            update_mode="release")
display(cs_demo)
cs_demo.run_update()

In [None]:
ex01_txt = TextExercise(
    exercise_description="""
It is always a good idea to take a good look at the data you are working with. 
Just play around with the viewer, look at the structure. 
What kind of features can you note by just observing the arrangement of the atoms?
Now switch on the environment view and use the atom slider to highlight a few select ones.
Can you easily recognize individual environments as liquid-like or solid-like?
    """,
    exercise_registry=exercise_registry,
    exercise_key="01",
    exercise_title="Exercise 01: What am I looking at?"
)
display(ex01_txt)

## Representations, a primer

Having taken the decision of focusing on atomic _environments_ for a structure $A$, that we will indicate as $A_i$, we need to come up with an appropriate way to encode information on the positions and types of _neighbors_ within the environment, $\{(a_j, \mathbf{r}_{ji})\}$.

<center><img src="figures/requirements.png" width="400"/></center>

In practice, we want to map to a vector of descriptors, or features $A_i\rightarrow\boldsymbol\xi(A_i)$. It is desirable to use a a mapping that fulfills a number of basic mathematical requirements: 

1. **Locality** (that is already satisfied by the use of atom-centered environments with a finite cutoff)
2. **Completeness** (two environments that are inequivalent should have different feature vectors)
3. **Smoothness** (the mapping between Cartesian coordinates and features should be differentiable, and "regular")
4. **Symmetry** (the mapping should be independent of rigid translations, rotations and permutation of atom indices

It is clear that $\{\mathbf{r}_{ij}\}$ fulfills (1) and (2), but is not smooth (the number of vectors change when atoms enter or leave the cutoff) and is only symmetric to translations. Using interatomic _distances_ $r_{ij}=|\mathbf{r}_{ij}|$ easily makes the features invariant to rotations, but are still dependent on the ordering of the atoms. 

Let's now try to build an invariant descriptor: a _histogram_ of the distances, discretized on a real-space grid. We use a kernel-density estimation, and include a _cutoff function_ to smoothly send contributions to zero as atoms approach the cutoff distance:

$$
\xi_k(A_i) = \sum_{j\in A_i} g(k - r_{ij}/\Delta_r) f_\mathrm{cut}(r_{ij})
$$

where $g(\cdot)$ is a Gaussian with zero mean and unit variance, and $\Delta_r=r_\mathrm{cut}/n_\mathrm{grid}$ is the resolution of the real-space grid, and 
$f_\mathrm{cut}(r_{ij})=1+\cos \pi r_{ij}/r_\mathrm{cut}$.

In the following exercise you will be asked to implement this radial distribution fingerprint, and the exercise will compute and display it for the six environments visualized in the viewer for exercise 1. 

In [None]:
ex02_wci = CodeInput(
        function_name="radial_fp", 
        function_parameters="rij_list, rcut, ngrid",
        docstring="""
        compute a radial distribution fingerprint using a kernel density estimation in real-space
        
        
        :param rij_list: a list of interatomic distances for an environment
        :param rcut: cutoff distance
        :param ngrid: number of grid points and size of the feature vector
        
        :returns: a vector with the radial fistribution features computed for the given environment
""",
        function_body="""

import numpy as np
rgrid = np.linspace(0, rcut, ngrid)
feats = np.zeros(shape=rgrid.shape)

### ADD THE CALCULATION OF THE FEATURES HERE ###

return feats
"""
        )


In [None]:
# makes neighbor list for the six selected environments (ASE is too slow to be usable for this box)
max_cutoff = 8
px = frame_iron.positions
az = frame_iron.cell[2,2]
nl_idx = []
nl_dx = []
nl_dr = []
for isel in sel_env_idx:
    dx = px - px[isel]
    dx[:,2] /= az  # pbc along z
    dx[:,2] -= np.round(dx[:,2])
    dx[:,2] *= az
    dr = np.sqrt((dx**2).sum(axis=1))
    iw = np.where(dr<max_cutoff)[0]
    nl_idx.append(iw)
    nl_dx.append(dx[iw])
    nl_dr.append(dr[iw])

In [None]:
ex02_img = mpl.image.imread('figures/selected-env.jpg')
def update_02(code_exercise):
    rcut, ngrid = code_exercise.parameters.values()
    ax, aximg = code_exercise.cue_outputs[0].figure.get_axes()
    aximg.imshow(ex02_img)
    aximg.axis('off') 
    rgrid = np.linspace(0, rcut, ngrid)
    for dr, l in zip(nl_dr, ["A", "B", "C", "D", "E", "F"]):
        ygrid = ex02_wci.get_function_object()(dr, rcut, ngrid)    
        ax.plot(rgrid, ygrid,label=l)
    # ax.text(-4,8,f'$\ell = ${l:.3f}')
    ax.set_xlabel(r'$r$ / Å')
    ax.set_ylabel(r'$\xi$')
    ax.legend()

ex02_pb =  ParameterPanel(
    rcut = FloatSlider(value=5,min=3,max=8,step=0.1,description=r'$r_{cut}$ / Å'),
    ngrid = IntSlider(value=10,min=5,max=20,description=r'$n_{grid}$') )

In [None]:
ex02_figure, ex02_ax = plt.subplots(1, 2, figsize=(8,5), tight_layout=True)
ex02_output = CueFigure(ex02_figure)
ex02_ax[1].imshow(ex02_img)
ex02_ax[1].axis('off') 

ex02_code_demo = CodeExercise(
            code= ex02_wci,
            parameters= ex02_pb,
            check_registry=check_registry,
            cue_outputs = [ex02_output],
            update_func = update_02,
    exercise_key="02",
    exercise_registry=exercise_registry,
    exercise_title="Exercise 02: Radial distribution fingerprints",
    exercise_description="""
Implement a function that computes a radial distribution fingerprint given the list
of distances for an environment, a cutoff and the number of grid points. 
You should implement the exact functional form given above, if you want checks to pass, 
but of course you're also encouraged to try something different!
"""
)

ex02_ref_input = [{"rij_list": np.array([1,3,4]), "rcut": 5, "ngrid": 10},
                 {"rij_list": np.array([5,7,8]), "rcut": 6, "ngrid": 4}
                 ]
ex02_ref_output = [(np.array([0.0331333 , 0.82091159, 1.72185318, 0.30631179, 0.06605532,
       0.56761871, 0.47532342, 0.21108182, 0.08683   , 0.00349805])/2,),
                  (np.array([2.00234259e-06, 2.45588889e-03, 8.87637079e-02, 2.56310425e-01])/2,)
                  ]

check_registry.add_check(ex02_code_demo,
    asserts=[
        assert_type,
        assert_shape,
        assert_numpy_allclose,
    ],
    inputs_parameters=ex02_ref_input,
    outputs_references=ex02_ref_output
)
                         
#                         inputs_parameters=ex_08_ref_input,
#                         reference_outputs = ex_08_ref_output,
#                         equal=ex08_chk,
#                        fingerprint=identity)

display(ex02_code_demo)

In [None]:
ex02b_txt = TextExercise(
    exercise_description="""
Experiment with different grid resolutions, cutoff radius, etc. 
Can you recognize clear-cut differences between liquid-like and solid-like environments?
    """,
    exercise_registry=exercise_registry,
    exercise_key="02b",
    exercise_title="Exercise 02b: Resolving power of radial fingerprints."
)
display(ex02b_txt)

## Atom-centered symmetry functions

This set of radial features can be seen as a special case of so-called _atom-centered symmetry functions_ (ACSFs), one of the first types of representations used e.g. by [Behler and Parrinello](http://doi.org/10.1103/PhysRevLett.98.146401). 


<center><img src="figures/radial-acsf.png" width="500"/><br/>
<i> Representative examples of radial symmetry functions.</i><br/><br/>
</center>


ACSFs are designed as bespoke functions $\phi_k$ of the internal coordinates of the environment, accumulated over neighbors to achieve invariance to atom index permutations.
They can be generalized to also include functions of distances and angles (3-body symmetry functions) and can be tuned to focus on the structural features that are most discriminating, or most straightforwardly related to the structure-property relations one is trying to learn.
Radial (two-body) symmetry functions take the form

$$
\xi_k(A_i) = \sum_{j\in A_i} \phi_k(r_{ij}) f_\mathrm{cut}(r_{ij})
$$

where $\phi_k$ has typically a parametric form, or enumerates a set of orthogonal basis functions. 

In [None]:
ex03_wci = CodeInput(
        function_name="radial_acsf", 
        function_parameters="rij_list, rcut, delta, rs",
        docstring="""
        compute a radial distribution fingerprint using a kernel density estimation in real-space
        
        
        :param rij_list: a list of interatomic distances for an environment
        :param rcut: cutoff distance
        :param delta: the smearing of the Gaussian ACSF
        :param rs: the center of the Gaussian ACSF
        
        :returns: a float containing the value of the ACSF for the environment
""",
        function_body="""

import numpy as np

acsf = 0.0
### ADD THE CALCULATION OF THE ACSF VALUE HERE ###

return acsf
"""
        )

In [None]:
ex03_img = mpl.image.imread('figures/selected-env.jpg')
def update_03(code_exercise):
    rcut, delta, rs = code_exercise.parameters.values()
    ax, aximg = code_exercise.cue_outputs[0].figure.get_axes()
    aximg.imshow(ex02_img)
    aximg.axis('off') 
    rgrid = np.linspace(0, rcut, 100)
    ygrid = np.zeros_like(rgrid)
    for ir, r in enumerate(rgrid):
        ygrid[ir] = ex03_wci.get_function_object()([r], rcut, delta, rs)    
    ax.plot(rgrid, ygrid, 'r-')
    # ax.text(-4,8,f'$\ell = ${l:.3f}')
    ax.set_xlabel(r'$r$ / Å')
    ax.set_ylabel(r'$\phi_k(r)$')

    labels = []
    for dr, l in zip(nl_dr, ["A", "B", "C", "D", "E", "F"]):
        acf = ex03_wci.get_function_object()(dr, rcut, delta, rs) 
        labels.append(f"{l}: {acf:9.4f}")
    aximg.legend(handles=[mpl.patches.Patch(color="w", )]*6, labels=labels,
                 handlelength=0.1, loc='lower left')

ex03_pb =  ParameterPanel(
    rcut = FloatSlider(value=5,min=3,max=8,step=0.1,description=r'$r_{cut}$ / Å'),
    delta = FloatSlider(value=0.5,min=0.1,max=2,step=0.1,description=r'$\Delta$ / Å'),
    rs = FloatSlider(value=5,min=3,max=8,step=0.1,description=r'$r_s$ / Å'),
    )

In [None]:
ex03_figure, ex03_ax = plt.subplots(1, 2, figsize=(8,4), tight_layout=True)
ex03_output = CueFigure(ex03_figure)
ex03_ax[1].imshow(ex03_img)
ex03_ax[1].axis('off') 

ex03_code_demo = CodeExercise(
            code= ex03_wci,
            parameters= ex03_pb,
            check_registry=check_registry,
            cue_outputs = [ex03_output],
            update_func = update_03,
    update_mode="manual",
    exercise_key="03",
    exercise_registry=exercise_registry,
    exercise_title="Exercise 03: Radial ACSF",
    exercise_description=mdwn("""
Implement a function that computes a Behler-Parrinello atom-center symmetry function
of the form

$$
\phi_k(r) = \exp\[-(r-r_s)^2/\delta^2\]  f_c(r_s)
$$

using a cosine cutoff function.
""")
)

ex02_ref_input = [{"rij_list": np.array([1,3,4]), "rcut": 5, "ngrid": 10},
                 {"rij_list": np.array([5,7,8]), "rcut": 6, "ngrid": 4}
                 ]
ex02_ref_output = [(np.array([0.0331333 , 0.82091159, 1.72185318, 0.30631179, 0.06605532,
       0.56761871, 0.47532342, 0.21108182, 0.08683   , 0.00349805]),),
                  (np.array([2.00234259e-06, 2.45588889e-03, 8.87637079e-02, 2.56310425e-01]),)
                  ]

check_registry.add_check(ex02_code_demo,
    asserts=[
        assert_type,
        assert_shape,
        assert_numpy_allclose,
    ],
    inputs_parameters=ex02_ref_input,
    outputs_references=ex02_ref_output
)
                         

display(ex03_code_demo)

In [None]:
ex03b_txt = TextExercise(
    exercise_description="""
Observe how the shape of the symmetry function, and its value for the 
various environments, change with its parameters. Try to find values that maximise the difference
between solid-like and liquid-like environments. 
    """,
    exercise_registry=exercise_registry,
    exercise_key="03b",
    exercise_title="Exercise 03b: ACSF."
)
display(ex02b_txt)

## Discretized density expansion

## Three-body correlations: SOAP features

# Automatic identification of environments

In [None]:
selection = np.where((frame_iron.positions[:,0]>max_cutoff+1) & (frame_iron.positions[:,0]<199-max_cutoff) &
                     (frame_iron.positions[:,1]>max_cutoff+1) & (frame_iron.positions[:,1]<199-max_cutoff) & 
                     (frame_iron.positions[:,2]>17) & (frame_iron.positions[:,2]<27)
                    )[0] 

In [None]:
nl_code = rascaline.NeighborList(cutoff=6.0, full_neighbor_list=True)

In [None]:
%%time 
nl_selected = nl_code.compute(frame_iron)
#                    selected_samples=Labels(names=["first_atom"], values=selection[:,np.newaxis]))

In [None]:
%%time
sb = slice_block(nl_selected.block(0),axis="samples", 
            labels=Labels(names=["first_atom"], values=selection[:,np.newaxis]))

In [None]:
all_samples = nl_selected.block(0).samples.view("first_atom")
labels=Labels(names=["first_atom"], values=selection[:,np.newaxis])

In [None]:
from tqdm.notebook import tqdm

In [None]:
np.asarray(labels["first_atom"])

In [None]:
%%time
labs = np.asarray(labels.view(["first_atom"]))
samp = np.asarray(all_samples)
mask = np.zeros(len(samp), dtype=bool)

# Two-pointer technique
i, j = 0, 0
while i < len(samp) and j < len(labs):
    if samp[i] == labs[j]:
        mask[i] = True
        i += 1
    elif samp[i] < labs[j]:
        i += 1
    else:
        j += 1

In [None]:
%%time
mask=np.isin(all_samples, labels.view(["first_atom"]) )

In [None]:
%%time 
samples_mask = [all_samples.entry(i) in labels for i in tqdm(range(len(all_samples)))]

In [None]:
fun = ex03_wci.get_f

In [None]:
nl_selected

In [None]:
frame_iron.cell

# Structure-property maps for molecular materials

In [None]:
aza_frames = ase.io.read("data/azaphenacene.xyz", ":")

aza_props = {
 prop: np.array([f.info[prop] for f in aza_frames])
 for prop in ["energy", "mobility", "nHB"]}

In [None]:
cs = chemiscope.show(frames = aza_frames,
                properties = chemiscope.extract_properties(aza_frames),
              settings =  {'map': {'x': {'property': 'mobility','scale': 'log'},
                                   'y': {'property': 'energy','scale': 'linear'},
                                   'symbol': 'nHB_class',
                                   'palette': 'inferno',
                                    'color': {'property': 'molecule'},
                                  },
                             'structure': [{'unitCell': True,
                                            'supercell': {'0': 2, '1': 2, '2': 2},
                                           }]
                          },
               mode="default")
display(cs)

# Geometric representations and symmetries

An overview of the ideas of symmetry-compliant descriptors

* Build distance-histogram descriptors "by hand". Internal coordinates and symmetries
* Visualize atomic environments with chemiscope
* Generalize this: density expansion for permutation invariance
* PCA map
* Construction of invariants - explain addition theorem
* Multiple species and azaphenacene.
* PCOVr maps

This example takes a structure which is cut ouf of a simulation of freezing iron ([Shibuta et al., Acta Mater. (2016)](https://www.sciencedirect.com/science/article/abs/pii/S1359645415301397)).
The snapshot contains a few solid nuclei embedded in a supercoled liquid.

In [None]:
nl_code = rascaline.NeighborList(cutoff=6.0, full_neighbor_list=True)

In [None]:
nl_val = nl.compute(frame_iron
        ,selected_samples=Labels(names=["first_atom"], values=sel_env_idx[:,np.newaxis]))