Skip to content

andreagrisafi/SALTED

Repository files navigation

SALTED: Symmetry-Adapted Learning of Three-dimensional Electron Densities

This repository contains an implementation of symmetry-adapted Gaussian Process Regression suitable to perform equivariant predictions of the electron density of both molecular and condensed-phase systems, as decomposed on an atom-centered spherical harmonics basis.

Documentation

A quick-start guide is provided below; full documentation is also available.

References

  1. Andrea Grisafi, Alberto Fabrizio, David M. Wilkins, Benjamin A. R. Meyer, Clemence Corminboeuf, Michele Ceriotti, "Transferable Machine-Learning Model of the Electron Density", ACS Central Science 5, 57 (2019) [https://pubs.acs.org/doi/10.1021/acscentsci.8b00551]
  2. Alberto Fabrizio, Andrea Grisafi, Benjamin A. R. Meyer, Michele Ceriotti, Clemence Corminboeuf, "Electron density learning of non-covalent systems", Chemical Science 10, 9424 (2019) [https://pubs.rsc.org/en/content/articlelanding/2019/sc/c9sc02696g]
  3. Alan M. Lewis, Andrea Grisafi, Michele Ceriotti, Mariana Rossi, "Learning electron densities in the condensed-phase", Journal of Chemical Theory and Computation 17, 7203 (2021) [https://pubs.acs.org/doi/10.1021/acs.jctc.1c00576]
  4. Andrea Grisafi, Alan M. Lewis, Mariana Rossi, Michele Ceriotti, "Electronic-Structure Properties from Atom-Centered Predictions of the Electron Density", Journal of Chemical Theory and Computation 19, 4451 (2023) [https://pubs.acs.org/doi/10.1021/acs.jctc.2c00850]

Installation

In the SALTED directory, simply run make, followed by pip install .

Dependencies

--> rascaline: rascaline installation requires a RUST compiler. To install a RUST compiler, run: curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh && source "$HOME/.cargo/env" rascaline can then be installed using pip install git+https://github.com/Luthaf/rascaline.git

--> mpi4py: mpi4py is required to use MPI parallelisation; SALTED can nonetheless be run without this. A parallel h5py installation is required to use MPI parellelisation. This can be installed by running: HDF5_MPI="ON" CC=mpicc pip install --no-cache-dir --no-binary=h5py h5py provided HDF5 has been compiled with MPI support.

Input file

SALTED input is provided in a inp.yaml file, which is structured in the following sections:

  • salted (required): define root storage directory and workflow label
  • system (required): define system parameters
  • qm (required): define information about quantum-mechanical reference
  • descriptor (required): define parameters of symmetry-adapted descriptors
  • gpr (required): define Gaussian Process Regression parameters
  • prediction (optional): manage predictions on unseen datasets

Input Dataset

Input structures are required in extXYZ format; the corresponding filename must be specified in the inp.system.filename. Electron density training data consists in the expansion coefficients of the scalar field over atom-centered basis functions made of radial functions and spherical harmonics. These coefficients are computed following density-fitting (DF), a.k.a. resolution of the identity, approximations, commonly applied in electronic-structure codes. We assume to work with orthonormalized real spherical harmonics defined with the Condon-Shortley phase convention. No restriction is instead imposed on the nature of the radial functions. Because of the non-orthogonality of the basis functions, the 2-center electronic integral matrices associated with the given density-fitting approximation are also required as input. The electronic-structure codes that are to date interfaced with SALTED are:

  • FHI-aims
  • CP2K
  • PySCF

We refer to the code-specific examples for how to produce the required quantum-mechanical data.

Usage

The root directory used for storing SALTED data is specified in inp.salted.saltedpath. Depending on the chosen input parameters, a SALTED workflow can be labelled adding a coherent string in the inp.salted.saltedname variable; in turn, this defines the name of the output folders that are automatically generated during the program execution. SALTED functions can be run either by importing the corresponding modules in Python, or directly from command line. MPI parallelization can be activated by setting inp.system.parallel as True, and can be used, whenever applicable, to parallelize the calculation of SALTED functions over training data. In what follows, we report an example of a general command line workflow:

  1. Initialize structural features defined from 3-body symmetry-adapted descriptors, $P^L$, as computed following PRL 120, 036002 (2018):

    python3 -m salted.initialize

    An optional sparsify subsection can be added to the inp.descriptor input section in order to reduce the feature space size down to ncut sparse features selected using a "farthest point sampling" (FPS) algorithm. To facilitate this procedure, it is possible to perform the FPS selection over a subset of nsamples configurations, selected at random from the entire training dataset.

  2. Find sparse set of inp.gpr.Menv atomic environments in order to recast the SALTED problem into a low dimensional space. The non-linearity degree of the model must be defined at this stage by setting the variable inp.gpr.z as a positive integer. z=1 corresponds to a linear model.

    python3 -m salted.sparse_selection

  3. Compute sparse vectors of descriptors $P^L_M$ for each atomic type and angular momentum:

    python3 -m salted.sparse_descriptor (MPI parallelizable)

  4. Compute sparse equivariant kernels $k^L_{MM}$ and find projector matrices over the Reproducing Kernel Hilbert Space (RKHS):

    python3 -m salted.rkhs_projector

  5. Compute equivariant kernels $k^L_{NM}$ over the entire dataset and project them on the RKHS to obtain the final SALTED input vectors:

    python3 -m salted.rkhs_vector (MPI parallelizable)

  6. Build the Hessian matrix of the quadratic RKHS problem over a maximum of inp.gpr.Ntrain training structures selected from the entire dataset; these can be either selected at random (inp.gpr.trainsel: random) or sequentially (inp.gpr.trainsel: sequential). The remaining structures will be automatically retained for validation. The variable inp.gpr.trainfrac can be used to define the fraction of the total training data to be used: this can go from 0 to 1 in order to make learning curves while keeping the validation set fixed.

    python3 -m salted.hessian_matrix (MPI parallelizable)

  7. Solve the regression problem with a given regularization parameter inp.gpr.regul.

    python3 -m salted.solve_regression

    NB: when the dimensionality exceeds $10^5$, it is recommended to perform a direct minimization of the SALTED loss function in place of an explicit matrix inversion (points 6 and 7). If the dimensionality exceeds $70000$, the loss function must be minimized directly. This can be run as follows:

    python3 -m salted.minimize_loss (MPI parallelizable)

  8. Validate predictions over the structures that have not been retained for training by computing the root mean square error in agreement to the definition of the SALTED loss function.

    python3 -m salted.validation (MPI parallelizable)

  9. Once the SALTED model has been trained and validated, SALTED predictions for a new unseen dataset can be handled according to the inp.prediction section. For that, a inp.prediction.filename must be specified in XYZ format, while a inp.prediction.predname string can be defined to label the prediction directories. Equivariant predictions can then be run as follows:

    python3 -m salted.prediction (MPI parallelizable)

Contact

andrea.grisafi@ens.psl.eu

alan.m.lewis@york.ac.uk

Contributors

Andrea Grisafi, Alan Lewis, Zekun Lou, Mariana Rossi

About

Symmetry-Adapted Learning of Three-dimensional Electron Densities

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages