Evgeni Ulanov1, Ghulam A. Qadir1, Kai Riedmiller1, Pascal Friederich23, Frauke Gräter14
This repository is a collection of python scripts used in the paper Predicting hydrogen atom transfer energy barriers using Gaussian process regression
Predicting reaction barriers for arbitrary configurations based on only a limited set of density functional theory (DFT) calculations would render the design of catalysts or the simulation of reactions within complex materials highly efficient. We here propose Gaussian process regression (GPR) as a method of choice if DFT calculations are limited to hundreds or thousands of barrier calculations. For the case of hydrogen atom transfer in proteins, an important reaction in chemistry and biology, we obtain a mean absolute error of 3.23 kcal/mol for the range of barriers in the data set using SOAP descriptors and similar values using the marginalized graph kernel. Thus, the two GPR models can robustly estimate reaction barriers within the large chemical and conformational space of proteins. Their predictive power is comparable to a graph neural network-based model, and GPR even outcompetes the latter in the low data regime. We propose GPR as a valuable tool for an approximate but data-efficient model of chemical reactivity in a complex and highly variable environment.
Top to bottom:
(A) Hydrogen Atom Transfer (HAT) reaction modelled with hydrogen atom moving start to end in equidistant
steps.
(B) Environments used to capture the local environments using SOAP vectors.
(C) Illustration of a molecular HAT reaction captured as a graph.
Note
This guide assumes a cluster type environment with SLURM & MPI installed and might not work on a local computer.
- Clone the repository.
- Download the structures from heidata of Heidelberg University
and place them, as well as the metadata.csv into data/pdb as described in pdb_to_atoms.py.
Afterward the structures need to be unzipped:
$ unzip dataset_traj.zip && mv dataset_2208_traj traj $ unzip dataset_synth.zip && mv dataset_2208_synth synth
- Create local.env in the project root file with variables:
- PROJECTROOT (absolute path to the root of the project)
- CONDABIN (absolute path to the conda executable, e.g. CONDABIN="/.../miniconda3/bin/conda")
- OPENMPIMODULE - The OpenMPI module to be loaded in the HPC environment, e.g. "OpenMPI/3.1.4-GCC-8.3.0"
- Install the conda environments main_gpr_env, mgk_gpr_env and painn_env by executing install.sh inside the install folder (make executable first with chmod).
- Extract the ASE atom structures from the pdb files by running pdb_to_atoms.py (Environment: main_gpr_env).
- Run atoms_to_soap.py (Environment: main_gpr_env) with
config = dict(i_position=x, [...])
for x=0, 5 and 10 once each and note the corresponding SD-*.npy name. - Insert the correct SOAP distance files from the last step into MAIN_RUN.py
& SOAP_second_stage.py by editing the
config=(soaps={"s_0.0": x, "s_5.0": y, "s_10.0": z, [...])
dictionary. - Run submit_SOAP_GPR.sh with
sbatch
which does all the SOAP GPR calculations. - Train the PaiNN models using submit_data_efficiency.sh
with
sbatch
- Run analyse_GPR_and_PaiNN_results.py (Environment: main_gpr_env) to collect the results of PaiNN and SOAP.
- Train PaiNN models for the two stage learning with submit_two_stage_learning.sh
- Collect the PaiNN two stage learning predictions with submit_collect_painn_predictions.sh
- Train SOAP model on difference of PaiNN predictions with submit_SOAP_second_stage.sh
- Create the MGK covariance matrix with submit_calculate_K.sh
- Find optimal parameters of MGK with optimize_kernel_parameters.py (Environment: mgk_gpr_env)
- Figure of prediction comparison of SOAP GPR, MGK GPR and PaiNN: traj_predictions.py (Environment: main_gpr_env)
- Figure of the data efficiency: data_efficiency_SOAP_PaiNN.py (Environment: main_gpr_env)
- Figure of coverage and negative interval score: plot_interval_score.py (Environment: main_gpr_env)
- Figure of two stage learning MAE of SOAP GPR with PaiNN: plot_two_stage_predictions.py (Environment: main_gpr_env)
- CPU: Intel(R) Xeon(R) Gold 6230
- GPU: NVIDIA GeForce RTX 2080 SUPER
- Conda version: 22.11.1
- OpenMPI: 3.1.4
- The individual packages used, including version numbers, can be found in the folders of install.
The structures and energies used in this project can be downloaded from heiDATA of Heidelberg University.
Footnotes
-
Heidelberg Institute for Theoretical Studies, Heidelberg, Germany ↩ ↩2 ↩3 ↩4
-
Institute of Theoretical Informatics, Karlsruhe Institute of Technology, Karlsruhe, Germany ↩
-
Institute of Nanotechnology, Karlsruhe Institute of Technology, Eggenstein-Leopoldshafen, Germany ↩
-
Interdisciplinary Center for Scientific Computing, Heidelberg, Germany ↩