# Binary compound formation energy prediction example

This notebook demonstrates how to create a probabilistic model for predicting
formation energies of binary compounds with a quantified uncertainty. Before
running this notebook, ensure that you have a valid Materials Project API key
from <https://www.materialsproject.org/dashboard>. Next, either put this
key in a `.config` file, or change `MAPI_KEY` to the key.

<div class="alert alert-block alert-warning">
Be careful not to include API keys in published versions of this notebook!
</div>


In [1]:
import shutil
from pathlib import Path

import pandas as pd
from megnet.models import MEGNetModel
from pymatgen.ext.matproj import MPRester
from tensorflow.keras.callbacks import TensorBoard
from unlockgnn import MEGNetProbModel


In [2]:
THIS_DIR = Path(".").parent
CONFIG_FILE = THIS_DIR / ".config"

MAPI_KEY = None
MODEL_SAVE_DIR: Path = THIS_DIR / "binary_e_form_model"
LOG_DIR = THIS_DIR / "logs"
BATCH_SIZE: int = 128
NUM_INDUCING_POINTS: int = 1500
OVERWRITE: bool = True

if OVERWRITE:
    for directory in [MODEL_SAVE_DIR, LOG_DIR]:
        if directory.exists():
            shutil.rmtree(directory)

try:
    mp_key = CONFIG_FILE.read_text()
except FileNotFoundError:
    if MAPI_KEY is None:
        raise ValueError("Enter Materials Project API key either in a `.config` file or in the notebook itself.")
    mp_key = MAPI_KEY


# Data gathering

Here we download binary compounds that lie on the convex hull from the Materials
Project, then split them into training and validation subsets.


In [3]:
query = {
    "criteria": {"nelements": 2, "e_above_hull": 0},
    "properties": ["structure", "formation_energy_per_atom"],
}

with MPRester(mp_key) as mpr:
    full_df = pd.DataFrame(mpr.query(**query))


  0%|          | 0/5272 [00:00<?, ?it/s]

In [4]:
full_df.head()

Unnamed: 0,structure,formation_energy_per_atom
0,"[[ 1.982598 -4.08421341 3.2051745 ] La, [1....",-0.737439
1,"[[0. 0. 0.] Fe, [1.880473 1.880473 1.880473] H]",-0.068482
2,"[[1.572998 0. 0. ] Ta, [0. ...",-0.773151
3,"[[0. 0. 7.42288687] Hf, [0. ...",-0.177707
4,"[[ 1.823716 -3.94193291 3.47897025] Tm, [1....",-0.905038


In [5]:
TRAINING_RATIO: float = 0.8

num_training = int(TRAINING_RATIO * len(full_df.index))
train_df = full_df[:num_training]
val_df = full_df[num_training:]

print(f"{num_training} training samples, {len(val_df.index)} validation samples.")


4217 training samples, 1055 validation samples.


# Model creation

Now we load the `MEGNet` 2019 formation energies model, then convert this to a
probabilistic model. We begin by first training this `MEGNetModel` on our data to
achieve a slightly more precise fit.


In [6]:
meg_model = MEGNetModel.from_mvl_models("Eform_MP_2019")


INFO:megnet.utils.models:Package-level mvl_models not included, trying temperary mvl_models downloads..
INFO:megnet.utils.models:Model found in local mvl_models path


Instructions for updating:
The `validate_indices` argument has no effect. Indices are always validated on CPU and never validated on GPU.


Instructions for updating:
The `validate_indices` argument has no effect. Indices are always validated on CPU and never validated on GPU.


In [7]:
tb_callback_1 = TensorBoard(log_dir=LOG_DIR / "megnet", write_graph=False)

train_structs = train_df["structure"]
val_structs = val_df["structure"]

train_targets = train_df["formation_energy_per_atom"]
val_targets = val_df["formation_energy_per_atom"]

In [8]:
# meg_model.train(
#     train_structs,
#     train_targets,
#     val_structs,
#     val_targets,
#     epochs=10,
#     batch_size=BATCH_SIZE,
#     save_checkpoint=False,
#     callbacks=[tb_callback_1]
# )


In [9]:
%load_ext tensorboard
%tensorboard --logdir logs

In [10]:
KL_WEIGHT = BATCH_SIZE / num_training

prob_model = MEGNetProbModel(
    num_inducing_points=NUM_INDUCING_POINTS, save_path=MODEL_SAVE_DIR, meg_model=meg_model, kl_weight=KL_WEIGHT
)


INFO:tensorflow:Assets written to: binary_e_form_model/megnet/assets


INFO:tensorflow:Assets written to: binary_e_form_model/megnet/assets


INFO:tensorflow:Assets written to: binary_e_form_model/gnn/assets


INFO:tensorflow:Assets written to: binary_e_form_model/gnn/assets


Instructions for updating:
`jitter` is deprecated; please use `marginal_fn` directly.


Instructions for updating:
`jitter` is deprecated; please use `marginal_fn` directly.


# Train the uncertainty quantifier

Now we train the model. By default, the `MEGNet` (GNN) layers of the model are
frozen after initialization. Therefore, when we call `prob_model.train()`, the
only layers that are optimized are the `VariationalGaussianProcess` (VGP) and the
`BatchNormalization` layer (`Norm`) that feeds into it.

After this initial training, we will then fine tune the model by freezing the
`Norm` and VGP layers and training just the GNN layers. Then, finally, we
unfreeze _all_ the layers and train the full model simulateously.


In [11]:
tb_callback_2 = TensorBoard(log_dir=LOG_DIR / "vgp_training", write_graph=False)
tb_callback_3 = TensorBoard(log_dir=LOG_DIR / "gnn_training", write_graph=False)
tb_callback_4 = TensorBoard(log_dir=LOG_DIR / "fine_tuning", write_graph=False)


In [12]:
prob_model.train(
    train_structs,
    train_targets,
    epochs=50,
    val_structs=val_structs,
    val_targets=val_targets,
    callbacks=[tb_callback_2],
)


Epoch 1/50




33/33 - 46s - loss: 11720238.0000 - mae: 0.6338 - val_loss: 2359155.0000 - val_mae: 0.6285
Epoch 2/50
33/33 - 39s - loss: 2124926.7500 - mae: 0.6077 - val_loss: 1893577.3750 - val_mae: 0.6012
Epoch 3/50
33/33 - 38s - loss: 1931274.2500 - mae: 0.5884 - val_loss: 1808416.8750 - val_mae: 0.5905
Epoch 4/50


In [None]:
prob_model.set_frozen("VGP", recompile=False)
# Don't recompile until we've got all the freezing/thawing sorted!
prob_model.set_frozen("GNN", freeze=False, recompile=True)


In [None]:
prob_model.train(
    train_structs,
    train_targets,
    epochs=100,
    val_structs=val_structs,
    val_targets=val_targets,
    callbacks=[tb_callback_3],
)


Epoch 1/100
33/33 - 22s - loss: 80863.4297 - mae: 0.6572 - val_loss: 44132.7539 - val_mae: 0.6679
Epoch 2/100
33/33 - 10s - loss: 22800.8184 - mae: 0.6724 - val_loss: 12553.1104 - val_mae: 0.7181
Epoch 3/100
33/33 - 10s - loss: 8190.4604 - mae: 0.6770 - val_loss: 8790.3799 - val_mae: 0.6918
Epoch 4/100
33/33 - 10s - loss: 5164.6772 - mae: 0.6776 - val_loss: 7416.4678 - val_mae: 0.6995
Epoch 5/100
33/33 - 10s - loss: 4210.4551 - mae: 0.6775 - val_loss: 6145.9316 - val_mae: 0.7040
Epoch 6/100
33/33 - 10s - loss: 3112.9285 - mae: 0.6776 - val_loss: 5781.2090 - val_mae: 0.7083
Epoch 7/100
33/33 - 10s - loss: 2659.3921 - mae: 0.6780 - val_loss: 5394.1660 - val_mae: 0.7019
Epoch 8/100
33/33 - 10s - loss: 2354.1545 - mae: 0.6790 - val_loss: 5417.5601 - val_mae: 0.7034
Epoch 9/100
33/33 - 10s - loss: 2110.5874 - mae: 0.6775 - val_loss: 5164.2505 - val_mae: 0.7054
Epoch 10/100
33/33 - 10s - loss: 1863.1519 - mae: 0.6793 - val_loss: 4982.6191 - val_mae: 0.7038
Epoch 11/100
33/33 - 10s - loss: 17

KeyboardInterrupt: 

In [None]:
prob_model.set_frozen(["GNN", "Norm", "VGP"], freeze=False)

In [None]:
prob_model.train(
    train_structs,
    train_targets,
    epochs=50,
    val_structs=val_structs,
    val_targets=val_targets,
    callbacks=[tb_callback_4],
)

In [None]:
prob_model.save()

# Model evaluation

Finally, we'll evaluate model metrics and make some sample predictions! Note that the predictions give predicted values and standard deviations. The standard deviations can then be converted to an uncertainty;
in this example, we'll take the uncertainty as twice the standard deviation, which will give us the 95% confidence interval (see <https://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule>).


In [None]:
prob_model.evaluate(val_structs, val_targets)




{'loss': 6162.57421875, 'mae': 0.7020843029022217}

In [None]:
example_structs = val_structs[:10].tolist()
example_targets = val_targets[:10].tolist()

predicted, stddevs = prob_model.predict(example_structs)
uncerts = 2 * stddevs


In [None]:
pd.DataFrame(
    {
        "Composition": [struct.composition.reduced_formula for struct in example_structs],
        "Formation energy per atom / eV": example_targets,
        "Predicted / eV": [
            f"{pred:.2f} ± {uncert:.2f}" for pred, uncert in zip(predicted, stddevs)
        ],
    }
)


Unnamed: 0,Composition,Formation energy per atom / eV,Predicted / eV
0,Zr2Cu,-0.132384,-0.08 ± 0.02
1,NbRh,-0.401313,-0.49 ± 0.02
2,Cu3Ge,-0.005707,-0.04 ± 0.02
3,Pr3In,-0.273232,-0.18 ± 0.02
4,InS,-0.742895,-0.80 ± 0.02
5,TmPb3,-0.215892,-0.18 ± 0.02
6,InNi,-0.174754,-0.19 ± 0.02
7,GdGe,-0.857117,-0.82 ± 0.02
8,GdTl,-0.380423,-0.42 ± 0.02
9,HoTl3,-0.215986,-0.20 ± 0.02
