# Binary compound formation energy prediction example

This notebook demonstrates how to create a probabilistic model for predicting
formation energies of binary compounds with a quantified uncertainty. Before
running this notebook, ensure that you have a valid Materials Project API key
from <https://www.materialsproject.org/dashboard>. Next, either put this
key in a `.config` file, or change `MAPI_KEY` to the key.

<div class="alert alert-block alert-warning">
Be careful not to include API keys in published versions of this notebook!
</div>


In [1]:
import shutil
from pathlib import Path

import numpy as np
import pandas as pd
from megnet.models import MEGNetModel
from pymatgen.ext.matproj import MPRester
from tensorflow.keras.callbacks import TensorBoard
from unlockgnn import MEGNetProbModel
from unlockgnn.initializers import SampleInitializer


In [2]:
THIS_DIR = Path(".").parent
CONFIG_FILE = THIS_DIR / ".config"

MAPI_KEY = None
MODEL_SAVE_DIR: Path = THIS_DIR / "binary_e_form_model"
DATA_SAVE_DIR: Path = THIS_DIR / "binary_data.pkl"
LOG_DIR = THIS_DIR / "logs"
BATCH_SIZE: int = 128
NUM_INDUCING_POINTS: int = 3000
OVERWRITE: bool = True

if OVERWRITE:
    for directory in [MODEL_SAVE_DIR, LOG_DIR]:
        if directory.exists():
            shutil.rmtree(directory)

try:
    mp_key = CONFIG_FILE.read_text()
except FileNotFoundError:
    if MAPI_KEY is None:
        raise ValueError("Enter Materials Project API key either in a `.config` file or in the notebook itself.")
    mp_key = MAPI_KEY


# Data gathering

Here we download binary compounds that lie on the convex hull from the Materials
Project, then split them into training and validation subsets.


In [3]:
query = {
    "criteria": {"nelements": 2, "e_above_hull": 0},
    "properties": ["structure", "formation_energy_per_atom"],
}

if DATA_SAVE_DIR.exists():
    full_df = pd.read_pickle(DATA_SAVE_DIR)
else:
    with MPRester(mp_key) as mpr:
        full_df = pd.DataFrame(mpr.query(**query))
    full_df.to_pickle(DATA_SAVE_DIR)


In [4]:
full_df.head()

Unnamed: 0,structure,formation_energy_per_atom
0,"[[ 1.982598 -4.08421341 3.2051745 ] La, [1....",-0.737439
1,"[[0. 0. 0.] Fe, [1.880473 1.880473 1.880473] H]",-0.068482
2,"[[1.572998 0. 0. ] Ta, [0. ...",-0.773151
3,"[[0. 0. 7.42288687] Hf, [0. ...",-0.177707
4,"[[ 1.823716 -3.94193291 3.47897025] Tm, [1....",-0.905038


In [4]:
TRAINING_RATIO: float = 0.8

num_training = int(TRAINING_RATIO * len(full_df.index))
train_df = full_df[:num_training]
val_df = full_df[num_training:]

print(f"{num_training} training samples, {len(val_df.index)} validation samples.")


4217 training samples, 1055 validation samples.


# Model creation

Now we load the `MEGNet` 2019 formation energies model, then convert this to a
probabilistic model. We begin by first training this `MEGNetModel` on our data to
achieve a slightly more precise fit.


In [6]:
meg_model = MEGNetModel.from_mvl_models("Eform_MP_2019")


INFO:megnet.utils.models:Package-level mvl_models not included, trying temperary mvl_models downloads..
INFO:megnet.utils.models:Model found in local mvl_models path


Instructions for updating:
The `validate_indices` argument has no effect. Indices are always validated on CPU and never validated on GPU.


Instructions for updating:
The `validate_indices` argument has no effect. Indices are always validated on CPU and never validated on GPU.


In [5]:
tb_callback_1 = TensorBoard(log_dir=LOG_DIR / "megnet", write_graph=False)

train_structs = train_df["structure"]
val_structs = val_df["structure"]

train_targets = train_df["formation_energy_per_atom"]
val_targets = val_df["formation_energy_per_atom"]

In [8]:
# Make the initializer
index_points_init = SampleInitializer(train_structs, meg_model)
# index_points_init = None

In [6]:
KL_WEIGHT = BATCH_SIZE / num_training

prob_model = MEGNetProbModel(
    num_inducing_points=NUM_INDUCING_POINTS,
    save_path=MODEL_SAVE_DIR,
    meg_model=meg_model,
    kl_weight=KL_WEIGHT,
    index_initializer=index_points_init,
)
# prob_model = MEGNetProbModel.load(MODEL_SAVE_DIR)

Instructions for updating:
The `validate_indices` argument has no effect. Indices are always validated on CPU and never validated on GPU.


Instructions for updating:
The `validate_indices` argument has no effect. Indices are always validated on CPU and never validated on GPU.


Instructions for updating:
`jitter` is deprecated; please use `marginal_fn` directly.


Instructions for updating:
`jitter` is deprecated; please use `marginal_fn` directly.


# Train the uncertainty quantifier

Now we train the model. By default, the `MEGNet` (GNN) layers of the model are
frozen after initialization. Therefore, when we call `prob_model.train()`, the
only layers that are optimized are the `VariationalGaussianProcess` (VGP) and the
`BatchNormalization` layer (`Norm`) that feeds into it.

After this initial training, we will then fine tune the model by freezing the
`Norm` and VGP layers and training just the GNN layers. Then, finally, we
unfreeze _all_ the layers and train the full model simulateously.


In [10]:
tb_callback_2 = TensorBoard(log_dir=LOG_DIR / "vgp_training", write_graph=False)
tb_callback_3 = TensorBoard(log_dir=LOG_DIR / "fine_tuning", write_graph=False)


In [11]:
%load_ext tensorboard
%tensorboard --logdir logs

In [12]:
prob_model.train(
    train_structs,
    train_targets,
    epochs=50,
    val_structs=val_structs,
    val_targets=val_targets,
    callbacks=[tb_callback_2],
)


Epoch 1/50




33/33 - 21s - loss: 2091617.7500 - mae: 0.6478 - val_loss: 1869004.0000 - val_mae: 0.6430
Epoch 2/50
33/33 - 9s - loss: 1896041.3750 - mae: 0.6150 - val_loss: 1754264.1250 - val_mae: 0.6148
Epoch 3/50
33/33 - 9s - loss: 1785994.5000 - mae: 0.5935 - val_loss: 1656106.7500 - val_mae: 0.6004
Epoch 4/50
33/33 - 10s - loss: 1687178.7500 - mae: 0.5765 - val_loss: 1567845.3750 - val_mae: 0.5793
Epoch 5/50
33/33 - 10s - loss: 1598149.1250 - mae: 0.5615 - val_loss: 1487190.3750 - val_mae: 0.5670
Epoch 6/50
33/33 - 10s - loss: 1516122.0000 - mae: 0.5489 - val_loss: 1414070.8750 - val_mae: 0.5561
Epoch 7/50
33/33 - 9s - loss: 1441278.8750 - mae: 0.5382 - val_loss: 1346037.5000 - val_mae: 0.5419
Epoch 8/50
33/33 - 9s - loss: 1371843.1250 - mae: 0.5286 - val_loss: 1282792.7500 - val_mae: 0.5400
Epoch 9/50
33/33 - 9s - loss: 1307924.7500 - mae: 0.5208 - val_loss: 1225147.0000 - val_mae: 0.5245
Epoch 10/50
33/33 - 9s - loss: 1249076.8750 - mae: 0.5129 - val_loss: 1170159.7500 - val_mae: 0.5224
Epoch 

In [15]:
prob_model.set_frozen(["GNN", "VGP"], freeze=False)

In [16]:
prob_model.train(
    train_structs,
    train_targets,
    epochs=50,
    val_structs=val_structs,
    val_targets=val_targets,
    callbacks=[tb_callback_3],
)

Epoch 1/50
33/33 - 22s - loss: 498863.0000 - mae: 0.5751 - val_loss: 432173.1875 - val_mae: 0.5838
Epoch 2/50
33/33 - 9s - loss: 447332.3438 - mae: 0.5635 - val_loss: 420508.9375 - val_mae: 0.5786
Epoch 3/50
33/33 - 9s - loss: 435303.9062 - mae: 0.5584 - val_loss: 409642.0625 - val_mae: 0.5736
Epoch 4/50
33/33 - 9s - loss: 424099.5938 - mae: 0.5536 - val_loss: 399530.7188 - val_mae: 0.5691
Epoch 5/50
33/33 - 9s - loss: 413618.5625 - mae: 0.5494 - val_loss: 389930.1875 - val_mae: 0.5649
Epoch 6/50
33/33 - 9s - loss: 403688.5000 - mae: 0.5456 - val_loss: 380808.0000 - val_mae: 0.5612
Epoch 7/50
33/33 - 9s - loss: 394244.6562 - mae: 0.5420 - val_loss: 372145.2500 - val_mae: 0.5579
Epoch 8/50
33/33 - 9s - loss: 385253.4062 - mae: 0.5389 - val_loss: 363835.8750 - val_mae: 0.5549
Epoch 9/50
33/33 - 9s - loss: 376645.3125 - mae: 0.5359 - val_loss: 355896.8125 - val_mae: 0.5519
Epoch 10/50
33/33 - 9s - loss: 368409.1250 - mae: 0.5331 - val_loss: 348249.1562 - val_mae: 0.5494
Epoch 11/50
33/33 

In [17]:
prob_model.save()

# Model evaluation

Finally, we'll evaluate model metrics and make some sample predictions! Note that the predictions give predicted values and standard deviations. The standard deviations can then be converted to an uncertainty;
in this example, we'll take the uncertainty as twice the standard deviation, which will give us the 95% confidence interval (see <https://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule>).


In [18]:
prob_model.evaluate(val_structs, val_targets)




{'loss': 169790.640625, 'mae': 0.5161992311477661}

In [7]:
example_structs = val_structs[:10].tolist()
example_targets = val_targets[:10].tolist()

predicted, stddevs = prob_model.predict(example_structs)
uncerts = 2 * stddevs


prediction=array([-0.71716455, -0.71716455, -0.71716455, -0.71716455, -0.71716455,
       -0.71716455, -0.71716455, -0.71716455, -0.71716455, -0.71716455,
        0.59276481,  0.59276481,  0.59276481,  0.59276481,  0.59276481,
        0.59276481,  0.59276481,  0.59276481,  0.59276481,  0.59276481])
(10,)
(10,)
[-0.71716455 -0.71716455 -0.71716455 -0.71716455 -0.71716455 -0.71716455
 -0.71716455 -0.71716455 -0.71716455 -0.71716455]
[1.18552961 1.18552961 1.18552961 1.18552961 1.18552961 1.18552961
 1.18552961 1.18552961 1.18552961 1.18552961]


In [8]:
pd.DataFrame(
    {
        "Composition": [struct.composition.reduced_formula for struct in example_structs],
        "Formation energy per atom / eV": example_targets,
        "Predicted / eV": [
            f"{pred:.2f} ± {uncert:.2f}" for pred, uncert in zip(predicted, uncerts)
        ],
    }
)


Unnamed: 0,Composition,Formation energy per atom / eV,Predicted / eV
0,Zr2Cu,-0.132384,-0.72 ± 0.59
1,NbRh,-0.401313,-0.72 ± 0.59
2,Cu3Ge,-0.005707,-0.72 ± 0.59
3,Pr3In,-0.273232,-0.72 ± 0.59
4,InS,-0.742895,-0.72 ± 0.59
5,TmPb3,-0.215892,-0.72 ± 0.59
6,InNi,-0.174754,-0.72 ± 0.59
7,GdGe,-0.857117,-0.72 ± 0.59
8,GdTl,-0.380423,-0.72 ± 0.59
9,HoTl3,-0.215986,-0.72 ± 0.59


In [None]:
full_pred, full_stddev = prob_model.predict(train_structs)

resids = train_targets - full_pred
mae = np.mean(np.abs(resids))

print(mae)

NameError: name 'prob_model' is not defined