In [1]:
import numpy as np

from autocat.adsorption import place_adsorbate
from autocat.surface import generate_surface_structures
from autocat.learning.predictors import AutoCatStructureCorrector
from autocat.learning.sequential import simulated_sequential_learning
from autocat.learning.sequential import multiple_sequential_learning_runs

In this tutorial we show how to conduct sequential learning runs for training an `AutoCatStructureCorrector`

# Sequential Learning

In order to conduct a sequential learning run, we need three things: i) an `AutoCatStructureCorrector` object with our desired featurizer settings, ii) base structures to be used for training, iii) base_structures to be used for testing (this one is optional)

In [2]:
acsc = AutoCatStructureCorrector(
    structure_featurizer="sine_matrix",
    adsorbate_featurizer="soap",
    adsorbate_featurization_kwargs={"rcut": 5.0, "nmax": 4, "lmax": 4},
    refine_structures = True,
    maximum_structure_size = None, # will default to max structure encountered
    maximum_adsorbate_size = None, # will default to max adsorbate encountered
    species_list = ["Pd", "C", "O", "Cu"] # This is important to include!
)

In [3]:
sub = generate_surface_structures(["Pd"], facets={"Pd": ["100"]})["Pd"]["fcc100"][
    "structure"
]
train_base_struct = place_adsorbate(sub, "CO")["custom"]["structure"]

In [4]:
sub2 = generate_surface_structures(["Cu"], facets={"Cu": ["100"]})["Cu"]["fcc100"][
    "structure"
]
test_base_struct = place_adsorbate(sub2, "O")["custom"]["structure"]

We're now ready to conduct a sequential learning run

In [5]:
sl_dict = simulated_sequential_learning(
    acsc,
    [train_base_struct],
    testing_base_structures = [test_base_struct],
    initial_num_of_perturbations_per_base_structure = 4, # for generating the initial training set
    batch_num_of_perturbations_per_base_structure = 3, # how large of a pool to predict on for each loop
    batch_size_to_add = 3, # how many structures to add to training on each loop
    number_of_sl_loops = 4,
    write_to_disk = False # if we want to save the history info to disk
)

Sequential Learning Iteration #1
Sequential Learning Iteration #2
Sequential Learning Iteration #3
Sequential Learning Iteration #4


Contained within the returned disk is multiple histories as a function of sequential learning iteration including MAE & RMSE on the validation and testing sets

In [6]:
print(f"MAE training history: {sl_dict['mae_train_history']}")
print(f"RMSE training history: {sl_dict['rmse_train_history']}")
print('\n')
print(f"MAE test history: {sl_dict['mae_test_history']}")
print(f"RMSE test history: {sl_dict['rmse_test_history']}")

MAE training history: [0.6085485197361659, 0.6085485197361659, 0.6085485197361659, 0.6085485197361659]
RMSE training history: [0.40434634537932845, 0.40434634537932845, 0.40434634537932845, 0.40434634537932845]


MAE test history: [0.6661825694029918, 0.6661825694029918, 0.6661825694029918, 0.6661825694029918]
RMSE test history: [0.46687185319246804, 0.46687185319246804, 0.46687185319246804, 0.46687185319246804]


Moreover, the uncertainties of each batch that is added to the training set is also kept (note here it will be very large as the training set was kept small for easier demonstration purposes)

In [7]:
print(f"Full Maximum Uncertainty History: {sl_dict['max_unc_history']}")

Full Maximum Uncertainty History: [[1.0, 1.0, 1.0], [1.0, 1.0, 1.0], [1.0, 1.0, 1.0], [1.0, 1.0, 1.0]]


For additional information about the other quantities that are tracked, we refer the reader to the documentation

# Running Multiple Sequential Learning Runs

In most cases we will want to run multiple sequential learning runs in order to remove any effects from data initialization of the first training set. Through the use of `joblib`, `AutoCat` can parallelize this process across multiple cores where each core runs a separate sl run. The example below runs 2 sequential learning runs on 2 cores (1 on each core)

In [8]:
runs_history = multiple_sequential_learning_runs(
    acsc,
    [train_base_struct],
    testing_base_structures = [test_base_struct],
    number_of_runs = 2,
    number_parallel_jobs = 2,
    initial_num_of_perturbations_per_base_structure = 3, # for generating the initial training set
    batch_num_of_perturbations_per_base_structure = 3, # how large of a pool to predict on for each loop
    batch_size_to_add = 1, # how many structures to add to training on each loop
    number_of_sl_loops = 4,
    write_to_disk = False # if we want to save the runs history to disk
)
print(f"Number of runs information stored: {len(runs_history)}")

Number of runs information stored: 2


The returned list contains the sequential learning dictionary of each separate run, thus allowing for averaging of any of the quantities across runs