# Final Project

This notebook is adapted from here: https://aiqm.github.io/torchani/examples/nnp_training.html

## Checkpoint 1: Data preparation

1. Create a working directory: `/global/scratch/users/[USER_NAME]/[DIR_NAME]`. Replace the [USER_NAME] with yours and specify a [DIR_NAME] you like.
2. Copy this Jupyter Notebook to the working directory
3. Download the ANI dataset `ani_dataset_gdb_s01_to_s04.h5` from bCourses and upload it to the working directory
4. Complete this notebook (can be worked on with your laptop, but **must be run on the cluster** for the final outputs)

Hint: You can use ? to learn more about any python function, e.g. ?torch.nn.Linear

In [1]:
!pwd

/Users/chu/Documents/Class/MSSE_Spring2024/Chem277B/Final_Project


In [24]:
import warnings
warnings.filterwarnings("ignore", category=UserWarning)
import numpy as np
from tqdm import tqdm
import torch
from torch.utils.data import DataLoader
import torch.nn as nn
import torchani
import torchani.data

### Use GPU

In [16]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

cpu


### Set up AEV computer

#### AEV: Atomic Environment Vector (atomic features)

Ref: Chem. Sci., 2017, 8, 3192

In [17]:
def init_aev_computer():
    Rcr = 5.2
    Rca = 3.5
    EtaR = torch.tensor([16], dtype=torch.float, device=device)
    ShfR = torch.tensor([
        0.900000, 1.168750, 1.437500, 1.706250, 
        1.975000, 2.243750, 2.512500, 2.781250, 
        3.050000, 3.318750, 3.587500, 3.856250, 
        4.125000, 4.393750, 4.662500, 4.931250
    ], dtype=torch.float, device=device)


    EtaA = torch.tensor([8], dtype=torch.float, device=device)
    Zeta = torch.tensor([32], dtype=torch.float, device=device)
    ShfA = torch.tensor([0.90, 1.55, 2.20, 2.85], dtype=torch.float, device=device)
    ShfZ = torch.tensor([
        0.19634954, 0.58904862, 0.9817477, 1.37444680, 
        1.76714590, 2.15984490, 2.5525440, 2.94524300
    ], dtype=torch.float, device=device)

    num_species = 4
    aev_computer = torchani.AEVComputer(
        Rcr, Rca, EtaR, ShfR, EtaA, Zeta, ShfA, ShfZ, num_species
    )
    return aev_computer

aev_computer = init_aev_computer()
aev_dim = aev_computer.aev_length
print(aev_dim)

384


### Prepare dataset & split

In [18]:
def load_ani_dataset(dspath):
    self_energies = torch.tensor([
        0.500607632585, -37.8302333826,
        -54.5680045287, -75.0362229210
    ], dtype=torch.float, device=device)
    energy_shifter = torchani.utils.EnergyShifter(None)
    species_order = ['H', 'C', 'N', 'O']

    dataset = torchani.data.load(dspath)
    dataset = dataset.subtract_self_energies(energy_shifter, species_order)
    dataset = dataset.species_to_indices(species_order)
    dataset = dataset.shuffle()
    return dataset

dataset = load_ani_dataset("ani_gdb_s01_to_s04.h5")

In [22]:
# Use dataset.split method to do split
train_data, val_data, test_data = dataset.split(0.8, 0.1, None)

# Show amount of training data vs total data
print("Training data size:", len(train_data))
print("Validation data size:", len(val_data))
print("Test data size:", len(test_data))
print("Total data size:", len(dataset))
# assert(len(dataset) == len(val_data) + len(test_data) + len(train_data))
print(691918 + 86489 * 2)

Training data size: 691918
Validation data size: 86489
Test data size: 86489
Total data size: 864898
864896


### Batching

In [34]:
batch_size = 8192
# use dataset.collate(...).cache() method to do batching

train_data_loader = train_data.collate(batch_size).cache()
val_data_loader = val_data.collate(batch_size).cache()
test_data_loader = test_data.collate(batch_size).cache()

<torchani.data.TransformableIterable at 0x1460355a0>

In [44]:
# Show that batching is working correctly
train_data_loader_list = list(train_data_loader)
train_data_loader_list

[defaultdict(list,
             {'species': tensor([[ 3,  1,  1,  ..., -1, -1, -1],
                      [ 1,  1,  3,  ..., -1, -1, -1],
                      [ 1,  2,  1,  ..., -1, -1, -1],
                      ...,
                      [ 1,  1,  3,  ..., -1, -1, -1],
                      [ 1,  1,  1,  ..., -1, -1, -1],
                      [ 1,  1,  1,  ...,  0, -1, -1]]),
              'coordinates': tensor([[[-1.5067e+00, -4.6423e-01, -1.2594e-01],
                       [-5.6311e-01,  6.1330e-01,  3.9458e-02],
                       [ 8.0604e-01,  6.3581e-02,  6.9856e-02],
                       ...,
                       [ 0.0000e+00,  0.0000e+00,  0.0000e+00],
                       [ 0.0000e+00,  0.0000e+00,  0.0000e+00],
                       [ 0.0000e+00,  0.0000e+00,  0.0000e+00]],
              
                      [[-7.4559e-03, -1.0627e+00,  1.3833e-01],
                       [ 1.0146e+00,  6.7303e-02, -2.2808e-01],
                       [ 8.2268e-03,  1.0838e+

In [62]:
print(len(train_data_loader_list))
assert(len(train_data_loader_list) == len(train_data) // batch_size + 1)

85


The appropriate number of batches were created. For a dataset of size 691918, a total of 85 batches should be created and that is what is observed

In [55]:
display(len(train_data_loader_list[0]['species']))
display(len(train_data_loader_list[0]['coordinates']))
display(len(train_data_loader_list[0]['energies']))

8192

8192

8192

Batching is appropriately creating batches of size 8192. Each batch of the ANI dataset is of a dictionary with species, coordinates, and energies all stored in corresponding tensors.

In [74]:
# Training Data Batches
for i, batch in enumerate(train_data_loader):
    species = batch['species']
    print(f'Batch # {i} is of size: { len(species) }')

Batch # 0 is of size: 8192
Batch # 1 is of size: 8192
Batch # 2 is of size: 8192
Batch # 3 is of size: 8192
Batch # 4 is of size: 8192
Batch # 5 is of size: 8192
Batch # 6 is of size: 8192
Batch # 7 is of size: 8192
Batch # 8 is of size: 8192
Batch # 9 is of size: 8192
Batch # 10 is of size: 8192
Batch # 11 is of size: 8192
Batch # 12 is of size: 8192
Batch # 13 is of size: 8192
Batch # 14 is of size: 8192
Batch # 15 is of size: 8192
Batch # 16 is of size: 8192
Batch # 17 is of size: 8192
Batch # 18 is of size: 8192
Batch # 19 is of size: 8192
Batch # 20 is of size: 8192
Batch # 21 is of size: 8192
Batch # 22 is of size: 8192
Batch # 23 is of size: 8192
Batch # 24 is of size: 8192
Batch # 25 is of size: 8192
Batch # 26 is of size: 8192
Batch # 27 is of size: 8192
Batch # 28 is of size: 8192
Batch # 29 is of size: 8192
Batch # 30 is of size: 8192
Batch # 31 is of size: 8192
Batch # 32 is of size: 8192
Batch # 33 is of size: 8192
Batch # 34 is of size: 8192
Batch # 35 is of size: 8192
Ba

In [75]:
# Test Data Batches
for i, batch in enumerate(test_data_loader):
    species = batch['species']
    print(f'Batch # {i} is of size: { len(species) }')

Batch # 0 is of size: 8192
Batch # 1 is of size: 8192
Batch # 2 is of size: 8192
Batch # 3 is of size: 8192
Batch # 4 is of size: 8192
Batch # 5 is of size: 8192
Batch # 6 is of size: 8192
Batch # 7 is of size: 8192
Batch # 8 is of size: 8192
Batch # 9 is of size: 8192
Batch # 10 is of size: 4569


In [76]:
# Val Data Batches
for i, batch in enumerate(val_data_loader):
    species = batch['species']
    print(f'Batch # {i} is of size: { len(species) }')

Batch # 0 is of size: 8192
Batch # 1 is of size: 8192
Batch # 2 is of size: 8192
Batch # 3 is of size: 8192
Batch # 4 is of size: 8192
Batch # 5 is of size: 8192
Batch # 6 is of size: 8192
Batch # 7 is of size: 8192
Batch # 8 is of size: 8192
Batch # 9 is of size: 8192
Batch # 10 is of size: 4569


All the batching works well! Appropriate number of batches all of size 8192 except for the last batch were created

### Torchani API

In [77]:
class AtomicNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(384, 128),
            nn.ReLU(),
            nn.Linear(128, 1)
        )
    
    def forward(self, x):
        return self.layers(x)

net_H = AtomicNet()
net_C = AtomicNet()
net_N = AtomicNet()
net_O = AtomicNet()

# ANI model requires a network for each atom type
# use torchani.ANIModel() to compile atomic networks
ani_net = torchani.ANIModel([net_H, net_C, net_N, net_O])
model = nn.Sequential(
    aev_computer,
    ani_net
).to(device)

In [78]:
train_data_batch = next(iter(train_data_loader))

loss_func = nn.MSELoss()
species = train_data_batch['species'].to(device)
coords = train_data_batch['coordinates'].to(device)
true_energies = train_data_batch['energies'].to(device).float()
_, pred_energies = model((species, coords))
loss = loss_func(true_energies, pred_energies)
print(loss)

tensor(0.0954, grad_fn=<MseLossBackward0>)
