# A3MDnet tutorial


## Building a small suggar benchmark database

We have made a fast search in Zinc15 to adquire molecules that contain suggars. These molecules can be too large for QM calculations, so we might just employ those that have less than 40 atoms. Besides, the methodology is still on development for second row atoms, so we will keep only thosee containing C,H,N and O atoms.

We will read a list of SMILES, embed the molecules in 3D structures, optimize those using the Merk Molecular Field, and then optimize them again using TorchANI potentials. The resulting structures will be employed to generate electron densities.

In [1]:
import rdkit.Chem as Chem
import rdkit.Chem.AllChem as AllChem

[23:35:57] Enabling RDKit 2019.09.3 jupyter extensions


In [2]:
def has_not_allowed_atoms(mol, allowed_elements):
    symbols = [i.GetSymbol() for i in mol.GetAtoms()]
    if any([i not in allowed_elements for i in allowed_elements]):
        return True
    else:
        return False

In [3]:
# Reading SMILES 
with open('substances.smi') as f:
    smi = [Chem.MolFromSmiles(i) for i in f.readlines()]

In [4]:
# Embedding and Optimizing Smiles
embed_mols = []
for i in smi:
    if i.GetNumHeavyAtoms() > 30: continue
    if has_not_allowed_atoms(i, ['C', 'H', 'N', 'O']): continue
    u = Chem.AddHs(i)
    AllChem.EmbedMolecule(u)
    try:
        AllChem.MMFFOptimizeMolecule(u)
    except ValueError:
        continue
    embed_mols.append(u)

In [5]:
# Saving the molecules in Mol format (rdkit does not support Mol2 format)
for i, mol in enumerate(embed_mols):
    Chem.MolToMolFile(mol, './mol/sugbench_{:06d}.mol'.format(i))

Now we will translate the mol format to mol2 format. While we could keep working in Mol format, Mol2 has a specific column for charges and for segments, thus it is more convinient for our methodology. We will use obabel for that.

In [None]:
%%bash
cd ./mol
ls *.mol | sed "s/.mol//g" | xargs -I % -P 8 obabel -i mol %.mol -o mol2 -O ../mol2/%.mol2

Now we shall optimize the molecules using TorchANI potentials. Though we could code these instructions, we have seen different examples of optimization, though we will use a script from a3md-utils for this task.

In [8]:
%%bash
cd ./mol2/
source ~/miniconda3/etc/profile.d/conda.sh
conda activate a3md
export A3MDSCRIPTS=""  # Include path to a3md/scripts 
ls *.mol2 | sed "s/.mol2//g" | xargs -I % python3 ${A3MDSCRIPTS}/conformational.py optimize --model=ani1ccx --output %.opt.mol2 %.mol2 

Process is interrupted.


Now we can generate the orca input files, and submit the calculations to a given QM cluster. We will use again one of the scripts of the a3md-utils library to avoid the burdensome task of writing orca input files.

In [None]:
%%bash
cd ./mol2/
source ~/miniconda3/etc/profile.d/conda.sh
conda activate a3md
export A3MDSCRIPTS="" # Include path to a3md/scripts 
ls *.opt.mol2 > .input
python3 ${A3MDSCRIPTS}/utils.py many-prepare-qm --method=wB97X --basis="6-31+G*" --program=orca .input
ls *.orca | xargs sed -i "s/PAL4 AIM/g"

QM calculations can be run elsewhere (e.g., a cluster). For those that cannot run the calculations, we included the WFN outputs in the folder wfn. 

**Proposed exercise**: Replace the Gaussian/Orca qm code by PySCF, using tricks from the previous tutorial.

## Compiling the database

We will use another script to calculate the density matrix of each molecule, and to compile all the wfn information

In [9]:
%%bash
cd wfn
tar xfz sugbench.wfx.tar.gz

In [None]:
%%bash
source /home/bcz/miniconda3/etc/profile.d/conda.sh
conda activate a3md
export A3MDSCRIPTS=""
cd wfn/
ls *.opt.orca.wfx > .input
python3 ${A3MDSCRIPTS}/utils.py many-compile-wfn --input_type=wfx .input sugbench.wfn.h5

The resulting H5 file contains all the information about the wave-function stored as a dictionary. We can access its contents in an interactive way:

In [10]:
import h5py

In [11]:
f = h5py.File('./wfn/sugbench.wfn.h5', 'r')
print(f['sugbench_000000.opt.orca']['atomic_symbols'][:10])
f.close()

[b'C' b'C' b'O' b'O' b'C' b'C' b'O' b'C' b'O' b'C']


## Adquiring data

Models are usually trained on heavy amounts of data. The original A3MDnet models have been trained using thousands of molecules from different sources. Yet, this might be difficult to carry out in a personal computer, so we will perform a toy training with only a fraction of such dataset.

We will employ the GDB7, containing organic molecules with up to seven heavy atoms.


In [None]:
%%bash 

wget https://zenodo.org/record/4542915/files/gdb7.wfn.h5
wget https://zenodo.org/record/4542915/files/gdb7.index
wget https://zenodo.org/record/4542915/files/gdb7.json
    
mv gdb7.wfn.h5 ./training_data/
mv gdb7.index ./training_data/
mv gdb7.json ./training_data/

We might truncate the dataset to make training faster

In [48]:
%%bash
cat training_data/gdb7.index | shuf | head -n 1000 > training_data/gdb7.shuf.index
sed -i "s/gdb7.index/gdb7.shuf.index/g" training_data/gdb7.json

## Building an ML model

The A3MDnet architecture is based on different modules: embeddings, message-passing, normalization, aggregation, and density. We will build a custom predictor employing a3mdnet layers.

In [1]:
import torch
from torch import nn
from a3mdnet.graph import NodeConvolve, NodePool, EdgePool, TopKEdges, MolecularEmbedding, MessagePassing
from a3mdnet.density_models import GenAMD, HarmonicGenAMD
from a3mdnet.data import AMDParameters
from a3mdnet.modules import TranslateAtomicSymbols
from a3mdnet.models.ddnn import A3MDnet

In [2]:
class TDNN3(nn.Module):
    def __init__(
            self
    ):
        super(TDNN3, self).__init__()
        self.emb = MolecularEmbedding(n_species=6, embedding_dim=128)
        self.conv = NodeConvolve(
            distances=[1.88, 3.76, 5.64, 7.52], widths=[1.88, 1.88, 1.88, 1.88], 
            net=nn.Sequential(nn.Linear(640, 256), nn.Tanh(), nn.Linear(256, 128)), 
            update_ratio=0.1)
        self.pool = NodePool(net=nn.Sequential(nn.Linear(128, 64), nn.Tanh(), nn.Linear(64, 8)))
        self.edge = TopKEdges(k=4, rc=8.0, net=nn.Sequential(nn.Linear(256, 128), nn.Tanh(), nn.Linear(128, 9)))
        self.n = 3
        self.decay = 0.5

    def forward(self, x):
        x = self.emb(x)
        for i in range(self.n):
            x = self.conv(x, decay=(self.decay**i))
        c_iso = self.pool(x)[1]
        c_aniso = self.edge(x)[2]
        return c_iso, c_aniso.reshape(c_aniso.shape[0], c_aniso.shape[1], -1, c_aniso.shape[4])
    

In [3]:
tdnn3 = TDNN3()
table = {1: 0, 6: 1, 7: 2, 8: 3, 16: 4, -1: -1}
prodensity_params = AMDParameters.from_file('params/a3md_promolecule.json')
isodensity_params = AMDParameters.from_file('params/a3md_isotropic_basis.json')
anidenisty_params = AMDParameters.from_file('params/a3md_anisotropic_basis.json')
prodensity = GenAMD(prodensity_params, table=table)
isodensity = GenAMD(isodensity_params, table=table)
anidensity = HarmonicGenAMD(anidenisty_params, k=4, max_angular_moment=3, table=table)

The custom predictor interacts with electron density inside an A3MDnet model.

In [4]:
model = A3MDnet(tdnn3, prodensity, isodensity, anidensity, table)
model

A3MDnet(
  (predictor): TDNN3(
    (emb): MolecularEmbedding(
      (map): Embedding(7, 128)
    )
    (conv): NodeConvolve(
      (net): Sequential(
        (0): Linear(in_features=640, out_features=256, bias=True)
        (1): Tanh()
        (2): Linear(in_features=256, out_features=128, bias=True)
      )
    )
    (pool): NodePool(
      (net): Sequential(
        (0): Linear(in_features=128, out_features=64, bias=True)
        (1): Tanh()
        (2): Linear(in_features=64, out_features=8, bias=True)
      )
    )
    (edge): TopKEdges(
      (net): Sequential(
        (0): Linear(in_features=256, out_features=128, bias=True)
        (1): Tanh()
        (2): Linear(in_features=128, out_features=9, bias=True)
      )
    )
  )
  (density): GenAMD()
  (proto): GenAMD()
  (deformation): HarmonicGenAMD()
  (translate): TranslateAtomicSymbols()
)

### Training

We will train the Neural Network by mini-batch optimization, using an ADAM optimization algorithm for weight updates, and we will adjust the learning rate to decrease upon stacking of the performance on the test set.

In [5]:
from a3mdnet.data import H5MonomerDataset
from a3mdnet.sampling import IntegrationGrid
from a3mdnet.density_models import WaveFunctionDensity
from torch.optim import Adam
from torch.optim.lr_scheduler import ReduceLROnPlateau
import math
device = torch.device('cuda:0')

In [6]:
sampler = IntegrationGrid(grid='minimal', radial_resolution=5).to(device)
wfn = WaveFunctionDensity().to(device)
model = model.to(device)

In [7]:
learning_rate = 1e-3
weight_decay = 1e-5
initial_epoch = 0
final_epoch = 100
batch_size = 4

opt = Adam(params=model.parameters(), lr=learning_rate, weight_decay=weight_decay)
schd = ReduceLROnPlateau(opt, mode='min', factor=0.5)

In [None]:
training_data = H5MonomerDataset.from_json('./training_data/gdb7.json', device=device, float_dtype=torch.float)
training_data.split(0.8, shuffle=True)

In [9]:
validation_data = H5MonomerDataset.from_json('./wfn/sugbench.json', device=device, float_dtype=torch.float)

In [None]:
for i in range(initial_epoch, final_epoch + 1):
    test_labs = 0.0
    with torch.no_grad():
        for u in training_data.epoch(split='test', shuffle=False, batch_size=batch_size):
            u.to(device)
            _, dv, w = sampler.sample(u.atomic_numbers, u.coordinates)
            pred, c = model.forward(dv, u.atomic_numbers, u.coordinates, u.charge)
            ref = wfn.density(dv, u.primitive_centers, u.exponents, u.symmetry, u.density_matrix)
            test_labs += (((ref - pred).abs() * w).sum(1) / (pred * w).sum(1)).sum()

    test_labs = test_labs/ len(training_data.ids['test'])
    schd.step(test_labs)
    lr = opt.param_groups[0]['lr']
    print('{:6d} {:18.6e} {:12.6e}'.format(i, test_labs, lr))

    for u in training_data.epoch(split='train', shuffle=True, batch_size=batch_size):
        u.to(device)
        _, dv, w = sampler.sample(u.atomic_numbers, u.coordinates)
        pred, c = model.forward(dv, u.atomic_numbers, u.coordinates, u.charge)
        ref = wfn.density(dv, u.primitive_centers, u.exponents, u.symmetry, u.density_matrix)
        l2 = ((ref - pred).pow(2) * w).sum()
        opt.zero_grad()
        l2.backward()
        opt.step()
        
    if i % 10 == 0:
        torch.save(model, 'tdnn3_{:06d}.pt'.format(i))



     0       2.819129e-01 1.000000e-03
     1       7.983937e-02 1.000000e-03
     2       3.823092e-02 1.000000e-03
     3       3.828569e-02 1.000000e-03
     4       3.788176e-02 1.000000e-03
     5       4.068657e-02 1.000000e-03
     6       3.603610e-02 1.000000e-03
     7       3.985449e-02 1.000000e-03
     8       4.010543e-02 1.000000e-03
     9       3.130591e-02 1.000000e-03
    10       3.769179e-02 1.000000e-03
    11       3.433502e-02 1.000000e-03
    12       3.117320e-02 1.000000e-03
    13       3.294291e-02 1.000000e-03
    14       3.360168e-02 1.000000e-03
    15       3.093831e-02 1.000000e-03
    16       3.130447e-02 1.000000e-03
    17       3.147235e-02 1.000000e-03
    18       3.239320e-02 1.000000e-03
    19       3.025984e-02 1.000000e-03
    20       2.843221e-02 1.000000e-03
    21       3.135502e-02 1.000000e-03
    22       3.118245e-02 1.000000e-03
    23       2.785484e-02 1.000000e-03
    24       3.001516e-02 1.000000e-03
    25       3.152177e-02

In [17]:
abse = []
for u in validation_data.epoch(split='remaining', shuffle=False, batch_size=1):
    u.to(device)
    _, dv, w = sampler.sample(u.atomic_numbers, u.coordinates)
    pred, c = model.forward(dv, u.atomic_numbers, u.coordinates, u.charge)
    ref = wfn.density(dv, u.primitive_centers, u.exponents, u.symmetry, u.density_matrix)
    u = (w * torch.abs(ref- pred) / (ref * w).sum()).sum()
    abse.append(u.item())
    

## Visualizing electron density on suggars

We can plot the electron densities using dx volumetric format, which is similar on its purpose to CUBE files.

In [10]:
from a3mdnet.utils import DxGrid, to_xyz_file

In [11]:
%%bash 
mkdir -p dx

In [13]:
dxg = DxGrid(device=torch.device('cuda:0'), dtype=torch.float, resolution=0.5, spacing=2.0)

model = torch.load('tdnn3_000050.pt')
with torch.no_grad():
    for i, u in enumerate(validation_data.epoch(split='remaining', shuffle=False, batch_size=1)):
        u.to(device)
        g, dv, cell_info = dxg.generate_grid(u.coordinates.to(torch.device('cuda:0')))
        pred, c = model.forward(dv, u.atomic_numbers, u.coordinates, u.charge)
        ref = wfn.density(dv, u.primitive_centers, u.exponents, u.symmetry, u.density_matrix)
        pred = pred.detach().to(torch.device('cpu')).clamp(max=1.0)
        ref = ref.detach().to(torch.device('cpu')).clamp(max=1.0)
        p_pred = dxg.dx(pred, **cell_info)
        p_pred.write('./dx/sugbench_{:06d}.pred.dx'.format(i))
        p_ref = dxg.dx(ref, **cell_info)
        p_ref.write('./dx/sugbench_{:06d}.ref.dx'.format(i))
        
        for xyz in to_xyz_file(u.atomic_numbers, u.coordinates):
            with open('./dx/sugbench_{:06d}.xyz'.format(i), 'w') as f:
                f.write(xyz)
        break

![](movie_a3md.gif)

And here we are! The next tutorial will cover another neural network based methodology, DeepDFT, which represents state-of-the-art on valence electron density prediction. Enjoy!

In [None]:
with torch.no_grad():
    for i, u in enumerate(validation_data.epoch(split='remaining', shuffle=False, batch_size=1)):
        for xyz in to_xyz_file(u.atomic_numbers, u.coordinates):
            with open('./xyz_rotated/sugbench_{:06d}.xyz'.format(i), 'w') as f:
                f.write(xyz)