# Case 3: Predicting Electron Density with pre-trained DeepDFT

## Introduction

DeepDFT employs a Message-Passing scheme to compute electron density in real 3D space. For that, it requires the construction of a graph that joins atomic nodes and probe nodes. In this small study case, we will employ DeepDFT to compute electron density cube files for different suggar molecules. 

We will use the *xyz* files created in the last block of the previous tutorial, because their coordinates match those from the wfx files.


## Libraries and dependencies

In [1]:
import torch
import math
from deepdft import utils
from deepdft import densitymodel
from deepdft import dataset
import os
import argparse
import json
import ase
import ase.io
import numpy as np
from ase.units import Bohr
import pandas as pd
import matplotlib.pyplot as plt
import pathlib
import glob

No protocol specified


In [2]:
class LazyMeshGrid():
    def __init__(self, cell, grid_step, origin=None):
        self.cell = cell
        self.scaled_grid_vectors = [np.arange(0, l, grid_step)/l for l in self.cell.lengths()]
        self.shape = np.array([len(g) for g in self.scaled_grid_vectors] + [3])
        if origin is None:
            self.origin = np.zeros(3)
        else:
            self.origin = origin

        self.origin = np.expand_dims(self.origin, 0)

    def __getitem__(self, indices):
        indices = np.array(indices)
        indices_shape = indices.shape
        if not (len(indices_shape) == 2 and indices_shape[0] == 3):
            raise NotImplementedError("Indexing must be a 3xN array-like object")
        gridA = self.scaled_grid_vectors[0][indices[0]]
        gridB = self.scaled_grid_vectors[1][indices[1]]
        gridC = self.scaled_grid_vectors[2][indices[2]]

        grid_pos = np.stack([gridA, gridB, gridC], 1)
        grid_pos = np.dot(grid_pos, self.cell)
        grid_pos += self.origin

        return grid_pos

In [3]:
def load_model(model_dir, device):
    """
    load_model
    ==========
    Parameters
    ----------
    model_dir: str
        Where the model is located
    device: torch.Device    
        Where to store the model (cpu, cuda, etc)
    Returns
    -------
    densitymodel.DensityModel, float
    
    Examples
    --------
    >>> load_model(deepdft_folder / 'qm9_pretrained_model', torch.device('cuda:0'))
    """
    with open(os.path.join(model_dir, "arguments.json"), "r") as f:
        runner_args = argparse.Namespace(**json.load(f))
    model = densitymodel.DensityModel(runner_args.num_interactions, runner_args.node_size, runner_args.cutoff)
    device = torch.device(device)
    model.to(device)
    state_dict = torch.load(os.path.join(model_dir, "best_model.pth"))
    model.load_state_dict(state_dict["model"])
    return model, runner_args.cutoff


def load_molecule(atomspath, vacuum, grid_step):
    """
    load_molecule
    =============
    
    Parameters
    ----------
    atomspath: str
        File. Tested on xyz and mol formats
    vacuum: float
        padding
    grid_step: number of grids
    """
    atoms = ase.io.read(atomspath)
    
    diff = atoms.get_positions().max(0) - atoms.get_positions().min(0)
    
    atoms.center(vacuum=vacuum)

    
    
    a, b, c, ang_bc, ang_ac, ang_ab = atoms.get_cell_lengths_and_angles()
    a, b, c = ceil_float(a, grid_step), ceil_float(b, grid_step), ceil_float(c, grid_step)
    atoms.set_cell([a, b, c, ang_bc, ang_ac, ang_ab])

    origin = np.zeros(3)

    grid_pos = LazyMeshGrid(atoms.get_cell(), grid_step)

    metadata = {"filename": atomspath}
    return {
        # "density": density,
        "atoms": atoms,
        "origin": origin,
        "grid_position": grid_pos,
        "metadata": metadata, # Meta information
        "diff": diff
    }



def ceil_float(x, step_size):
    # Round up to nearest step_size and subtract a small epsilon
    x = math.ceil(x/step_size) * step_size
    # eps = 2*np.finfo(float).eps * x
    return x # - eps

## Exercises

### Ex1: Save cube files with electron densities

Bellow we have re-adapted DeepDFT script for the evaluation of cube files.
The first step is about loading the model. We will use their pretrained model on GDB9, which is a dataset containing 134k organic molecules with less than 9 heavy atoms.


In [4]:
path_to_model = pathlib.Path(dataset.__file__).parent / "qm9_pretrained_model"

In [5]:
device = torch.device(torch.device('cuda:0'))
model, cutoff = load_model(path_to_model, device)
grid_step = 0.25 # If required, we can make tighter grids by reducing this value
#                  for visualization purposes, it's ok
padding = 1.0 # If required, we can increase it to visualize the density in far regions
#               Yet, if it becomes too large, density calcualtion will be simply clamped
bohr2ang = 0.529177 # DeepDFT works in electron/(angstrom^-3). We will move it to atomic units
bohr2angp3 = bohr2ang ** 3

The protocol employs the Atomic Simulation Environment (ASE). This library accepts .mol files and .xyz files. Our optimized molecules are at .mol2 format, so we will translate them to xyz using babel

In [None]:
%%bash

mkdir -p xyz
cd ./mol2/ 
ls *.mol2 | grep opt | sed "s/.mol2//g" | xargs -I % -P 10 obabel -i mol2 %.mol2 -o xyz -O ../xyz/%.xyz

Loading all the molecules in a list of (name,molecule) tuples

In [6]:
sugbench = []
for i in glob.glob('./xyz_rotated/*.xyz'): 
    mol = load_molecule('{:s}'.format(i), vacuum=padding, grid_step=grid_step)
    sugbench.append(
        (
            i.split('/')[-1], mol
            
        )
    )



DeepDFT works as a message-passing neural network. This means that any computation requires building graphs between atoms (nodes) and probes (probe nodes). DeepDFT uses distance as adjacency criteria, and the distance between nodes as edge data.


*collate_fn* generates such graph for atomic coordinates, so the MPNN can process the molecule information

The calculation of the electron density is performed at two steps:

1. Calculation of molecule representation, through message-passing between atomic nodes. Probing nodes are excluded
2. Calculation of electron density at probing coordinates, through message-passing from atomic nodes to probing nodes.

In [7]:
collate_fn = dataset.CollateFuncAtoms(
    cutoff=cutoff,
    pin_memory=True,
    disable_pbc=True,
)


In [8]:
!mkdir -p cube
with torch.no_grad(): # Disabling gradient calculation is important to avoid memory deplition
    for i, density_dict in sugbench: # iterating on the suggar dataset
        print("processing {:s}".format(i), end=" ")
        r = - density_dict["diff"] / 2
        cubewriter = utils.CubeWriter(
            "./cube/{:s}.deepdft.cube".format(i),
            density_dict["atoms"],
            density_dict["grid_position"].shape[0:3],
            density_dict["origin"],
            "predicted by DeepDFT model",
        )
        # Part 1: Message Passing between atomic nodes
        graph_dict = collate_fn([density_dict])
        device_batch = {
            k: v.to(device=device, non_blocking=True) for k, v in graph_dict.items()
        }
        atom_representation = model.atom_model(device_batch)
        
        # Part 2: Message Passing from atomic nodes to probing nodes
        # given that there are many probing nodes, calculations become iterative
        density_iter = dataset.DensityGridIterator(density_dict, True, 1000, cutoff)
        for probe_graph_dict in density_iter:

            probe_dict = dataset.collate_list_of_dicts([probe_graph_dict])
            probe_dict = {
                k: v.to(device=device, non_blocking=True) for k, v in probe_dict.items()
            }
            device_batch["probe_edges"] = probe_dict["probe_edges"]
            device_batch["probe_edges_features"] = probe_dict["probe_edges_features"]
            device_batch["num_probe_edges"] = probe_dict["num_probe_edges"]
            device_batch["num_probes"] = probe_dict["num_probes"]

            cubewriter.write(bohr2angp3 * model.probe_model(device_batch, atom_representation).cpu().detach().numpy().flatten())
        ase.io.write('./cube/{:s}.xyz'.format(i), density_dict['atoms'])
        print("-- DONE")
        break # we break here, to avoid calculating the whole dataset

processing sugbench_000016.xyz -- DONE


The results can be visualized using different programs. In this case, we will use ChimeraX.

<img src="movie1.gif" alt="">

## Ex 2: Calculating electron density at custom coordinates



In [None]:
def process_arbitrary_coordinates(atoms, r, cutoff):
    probe_edges, probe_edge_features = dataset.probes_to_graph(atoms, r, cutoff)
    probe_edges = np.concatenate(probe_edges, axis=0)
    probe_edge_features = np.concatenate(probe_edge_features, axis=0)[:, None]
    num_probe_edges = probe_edges.shape[0]
    num_probes = r.shape[0]
    probe_dict = dict(
        probe_edges=torch.tensor(probe_edges, dtype=torch.long), 
        probe_edges_features=torch.tensor(probe_edge_features, dtype=torch.float), 
        num_probe_edges=torch.tensor(num_probe_edges), num_probes=torch.tensor(num_probes)
    )
    return probe_dict

In [None]:
r = density_dict['atoms'].get_positions().mean(0).reshape(1, 3) + np.random.rand(100, 3) * 3.0

with torch.no_grad(): # Disabling gradient calculation is important to avoid memory deplition

    density_dict = sugbench[0][1]
    
    # Part 1: Message Passing between atomic nodes
    graph_dict = collate_fn([density_dict])
    device_batch = {
        k: v.to(device=device, non_blocking=True) for k, v in graph_dict.items()
    }
    atom_representation = model.atom_model(device_batch)

    # Part 2: Message Passing from atomic nodes to probing nodes
    # given that there are many probing nodes, calculations become iterative
    probe_graph_dict = process_arbitrary_coordinates(density_dict['atoms'], r, cutoff)
    probe_dict = dataset.collate_list_of_dicts([probe_graph_dict])
    probe_dict = {
        k: v.to(device=device, non_blocking=True) for k, v in probe_dict.items()
    }
    device_batch["probe_edges"] = probe_dict["probe_edges"]
    device_batch["probe_edges_features"] = probe_dict["probe_edges_features"]
    device_batch["num_probe_edges"] = probe_dict["num_probe_edges"]
    device_batch["num_probes"] = probe_dict["num_probes"]
    
    p = bohr2angp3 * model.probe_model(device_batch, atom_representation).cpu().detach().numpy().flatten()
    print("-- DONE")

## Analysis of the AIM charge of a dimer

In the last block of these tutorials, we will go over the calculation of charge transfer using the accurate DeepDFT model. For that, we will study ionic hydrogen bonds from the [Non-Covalent Interaction Atlas](http://www.nciatlas.org/IHB100.html). In this case we will study the interaction between a charged imidizole and a carbonyl oxygen (which is an interaction that could be found in proteins).

![](illustration_03.png)


For this, we will:

1. Compute the electron density for these dimers.
2. Save the CUBE files
3. Use the bader program to make AIM analysis of the resulting electron densities
4. Parse the resulting bader outputs
5. Display the charge gain-loose along the different conformations.

**IMPORTANT NOTE**: DeepDFT only generates the valence electron density. Thus, the values that we are going to obtain are important in relative terms, not in absolute terms — we ignore how much charge was included in the cores.


In [None]:
device = torch.device(torch.device('cuda:0'))
model, cutoff = load_model(path_to_model, device)
grid_step = 0.05 # Again, these parameters can be modified to get more detailed CUBE files.
padding = 2.0 
bohr2ang = 0.529177
bohr2angp3 = bohr2ang ** 3

In [None]:
dimer = []
for i in glob.glob('./dimers/*.xyz'): 
    mol = load_molecule('{:s}'.format(i), vacuum=padding, grid_step=grid_step)
    dimer.append(
        (
            i.split('/')[-1].replace('.xyz', ''), mol
            
        )
    )

In [None]:
collate_fn = dataset.CollateFuncAtoms(
    cutoff=cutoff,
    pin_memory=True,
    disable_pbc=True,
)

In [None]:
with torch.no_grad(): 
    for i, density_dict in dimer:
        print("processing {:s}".format(i), end=" ")
        r = - density_dict["diff"] / 2
        cubewriter = utils.CubeWriter(
            "./dimers/{:s}.cube".format(i),
            density_dict["atoms"],
            density_dict["grid_position"].shape[0:3],
            density_dict["origin"],
            "predicted by DeepDFT model",
        )
        graph_dict = collate_fn([density_dict])
        device_batch = {
            k: v.to(device=device, non_blocking=True) for k, v in graph_dict.items()
        }
        atom_representation = model.atom_model(device_batch)

        density_iter = dataset.DensityGridIterator(density_dict, True, 1000, cutoff)
        for probe_graph_dict in density_iter:

            probe_dict = dataset.collate_list_of_dicts([probe_graph_dict])
            probe_dict = {
                k: v.to(device=device, non_blocking=True) for k, v in probe_dict.items()
            }
            device_batch["probe_edges"] = probe_dict["probe_edges"]
            device_batch["probe_edges_features"] = probe_dict["probe_edges_features"]
            device_batch["num_probe_edges"] = probe_dict["num_probe_edges"]
            device_batch["num_probes"] = probe_dict["num_probes"]

            cubewriter.write(bohr2angp3 * model.probe_model(device_batch, atom_representation).cpu().detach().numpy().flatten())

        print("-- DONE")


Now we can analyze the resulting cube files using the bader program provided by [Henkelman group](http://theory.cm.utexas.edu/henkelman/code/bader/). Such program will decompose the density in AiM volumes that will give us the change in atomic charges as the oxygen from acetomne approaches one of the nitrogens.

In [None]:
%%bash

wget http://theory.cm.utexas.edu/henkelman/code/bader/download/bader_lnx_64.tar.gz
tar xfvz bader_lnx_64.tar.gz
cd dimers
for i in $(ls *.cube); do
    ../bader $i
    mv ACF.dat $(basename $i cube).charge
done

Now we read the resulting files, and we merge them with atom identifiers that we obtain from the xyz file.

In [None]:
mol = pd.read_csv(
    './dimers/imidazolium--acetone_080.xyz',
    delim_whitespace=True, skiprows=2, header=None, names=['element', 'x', 'y', 'z']
)
mol.index = mol.index +1
mol['id'] = mol.index
mol['id'] = mol.id.astype(str)
mol['id'] = mol['element'] + mol['id']
mol

In [None]:
aim_analysis = []
for i, name in enumerate(sorted(glob.glob('./dimers/*.charge'))):
    print(i, name)
    tmp = pd.read_csv(
        name, nrows=20, skiprows=[0, 1], delim_whitespace=True, header=None,
        names=['x', 'y', 'z', 'charge', 'dist', 'vol']
    )
    a = mol.join(tmp[['charge']])
    a['scale'] = i
    aim_analysis.append(a)
aim_analysis = pd.concat(aim_analysis)



In [None]:
min_charge = aim_analysis.groupby('id').transform(lambda x: x.min())
aim_analysis['d_charge'] = aim_analysis['charge'] - min_charge['charge']

In [None]:
fig, ax = plt.subplots(1)
ax.set_xlabel('scale')
ax.set_ylabel('$\delta$ charge')
aim_analysis.query('id == "O18"').plot('scale', 'd_charge', kind='line', ax=ax, label='O')
aim_analysis.query('id == "N3"').plot('scale', 'd_charge', kind='line', ax=ax, label='N1')
aim_analysis.query('id == "N6"').plot('scale', 'd_charge', kind='line', ax=ax, label='N2')
plt.show()

As we can see, the oxygen borrows electrons from the nitrogen when the distances are short, and returns those electrons when they get further away. Finer cube grids might provide a softer curve, yet the qualitative behaviour is correct.

This is an example of how new Machine-Learning tools can be used to understand chemistry better. We are pretty confident that newer methods will improve both the accuracy and the scaling.