# Analysis of the QM7 dataset


*Taken from:* https://www.kaggle.com/code/mjmurphy28/predicting-atomization-energy-qm7/notebook

*Original website:* http://quantum-machine.org/datasets/

**Attributes:**

- X: (7165 x 23 x 23), Coulomb matrices, low-level molecular descriptor (Rupp et al., 2012)
- T: (7165), atomization energies (unit: kcal/mol)
- P: (5 x 1433), cross-validation splits as used in [Montavon et al. NIPS, 2012]
- Z: (7165 x 23), atomic charges
- R: (7165 x 23 x 3), cartesian coordinate (unit: Bohr) of each atom in the molecules


I.e. it contains 23 atoms and 7165 molecules.

They say 'The Coulomb matrix has built-in invariance to translation and rotation of the molecule', because it is calculated by:

$$C_{ii} = \frac{1}{2} Z_i^{2.4} \text{ and } C_{ij} = \frac{Z_i Z_j}{|R_i - R_j|}$$

where $Z_i$ is the nuclear charge of atom $i$ and $R_i$ is its position. Thus translation and rotation of a molecule (the atoms positions) will not change the value of $C$.


In [202]:
import pandas as pd
import scipy.io
import numpy as np
from scipy.spatial.distance import pdist, squareform
import networkx as nx
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import time
import torch
import os

rand_state = 42

np.random.seed(rand_state)

In [9]:
qm7 = scipy.io.loadmat('../data/qm7.mat')
qm7.keys()

dict_keys(['__header__', '__version__', '__globals__', 'X', 'R', 'Z', 'T', 'P'])

In [46]:
# coulomb matrix of first molecule

qm7['X'][0][:5,:5]

array([[36.858105  ,  2.9076326 ,  2.907612  ,  2.9075644 ,  2.9053485 ],
       [ 2.9076326 ,  0.5       ,  0.29672   ,  0.29671896,  0.2966784 ],
       [ 2.907612  ,  0.29672   ,  0.5       ,  0.29671845,  0.29667813],
       [ 2.9075644 ,  0.29671896,  0.29671845,  0.5       ,  0.29667678],
       [ 2.9053485 ,  0.2966784 ,  0.29667813,  0.29667678,  0.5       ]],
      dtype=float32)

In [48]:
# atom energies (see periodic table)
#    - from here we see it consists of CH4 (which is methane)
# note: this ordering varies per molecule
qm7['Z'][0][:5]

array([6., 1., 1., 1., 1.], dtype=float32)

### formatting the dataset to Datasets.py

In [121]:
print('total atoms with charges: ', (qm7['Z']>0).sum())
print('total atoms with coordinates: ', ((qm7['R']**2).sum(axis=2)>0).sum())
print('the same thus former can be used from self.num_nodes')

total atoms with charges:  110650
total atoms with coordinates:  110650
the same thus former can be used from self.num_nodes


In [187]:
# it needs the following attributes

# GRAPH/MOLECULE RELATED:

# Number of graphs in the dataset, i.e. molecules
num_graphs = len(qm7['T'][0]) # T is atomization energies (target)

# Graph list, each molecule has a
graph_list = torch.tensor(range(num_graphs))

# the energy of each molecule
molecule_energy = torch.tensor(qm7['T'][0])

# NODE/ATOM RELATED:

# i.e. atoms, each atom will be distinct
# total charges higher than 0 (there are no negative and 0 charged atoms, see above)
num_nodes = int((qm7['Z']>0).sum())

node_list = torch.tensor(range(num_nodes))

# Node graph index, molecule number each atom belongs to
node_graph_index = []

# Node coordinates
node_coordinates = []

# Node atomic charge
node_charge = [0]*num_nodes # currently empty


# EDGE RELATED:

# Edge list - fully conected graphs due to 
edge_list = []

# the coulomb value for each edge
edge_coulomb = []


# keeping note of atom indices globally (i.e. for all graphs and w.r.t. num_nodes)
global_idx = 0
# looping each molecule
for molecule in graph_list:
    
    # each nodes index in current graph (globally)
    nodes_idx_graph = [local_idx + global_idx for local_idx in list(range((qm7['Z'][molecule]>0).sum()))]
    
    # looping each atom/node in current molecule
    for node_idx in range((qm7['Z'][molecule]>0).sum()) :
        node_graph_index.append(molecule) # saving which molecule this atom belongs to
        node_coordinates.append(qm7['R'][molecule][node_idx]) # saving nodes/atoms coordinate
        node_charge[global_idx] = qm7['Z'][molecule][node_idx] # saving each nodes/atoms energy
        
        # looping all neighbouring nodes/atoms in graph/molecule (based on global node index)
        # creating edge list, note: fully connected
        for idx, neighbouring_node in enumerate(nodes_idx_graph):
            # if not current atom_idx (don't want edges going to themselvel)
            if neighbouring_node != global_idx:
                # creating the edge list
                edge_list.append([global_idx, neighbouring_node])
                # coulomb value per edge, note: symmetric
                edge_coulomb.append(qm7['X'][molecule][node_idx, idx])
        
        global_idx += 1

assert num_nodes == global_idx, 'inconsistencies noticed'

#self.node_graph_index = torch.tensor(node_coordinates)

#self.node_coordinates = torch.tensor(node_coordinates)

#self.node_charge = torch.tensor(node_charge)

#self.edge_list = torch.tensor(edge_list)

#self.edge_coulomb = torch.tensor(edge_coulomb)

In [194]:
import Datasets

In [193]:
Datasets.QM7

AttributeError: module 'Datasets' has no attribute 'QM7'

in til train.py script

In [207]:
path = os.getcwd()
path

'/Users/arond.jacobsen/Documents/GitHub/uq-gnn/content/notebooks'

In [201]:
path = pwd

NameError: name 'pwd' is not defined

In [None]:
sys.path.insert(1, '/path/to/application/app/folder')