# Note of `get_data.py`

## File Preparation

The first step is to download QM9 database. Since in China, the downloading speed is quite slow; so I have downloaded the database file locally. The code has been modified by myself.

(In China, VPN is recommended to download database file.)

> Following code starting from Line 1 in `get_data.py`.

In [1]:
#! /usr/local/share/ajzapps/anaconda3/bin/python

import os
from rdkit import Chem
import glob
import json
import numpy as np

if not os.path.exists('data'):
    os.mkdir('data')
    print('made directory ./data/')

download_path = os.path.join('data', 'dsgdb9nsd.xyz.tar.bz2')
#if not os.path.exists(download_path):
#    print('downloading data to %s ...' % download_path)
#    source = 'https://ndownloader.figshare.com/files/3195389'
#    os.system('wget -O %s %s' % (download_path, source))
#    print('finished downloading')

unzip_path = os.path.join('data', 'qm9_raw')
if not os.path.exists(unzip_path):
    print('extracting data to %s ...' % unzip_path)
    os.mkdir(unzip_path)
    os.system('tar xvjf %s -C %s' % (download_path, unzip_path))
    print('finished extracting')

extracting data to data/qm9_raw ...
finished extracting


In the cell above,
* Line 4: The [RDKit package](http://www.rdkit.org/) is a package for cheminformation.
* Line 9-11: Make a directory `data` here.
* Line 13: Specify zipped database file.
  * Before the program actually runs, I have created a directory named `data` in the current path. I have also copied file [`dsgdb9nsd.xyz.tar.bz2`](https://ndownloader.figshare.com/files/3195389) (~ 82 MB) into this directory. However, I havn't created `data/qm9_raw` yet.
* Line 20-25: Unzip all files to directory `data/qm9_raw`.
  * This process takes a few minutes (probably no longer than 3 minutes).
  * After unzipping, file names in `data/qm9_raw` is `dsgdb9nsd_??????.xyz`, where `??????` should be `000001` - `133885`.

## Data Reading

Since the function `preprocess()` is the actual main program, we don't need to treat this function as a whole. We can just split this function into the following functionalities.

### Training / validation split

The training and validation set are predefined. The validation set index has been stored in .json file `valid_idx.json`. The total number of validation set is 13082.
* However, I believe that in their implementation, no testing set is defined. More over, there are some molecules failed in some properties (like failure in SMILES string testing) are excluded by Faber et al. (JCTC 2017), but probably not excluded by this implementation.

> Following code starting from Line 38 in `get_data.py`.

In [2]:
print('loading train/validation split')
with open('valid_idx.json', 'r') as f:
    valid_idx = json.load(f)['valid_idxs']
valid_files = [os.path.join(unzip_path, 'dsgdb9nsd_%s.xyz' % i) for i in valid_idx]

loading train/validation split


### Parse .xyz file

This part of code is to extract information (SMILES string, property dipole moment (Debye) only).

> Following code starting from Line 28 in `get_data.py`.

In [5]:
index_of_mu = 4

def read_xyz(file_path):
    with open(file_path, 'r') as f:
        lines = f.readlines()
        smiles = lines[-2].split('\t')[0]
        properties = lines[1].split('\t')
        mu = float(properties[index_of_mu])
    return {'smiles': smiles, 'mu': mu}

The example of the code above:

In [7]:
read_xyz("data/qm9_raw/dsgdb9nsd_000588.xyz")

{'mu': 0.1174, 'smiles': 'CCCC1CC1'}

### Data reading and stroage

The following code is to read data and store data. It may take several minutes to execute.

> Following code starting from Line 43 in `get_data.py`.

In [10]:
print('reading data...')
raw_data = {'train': [], 'valid': []}
all_files = glob.glob(os.path.join(unzip_path, '*.xyz'))
for file_idx, file_path in enumerate(all_files):
    if file_idx % 100 == 0:
        print('%.1f %%    \r' % (file_idx / float(len(all_files)) * 100), end=""),
    if file_path not in valid_files:
        raw_data['train'].append(read_xyz(file_path))
    else:
        raw_data['valid'].append(read_xyz(file_path))
all_mu = [mol['mu'] for mol in raw_data['train']]
mean_mu = np.mean(all_mu)
std_mu = np.std(all_mu)

reading data...
99.9 %    

In the cell above,
* Line 3: `glob.glob` list all the possible file path `./data/qm9_raw/*.xyz` in the list `all_files`.
* Line 5-6: Print the current process percentage. `\r` here is essential; that means the cursor is rewinded to the initial of the line. Then the string can be covered when a new output is dumped.
* Line 7-10: Save smiles string and dipole property to the data list `raw_data`. The split is very crude here, since only training and validation set are splited, where no excluded and testing set being splited.
* Line 11-13: Calculating the mean and standard deviation of dipoles. Calculation only covers training set.
  * It should be noted that Faber et al. (JCTC 2017) just calculate mean and standard deviation by all the valid (trainning / validation / testing set) molecules. So in Faber et al. (JCTC 2017), they just simply assume that the mean and standard deviation of dipoles for testing, validation and testing set are the same. However, in this implementation, no assumption on mean, standard deviation is aquired.

### Small utilities 

This small utility inputs dipole value (Debye), and outputs the normalized dipole (Gaussian distribution assumed).

> Following code starting from Line 57 in `get_data.py`.

In [20]:
def normalize_mu(mu):
    return (mu - mean_mu) / std_mu

This small utility generates onehot coding. Though in this implementation, the `onehot` function is utilized to represent atom information; however, this is not the only ability of this function.

> Following code starting from Line 60 in `get_data.py`.

In [21]:
def onehot(idx, len):
    z = [0 for _ in range(len)]
    z[idx] = 1
    return z

## SMILES String

The following code is to generate simple node and edge information.

> Following code starting from Line 65 in `get_data.py`.

In [23]:
bond_dict = {'SINGLE': 1, 'DOUBLE': 2, 'TRIPLE': 3, "AROMATIC": 4}
def to_graph(smiles):
    mol = Chem.MolFromSmiles(smiles)
    mol = Chem.AddHs(mol)
    edges = []
    nodes = []
    for bond in mol.GetBonds():
        edges.append((bond.GetBeginAtomIdx(), bond_dict[str(bond.GetBondType())], bond.GetEndAtomIdx()))
    for atom in mol.GetAtoms():
        nodes.append(onehot(["H", "C", "N", "O", "F"].index(atom.GetSymbol()), 5))
    return nodes, edges

We will use an example to illustrate how this works. The molecule is CHONH2. 

### SMILES string explanation

The bond connection relationship can be expressed as following (one without numbering, one with numbering):
```
        H                  H5
        |                  |
H - N - C = O    H3 - N0 - C1 = O2
    |                 |
    H                 H4
```
The SMILES string for this molecule is `NC=O`. This string is simple enough to be explained. We can believe that since `N` and `C` simply concatenate together, then bond order between `N0` and `C1` is one. Since `C` and `O` is joined by `=`, the bond order between `C1` and `O2` is two. The index of atoms are implied by the sting itself.

However, there are quite a few SMILES strings challenging to be explained by hand. The package [RDKit package](http://www.rdkit.org/) can do these jobs.

### Output node and edge information

We will take a quick look to the node and edge information. These information are going to be dumped to .json files later, as the feature vectors of the MPNN learning program.

In [30]:
to_graph("NC=O")

([[0, 0, 1, 0, 0],
  [0, 1, 0, 0, 0],
  [0, 0, 0, 1, 0],
  [1, 0, 0, 0, 0],
  [1, 0, 0, 0, 0],
  [1, 0, 0, 0, 0]],
 [(0, 1, 1), (1, 2, 2), (0, 1, 3), (0, 1, 4), (1, 1, 5)])

The first element (node) of the tuple refers to the onehot codes of chemical element in the molecule.

For example, the onehot code of atom O(2) refers to:

In [33]:
onehot(["H", "C", "N", "O", "F"].index("O"), 5)

[0, 0, 0, 1, 0]

The second element (edge) of the tuple refers to the connection atoms of the bond, as well as the bond order.

In this molecule, all the bond information can be listed in the following table:

 Idx | Begin Atom | Bond Order | End Atom | Edge Feature
-----|------------|------------|----------|---------
0 | N(0) | 1 | C(1) | `(0, 1, 1)`
1 | C(1) | 2 | O(2) | `(1, 2, 2)`
2 | N(0) | 1 | H(3) | `(0, 1, 3)`
3 | N(0) | 1 | H(4) | `(0, 1, 4)`
4 | C(1) | 1 | H(5) | `(1, 1, 5)`

## Data prase and dump

The following code is to convert the SMILES string to feature vectors, as well as to parse the target values (normalized dipole values). These values are dumped to .json file `molecules_train.json` and `molecules_train.json`.

> Following code starting from Line 77 in `get_data.py`.

In [38]:
print('parsing smiles as graphs...')
processed_data = {'train': [], 'valid': []}
for section in ['train', 'valid']:
    for i,(smiles, mu) in enumerate([(mol['smiles'], mol['mu']) for mol in raw_data[section]]):
        if i % 100 == 0:
            print('%s: %.1f %%      \r' % (section, 100*i/float(len(raw_data[section]))), end="")
        nodes, edges = to_graph(smiles)
        processed_data[section].append({
            'targets': [[normalize_mu(mu)]],
            'graph': edges,
            'node_features': nodes
        })
    print('%s: 100 %%      ' % (section))
    with open('molecules_%s.json' % section, 'w') as f:
        json.dump(processed_data[section], f)

parsing smiles as graphs...
train: 100 %        
valid: 100 %       


The following example code aims to illustrate one molecule in the file.

In [42]:
for section in ['train']:
    for i,(smiles, mu) in enumerate([(mol['smiles'], mol['mu']) for mol in raw_data[section][0:1]]):
        if i % 100 == 0:
            print('%s: %.1f %%      \r' % (section, 100*i/float(len(raw_data[section]))), end="")
        nodes, edges = to_graph(smiles)
        temp_json = {
            'targets': [[normalize_mu(mu)]],
            'graph': edges,
            'node_features': nodes
        }
print(temp_json)

train: 0.0 %      {'targets': [[-0.41439441617167755]], 'graph': [(0, 1, 1), (1, 2, 2), (1, 1, 3), (3, 1, 4), (3, 2, 5), (5, 1, 6), (6, 1, 7), (7, 1, 8), (8, 1, 6), (0, 1, 9), (0, 1, 10), (0, 1, 11), (4, 1, 12), (4, 1, 13), (6, 1, 14), (7, 1, 15), (7, 1, 16), (8, 1, 17), (8, 1, 18)], 'node_features': [[0, 1, 0, 0, 0], [0, 1, 0, 0, 0], [0, 0, 0, 1, 0], [0, 1, 0, 0, 0], [0, 0, 1, 0, 0], [0, 0, 1, 0, 0], [0, 1, 0, 0, 0], [0, 1, 0, 0, 0], [0, 1, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0]]}
