# Load in the Data, Make Pandas-able
The raw data for this project is an ASE Database that holds the results of the water cluster calculations.
We need to convert these ASE `Atoms` objects, which list the atomic coordinates and energy, into a form with the graph structure.
This notebook contains the code to compute the graph structure for each entry in the database and save it all in an easily-accessible Pandas dataframe.

In [1]:
from graphsage.importing import create_graph, make_entry, make_tfrecord, make_nfp_network
from multiprocessing import Pool
from ase.db import connect
from tqdm import tqdm
from random import random
import tensorflow as tf
import pandas as pd
import numpy as np
import zipfile
import json
import gzip
import os

Configuration settings

In [2]:
val_fraction = 0.1  # Fraction of entries set aside for validation
data_path = os.path.join('data', 'input', 'tutorial.db')

## Connect to the ASE Database
This databsae stores the output of our simulations in a way we can easily use with atomic structure analysis codes.

In [3]:
ase_db = connect(data_path)
print(f'Connected to database with {len(ase_db)} records')

Connected to database with 32767 records


## Convert ASE Objects to Networkx Graphs
The code here is adapated from the [Exalearn:Design Github page](https://github.com/exalearn/design/blob/16cfe21d85528c6004514d2985428566453b24a1/graph_descriptors/graph_builder.py).

Loop over the whole database

In [4]:
def pull_from_database(ase_db, chunk_size=128, total=None):
    """Iterate over a large database
    
    Queries only a small chunk size at a time to prevent loading the 
    whole database into memory. 
    
    Args:
    """
    # Figure out how many iterations we need to make
    if total is None:
        total = ase_db.count()
    
    # Generate the dataset
    starts = np.arange(0, total, chunk_size, dtype=np.int32)
    
    # Randomize the starts to help the diversity
    np.random.shuffle(starts)
    
    # Iterate through the whole database
    for start in starts:
        for a in ase_db.select(
            selection=[('id','>=', str(start)), ('id', '<', str(start+chunk_size))]):
            yield a.toatoms()

## Save the Data as NFP-ready TensorFlow record objects
Save the whole file as a JSON-LD where each entry has the form:

```json
{
  'entry': 'entry number as an integer',
  'energy': 'energy as a float',
  'n_waters': 'number of water molecules as an integer', 
  'n_atom': 'number of atoms as an integer', 
  'n_bonds': 'number of bonds as an integer',
  'atom': 'List of atom types (0 -> Oxygen, 1 -> Hydrogen)',
  'bond': 'List of bond types (0 -> covalent, 1 -> Hydrogen)',
  'connectivity': 'List of connections between atoms, as a list of pairs of ints. Sorted ascending by column 0, then 1'
}
```

Pull a single cluster and make its network graph

In [5]:
atoms = next(pull_from_database(ase_db, total=2))

In [6]:
make_nfp_network(atoms)

{'energy': -86.9336243,
 'n_waters': 10,
 'n_atom': 30,
 'n_bond': 68,
 'atom': [0,
  1,
  1,
  0,
  1,
  1,
  0,
  1,
  1,
  0,
  1,
  1,
  0,
  1,
  1,
  0,
  1,
  1,
  0,
  1,
  1,
  0,
  1,
  1,
  0,
  1,
  1,
  0,
  1,
  1],
 'bond': [0,
  0,
  0,
  0,
  1,
  1,
  1,
  1,
  1,
  1,
  0,
  0,
  0,
  0,
  1,
  1,
  1,
  1,
  0,
  0,
  0,
  0,
  1,
  1,
  1,
  1,
  0,
  0,
  0,
  0,
  1,
  1,
  1,
  1,
  0,
  0,
  0,
  0,
  1,
  1,
  1,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  1,
  1,
  1,
  1,
  1,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0],
 'connectivity': array([[ 0,  1],
        [ 0,  2],
        [ 0,  5],
        [ 1,  0],
        [ 1, 27],
        [ 2,  0],
        [ 2,  9],
        [ 3,  4],
        [ 3,  5],
        [ 3, 25],
        [ 4,  3],
        [ 4, 15],
        [ 5,  0],
        [ 5,  3],
        [ 6,  7],
        [ 6,  8],
        [ 6, 22],
        [ 7,  6],
        [ 7, 15],
        [ 8,  6],
        [ 9,  2],
        [ 9, 10],
        [

Generate many records in parallel

In [7]:
os.makedirs(os.path.join('data', 'output'), exist_ok=True)

In [8]:
with tf.io.TFRecordWriter(os.path.join('data', 'output', 'water_clusters.proto')) as writer:
    with tf.io.TFRecordWriter(os.path.join('data', 'output', 'water_clusters_validation.proto')) as writer_val:
        with Pool(os.cpu_count() - 1) as p:  # Keep one CPU for reading the database and writing to output
            for entry in tqdm(
                p.imap(make_tfrecord, pull_from_database(ase_db), chunksize=64),
                total=len(ase_db)
            ):
                if random() < val_fraction:
                    writer_val.write(entry)
                else:
                    writer.write(entry)

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 32767/32767 [02:07<00:00, 256.86it/s]
