# Note of `chem_tensorflow_dense.py`

This program is downloaded at Jan 25<sup>th</sup>, 2018 at github repository [gated-graph-neural-network-samples](https://github.com/Microsoft/gated-graph-neural-network-samples) maintained by Microsoft. The commit code hash is [049a8a2](https://github.com/Microsoft/gated-graph-neural-network-samples/commit/049a8a2c51e74c1bd75f4873fbe1c9beff7250a2).

This job is tested at 2018-01-25-12-21-43.

This jupyter notebook is not executable.

This code has been slightly modified. At line 48-49, there should exists a missing key `'task_example_ratios'` in dictionary.

## `docopt` Option Interface

We first try to explain how main program works. At line 194 in `chem_tensorflow_dense.py`, the main program is the following codes:

In [None]:
def main():
    args = docopt(__doc__)
    try:
        model = DenseGGNNChemModel(args)
        model.train()
    except:
        typ, value, tb = sys.exc_info()
        traceback.print_exc()
        pdb.post_mortem(tb)


if __name__ == "__main__":
    main()

For the line 2 above, the program parse command-line inputs. These inputs are defined at line 2 in `chem_tensorflow_dense.py`:

In [None]:
"""
Usage:
    chem_tensorflow_dense.py [options]

Options:
    -h --help                Show this screen.
    --config-file FILE       Hyperparameter configuration file path (in JSON format)
    --config CONFIG          Hyperparameter configuration dictionary (in JSON format)
    --log_dir NAME           log dir name
    --data_dir NAME          data dir name
    --restore FILE           File to restore weights from.
    --freeze-graph-model     Freeze weights of graph model components.
"""

More information of how program interprets the command line, we refer to the github page of [docopt](https://github.com/docopt/docopt). What we should know is that these options are stored in a dictionary `args`, which has keys like `--config-file` and `--config`.

## Abnormal Termination

In line 199-202 of `chem_tensorflow_dense.py`, this program defines how the program behaves when the training is not successful. The following code is an example that how the program works.

We first generate a fail information.

In [42]:
import sys, traceback
try:
    lst.index("a")
except:
    typ, value, tb = sys.exc_info()

This error is equal to the following code:

In [38]:
lst.index("a")

NameError: name 'lst' is not defined

`typ`, `value` is easily defined

In [39]:
typ

NameError

In [40]:
value

NameError("name 'lst' is not defined")

Traceback information is stored in `tb`. To retain the traceback output, we can use the following code to print the information to standard output:

In [44]:
traceback.print_exception(typ, value, tb)

Traceback (most recent call last):
  File "<ipython-input-42-b564f2ba8832>", line 3, in <module>
    lst.index("a")
NameError: name 'lst' is not defined


The code `traceback.print_exc()` is shorthand for the above code. It can't be used in the new cells in jupyter notebook.

For `pdb.post_mortem(tb)`, it is a kind of debugging technique. We don't discuss that in deep.

## Class Inherition

From then we start to explore classes `DenseGGNNChemModel` and class `ChemModel`. These two classes are defined seperately in `chem_tensorflow_dense.py` and `chem_tensorflow.py`.

At first, `DenseGGNNChemModel` class inherited all the methods in `ChemModel`. This inherition is coded as:

> Following code starting from Line 38 in `chem_tensorflow_dense.py`.

In [None]:
class DenseGGNNChemModel(ChemModel):
    def __init__(self, args):
        super().__init__(args)

These code means that when loading this class, all the arguments (`ChemModel`) will be loaded by `super()`, importing `args` (obtained by `docopt` package) as the arguments loaded to the inherited class (`ChemModel`).

Not only the methods are inherited, but also the parameters are inherited. The parameters in `chem_tensorflow_dense.py` is:

> Following code starting from Line 43 in `chem_tensorflow_dense.py`.

In [None]:
    def default_params(cls):
        params = dict(super().default_params())
        params.update({
                        'batch_size': 256,
                        'graph_state_dropout_keep_prob': 1.,
                        # Ajz34
                        'task_sample_ratios': {},
                      })
        return params

These default parameters are updated based on the following code in `chem_tensorflow.py`:

> Following code starting from Line 18 in `chem_tensorflow.py`.

In [None]:
    def default_params(cls):
        return {
            'num_epochs': 3000,
            'patience': 25,
            'learning_rate': 0.0001,
            'clamp_gradient_norm': 1.0,
            'out_layer_dropout_keep_prob': 1.0,

            'hidden_size': 100,
            'num_timesteps': 4,
            'use_graph': True,

            'tie_fwd_bkwd': True,
            'task_ids': [0],

            'random_seed': 0,
        }

These methods are classified in `classmethod`. For the discussion of the difference of `classmethod` and `staticmethod`, we refer to this [stackoverflow link](https://stackoverflow.com/questions/136097/what-is-the-difference-between-staticmethod-and-classmethod-in-python).

## Class `ChemModel` Initialization

In this subsection, we will observe what `ChemModel` Initialization process actually does. The code discussed here is in method `ChemModel.__init__` the file `chem_tensorflow.py`.

### Collect argument things

> Following code starting from Line 39 in `chem_tensorflow.py`. 8 space indentation is ignored.

In [None]:
# Collect argument things:
data_dir = ''
if '--data_dir' in args and args['--data_dir'] is not None:
    data_dir = args['--data_dir']
self.data_dir = data_dir

self.run_id = "_".join([time.strftime("%Y-%m-%d-%H-%M-%S"), str(os.getpid())])
log_dir = args.get('--log_dir') or '.'
self.log_file = os.path.join(log_dir, "%s_log.json" % self.run_id)
self.best_model_file = os.path.join(log_dir, "%s_model_best.pickle" % self.run_id)

For the code above:
* Line 2-5: If inputed command-line arguments includes `data_dir`, then all the outputs should be in `./data_dir`; otherwise, all the outputs are dumped in the current directory. The example is listed in the cell beneath.
* Line 7: Defines the job's id number. All the outputs then have an id number. In my implementation, the job's id is `2018-01-25-12-21-43_9839`. How the id be generated is also listed in the cell beneath.
* Line 8: Defines log directory. If not defined when executing program in command-line, then `args.get('--log_dir')` returns `None`; consequently, the log files will be dumped to current directory `.`. 
* Line 9-10: Defines path of log file and best model parameters during training.

In [80]:
import os
print(os.path.join("", "file_name"))
print(os.path.join("data_dir", "file_name"))

file_name
data_dir/file_name


In [64]:
import time, os
print(time.strftime("%Y-%m-%d-%H-%M-%S"))
print(str(os.getpid()))

2018-01-25-16-17-44
10091


In [85]:
None or '.'

'.'

### Collect parameters

> Following code starting from Line 50 in `chem_tensorflow.py`. 8 space indentation is ignored.

In [None]:
# Collect parameters:
params = self.default_params()
config_file = args.get('--config-file')
if config_file is not None:
    with open(config_file, 'r') as f:
        params.update(json.load(f))
config = args.get('--config')
if config is not None:
    params.update(json.loads(config))
self.params = params
with open(os.path.join(log_dir, "%s_params.json" % self.run_id), "w") as f:
    json.dump(params, f)
print("Run %s starting with following parameters:\n%s" % (self.run_id, json.dumps(self.params)))

For the code above:
* Line 4-9: Note that `json.load` and `json.loads` is two different functions, though they work quite similarly. Generally speaking, file can be input to `json.load` (which can be parsed by `file.read()`), as well as strings can be input to `json.loads`. More information at [Python's documentation on json](https://docs.python.org/2/library/json.html).
* Line 10-11: Defines path of parameters for learning utility.

### Load data

> Following code starting from Line 64 in `chem_tensorflow.py`. 8 space indentation is ignored.

In [None]:
# Load data:
self.max_num_vertices = 0
self.num_edge_types = 0
self.annotation_size = 0
self.train_data = self.load_data("molecules_train.json", is_training_data=True)
self.valid_data = self.load_data("molecules_valid.json", is_training_data=False)

For the code above:
* Line 4-5: The data files `molecules_train.json` and `molecules_valid.json` are generated by the previous program `get_data.py`.
<a id="ref_load_data"></a>
* Class function [`load_data`](#load_data) will be discussed later. This function also defines `self.max_num_vertices`, `self.num_edge_types` and `annotation_size`.

### Build the actual model

> Following code starting from Line 71 in `chem_tensorflow.py`. 8 space indentation is ignored.

In [None]:
# Build the actual model
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
self.graph = tf.Graph()
self.sess = tf.Session(graph=self.graph, config=config)
with self.graph.as_default():
    random.seed(params['random_seed'])
    np.random.seed(params['random_seed'])
    tf.set_random_seed(params['random_seed'])
    self.placeholders = {}
    self.weights = {}
    self.ops = {}
    self.make_model()
    self.make_train_step()

    # Restore/initialize variables:
    restore_file = args.get('--restore')
    if restore_file is not None:
        self.restore_model(restore_file)
    else:
        self.initialize_model()

For the code above:
* Line 3: To understand `tf.ConfigProto().gpu_options.allow_growth = True`, we need to inspect the source code of tensorflow [`config.proto`](https://github.com/tensorflow/tensorflow/blob/r1.5/tensorflow/core/protobuf/config.proto) and search for `allow_growth`. This line of code means that we initialize training with small usage of memory in GPU, instead of pre-allocate the whole memory space. During training, the memory may grow larger and larger. Since GPU memory is allocated dynamically, [memory leaking is possible](http://blog.csdn.net/u012436149/article/details/53837651).
* Line 5: This line of code is equal to `self.sess = tf.Session(config=config)`. However, since the `self.graph` would be updated by the actions later, so we also define `self.graph` here.
* Line 7-9: Set random seeds. 0 as default, defined by line 33 in `chem_tensorflow.py`.
* Line 13-21: All these functions in the class should be explained later.

Up to now, the initialization of `ChemModel` class is finished. However, many functions in this initialization hasn't been defined. From then on, we need to know how these variables are defined.

## Functions in `ChemModel`

<a id="load_data"></a>
### Function `load_data`

Function `load_data` was mentioned [before](#ref_load_data). 

> Following code starting from Line 93 in `chem_tensorflow.py`. 4 space indentation is ignored.

In [None]:
def load_data(self, file_name, is_training_data: bool):
    full_path = os.path.join(self.data_dir, file_name)

    print("Loading data from %s" % full_path)
    with open(full_path, 'r') as f:
        data = json.load(f)

    restrict = self.args.get("--restrict_data")
    if restrict is not None and restrict > 0:
        data = data[:restrict]

    # Get some common data out:
    num_fwd_edge_types = 0
    for g in data:
        self.max_num_vertices = max(self.max_num_vertices, max([v for e in g['graph'] for v in [e[0], e[2]]]))
        num_fwd_edge_types = max(num_fwd_edge_types, max([e[1] for e in g['graph']]))
    self.num_edge_types = max(self.num_edge_types, num_fwd_edge_types * (1 if self.params['tie_fwd_bkwd'] else 2))
    self.annotation_size = max(self.annotation_size, len(data[0]["node_features"][0]))

    return self.process_raw_graphs(data, is_training_data)

For the code above:
* Line 15-18: These are simple searching for maximum graph or molecule information. 
  * Line 15-16: Count maximum vertices numbers (atom numbers if hydrogen is included, or heavy atom numbers otherwise) and edge types. Frankly speaking, this process may be inefficient, for we have to go through the whole data again. We have already read all the data when loading `get_data.py`. This process can be implemented in `get_data.py` to decrease data I/O time waste.
  * `e in g['graph']`: `e` should be similar to `[0, 2, 1]`. `e[0]`, `e[2]` are the number of atom, while `e[1]` is the bond type.
  * Line 17: Since we just consider undirected graph here, so the edge types should be 4. If directed graph here, 8 instead.
  * Line 18: Count of atom type. 5 here.
  * Line 20: The function actually runs is not in line 119 `chem_tensorflow.py`, but in line 112 in `chem_tensorflow_dense.py`. This is function override, which is not apparent in vim editor, but appearant in IDE environment.

<a id="process_raw_graphs"></a>
### Function `process_raw_graphs`

> Following code starting from Line 112 in `chem_tensorflow_dense.py`. 4 space indentation is ignored.

In [None]:
def process_raw_graphs(self, raw_data: Sequence[Any], is_training_data: bool) -> Any:
    bucket_sizes = np.array(list(range(4, 28, 2)) + [29])
    bucketed = defaultdict(list)
    x_dim = len(raw_data[0]["node_features"][0])
    for d in raw_data:
        chosen_bucket_idx = np.argmax(bucket_sizes > max([v for e in d['graph']
                                                            for v in [e[0], e[2]]]))
        chosen_bucket_size = bucket_sizes[chosen_bucket_idx]
        bucketed[chosen_bucket_idx].append({
            'adj_mat': graph_to_adj_mat(d['graph'], chosen_bucket_size, self.num_edge_types, self.params['tie_fwd_bkwd']),
            'init': d["node_features"] + [[0 for _ in range(x_dim)] for __ in
                                          range(chosen_bucket_size - len(d["node_features"]))],
            'labels': [d["targets"][task_id][0] for task_id in self.params['task_ids']],
        })

    if is_training_data:
        for (bucket_idx, bucket) in bucketed.items():
            np.random.shuffle(bucket)
            for task_id in self.params['task_ids']:
                task_sample_ratio = self.params['task_sample_ratios'].get(str(task_id))
                if task_sample_ratio is not None:
                    ex_to_sample = int(len(bucket) * task_sample_ratio)
                    for ex_id in range(ex_to_sample, len(bucket)):
                        bucket[ex_id]['labels'][task_id] = None

    bucket_at_step = [[bucket_idx for _ in range(len(bucket_data) // self.params['batch_size'])]
                      for bucket_idx, bucket_data in bucketed.items()]
    bucket_at_step = [x for y in bucket_at_step for x in y]

    return (bucketed, bucket_sizes, bucket_at_step)

This is sample output from function `process_raw_graphs` if the .json file only includes the first two molecules in `molecules_train.json`.

```
(defaultdict(<class 'list'>, {0: [{'adj_mat': array([[[0., 0., 1., 1.],
        [0., 0., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.]],

       [[0., 1., 0., 0.],
        [1., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.]]]), 'labels': [-0.3917742606773419], 'init': [[0, 1, 0, 0, 0], [0, 0, 0, 1, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0]]}], 1: [{'adj_mat': array([[[0., 1., 1., 1., 1., 0.],
        [1., 0., 0., 0., 0., 1.],
        [1., 0., 0., 0., 0., 0.],
        [1., 0., 0., 0., 0., 0.],
        [1., 0., 0., 0., 0., 0.],
        [0., 1., 0., 0., 0., 0.]],

       [[0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0.]]]), 'labels': [-0.7729827193116501], 'init': [[0, 1, 0, 0, 0], [0, 0, 0, 1, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0]]}]}), array([ 4,  6,  8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 29]), [])
```