# Graph Convolutions For Tox21
In this notebook, we will explore the use of KerasModel to create graph convolutional models with DeepChem. In particular, we will build a graph convolutional network on the Tox21 dataset.

Let's start with some basic imports.

In [1]:
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals

import numpy as np 
import tensorflow as tf
import deepchem as dc
from deepchem.models.graph_models import GraphConvModel

Now, let's use MoleculeNet to load the Tox21 dataset. We need to make sure to process the data in a way that graph convolutional networks can use For that, we make sure to set the featurizer option to 'GraphConv'. The MoleculeNet call will return a training set, an validation set, and a test set for us to use. The call also returns `transformers`, a list of data transformations that were applied to preprocess the dataset. (Most deep networks are quite finicky and require a set of data transformations to ensure that training proceeds stably.)

In [2]:
# Load Tox21 dataset
tox21_tasks, tox21_datasets, transformers = dc.molnet.load_tox21(featurizer='GraphConv')
train_dataset, valid_dataset, test_dataset = tox21_datasets

Loading dataset from disk.
Loading dataset from disk.
Loading dataset from disk.


Let's now train a graph convolutional network on this dataset. DeepChem has the class `GraphConvModel` that wraps a standard graph convolutional architecture underneath the hood for user convenience. Let's instantiate an object of this class and train it on our dataset.

In [3]:
n_tasks = len(tox21_tasks)
model = GraphConvModel(n_tasks, batch_size=50, mode='classification')
# Set nb_epoch=10 for better results.
model.fit(train_dataset, nb_epoch=1)

0.8578314024668473

Let's try to evaluate the performance of the model we've trained. For this, we need to define a metric, a measure of model performance. `dc.metrics` holds a collection of metrics already. For this dataset, it is standard to use the ROC-AUC score, the area under the receiver operating characteristic curve (which measures the tradeoff between precision and recall). Luckily, the ROC-AUC score is already available in DeepChem. 

To measure the performance of the model under this metric, we can use the convenience function `model.evaluate()`.

In [4]:
metric = dc.metrics.Metric(dc.metrics.roc_auc_score, np.mean)

print("Evaluating model")
train_scores = model.evaluate(train_dataset, [metric], transformers)
print("Training ROC-AUC Score: %f" % train_scores["mean-roc_auc_score"])
valid_scores = model.evaluate(valid_dataset, [metric], transformers)
print("Validation ROC-AUC Score: %f" % valid_scores["mean-roc_auc_score"])

Evaluating model
computed_metrics: [0.7846645298316683, 0.8255705081888247, 0.8389671176952215, 0.8060793455430841, 0.6711073801951157, 0.7740920978757578, 0.7495887814740334, 0.7160807746060438, 0.767546874021283, 0.7088871155213989, 0.7977418645237055, 0.7459137439167645]
Training ROC-AUC Score: 0.765520
computed_metrics: [0.6741700885568693, 0.7308201058201058, 0.833249897601602, 0.7730045754956787, 0.6146818181818182, 0.6578554008339234, 0.6388888888888888, 0.7393550811432866, 0.7231457499411349, 0.5883899676375405, 0.7528563558408583, 0.6820844099913868]
Validation ROC-AUC Score: 0.700709


What's going on under the hood? Could we build `GraphConvModel` ourselves? Of course! The first step is to define the inputs to our model. Conceptually, graph convolutions just require the structure of the molecule in question and a vector of features for every atom that describes the local chemical environment. However in practice, due to TensorFlow's limitations as a general programming environment, we have to have some auxiliary information as well preprocessed.

`atom_features` holds a feature vector of length 75 for each atom. The other inputs are required to support minibatching in TensorFlow. `degree_slice` is an indexing convenience that makes it easy to locate atoms from all molecules with a given degree. `membership` determines the membership of atoms in molecules (atom `i` belongs to molecule `membership[i]`). `deg_adjs` is a list that contains adjacency lists grouped by atom degree. For more details, check out the [code](https://github.com/deepchem/deepchem/blob/master/deepchem/feat/mol_graphs.py).

To define feature inputs with Keras, we use the `Input` layer. Conceptually, a model is a mathematical graph composed of layer objects. `Input` layers have to be the root nodes of the graph since they consitute inputs.

In [5]:
import tensorflow as tf
from tensorflow.keras.layers import Input

atom_features = Input(shape=(75,))
degree_slice = Input(shape=(2,), dtype=tf.int32)
membership = Input(shape=tuple(), dtype=tf.int32)

deg_adjs = []
for i in range(0, 10 + 1):
    deg_adj = Input(shape=(i+1,), dtype=tf.int32)
    deg_adjs.append(deg_adj)

Let's now implement the body of the graph convolutional network. DeepChem has a number of layers that encode various graph operations. Namely, the `GraphConv`, `GraphPool` and `GraphGather` layers. We will also apply standard neural network layers such as `Dense` and `BatchNormalization`.

The layers we're adding effect a "feature transformation" that will create one vector for each molecule.

In [6]:
from tensorflow.keras.layers import Dense, BatchNormalization, Reshape, Softmax
from deepchem.models.layers import GraphConv, GraphPool, GraphGather

batch_size = 50

gc1 = GraphConv(64, activation_fn=tf.nn.relu)([atom_features, degree_slice, membership] + deg_adjs)
batch_norm1 = BatchNormalization()(gc1)
gp1 = GraphPool()([batch_norm1, degree_slice, membership] + deg_adjs)
gc2 = GraphConv(64, activation_fn=tf.nn.relu)([gp1, degree_slice, membership] + deg_adjs)
batch_norm2 = BatchNormalization()(gc2)
gp2 = GraphPool()([batch_norm2, degree_slice, membership] + deg_adjs)
dense = Dense(128, activation=tf.nn.relu)(gp2)
batch_norm3 = BatchNormalization()(dense)
readout = GraphGather(batch_size=batch_size, activation_fn=tf.nn.tanh)([batch_norm3, degree_slice, membership] + deg_adjs)
logits = Reshape((n_tasks, 2))(Dense(n_tasks*2)(readout))
softmax = Softmax()(logits)

Let's now create the `KerasModel`. To do that we specify the inputs and outputs to the model. We also have to define a loss for the model which tells the network the objective to minimize during training.


In [7]:
inputs = [atom_features, degree_slice, membership] + deg_adjs
outputs = [softmax]
keras_model = tf.keras.Model(inputs=inputs, outputs=outputs)
loss = dc.models.losses.CategoricalCrossEntropy()
model = dc.models.KerasModel(keras_model, loss=loss)

Now that we've successfully defined our graph convolutional model, we need to train it. We can call `fit()`, but we need to make sure that each minibatch of data populates all the `Input` objects that we've created. For this, we need to create a Python generator that given a batch of data generates the lists of inputs, labels, and weights whose values are Numpy arrays we'd like to use for this step of training.

In [8]:
from deepchem.metrics import to_one_hot
from deepchem.feat.mol_graphs import ConvMol

def data_generator(dataset, epochs=1, predict=False, pad_batches=True):
  for epoch in range(epochs):
    for ind, (X_b, y_b, w_b, ids_b) in enumerate(
        dataset.iterbatches(
            batch_size, pad_batches=pad_batches, deterministic=True)):
      multiConvMol = ConvMol.agglomerate_mols(X_b)
      inputs = [multiConvMol.get_atom_features(), multiConvMol.deg_slice, np.array(multiConvMol.membership)]
      for i in range(1, len(multiConvMol.get_deg_adjacency_lists())):
        inputs.append(multiConvMol.get_deg_adjacency_lists()[i])
      labels = [to_one_hot(y_b.flatten(), 2).reshape(-1, n_tasks, 2)]
      weights = [w_b]
      yield (inputs, labels, weights)

Now, we can train the model using `KerasModel.fit_generator(generator)` which will use the generator we've defined to train the model.

In [9]:
# Epochs set to 1 to render tutorials online.
# Set epochs=10 for better results.
model.fit_generator(data_generator(train_dataset, epochs=1))

0.888308482674452

Now that we have trained our graph convolutional method, let's evaluate its performance. We again have to use our defined generator to evaluate model performance.

In [10]:
metric = dc.metrics.Metric(dc.metrics.roc_auc_score, np.mean)

def reshape_y_pred(y_true, y_pred):
    """
    GraphConv always pads batches, so we need to remove the predictions
    for the padding samples.  Also, it outputs two values for each task
    (probabilities of positive and negative), but we only want the positive
    probability.
    """
    n_samples = len(y_true)
    return y_pred[:n_samples, :, 1]
    

print("Evaluating model")
train_predictions = model.predict_on_generator(data_generator(train_dataset, predict=True))
train_predictions = reshape_y_pred(train_dataset.y, train_predictions)
train_scores = metric.compute_metric(train_dataset.y, train_predictions, train_dataset.w)
print("Training ROC-AUC Score: %f" % train_scores)

valid_predictions = model.predict_on_generator(data_generator(valid_dataset, predict=True))
valid_predictions = reshape_y_pred(valid_dataset.y, valid_predictions)
valid_scores = metric.compute_metric(valid_dataset.y, valid_predictions, valid_dataset.w)
print("Valid ROC-AUC Score: %f" % valid_scores)

Evaluating model
computed_metrics: [0.7500439569332503]
Training ROC-AUC Score: 0.750044
computed_metrics: [0.6853516799391368]
Valid ROC-AUC Score: 0.685352


Success! The model we've constructed behaves nearly identically to `GraphConvModel`. If you're looking to build your own custom models, you can follow the example we've provided here to do so. We hope to see exciting constructions from your end soon!