#  Introduction to DMPNN

In this tutorial we will learn how to use DMPNN Model with the help of an example on an existing Tox21 dataset. This DMPNN model implementation is based on the paper: [Analyzing Learned Molecular Representations for Property Prediction](https://arxiv.org/pdf/1904.01561.pdf) and [Chemprop](https://github.com/chemprop/chemprop).




# What is DMPNN?


Directed — Message Passing Neural Network (D-MPNN) model is a graph convolution network (GCN) built upon the existing Message Passing Neural Network (MPNN) architecture. The primary difference between the D-MPNN and regular MPNNs is in the nature of the messages being passed through the molecule during the message passing phase. While the general MPNN framework assumes messages are centered on atoms, the D-MPNN centers messages on bonds instead.<br>
Specifically, the D-MPNN maintains two representations for the message centered on the bond between atoms 𝑣 and 𝑤: one from atom 𝑣 to atom 𝑤 and one from atom 𝑤 to atom 𝑣, hence the word Directed. Consequently, rather than aggregating information from neighboring atoms, the D-MPNN aggregates information from neighboring bonds. Each bond’s message is updated based on all incoming bond messages.

<div>
<img src="https://miro.medium.com/max/4800/1*IZFd3FKXyhAjsMuUtcz5DQ.webp" width="700"/>
</div>


The motivation of this design is to prevent totters, that is, to avoid messages being passed along any path of the form v1 v2 ··· vn where vi = vi+2 for some i. Such excursions are likely to introduce noise into the graph representation. Due to this structure, with messages centered on bonds and a distinction between the two directions of bond messages, the D-MPNN has greater control over the flow of information across the molecule and can therefore build more informative molecular representations.





## Colab

This tutorial and the rest in this sequence are designed to be done in Google colab. If you'd like to open this notebook in colab, you can use the following link.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepchem/deepchem/blob/master/examples/tutorials/Introduction_to_Graph_Convolutions.ipynb)



In [None]:
!pip install --pre deepchem

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


### Installing Dependencies required

Setting up all the dependencies required. DMPNN Model requires torch, torch-geometric, torch-sparse and torch-scatter libraries to be installed. (Installation of torch-sparse and torch-scatter takes some time).

In [None]:
!pip install deepchem[torch]
!pip install torch-geometric
!pip install torch-sparse
!pip install torch-scatter -f https://data.pyg.org/whl/torch-1.13.0+${CUDA}.html  # noqa

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting torch-sparse
  Using cached torch_sparse-0.6.16.tar.gz (208 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: torch-sparse
  Building wheel for torch-sparse (setup.py) ... [?25l[?25hdone
  Created wheel for torch-sparse: filename=torch_sparse-0.6.16-cp38-cp38-linux_x86_64.whl size=1077033 sha256=d66f9de0847900e2f1faad7deb2ad24041d416832ba1e238dc2a9ae423f4cf0d
  Stored in directory: /root/.cache/pip/wheels/d7/f5/41/86610d3a3ce0bec241d8549ecdd6c7e07fe000e041616cfcd6
Successfully built torch-sparse
Installing collected packages: torch-sparse
Successfully installed torch-sparse-0.6.16
Looking in indexes: https://pypi.org/simple, https://us

In [None]:
import deepchem as dc
import numpy as np
from deepchem.models import DMPNNModel
from rdkit.Chem import Descriptors

### DMPNN Featurizer

DMPNNFeaturizer class is a featurizer for Directed Message Passing Neural Network (D-MPNN) implementation

The default node representation are constructed by concatenating the following values,and the feature length is 133.
1. Atomic num - A one-hot vector of this atom, in a range of first 100 atoms.
2. Degree - A one-hot vector of the degree (0-5) of this atom.
3. Formal charge - Integer electronic charge, -1, -2, 1, 2, 0.
4. Chirality - A one-hot vector of the chirality tag (0-3) of this atom.
5. Number of Hydrogens - A one-hot vector of the number of hydrogens (0-4) that this atom connected.
6. Hybridization - A one-hot vector of "SP", "SP2", "SP3", "SP3D", "SP3D2".
7. Aromatic - A one-hot vector of whether the atom belongs to an aromatic ring.
8. Mass - Atomic mass * 0.01

The default edge representation are constructed by concatenating the following values, and the feature length is 14.

1. Bond type - A one-hot vector of the bond type, "single", "double", "triple", or "aromatic".
2. Same ring - A one-hot vector of whether the atoms in the pair are in the same ring.
3. Conjugated - A one-hot vector of whether this bond is conjugated or not. 
4. Stereo - A one-hot vector of the stereo configuration (0-5) of a bond.

    

    

  


In [None]:
# Example of Featurized data

smiles = ["C1=CC=CN=C1"]
featurizer = dc.feat.DMPNNFeaturizer()
out = featurizer.featurize(smiles)
print("Type of featurized data:-", type(out[0]))
print("[num_nodes, num_node_features]:-", out[0].node_features.shape)
print("[num_edges, num_edge_features]:-", out[0].edge_features.shape)


Type of featurized data:- <class 'deepchem.feat.graph_data.GraphData'>
[num_nodes, num_node_features]:- (6, 133)
[num_edges, num_edge_features]:- (12, 14)


### Parameters of DMPNNFeaturizer Class are:-
1. features_generator - It takes a List of global feature generators to be used during featurization. The input should be in the form of one of the following strings. Its default value is None.<br>
Available Feature Generators are-  
* "morgan" - Uses CircularFingerprint class with feature size 2048.
* "morgan_count" - Uses CircularFingerprint class with feature size 2048.
* "rdkit_desc" - Uses RDKitDescriptors class without normalization with feature size 200.
* "rdkit_desc_normalized" - Uses RDKitDescriptors class with normalization with feature size 200.

To know more, you can refer to this- [feature generators](https://github.com/deepchem/deepchem/blob/2388d66798dbbff732520824ad36143e0e5f1b1b/deepchem/feat/molecule_featurizers/dmpnn_featurizer.py#L47-L60)
2. is_adding_hs - It takes bool value, whether to add Hydrogen atoms or not. It's default value is False.
3. use_original_atom_ranks - Whether to use original atom mapping or canonical atom mapping. It's default value is False.



        

#### More on RDKitDescriptors

Currently, there are 208 RDKitDescriptors present in RDKit Library, But the normalizing cdf parameters are not available for BCUT2D descriptor.(BCUT2D_MWHI, BCUT2D_MWLOW, BCUT2D_CHGHI, BCUT2D_CHGLO, BCUT2D_LOGPHI, BCUT2D_LOGPLOW, BCUT2D_MRHI, BCUT2D_MRLOW).
Therefore, size=200 for "rdkit_desc_normalized" or "rdkit_desc".

In [None]:
total_descriptors = len(Descriptors.descList)
# After Ignoring 8 Bcut2D descriptors that we are not using DMPNN.
rdkit_feature_size = total_descriptors - 8
print("Feature size of rdkit_desc/rdkit_desc_normalized =", rdkit_feature_size)


Feature size of rdkit_desc/rdkit_desc_normalized = 200


## Training DMPNN Model
Let's use the MoleculeNet suite to load the Tox21 dataset. To featurize the data in a way that graph convolutional networks can use, we set the featurizer option to 'DMPNNFeaturizer'. The MoleculeNet call returns a training set, a validation set, and a test set for us to use. It also returns tasks, a list of the task names, and transformers, a list of data transformations that were applied to preprocess the dataset. (Most deep networks are quite finicky and require a set of data transformations to ensure that training proceeds stably.)

In [None]:
# Load Tox21 dataset
tox21_tasks, tox21_datasets, transformers = dc.molnet.load_tox21(
    featurizer=dc.feat.DMPNNFeaturizer(
        features_generators=["rdkit_desc_normalized"]),
    splitter='scaffold')


In [None]:
train_dataset, valid_dataset, test_dataset = tox21_datasets
print('dataset is featurized')

print(len(train_dataset), len(valid_dataset), len(test_dataset))
print(train_dataset.X[:5])

dataset is featurized
6264 783 784
[GraphData(node_features=[11, 133], edge_index=[2, 20], edge_features=[20, 14], global_features=[200])
 GraphData(node_features=[20, 133], edge_index=[2, 38], edge_features=[38, 14], global_features=[200])
 GraphData(node_features=[10, 133], edge_index=[2, 18], edge_features=[18, 14], global_features=[200])
 GraphData(node_features=[21, 133], edge_index=[2, 36], edge_features=[36, 14], global_features=[200])
 GraphData(node_features=[10, 133], edge_index=[2, 18], edge_features=[18, 14], global_features=[200])]


The DMPNN model has 2 phases, message-passing phase and read-out phase.

  - The goal of the message-passing phase is to generate 'hidden states of all the atoms in the molecule' using encoders.
  - Next in read-out phase, the features are passed into feed-forward neural network to get the task-based prediction.

### Parameters<br>
General Parameters
1.   mode - It specifies the model type - classification or regression,
default type is 'regression'.
2.   n_classes - The number of classes to predict (used only in classification mode), default value is 3.
3.   n_tasks - The number of tasks, default value is 1.
4.   batch_size - The number of datapoints in a batch, default value is 1.
5.   global_features_size - Size of the global features vector (based on the global featurizers used during featurization).
6.   use_default_fdim - If `True`, self.atom_fdim and self.bond_fdim are initialized using values from the GraphConvConstants class.
      If `False`, self.atom_fdim and self.bond_fdim are initialized from the values provided.
7.    atom_fdim - Specifies the dimension of atom feature vector.
8.    bond_fdim - Specifies the dimension of bond feature vector.

Encoder parameters
1.    enc_hidden - Specifies the size of hidden layer in the encoder layer.
2.   depth - It is the no of message passing steps.
3.   bias - If `True`, dense layers will use bias vectors.
4.   enc_activation - Activation function to be used in the encoder layer. The different activation functions are - 'relu' for ReLU, 'leakyrelu' for LeakyReLU, 'prelu' for PReLU, 'tanh' for TanH, 'selu' for SELU, and 'elu' for ELU.
5.   enc_dropout_p - Dropout probability for the encoder layer.
6.   aggregation - Aggregation type to be used in the encoder layer.
      Can choose between 'mean', 'sum', and 'norm'.
7.   aggregation_norm - Value required if `aggregation` type is 'norm'.

Feed Forward Parameters
1.   ffn_hidden - Size of hidden layer in the feed-forward network layer.
2.   ffn_activation - Activation function to be used in feed-forward network layer. Can choose between 'relu' for ReLU, 'leakyrelu' for LeakyReLU, 'prelu' for PReLU, 'tanh' for TanH, 'selu' for SELU, and 'elu' for ELU.
3.   ffn_layers - Number of layers in the feed-forward network layer.
4.    ffn_dropout_p - Dropout probability for the feed-forward network layer.



#### NOTE:- Do not forget to provide value for "global_features_size" if you have used one or more feature generators as discussed earlier.
If you have used more than one feature generator, then "global_features_size" is sum of the respective feature sizes.

In [None]:
# Initialise the model
model = DMPNNModel(n_tasks=len(tox21_tasks),
                   n_classes=2,
                   mode='classification',
                   batch_size=50,
                   global_features_size=200)

# Model training
print("Training model")
model.fit(train_dataset, nb_epoch=30)


Training model


0.15509110689163208

Let's try to evaluate the performance of the model we've trained. For this, we need to define a metric, a measure of model performance. `dc.metrics` holds a collection of metrics already. For this dataset, it is standard to use the ROC-AUC score, the area under the receiver operating characteristic curve (which measures the tradeoff between precision and recall). Luckily, the ROC-AUC score is already available in DeepChem. 

To measure the performance of the model under this metric, we can use the convenience function `model.evaluate()`.

In [None]:
# Model evaluation
metric = dc.metrics.Metric(dc.metrics.roc_auc_score, np.mean)

print("Evaluating model")
train_scores = model.evaluate(train_dataset, [metric], transformers)
valid_scores = model.evaluate(valid_dataset, [metric], transformers)

print("Train scores: ", train_scores)
print("Validation scores: ", valid_scores)

Evaluating model
Train scores:  {'mean-roc_auc_score': 0.9831147857676159}
Validation scores:  {'mean-roc_auc_score': 0.7535253916851191}


# Congratulations! Time to join the Community!

Congratulations on completing this tutorial notebook! If you enjoyed working through the tutorial, and want to continue working with DeepChem, we encourage you to finish the rest of the tutorials in this series. You can also help the DeepChem community in the following ways:

## Star DeepChem on [GitHub](https://github.com/deepchem/deepchem)
This helps build awareness of the DeepChem project and the tools for open source drug discovery that we're trying to build.

## Join the DeepChem Gitter
The DeepChem [Gitter](https://gitter.im/deepchem/Lobby) hosts a number of scientists, developers, and enthusiasts interested in deep learning for the life sciences. Join the conversation!