# Introduction to GROVER

In this tutorial, we will go over what Grover is, and how to get it up and running.

GROVER, or, Graph Representation frOm selfsuperVised mEssage passing tRansformer, is a novel framework proposed by Tencent AI Lab. GROVER utilizes self-supervised tasks in the node, edge and graph level in order to learn rich structural and semantic information of molecules from large unlabelled molecular datasets. GROVER integrates Message Passing Networks into a Transformer-style architecture to deliver more expressive molecular encoding. 

Reference Paper: [Rong, Yu, et al. "Grover: Self-supervised message passing transformer on large-scale molecular data." Advances in Neural Information Processing Systems (2020).](https://drug.ai.tencent.com/publications/GROVER.pdf)

## Colab

This tutorial and the rest in this sequence are designed to be done in Google colab. If you'd like to open this notebook in colab, you can use the following link.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepchem/deepchem/blob/master/examples/tutorials/Introduction_to_GROVER.ipynb)

## Setup

To run DeepChem within Colab, you'll need to run the following installation commands. This will take about 5 minutes to run to completion and install your environment. You can of course run this tutorial locally if you prefer. In that case, don't run these cells since they will download and install Anaconda on your local machine.

## Import and Setup required modules.
We will first clone the repository onto the preferred platform, then install it as a library. We will also import deepchem and install descriptastorus.

NOTE: The [original GROVER repository](https://github.com/tencent-ailab/grover) does not contain a `setup.py` file, thus we are currently using a fork which does.

In [1]:
# Clone the forked repository.
%cd drive/MyDrive
!git clone https://github.com/atreyamaj/grover.git

/content/drive/MyDrive
fatal: destination path 'grover' already exists and is not an empty directory.


In [2]:
# Navigate to the working folder.
%cd grover

/content/drive/MyDrive/grover


In [3]:
# Install the forked repository.
!pip install -e ./

Obtaining file:///content/drive/MyDrive/grover
Installing collected packages: grover
  Running setup.py develop for grover
Successfully installed grover-1.0.0


In [4]:
# Install deepchem and descriptastorus.
!pip install deepchem
!pip install git+https://github.com/bp-kelley/descriptastorus

Collecting deepchem
  Downloading deepchem-2.6.1-py3-none-any.whl (608 kB)
[?25l[K     |▌                               | 10 kB 29.8 MB/s eta 0:00:01[K     |█                               | 20 kB 34.5 MB/s eta 0:00:01[K     |█▋                              | 30 kB 37.0 MB/s eta 0:00:01[K     |██▏                             | 40 kB 20.6 MB/s eta 0:00:01[K     |██▊                             | 51 kB 23.0 MB/s eta 0:00:01[K     |███▎                            | 61 kB 25.9 MB/s eta 0:00:01[K     |███▊                            | 71 kB 23.6 MB/s eta 0:00:01[K     |████▎                           | 81 kB 24.8 MB/s eta 0:00:01[K     |████▉                           | 92 kB 26.6 MB/s eta 0:00:01[K     |█████▍                          | 102 kB 28.3 MB/s eta 0:00:01[K     |██████                          | 112 kB 28.3 MB/s eta 0:00:01[K     |██████▌                         | 122 kB 28.3 MB/s eta 0:00:01[K     |███████                         | 133 kB 28.3 MB/s eta

## Extracting semantic motif labels
The semantic motif label is extracted by `scripts/save_feature.py` with feature generator `fgtasklabel`.

In [5]:
!python scripts/save_features.py --data_path exampledata/pretrain/tryout.csv  \
                                --save_path exampledata/pretrain/tryout.npz   \
                                --features_generator fgtasklabel \
                                --restart

100% 5970/5970 [00:09<00:00, 620.91it/s]


## Extracting atom/bond contextual properties (vocabulary)
The atom/bond Contextual Property (Vocabulary) is extracted by `scripts/build_vocab.py`.

In [6]:
!python scripts/build_vocab.py --data_path exampledata/pretrain/tryout.csv  \
                             --vocab_save_folder exampledata/pretrain  \
                             --dataset_name tryout

Building atom vocab from file: exampledata/pretrain/tryout.csv
50000it [00:04, 10946.14it/s]
atom vocab size 324
Building bond vocab from file: exampledata/pretrain/tryout.csv
50000it [00:16, 3094.21it/s]
bond vocab size 353


## Splitting the data
To accelerate the data loading and reduce the memory cost in the multi-gpu pretraining scenario, the unlabelled molecular data need to be spilt into several parts using `scripts/split_data.py`.

In [7]:
!python scripts/split_data.py --data_path exampledata/pretrain/tryout.csv  \
                             --features_path exampledata/pretrain/tryout.npz  \
                             --sample_per_file 100  \
                             --output_path exampledata/pretrain/tryout

Number of files: 60


## Running Pretraining on Single GPU

In [8]:
!python main.py pretrain \
               --data_path exampledata/pretrain/tryout \
               --save_dir model/tryout \
               --atom_vocab_path exampledata/pretrain/tryout_atom_vocab.pkl \
               --bond_vocab_path exampledata/pretrain/tryout_bond_vocab.pkl \
               --batch_size 32 \
               --dropout 0.1 \
               --depth 5 \
               --num_attn_head 1 \
               --hidden_size 100 \
               --epochs 3 \
               --init_lr 0.0002 \
               --max_lr 0.0004 \
               --final_lr 0.0001 \
               --weight_decay 0.0000001 \
               --activation PReLU \
               --backbone gtrans \
               --embedding_output_type both

Namespace(activation='PReLU', atom_vocab_path='exampledata/pretrain/tryout_atom_vocab.pkl', backbone='gtrans', batch_size=32, bias=False, bond_drop_rate=0, bond_vocab_path='exampledata/pretrain/tryout_bond_vocab.pkl', cuda=True, data_path='exampledata/pretrain/tryout', dense=False, depth=5, dist_coff=0.1, dropout=0.1, embedding_output_type='both', enable_multi_gpu=False, epochs=3, fg_label_path=None, final_lr=0.0001, fine_tune_coff=1, hidden_size=100, init_lr=0.0002, max_lr=0.0004, no_cache=True, num_attn_head=1, num_mt_block=1, parser_name='pretrain', save_dir='model/tryout', save_interval=9999999999, undirected=False, warmup_epochs=2.0, weight_decay=1e-07)
Loading data
Loading data:
Number of files: 60
Number of samples: 5970
Samples/file: 100
Splitting data with seed 0.
Total size = 5,970 | train size = 5,400 | val size = 570
atom vocab size: 324, bond vocab size: 353, Number of FG tasks: 85
Pre-loaded test data: 6
  cpuset_checked))
  cpuset_checked))
Restore checkpoint, current ep

# Training and Finetuning

##Extracting Molecular Features

Given a labelled molecular dataset, it is possible to extract the additional molecular features in order to train & finetune the model from the existing pretrained model. The feature matrix is stored as `.npz`.

In [9]:
!python scripts/save_features.py --data_path exampledata/finetune/bbbp.csv \
                                --save_path exampledata/finetune/bbbp.npz \
                                --features_generator rdkit_2d_normalized \
                                --restart 

 64% 1308/2039 [00:52<00:29, 24.50it/s]
100% 2039/2039 [01:19<00:00, 25.67it/s]


## Finetuning with existing data
Given the labelled dataset and the molecular features, we can use `finetune` function to finetune the pretrained model.

In [10]:
!python main.py finetune --data_path exampledata/finetune/bbbp.csv \
                        --features_path exampledata/finetune/bbbp.npz \
                        --save_dir model/finetune/bbbp/ \
                        --checkpoint_path model/tryout/model.ep3 \
                        --dataset_type classification \
                        --split_type scaffold_balanced \
                        --ensemble_size 1 \
                        --num_folds 3 \
                        --no_features_scaling \
                        --ffn_hidden_size 200 \
                        --batch_size 32 \
                        --epochs 10 \
                        --init_lr 0.00015

Fold 0
Loading data
Number of tasks = 1
Splitting data with seed 0
100% 2039/2039 [00:00<00:00, 3681.51it/s]
Total scaffolds = 1,025 | train scaffolds = 764 | val scaffolds = 123 | test scaffolds = 138
Label averages per scaffold, in decreasing order of scaffold frequency,capped at 10 scaffolds and 20 labels: [(array([0.72992701]), array([137])), (array([1.]), array([1])), (array([0.]), array([1])), (array([1.]), array([1])), (array([1.]), array([1])), (array([0.]), array([1])), (array([1.]), array([1])), (array([1.]), array([2])), (array([0.]), array([2])), (array([1.]), array([1]))]
Class sizes
p_np 0: 23.49%, 1: 76.51%
Total size = 2,039 | train size = 1,631 | val size = 203 | test size = 205
Loading model 0 from model/tryout/model.ep3
Loading pretrained parameter "grover.encoders.edge_blocks.0.heads.0.mpn_q.act_func.weight".
Loading pretrained parameter "grover.encoders.edge_blocks.0.heads.0.mpn_q.W_h.weight".
Loading pretrained parameter "grover.encoders.edge_blocks.0.heads.0.mpn_

# Predicting output

## Extracting molecular features

If the finetuned model uses the molecular feature as input, we need to generate the molecular feature for the target molecules as well.

In [11]:
!python scripts/save_features.py --data_path exampledata/finetune/bbbp.csv \
                                --save_path exampledata/finetune/bbbp.npz \
                                --features_generator rdkit_2d_normalized \
                                --restart 

100% 2039/2039 [01:20<00:00, 25.38it/s]


## Predicting output with the finetuned model

In [12]:
!python main.py predict --data_path exampledata/finetune/bbbp.csv \
               --features_path exampledata/finetune/bbbp.npz \
               --checkpoint_dir ./model \
               --no_features_scaling \
               --output data_pre.csv

Loading training args
Loading data
Validating SMILES
Test size = 2,039
Predicting...
  0% 0/3 [00:00<?, ?it/s]Loading pretrained parameter "grover.encoders.edge_blocks.0.heads.0.mpn_q.act_func.weight".
Loading pretrained parameter "grover.encoders.edge_blocks.0.heads.0.mpn_q.W_h.weight".
Loading pretrained parameter "grover.encoders.edge_blocks.0.heads.0.mpn_k.act_func.weight".
Loading pretrained parameter "grover.encoders.edge_blocks.0.heads.0.mpn_k.W_h.weight".
Loading pretrained parameter "grover.encoders.edge_blocks.0.heads.0.mpn_v.act_func.weight".
Loading pretrained parameter "grover.encoders.edge_blocks.0.heads.0.mpn_v.W_h.weight".
Loading pretrained parameter "grover.encoders.edge_blocks.0.act_func.weight".
Loading pretrained parameter "grover.encoders.edge_blocks.0.layernorm.weight".
Loading pretrained parameter "grover.encoders.edge_blocks.0.layernorm.bias".
Loading pretrained parameter "grover.encoders.edge_blocks.0.W_i.weight".
Loading pretrained parameter "grover.encoders.

## Output

The output will be saved in a file called `data_pre.csv`.

# Congratulations! Time to join the Community!
Congratulations on completing this tutorial notebook! If you enjoyed working through the tutorial, and want to continue working with DeepChem, we encourage you to finish the rest of the tutorials in this series. You can also help the DeepChem community in the following ways:

# **Star DeepChem on [Github](https://github.com/deepchem/deepchem)**
This helps build awareness of the DeepChem project and the tools for open source drug discovery that we're trying to build.

# **Join the DeepChem Gitter**
The DeepChem [Gitter](https://gitter.im/deepchem/Lobby) hosts a number of scientists, developers, and enthusiasts interested in deep learning for the life sciences. Join the conversation!
