## Graph Transformers for Blood-Brain-Barrier Penetration Prediction
**Ayush Noori**

First, I load the relevant libraries.

In [1]:
# import base libraries
import numpy as np
import pandas as pd
import os # read directories
import matplotlib.pyplot as plt # inline plots
%matplotlib inline

# import TDC
from tdc.single_pred import ADME

Next, I load the dataset retrieved from the [Therapeutics Data Commons (TDC)](https://tdcommons.ai/single_pred_tasks/adme/#bbb-blood-brain-barrier-martins-et-al). Rather than a random split, we use the more challenging ["scaffold split"](https://tdcommons.ai/functions/data_split/), which partitions the data based on the scaffold of the molecules to differentiate the training, validation, and test sets. The data is split as 70% training, 10% validation, and 20% test.

Note that the scaffold split requires that the [`rdkit`](https://www.rdkit.org/docs/GettingStartedInPython.html) package is installed.

In [2]:
# import data
data = ADME(name = 'BBB_Martins')
split = data.get_split(method = 'scaffold', seed = 42, frac = [0.7, 0.1, 0.2])
data.print_stats()

Found local copy...
Loading...
Done!
100%|██████████| 2030/2030 [00:01<00:00, 1614.80it/s]
--- Dataset Statistics ---
1975 unique drugs.
--------------------------


In [30]:
split['train']

Unnamed: 0,Drug_ID,Drug,Y
0,Terbutylchlorambucil,CC(C)(C)OC(=O)CCCc1ccc(N(CCCl)CCCl)cc1,1
1,brosotamide,Cc1cc(Br)cc(C(N)=O)c1O,1
2,butacetin,CC(=O)Nc1ccc(OC(C)(C)C)cc1,1
3,Salicyluricacid,O=C(O)CNC(=O)c1ccccc1O,1
4,sumacetamol,CSCC[C@H](NC(C)=O)C(=O)Oc1ccc(NC(C)=O)cc1,1
...,...,...,...
1416,antipyrine,Cc1cc(=O)n(-c2ccccc2)n1C,1
1417,Aminopyrine,Cc1c(N(C)C)c(=O)n(-c2ccccc2)n1C,1
1418,cyprazepam,ON1CC(=NCC2CC2)N=c2ccc(Cl)cc2=C1c1ccccc1,1
1419,nomifensine,CN1Cc2c(N)cccc2C(c2ccccc2)C1,1


In [27]:
from tdc.chem_utils import MolConvert
converter = MolConvert(src = 'SMILES', dst = 'Graph2D')
# converter(['Clc1ccccc1C2C(=C(/N/C(=C2/C(=O)OCC)COCCN)C)\C(=O)OC',
#        'CCCOc1cc2ncnc(Nc3ccc4ncsc4c3)c2cc1S(=O)(=O)C(C)(C)C'])
converter([split['train']['Drug'][0]])

[({0: 'C',
   1: 'C',
   2: 'C',
   3: 'C',
   4: 'O',
   5: 'C',
   6: 'O',
   7: 'C',
   8: 'C',
   9: 'C',
   10: 'C',
   11: 'C',
   12: 'C',
   13: 'C',
   14: 'N',
   15: 'C',
   16: 'C',
   17: 'Cl',
   18: 'C',
   19: 'C',
   20: 'Cl',
   21: 'C',
   22: 'C'},
  array([[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0],
         [1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0],
         [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0],
         [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0],
         [0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0],
         [0, 0, 0, 0, 1, 0, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0],
         [0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0],
         [0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0],
         [0

See, for example https://docs.dgl.ai/en/latest/_modules/dgl/data/qm9.html and https://graphormer.readthedocs.io/en/latest/Datasets.html#id5.

In [5]:
 import torch
 from dgl.data import QM9

  from .autonotebook import tqdm as notebook_tqdm
Using backend: pytorch


In [6]:
dataset = QM9(label_keys=["mu"])

Downloading C:\Users\unity\.dgl/qm9_eV.npz from https://data.dgl.ai/dataset/qm9_eV.npz...


In [20]:
dataset[0][0]

Graph(num_nodes=5, num_edges=20,
      ndata_schemes={'R': Scheme(shape=(3,), dtype=torch.float32), 'Z': Scheme(shape=(), dtype=torch.int64)}
      edata_schemes={})