# Predicting Acute Toxicity LD50

Here we show a worked example applying graphein to process a [molecule dataset](https://tdcommons.ai/single_pred_tasks/tox/) from [TDC](https://tdcommons.ai/).

**Dataset Description**: Acute toxicity LD50 measures the most conservative dose that can lead to lethal adverse effects. The higher the dose, the more lethal of a drug. This dataset is kindly provided by the authors of [1].

**Task Description**: Regression. Given a drug SMILES string, predict its acute toxicity.

**Dataset Statistics**: 7,385 drugs.

**Dataset Split**: Random Split Scaffold Split

[1] Zhu, Hao, et al. “Quantitative structure− activity relationship modeling of rat acute toxicity by oral exposure.” Chemical research in toxicology 22.12 (2009): 1913-1921.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/a-r-j/graphein/blob/master/notebooks/molecule_model_tutorial_tox.ipynb) [![GitHub](https://img.shields.io/badge/-View%20on%20GitHub-181717?logo=github&logoColor=ffffff)](https://github.com/a-r-j/graphein/blob/master/notebooks/molecule_model_tutorial_tox.ipynb)

In [42]:
# Install Graphein if necessary
# !pip install graphein[extras]

# Install TDC if necessary
# !pip install PyTDC

# NB you may need to install DL libraries such as pytorch, pytorch-lightning and torch-geometric
# These are left to the user to configure as they depend on your particular desired configuration (e.g CUDA)

In [None]:
# NBVAL_SKIP
from tdc.single_pred import Tox

In [43]:
# NBVAL_SKIP
# Load data
data = Tox(name = 'LD50_Zhu')
split = data.get_split()
split["train"].head()

Found local copy...
Loading...
Done!


Unnamed: 0,Drug_ID,Drug,Y
0,"Methane, tribromo-",BrC(Br)Br,2.343
1,Bromoethene (9CI),C=CBr,2.33
2,"1,1'-Biphenyl, hexabromo-",Brc1ccc(-c2ccc(Br)c(Br)c2Br)c(Br)c1Br,1.465
3,"Isothiocyanic acid, p-bromophenyl ester",S=C=Nc1ccc(Br)cc1,2.729
4,"Benzene, bromo-",Brc1ccccc1,1.765


## Creating Molecular Graphs with Graphein

In [44]:
# NBVAL_SKIP
import torch
import graphein.molecule as gm
import graphein.ml as ml

In [45]:
# NBVAL_SKIP
config = gm.MoleculeGraphConfig()

# Iterate over dataframes containing each split
train_graphs = [gm.construct_graph(smiles=smiles, config=config) for smiles in split["train"]["Drug"]]
valid_graphs = [gm.construct_graph(smiles=smiles, config=config) for smiles in split["valid"]["Drug"]]
test_graphs = [gm.construct_graph(smiles=smiles, config=config) for smiles in split["test"]["Drug"]]

# Assign labels to graphs
train_graphs = ml.add_labels_to_graph(train_graphs, labels=split["train"]["Y"].apply(torch.tensor), name="graph_label")
valid_graphs = ml.add_labels_to_graph(valid_graphs, labels=split["valid"]["Y"].apply(torch.tensor), name="graph_label")
test_graphs = ml.add_labels_to_graph(test_graphs, labels=split["test"]["Y"].apply(torch.tensor), name="graph_label")

                                                                   

In [46]:
# NBVAL_SKIP
from graphein.ml import GraphFormatConvertor
from torch_geometric.loader import DataLoader

# Define a conversion object
convertor = GraphFormatConvertor(
    src_format="nx",
    dst_format="pyg",
    columns=["edge_index", "atom_type_one_hot", "graph_label"]
    )

# Convert Graphs from NX to PyG
train_graphs = [convertor(g) for g in train_graphs]
valid_graphs = [convertor(g) for g in valid_graphs]
test_graphs = [convertor(g) for g in test_graphs]

# Create Dataloaders
train_loader = DataLoader(train_graphs, batch_size=32, shuffle=True)
valid_loader = DataLoader(valid_graphs, batch_size=32, shuffle=False)
test_loader = DataLoader(test_graphs, shuffle=False)

In [49]:
# NBVAL_SKIP
# Inspect a batch
for i in train_loader:
    print(i)
    break

DataBatch(edge_index=[2, 546], node_id=[32], atom_type_one_hot=[529, 11], graph_label=[32], num_nodes=529, batch=[529], ptr=[33])


## Define Model

In [50]:
# NBVAL_SKIP
from torch_geometric.nn import GCNConv, global_add_pool
from torch.nn.functional import mse_loss
from torch.nn import functional as F
import torch.nn as nn
import pytorch_lightning as pl

In [51]:
# NBVAL_SKIP
config_default = dict(
    n_hid = 8,
    n_out = 8,
    batch_size = 4,
    dropout = 0.5,
    lr = 0.001,
    num_heads = 32,
    num_att_dim = 64,
    model_name = 'GCN'
)

class Struct:
    def __init__(self, **entries):
        self.__dict__.update(entries)

config = Struct(**config_default)

global model_name
model_name = config.model_name

In [52]:
# NBVAL_SKIP
class GraphNets(pl.LightningModule):
    def __init__(self):
        super().__init__()

        self.layer1 = GCNConv(in_channels=11, out_channels=config.n_hid)
        self.layer2 = GCNConv(in_channels=config.n_hid, out_channels=config.n_out)
        self.decoder = nn.Linear(config.n_out, 1)

    def forward(self, g):
        x = g.atom_type_one_hot.float()
        x = F.dropout(x, p=config.dropout, training=self.training)
        x = F.elu(self.layer1(x, g.edge_index))
        x = F.dropout(x, p=config.dropout, training=self.training)
        x = self.layer2(x, g.edge_index)
        x = global_add_pool(x, batch=g.batch)
        x = self.decoder(x)
        return x

    def training_step(self, batch, batch_idx):
        x = batch
        y = x.graph_label
        y_hat = self(x)
        loss = mse_loss(y_hat, y)

        self.log("train_loss", loss)
        return loss

    def validation_step(self, batch, batch_idx):
        x = batch
        y = x.graph_label
        y_hat = self(x)
        loss = mse_loss(y_hat, y)
        self.log("valid_loss", loss)

    def test_step(self, batch, batch_idx):
        x = batch
        y = x.graph_label
        y_hat = self(x)
        loss = mse_loss(y_hat, y)

        self.log("test_loss", loss)

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=config.lr)

## Train!

In [53]:
# NBVAL_SKIP
model = GraphNets()

trainer = pl.Trainer(max_epochs=50, gpus=1, strategy=None)
trainer.fit(model, train_loader, valid_loader)

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]

  | Name    | Type    | Params
------------------------------------
0 | layer1  | GCNConv | 96    
1 | layer2  | GCNConv | 72    
2 | decoder | Linear  | 9     
------------------------------------
177       Trainable params
0         Non-trainable params
177       Total params
0.001     Total estimated model params size (MB)


Epoch 49: 100%|██████████| 186/186 [00:01<00:00, 179.13it/s, loss=0.932, v_num=2]


## Test

In [58]:
# NBVAL_SKIP
trainer.test(model, dataloaders=[test_loader])

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]


Testing DataLoader 0: 100%|██████████| 1477/1477 [00:03<00:00, 443.90it/s]


[{'test_loss': 0.8866580128669739}]