<a href="https://colab.research.google.com/github/beekiran00/DeepMirror-Task-ML-Engineer---Bhanu-Velpula/blob/main/Transformer_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Downloading and Importing Libraries

In [1]:
!pip install rdkit
!pip install pyg-lib torch-scatter torch-sparse torch-cluster torch-spline-conv torch-geometric -f https://data.pyg.org/whl/torch-1.13.0+cpu.html
!pip install PyTDC
!pip install deepchem
!pip install dgl
!pip install dgllife
# since we are uisng a pretrained model --
!git clone https://github.com/seyonechithrananda/bert-loves-chemistry.git
%cd /content/bert-loves-chemistry
!pwd
!pip install transformers
!pip install simpletransformers
!pip install wandb
!wget https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/vocab.txt

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting rdkit
  Downloading rdkit-2022.9.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (29.3 MB)
[K     |████████████████████████████████| 29.3 MB 61.5 MB/s 
Installing collected packages: rdkit
Successfully installed rdkit-2022.9.3
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in links: https://data.pyg.org/whl/torch-1.13.0+cpu.html
Collecting pyg-lib
  Downloading https://data.pyg.org/whl/torch-1.13.0%2Bcpu/pyg_lib-0.1.0%2Bpt113cpu-cp38-cp38-linux_x86_64.whl (539 kB)
[K     |████████████████████████████████| 539 kB 31.5 MB/s 
[?25hCollecting torch-scatter
  Downloading https://data.pyg.org/whl/torch-1.13.0%2Bcpu/torch_scatter-2.1.0%2Bpt113cpu-cp38-cp38-linux_x86_64.whl (491 kB)
[K     |████████████████████████████████| 491 kB 47.0 MB/s 
[?25hCollecting torch-sparse
  Downloading https://data.pyg.org/whl/to

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting PyTDC
  Downloading PyTDC-0.3.8.tar.gz (100 kB)
[K     |████████████████████████████████| 100 kB 8.4 MB/s 
[?25hCollecting rdkit-pypi
  Downloading rdkit_pypi-2022.9.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (29.3 MB)
[K     |████████████████████████████████| 29.3 MB 1.5 MB/s 
[?25hCollecting fuzzywuzzy
  Downloading fuzzywuzzy-0.18.0-py2.py3-none-any.whl (18 kB)
Building wheels for collected packages: PyTDC
  Building wheel for PyTDC (setup.py) ... [?25l[?25hdone
  Created wheel for PyTDC: filename=PyTDC-0.3.8-py3-none-any.whl size=136139 sha256=b965a2a5fa28f55c34b81aeb2a6fefe3a6d8cc4e7d0709d338ad100fadd60463
  Stored in directory: /root/.cache/pip/wheels/4a/4f/da/b494dde823e311675b3ce778ca8c22e734ef8f779dbf311805
Successfully built PyTDC
Installing collected packages: rdkit-pypi, fuzzywuzzy, PyTDC
Successfully installed PyTDC-0.3.8 fuzzywuzzy-0.18.0 rdk

In [2]:
# IMPORTING LIBRARIES

# Import basic libraries
import os
import pandas as pd
import numpy as np
from typing import List
import logging

# import sklearn
import sklearn

# RDkit
from rdkit import Chem
from rdkit.Chem.rdmolops import GetAdjacencyMatrix


# import molnet loaders from deepchem
from deepchem.molnet import load_bbbp, load_clearance, load_clintox, load_delaney, load_hiv, load_qm7, load_tox21


# import MolNet dataloder from bert-loves-chemistry fork
from chemberta.utils.molnet_dataloader import load_molnet_dataset, write_molnet_dataset_for_chemprop

# import classification model from simple transformers
from simpletransformers.classification import ClassificationModel
from simpletransformers.classification import ClassificationArgs




# Datasets

In [3]:
from tdc.single_pred import ADME
data = ADME(name = 'HIA_Hou')
split = data.get_split()

Downloading...
100%|██████████| 40.1k/40.1k [00:00<00:00, 5.59MiB/s]
Loading...
Done!


Viewing the dataset, the dataset is in the form of a dictionary, train: values, valid: values and test: values

In [4]:
#split

In [5]:
print("Dataset type: ", type(data))
print(split.keys())
#print(split.values())


Dataset type:  <class 'tdc.single_pred.adme.ADME'>
dict_keys(['train', 'valid', 'test'])


In [6]:
train_df = pd.DataFrame.from_dict(split['train'])
train_df
valid_df = pd.DataFrame.from_dict(split['valid'])
#valid_df
test_df = pd.DataFrame.from_dict(split['test'])
#test_df

# ^ converting the dict to a dataframe of train valid and test

# Transformer Model Class

Transformers are mainly used in NLP tasks. For this task to predict molecule properties over small molecules, we used a pre-trained model, ChemBERTa, a transformer model pre-trained on PubChem-10M.

In [7]:
# Make the dataset such that onlu SMILES Strings and Labels are present in the dataframe structure

train_bert_data = train_df[['Drug', "Y"]] # prepare train dataset
test_bert_data = test_df[["Drug","Y"]] # prepare test dataset
valid_bert_data = valid_df[["Drug","Y"]] # prepare valid dataset


In [8]:
# set up a logger to record if any issues occur, and notify us if there are any problems with the arguments we've set for the model. 
logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)

Using simple-transformer, load the pre-trained model from HuggingFace's useful model-hub. We'll set the number of epochs to 10 in the arguments, and set auto weights = True for automatic weight balancing

In [9]:
# load pre-trained model from hugging face

model = ClassificationModel('roberta', 'seyonec/PubChem10M_SMILES_BPE_396_250', use_cuda=False ,args={'evaluate_each_epoch': True, 'evaluate_during_training_verbose': True, 'no_save': True, 'num_train_epochs': 10, 'auto_weights': True} ) # You can set class weights by using the optional weight argument

Downloading:   0%|          | 0.00/515 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/336M [00:00<?, ?B/s]

Some weights of the model checkpoint at seyonec/PubChem10M_SMILES_BPE_396_250 were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.decoder.bias', 'roberta.pooler.dense.weight', 'lm_head.decoder.weight', 'lm_head.bias', 'lm_head.layer_norm.weight', 'lm_head.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at seyonec/PubChem10M_SMILES_BPE

Downloading:   0%|          | 0.00/62.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/165k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/101k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/772 [00:00<?, ?B/s]

In [10]:
print(model.tokenizer)

PreTrainedTokenizerFast(name_or_path='seyonec/PubChem10M_SMILES_BPE_396_250', vocab_size=7924, model_max_len=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'eos_token': AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'unk_token': AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'sep_token': AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'pad_token': AddedToken("<pad>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'cls_token': AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'mask_token': AddedToken("<mask>", rstrip=False, lstrip=True, single_word=False, normalized=True)})


In [11]:
# check if our train and evaluation dataframes are setup properly. There should only be two columns for the SMILES string and its corresponding label.
print("Train Dataset: {}".format(train_bert_data.shape))
print("Eval Dataset: {}".format(valid_bert_data.shape))
print("TEST Dataset: {}".format(test_bert_data.shape))

Train Dataset: (404, 2)
Eval Dataset: (58, 2)
TEST Dataset: (116, 2)


In [12]:
#train the model over 10 epochs
model.train_model(train_bert_data, eval_df=valid_bert_data) 

  0%|          | 0/404 [00:00<?, ?it/s]

Epoch:   0%|          | 0/10 [00:00<?, ?it/s]

Running Epoch 0 of 10:   0%|          | 0/51 [00:00<?, ?it/s]

Running Epoch 1 of 10:   0%|          | 0/51 [00:00<?, ?it/s]

Running Epoch 2 of 10:   0%|          | 0/51 [00:00<?, ?it/s]

Running Epoch 3 of 10:   0%|          | 0/51 [00:00<?, ?it/s]

Running Epoch 4 of 10:   0%|          | 0/51 [00:00<?, ?it/s]

Running Epoch 5 of 10:   0%|          | 0/51 [00:00<?, ?it/s]

Running Epoch 6 of 10:   0%|          | 0/51 [00:00<?, ?it/s]

Running Epoch 7 of 10:   0%|          | 0/51 [00:00<?, ?it/s]

Running Epoch 8 of 10:   0%|          | 0/51 [00:00<?, ?it/s]

Running Epoch 9 of 10:   0%|          | 0/51 [00:00<?, ?it/s]

(510, 0.06790382585052558)

In [13]:
# accuracy metrics of trained model
result, model_outputs, wrong_predictions = model.eval_model(test_bert_data, acc=sklearn.metrics.accuracy_score)

# ROC-AUC metrics of the model
result, model_outputs, wrong_predictions = model.eval_model(test_bert_data, acc=sklearn.metrics.roc_auc_score)

  0%|          | 0/116 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/15 [00:00<?, ?it/s]

  0%|          | 0/116 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/15 [00:00<?, ?it/s]

In [14]:
print(result) # display the metrics

{'mcc': 0.9040150203311128, 'tp': 96, 'tn': 17, 'fp': 2, 'fn': 1, 'auroc': 0.9875203472599023, 'auprc': 0.9974635942714869, 'acc': 0.9422137818773738, 'eval_loss': 0.16801945546661348}
