<a href="https://colab.research.google.com/github/aishus1/Drug-Efficiency-Prediction/blob/main/training_chemberta.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# First, install deepchem
!pip install --pre deepchem

In [None]:
import deepchem
deepchem.__version__

In [None]:
from rdkit import Chem

We want to install NVIDIA's Apex tool, for the training pipeline used by `simple-transformers` and Weights and Biases. This package enables us to use 16-bit training, mixed precision, and distributed training without any changes to our code. Generally GPUs are good at doing 32-bit(single precision) math, not at 16-bit(half) nor 64-bit(double precision). Therefore traditionally deep learning model trainings are done in 32-bit. By switching to 16-bit, we’ll be using half the memory and theoretically less computation at the expense of the available number range and precision. However, pure 16-bit training creates a lot of problems for us (imprecise weight updates, gradient underflow and overflow). **Mixed precision training, with Apex, alleviates these problems**.

We will be installing `simple-transformers`, a library which builds ontop of HuggingFace's `transformers` package specifically for fine-tuning ChemBERTa.

In [None]:
!git clone https://github.com/NVIDIA/apex
!cd /content/apex
!pip install -v --no-cache-dir /content/apex
!pip install transformers
!pip install simpletransformers
!pip install wandb
!cd ..

In [None]:
import sys
!test -d bertviz_repo && echo "FYI: bertviz_repo directory already exists, to pull latest version uncomment this line: !rm -r bertviz_repo"
# !rm -r bertviz_repo # Uncomment if you need a clean pull from repo
!test -d bertviz_repo || git clone https://github.com/jessevig/bertviz bertviz_repo
if not 'bertviz_repo' in sys.path:
  sys.path += ['bertviz_repo']
!pip install regex

We're going to clone an auxillary repository, bert-loves-chemistry, which will enable us to use the MolNet dataloader for ChemBERTa, which automatically generates scaffold splits on any MoleculeNet dataset!

In [None]:
!git clone https://github.com/seyonechithrananda/bert-loves-chemistry.git

In [None]:
from google.colab import drive
drive.mount('/content/drive')

### Load models

In [None]:
from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline, RobertaModel, RobertaTokenizer
from bertviz import head_view

model = AutoModelForMaskedLM.from_pretrained("seyonec/ChemBERTa-zinc-base-v1")
tokenizer = AutoTokenizer.from_pretrained("seyonec/ChemBERTa-zinc-base-v1")

fill_mask = pipeline('fill-mask', model=model, tokenizer=tokenizer)

In [None]:
from transformers import RobertaModel, RobertaTokenizer
from bertviz import head_view

model_version = 'seyonec/ChemBERTa-zinc-base-v1'
model = RobertaModel.from_pretrained(model_version, output_attentions=True)
tokenizer = RobertaTokenizer.from_pretrained(model_version)

### Define the Paths

In [None]:
TRAIN_PATH = '/content/drive/MyDrive/Code/train_smiles.csv'

In [None]:
import pandas as pd

# load the data
train = pd.read_csv(TRAIN_PATH)

In [None]:
# seperate features and labels

X_train = train['canonical_smiles']
y_train = train['pIC50']

### Define the Model

In [None]:
from torch.utils.data import DataLoader, TensorDataset
import torch
from torch.nn import functional as F
from transformers import AutoTokenizer, RobertaModel
from sklearn.model_selection import train_test_split, KFold
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
import numpy as np
from tqdm.notebook import tqdm

# Define the Regression model
class RobertaForRegression(torch.nn.Module):
    def __init__(self, model):
        super(RobertaForRegression, self).__init__()
        self.roberta = model
        # Assuming using 'roberta-base'
        # 768 is the output dimension of the roberta model
        # you can add more linear layers
        # need to define the correct input dimension = output dimension of the respective above layer
        # and required output dimension
        self.regressor = torch.nn.Linear(768, 64)
        self.regressor_layer_one = torch.nn.Linear(64, 32)
        self.regressor_layer_two = torch.nn.Linear(32,1)

    def forward(self, input_ids, attention_mask):
        outputs = self.roberta(input_ids=input_ids, attention_mask=attention_mask)
        pooler_output = outputs.pooler_output
        # make sure to defne the inputs and outputs for every layer defined in class constructor
        regression_output = self.regressor(pooler_output)
        regression_layer_one_output = self.regressor_layer_one(regression_output)
        regression_layer_two_output = self.regressor_layer_two(regression_layer_one_output)
        return regression_layer_two_output

# Setup tokenizer
tokenizer = AutoTokenizer.from_pretrained("seyonec/ChemBERTa-zinc-base-v1")

### Training

In [None]:
# Define loss function and learning parameters
loss_fn = torch.nn.MSELoss()
learning_rate = 0.00005
num_epochs = 50

# KFold Cross-validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)
mse_scores = []
r2_score_set = []

for train_index, valid_index in kf.split(X_train):
    X_train_fold, X_valid_fold = X_train.iloc[train_index], X_train.iloc[valid_index]
    y_train_fold, y_valid_fold = y_train.iloc[train_index], y_train.iloc[valid_index]

    # Create DataLoader for training fold
    input_encodings_train = tokenizer(list(X_train_fold), return_tensors="pt", padding=True, truncation=True, max_length=512)
    train_dataset = TensorDataset(input_encodings_train["input_ids"], input_encodings_train["attention_mask"], torch.tensor(list(y_train_fold)).unsqueeze(-1))
    train_dataloader = DataLoader(train_dataset, batch_size=64, shuffle=True)

    # Create DataLoader for validation fold
    input_encodings_valid = tokenizer(list(X_valid_fold), return_tensors="pt", padding=True, truncation=True, max_length=512)
    valid_dataset = TensorDataset(input_encodings_valid["input_ids"], input_encodings_valid["attention_mask"], torch.tensor(list(y_valid_fold)).unsqueeze(-1))
    valid_dataloader = DataLoader(valid_dataset, batch_size=64, shuffle=True)

    # Initialize model and optimizer
    model = RobertaForRegression(RobertaModel.from_pretrained("seyonec/ChemBERTa-zinc-base-v1")).to("cuda" if torch.cuda.is_available() else "cpu")
    optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

    # Training loop
    for epoch in tqdm(range(num_epochs)):
        model.train()
        for batch in train_dataloader:
            batch = tuple(t.to("cuda" if torch.cuda.is_available() else "cpu") for t in batch)
            input_ids_batch, attention_masks_batch, labels_batch = batch
            outputs = model(input_ids_batch, attention_mask=attention_masks_batch)
            loss = loss_fn(outputs, labels_batch)

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

    # Validation loop
    model.eval()
    all_preds = []
    all_labels = []
    with torch.no_grad():
        for batch in valid_dataloader:
            batch = tuple(t.to("cuda" if torch.cuda.is_available() else "cpu") for t in batch)
            input_ids_batch, attention_masks_batch, labels_batch = batch
            outputs = model(input_ids_batch, attention_mask=attention_masks_batch)
            all_preds.extend(outputs.cpu().numpy())
            all_labels.extend(labels_batch.cpu().numpy())

    mse_scores.append(mean_squared_error(all_labels, all_preds))
    r2_score_set.append(r2_score(all_labels, all_preds))

print("MSE scores from 5-fold cross validation:", mse_scores)
print("R2 score : {}".format(np.mean(r2_score_set)))
print("Root Mean Square Error: ", np.sqrt(np.mean(mse_scores)))