## **Introduction to ADCNet: Predicting ADC Activity with DeepChem**

Advancements in molecular biology have revolutionized drug discovery, focusing on more selective clinical candidates. Traditional small-molecule inhibitors face limitations like off-target toxicity and drug resistance. Monoclonal antibodies (mAbs) improved targeting accuracy but still struggle with heterogeneous cancer cell populations. To address these issues, Antibody-Drug Conjugates (ADCs) have emerged as a promising cancer therapy. By combining the specificity of mAbs with potent cytotoxic drugs linked through chemical linkers, ADCs enable targeted delivery to cancer cells, reducing harm to healthy tissues. However, developing effective ADCs is complex, requiring careful selection of antibodies, payloads, and linkers, which all impact safety and efficacy.

Therefore, ADCNet [[1]](#1) has been developed as a comprehensive deep learning framework that addresses the above issues and accurately predicts the activity of antibody-drug conjugates. It utilizes ESM-2 for key features from antibody and antigen sequences and FG-BERT for processing the SMILES strings of linkers and payloads. The framework also incorporates the Drug-Antibody Ratio (DAR), enhancing the rational design of safer and more effective ADC candidates. By learning and integrating patterns from these various molecular components, ADCNet aids in the rational design of safer and more effective ADC candidates. 

In this tutorial, we will explore how to predict the therapeutic activity of Antibody-Drug Conjugates (ADCs) using ADCNet, a unified deep learning framework implemented in DeepChem. Before, proceeding with this guide, it is recommended to build a foundational understanding of ADCs. Refer to the "Introduction to Antibody-Drug Conjugates" [[2]](#2) notebook available in the DeepChem tutorials.

# **Colab**
This tutorial and the rest in this sequence can be done in Google colab. If you'd like to open this notebook in colab, you can use the following link.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepchem/deepchem/blob/master/examples/tutorials/Introduction_to_ADCNet.ipynb)

## **Evolution of ADCs**

As the field continues to advance, the design of ADCs has been progressively refined over several generations. First-generation ADCs, such as *gemtuzumab ozogamicin*, were built using humanized antibodies, a single type of cytotoxic drug, and acid-sensitive linkers. These components were originally derived from murine antibodies and conventional chemotherapy agents. The conjugation process was typically random, targeting lysine or cysteine residues, which resulted in mixtures with uneven Drug-Antibody Ratios (DAR). This lack of uniformity made it difficult to establish a consistent therapeutic index and often led to off-target toxicity and narrow treatment windows. As a result, first-generation ADCs were prone to unmanageable side effects and limited clinical effectiveness.<br>

In contrast, second-generation ADCs, such as *brentuximab vedotin* and *trastuzumab emtansine*, introduced more powerful cytotoxic agents like tubulin inhibitors, along with more stable linkers. These improvements significantly enhanced both the treatment’s efficacy and molecular stability.<br>

Moreover, the third generation ADCs, represented by drugs like *polatuzumab vedotin* and *enfortumab vedotin*, introduced site-specific conjugation techniques and hydrophilic linkers. These innovations allowed for precise control over DAR, thereby enhancing both safety and therapeutic efficacy.

Let's take a closer look at the evolution of Antibody-Drug Conjugates (ADCs) with some of the first, second, and third-generation ADCs.

<img src = "assets/generations_adc.jpg" alt="image" height="700" width="800"> <br> **Fig.1** Schematic representations of first-, second-, and third-generation Antibody-Drug Conjugates (ADCs). [[3]](#3)

As we analyze the evolution of ADC design through the generations, we can observe significant advancements in various aspects. The figure above demonstrates the progression of ADC design through various generations. It highlights key differences, including linker cleavability (cleavable versus non-cleavable), the format of the monoclonal antibody (such as the IgG1 subtype), and the number of cytotoxic warheads attached to each antibody. Each generation shows advancements in specificity, stability, and therapeutic efficacy, marked by improvements in site-specific conjugation and payload delivery mechanisms.

Now that we have explored the fundamentals and evolution of ADCs, let's delve into the ADCNet architecture.

## **Overview of the model architecture**

We follow a three-step execution process. First, we process different types of input data. Second, we generate embeddings from these inputs using pretrained models. Finally, we concatenate the embeddings and feed them into a Multilayer Perceptron (MLP) to make predictions.

### Inputs

Let's examine the three different types of inputs we use:

(I) **Protein Sequences:**
- **Antibody Heavy Chain:** The protein sequence of the antibody's heavy chain.
- **Antibody Light Chain:** The protein sequence of the antibody's light chain.
- **Antigen:** The protein sequence of the target antigen.

(II) **Small Molecules (SMILES representations):**
- **Linker:** A SMILES string representing the chemical structure of the linker.
- **Payload:** A SMILES string representing the chemical structure of the cytotoxic payload.

(III) **Numerical Value:**
- **Drug–Antibody Ratio (DAR):** A value indicating the average number of payload molecules attached to each antibody.

Each input is processed individually to extract its unique features.

### Generating Embeddings

ADCNet uses pre-trained language models to transform the inputs into embeddings:

- Protein sequences (antibody heavy chain, antibody light chain, and antigen sequences) are processed using ESM-2 (Evolutionary Scale Modeling) [[4]](#4), a Transformer-based protein language model. ESM-2 converts these sequences into dense embeddings that encode their structural and functional properties.
- SMILES representations of the linker and payload are processed using ChemBERTA, which generates embeddings that capture the chemical properties of these small molecules.

(**Note**: While the original ADCNet paper utilized FGBERT, we are employing ChemBERTA here due to its availability and effectiveness within the DeepChem framework.)

### Prediction

After generating the embeddings:

- The embeddings from the three protein sequences (heavy chain, light chain, and antigen), the two small molecules (linker and payload), and the processed DAR value are combined into a single feature vector. 
- This combined feature vector is then input into a Multilayer Perceptron (MLP), which consists of two fully connected layers with non-linear activation functions. The MLP processes these concatenated features to predict the therapeutic activity of the antibody-drug conjugate.

Below is the architecture diagram of ADCNet, illustrating the complete workflow from input sequences and molecular structures through embedding layers and model components to the final prediction output.

<img src="assets/ADCNet_2.png" alt="image2" height = "800" width="800"> <br> **Fig.2** Diagram illustrating the network architecture of ADCNet model. [[4]](#4)

### Versatility of ADCNet

According to the original [ADCNet paper](https://arxiv.org/pdf/2401.09176), the architecture utilizes the plM ESM-2 to process antibody and antigen sequences. While ESM-2 is highly effective, confining the framework to a single model limits its adaptability to other protein representation techniques that could provide complementary strengths. Researchers can explore alternative models, such as ProtBERT, T5-Protein, or newer protein language models that are tailored to specific tasks. This flexibility enables experimentation with models that can better capture the structural and functional nuances of antibodies and antigens, depending on the context of the antibody-drug conjugate (ADC) design. For instance, some models might excel in predicting binding affinity, while others could be more skilled at managing sequence diversity.

Similarly, the original ADCNet architecture utilizes FG-BERT for encoding linker and payload SMILES strings. Although FG-BERT is effective for small molecule representation, relying solely on it limits the framework's ability to incorporate advancements in chemical modeling. To enhance performance, researchers are encouraged to adopt other small molecule models, such as ChemBERTa, MolBERT, or various graph-based and transformer-based chemical encoders.

In our implementation, we use ESM-2-8M, but researchers are encouraged to experiment with other ESM variants, such as ESM-2 650M and ESM-2-3B, which may offer improved accuracy. Additionally, we have replaced FG-BERT with ChemBERTa in this implementation to leverage its strengths in small molecule representation.

### Setup

Before we continue, let's install DeepChem in our environment and set up the other required dependencies.

In [9]:
# install the necessary libraries
!pip install deepchem numpy torch scikit-learn transformers tqdm



In [34]:
import deepchem as dc
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from tqdm import tqdm
from transformers import EsmTokenizer, EsmModel
from transformers import AutoTokenizer, AutoModel

### Data Collection and Pre-Processing

We will be using ADCdb, which was originally utilized by ADCNet and can be accessed at [[5]](#5). This database contains information on 6,572 antibody-drug conjugates (ADCs), including 359 that are approved by the FDA or are currently in clinical trials, 501 in preclinical testing, 819 with in vivo testing data, 1,868 with cell line or target testing data, and 3,025 without such testing.

For our purposes, we will use the preprocessed data for convenience, which is available at [[6]](#6). Let’s examine the dataset. The original ADCdb is a comprehensive collection of data on ADCs, and we will be working with the preprocessed version provided by ADCNet. This version can be found in the assets folder or in the ADCdb GitHub repository.

In [35]:
# load file
file_path = "assets/adcdb.csv"
df = pd.read_csv(file_path)

print(f'We have data of {len(df)} ADCs.')

We have data of 435 ADCs.


In [36]:
df.columns.to_list()

['index',
 'ADC ID',
 'ADC Name',
 'Antibody Name',
 'Antibody Heavy Chain Sequence',
 'Antibody Light Chain Sequence',
 'Antigen Sequence',
 'Payload Isosmiles',
 'Linker Isosmiles',
 'DAR',
 'label（10nm）',
 'label（100nm）',
 'label（1nm）',
 'label（1000nm）',
 'DAR_val']

We can see the dataset contains columns representing ADC names, antibody sequences, antigen sequences, SMILES strings for linker/payload, and labels at multiple concentrations. Now lets have a preview of the dataset we will be using:

In [37]:
df.head()

Unnamed: 0,index,ADC ID,ADC Name,Antibody Name,Antibody Heavy Chain Sequence,Antibody Light Chain Sequence,Antigen Sequence,Payload Isosmiles,Linker Isosmiles,DAR,label（10nm）,label（100nm）,label（1nm）,label（1000nm）,DAR_val
0,0,DRG0ABJAM,Trastuzumab-BCN-HydraSpace-Val-Cit-PABC-Gly-Ca...,Trastuzumab,EVQLVESGGGLVQPGGSLRLSCAASGFNIKDTYIHWVRQAPGKGLE...,DIQMTQSPSSLSASVGDRVTITCRASQDVNTAVAWYQQKPGKAPKL...,MELAALCRWGLLLALLPPGAASTQVCTGTDMKLRLPASPETHLDML...,CCN(C(=O)CN)C1COC(OC2C(OC3C#C/C=C\C#CC4(O)CC(=...,CC(C)C(NC(=O)OCCN(CCOC(=O)NC(C(=O)NC(CCCNC(N)=...,1.86,0,0,0,0,1.86
1,1,DRG0ZBATX,Anti-KIT NEG087?SSNPP-DM3,Anti-KIT mAb NEG087,EVQLVESGGGLVQPGGSLRLSCAASGFTFSDYYMAWVRQAPGKGLE...,DIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKL...,MRGARGAWDFLCVLLLLLRVQTGSSQPSVSPGEPSPPSIHPGKSDL...,C[C@@H]1[C@@H]2C[C@]([C@@H](/C=C/C=C(/CC3=CC(=...,CC(S)CCC(N)=O,3.0-4.0,0,0,0,0,3.5
2,2,DRG0XJKXB,Trastuzumab-C239I-SG3400,Engineered trastuzumab,EVQLVESGGGLVQPGGSLRLSCAASGFNIKDTYIHWVRQAPGKGLE...,DIQMTQSPSSLSASVGDRVTITCRASQDVNTAVAWYQQKPGKAPKL...,MELAALCRWGLLLALLPPGAASTQVCTGTDMKLRLPASPETHLDML...,C=C1CC2C=Nc3cc(OCCCOc4cc5c(cc4OC)C(=O)N4CC(=C)...,C[C@@H](C(=O)NC1=CC=C(C=C1)CO)NC(=O)[C@H](C(C)...,1.71,1,1,1,1,1.71
3,3,DRG0ZOYQV,Datopotamab deruxtecan,Datopotamab,QVQLVQSGAEVKKPGASVKVSCKASGYTFTTAGMQWVRQAPGQGLE...,DIQMTQSPSSLSASVGDRVTITCKASQDVSTAVAWYQQKPGKAPKL...,MARGPGLAPPPLRLPLLLLVLAAVTGHTAAQDNCTCPTNKMTVCSP...,CC[C@@]1(C2=C(COC1=O)C(=O)N3CC4=C5[C@H](CCC6=C...,C1=CC=C(C=C1)C[C@@H](C(=O)NCC(=O)O)NC(=O)CNC(=...,4,1,1,1,1,4.0
4,4,DRG0COMTY,Telisotuzumab vedotin,Telisotuzumab,QVQLVQSGAEVKKPGASVKVSCKASGYIFTAYTMHWVRQAPGQGLE...,DIVMTQSPDSLAVSLGERATINCKSSESVDSYANSFLHWYQQKPGQ...,MKAPAVLAPGILVLLFTLVQRSNGECKEALAKSEMNVNMKYQLPNF...,CC[C@H](C)[C@@H]([C@@H](CC(=O)N1CCC[C@H]1[C@@H...,CC(C)[C@@H](C(=O)N[C@@H](CCCNC(=O)N)C(=O)NC1=C...,3.1,1,1,1,1,3.1


### Preprocessing Numeric Features

We can see that the dataset includes an important feature: the Drug-Antibody Ratio (DAR), which plays a significant role in determining the efficacy and safety of ADCs, as it represents the average number of drug molecules attached to each antibody. Since DAR is a continuous numerical feature, we will scale it before inputting it into our model. Standardizing DAR to have zero mean and unit variance ensures that it is on a comparable scale with other features, which helps neural networks train more efficiently and converge faster.

In [38]:
# Extract and prepare DAR values
dar_series = pd.to_numeric(df['DAR'], errors='coerce')
dar_values = dar_series.dropna().values.reshape(-1, 1).astype(np.float64)

Now that we've verified the DAR values, we'll proceed to normalize them using DeepChem’s NormalizationTransformer.

In [39]:
from deepchem.trans import NormalizationTransformer

n_samples = len(dar_values)
dar_dataset = dc.data.NumpyDataset(X=dar_values)
normalizer = dc.trans.NormalizationTransformer(transform_X=True, dataset=dar_dataset)

# Transform the dataset (this normalizes the X values to zero mean and unit variance)
normalized_dataset = normalizer.transform(dar_dataset)

# Extract the scaled DAR values
dar_scaled = normalized_dataset.X

In [40]:
print(f"Original DAR mean: {dar_values.mean():.4f}, std: {dar_values.std():.4f}")
print(f"Scaled DAR mean: {dar_scaled.mean():.4f}, std: {dar_scaled.std():.4f}")

Original DAR mean: 3.9809, std: 1.8771
Scaled DAR mean: 0.0000, std: 1.0000


Now that our data is preprocessed and the Drug–Antibody Ratio (DAR) values are standardized, we are ready to generate embeddings using pre-trained models for each input type.

### Generating Embeddings with Pretrained Models

Set up the computation device (GPU if available, otherwise CPU) and import necessary transformer modules.

In [41]:
from transformers import EsmTokenizer, EsmModel, AutoTokenizer, AutoModel
import torch

# Choose device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

Before we move further, it's necessary to know about ESM-2. ESM-2 (Evolutionary Scale Modeling) is a protein language model using a transformer-based architecture to process protein sequences. It has been trained on large datasets of protein sequences to learn the relationships between amino acids and the structural and functional properties of proteins.

ESM-2 has demonstrated strong performance across various protein-related prediction tasks, making it a reliable choice for encoding protein sequences in deep learning workflows. To explore ESM-2 and other protein language models developed by Meta’s FAIR (Fundamental AI Research) team, visit the official GitHub repository [here](https://github.com/facebookresearch/esm).

Here we use the smallest ESM-2 model (esm2_t6_8M_UR50D, 6 layers, 8M parameters) for protein sequence embeddings. Larger ESM-2 models are available in the [Hugging Face Model Hub](https://huggingface.co/facebook/esm2_t6_8M_UR50D) for improved accuracy at the cost of increased computational resources.

In [42]:
# Load ESM-2 for protein sequences
esm_model_name = "facebook/esm2_t6_8M_UR50D"
tokenizer_esm = AutoTokenizer.from_pretrained(esm_model_name)
esm_model = AutoModel.from_pretrained(esm_model_name).to(device)

Some weights of EsmModel were not initialized from the model checkpoint at facebook/esm2_t6_8M_UR50D and are newly initialized: ['esm.pooler.dense.bias', 'esm.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


For small molecules like payloads and linkers, we will use ChemBERTa to generate embeddings from the SMILES strings of payloads and linkers. ChemBERTa is a Transformer-based model pre-trained on chemical SMILES, enabling it to capture the structural and chemical properties of small molecules for downstream tasks.

In [43]:
# Load ChemBERTa (for payload & linker)
chemberta_model_name = "seyonec/ChemBERTa-zinc-base-v1"
chemberta_tokenizer = AutoTokenizer.from_pretrained(chemberta_model_name)
chemberta_model = AutoModel.from_pretrained(chemberta_model_name).to(device)

We have now initialized two Transformer-based models: ESM-2 for protein sequences and ChemBERTa for SMILES.

Next, we will generate embeddings for the antibody heavy chain, light chain, and antigen protein sequences in our dataset using ESM-2. These embeddings capture the structural and functional properties of each protein sequence, enabling the model to learn meaningful biological representations for downstream prediction tasks.

In [46]:
# Extract sequences and SMILES from the dataframe

heavy_chains = df['Antibody Heavy Chain Sequence'].astype(str).tolist()
light_chains = df['Antibody Light Chain Sequence'].astype(str).tolist()
antigens = df['Antigen Sequence'].astype(str).tolist()
linkers = df['Linker Isosmiles'].tolist()
payloads = df['Payload Isosmiles'].tolist()

Let's first generate embeddings from protein sequences using ESM-2

In [47]:
MAX_SEQ_LENGTH = 1000

# Function to get embeddings from protein sequences
def get_embeddings(sequences):
    embeddings = []

    for seq in tqdm(sequences, desc="Generating embeddings"):
        inputs = tokenizer_esm(
            seq,
            return_tensors='pt',
            truncation=True,
            padding='max_length',
            max_length=MAX_SEQ_LENGTH,
            is_split_into_words=False
        )

        inputs = {k: v.to(device) for k, v in inputs.items()}

        with torch.no_grad():
            outputs = esm_model(**inputs)
        
        # Extract CLS token
        cls_emb = outputs.last_hidden_state[:, 0, :].squeeze().cpu()
        embeddings.append(cls_emb)

    return embeddings

Now we can use the above function to generate embeddings for our protein sequences (heavy chains, light chains, and antigens) by passing the corresponding sequence lists as input.

In [49]:
# Generate embeddings from protein sequences
print("Generating embeddings for heavy chains, light chains, and antigens...")

heavy_embeddings = get_embeddings(heavy_chains)
light_embeddings = get_embeddings(light_chains)
antigen_embeddings = get_embeddings(antigens)

Generating embeddings for heavy chains, light chains, and antigens...


Generating embeddings: 100%|██████████| 435/435 [01:26<00:00,  5.01it/s]
Generating embeddings: 100%|██████████| 435/435 [01:28<00:00,  4.91it/s]
Generating embeddings: 100%|██████████| 435/435 [01:28<00:00,  4.94it/s]


Users can uncomment the code below to save and load embeddings, saving time on restart.

In [None]:
# Save and load tensors separately

# torch.save(heavy_embeddings, 'heavy_embeddings.pt')
# torch.save(light_embeddings, 'light_embeddings.pt')
# torch.save(antigen_embeddings, 'antigen_embeddings.pt')
# heavy_embeddings = torch.load('heavy_embeddings.pt')
# light_embeddings = torch.load('light_embeddings.pt')
# antigen_embeddings = torch.load('antigen_embeddings.pt')

  heavy_embeddings = torch.load('heavy_embeddings.pt')
  light_embeddings = torch.load('light_embeddings.pt')
  antigen_embeddings = torch.load('antigen_embeddings.pt')


Now let's generate embeddings for the payload and linker SMILES strings using ChemBERTa.

In [54]:
# Function to get embedding for a single SMILES string
def get_smiles_embedding(smiles: str):
    inputs = chemberta_tokenizer(smiles, return_tensors="pt", padding=True, truncation=True)
    inputs = {k: v.to(device) for k, v in inputs.items()}

    with torch.no_grad():
        outputs = chemberta_model(**inputs)
        # Use the CLS token representation (first token)
        embedding = outputs.last_hidden_state[:, 0, :]  # shape: (1, hidden_size)
        return embedding.squeeze().cpu().numpy()

The above function now can be used to generate a ChemBERTa embedding for a single SMILES string, which can be used for both linker and payload molecules. <br>Lets see taking an example:

In [55]:
example = payloads[0]
print(example)

CCN(C(=O)CN)C1COC(OC2C(OC3C#C/C=C\C#CC4(O)CC(=O)C(NC(=O)OC)=C3/C4=C\CSSC(C)(C)CC(=O)NCCOCCOC)OC(C)C(NOC3CC(O)C(SC(=O)c4c(C)c(I)c(OC5OC(C)C(O)C(OC)C5O)c(OC)c4OC)C(C)O3)C2O)CC1OC


In [56]:
example_embedding = get_smiles_embedding(example)
print(f"Embedding shape for example payload: {example_embedding.shape}")

Embedding shape for example payload: (768,)


Now that we understand how to generate embeddings from SMILES strings, let's create embeddings for all linker and payload molecules in the dataset.

In [57]:
# Function to generate embeddings for a list of SMILES

def generate_embeddings(smiles_list):
    """Generate embeddings for a list of SMILES strings."""
    embeddings = []
    for smi in tqdm(smiles_list):
        try:
            emb = get_smiles_embedding(smi)
            embeddings.append(emb)
        except Exception as e:
            print(f"Failed for {smi}: {e}")
            embeddings.append(None)
    return embeddings

As we have defined the function above, we can now pass our Payload SMILES string to generate embeddings. This process transforms each payload molecule into a numerical vector representation that captures its chemical properties.

In [58]:
payload_embeddings = generate_embeddings(payloads) # embeddings from payload smiles

100%|██████████| 435/435 [00:10<00:00, 43.30it/s]


Similarly, we can get embeddings from linker smiles:

In [59]:
linker_embeddings = generate_embeddings(linkers) # embeddings from linker smiles

100%|██████████| 435/435 [00:08<00:00, 54.24it/s]


Now that we’ve generated embeddings from all protein sequences, as well as the payload and linker SMILES, we can concatenate them, along with the standardized DAR value to form a complete feature vector for each ADC.

In [62]:
import torch

adc_embeddings = []

for i in range(len(heavy_embeddings)):
    heavy = heavy_embeddings[i]
    light = light_embeddings[i]
    antigen = antigen_embeddings[i]
    payload = payload_embeddings[i]
    linker = linker_embeddings[i]
    dar = dar_scaled[i]  # Use 2D array directly

    # Convert to tensors and ensure 1D
    def to_tensor_1d(emb):
        if not isinstance(emb, torch.Tensor):
            emb = torch.tensor(emb, dtype=torch.float32)
        return emb.flatten()

    heavy = to_tensor_1d(heavy)
    light = to_tensor_1d(light)
    antigen = to_tensor_1d(antigen)
    payload = to_tensor_1d(payload)
    linker = to_tensor_1d(linker)
    dar = to_tensor_1d(dar)

    # Concatenate all embeddings
    full_emb = torch.cat([heavy, light, antigen, payload, linker, dar])
    adc_embeddings.append(full_emb)

print(f"Created {len(adc_embeddings)} ADC embeddings")
print(f"Each embedding shape: {adc_embeddings[0].shape}")

IndexError: index 294 is out of bounds for axis 0 with size 294

Now, lets check the embedding shape of each input we have generated. 

In [24]:
print("Payload shape:", payload_embeddings[0].shape)
print("Linker shape:", linker_embeddings[0].shape)
print("Heavy shape:", heavy_embeddings[0].shape)
print("Light shape:", light_embeddings[0].shape)
print("Antigen shape:", antigen_embeddings[0].shape)
print("DAR shape:", dar_scaled[0].shape)

Payload shape: (768,)
Linker shape: (768,)
Heavy shape: torch.Size([320])
Light shape: torch.Size([320])
Antigen shape: torch.Size([320])
DAR shape: (1,)


Lets see the shape of a single concatenated ADC embedding i.e., feature vector for one ADC.

In [25]:
full_emb.shape

torch.Size([2497])

Now, we can check the shape of the full batch tensor (i.e., all ADCs stacked), where the first dimension is the number of ADCs and the second is the embedding size.

In [26]:
adc_batch_tensor = torch.stack(adc_embeddings)
adc_batch_tensor.shape

torch.Size([435, 2497])

#### Defining MLP (Multi-Layer Perceptron)

Before we move into model training, it’s important to understand the architecture of the MLP (Multi-Layer Perceptron). MLP is a type of feedforward neural network consisting of an input layer, one or more hidden layers, and an output layer. Each neuron in a layer is fully connected to every neuron in the next, allowing the model to learn complex, non-linear patterns from the data.  Other inputs include Dropout and Activation function. Dropout randomly disables a fraction of neurons during each training iteration, which forces the network to not rely too heavily on any one neuron and helps in learning more robust features, and activation function is used to specify the activation function used in the hidden layers of the model. <br>

In our setup, the input layer has a dimension of 2497, which corresponds to the size of the combined embeddings. This is followed by 2 hidden layers that help the model extract deeper hierarchical features, and finally, an output layer that produces predictions. Other inputs includes a dropout rate of 0.2 and Relu activation function.

In [27]:
import torch
import torch.nn as nn
from deepchem.models.torch_models.layers import MultilayerPerceptron

# Define the model
ADCNet = MultilayerPerceptron(
    d_input= 2497,
    d_output=1,
    d_hidden=(1024, 256),
    dropout=0.2,
    activation_fn='relu'
)

# Forward pass
op = ADCNet(adc_batch_tensor)
print(op.shape)  # [435, 1]

No normalization for SPS. Feature removed!
No normalization for AvgIpc. Feature removed!
Skipped loading modules with pytorch-geometric dependency, missing a dependency. No module named 'torch_geometric'
Skipped loading modules with pytorch-geometric dependency, missing a dependency. cannot import name 'DMPNN' from 'deepchem.models.torch_models' (/opt/miniconda3/envs/adcnet/lib/python3.8/site-packages/deepchem/models/torch_models/__init__.py)
Skipped loading modules with pytorch-lightning dependency, missing a dependency. No module named 'lightning'
Skipped loading some Jax models, missing a dependency. No module named 'jax'


torch.Size([435, 1])


We will be using the 100 nm label from the dataset as our target variable. Although the dataset also includes labels at 10 nm and 1000 nm, we choose the 100 nm label.

In [28]:
# Now extract the label
label_col = "label（100nm）"
labels = df[label_col].values  # 0 or 1

# Convert to PyTorch tensor
y = torch.tensor(labels, dtype=torch.float32).unsqueeze(1)

#### creating deepchem's numpy dataset

In [29]:
from deepchem.data.datasets import NumpyDataset
import numpy as np

X = adc_batch_tensor.numpy()  # Convert to numpy array
y = labels
y = y.astype(np.float32)  # Ensure labels are float32

dataset = NumpyDataset(X= X, y= y, ids= np.arange(len(y)))

print(f"Dataset created with {len(dataset)} samples.")
print(f"Features shape: {dataset.X.shape}, Labels shape: {dataset.y.shape}")
print(f"First sample features: {dataset.X[0]}, First sample label: {dataset.y[0]}")

Dataset created with 435 samples.
Features shape: (435, 2497), Labels shape: (435,)
First sample features: [-0.20455705  0.8342899   0.15026727 ... -0.15449478  0.31129766
 -1.2800006 ], First sample label: 0.0


Let's print the type of our dataset now, to verify it has been converted to a numpy dataaset or not.

In [None]:
print(type(dataset))
print(dataset)

Lets split our dataset into train test split using deepchem's splitters. Here we use random splitters for the purpose. you can also check out other splitters in deepchem splitters, according to task.

In [31]:
import deepchem as dc

# Creating a RandomSplitter object
splitter = dc.splits.RandomSplitter()

# Splitting dataset into train, validation, and test datasets
train_dataset, valid_dataset, test_dataset = splitter.train_valid_test_split(
    dataset, 
    frac_train=0.8,    # 80% for training
    frac_valid=0.1,   # 10% for validation
    frac_test=0.1     # 10% for testing
)

print(f"Train: {len(train_dataset)}")
print(f"Valid: {len(valid_dataset)}")
print(f"Test: {len(test_dataset)}")

Train: 348
Valid: 43
Test: 44


Let's define the model, loss function, and optimizer that we'll be using for training. Wrapping our torch model.

In [33]:
from deepchem.models.torch_models.torch_model import TorchModel
from deepchem.models.torch_models.layers import MultilayerPerceptron
from deepchem.models.losses import L2Loss

class ADCNetModel(TorchModel):
    def __init__(self,input_dim=2497, output_dim=1, hidden_dims=(1024, 256), dropout=0.2, activation_fn='relu', **kwargs):
        
        model = MultilayerPerceptron(
            d_input=input_dim,
            d_output=output_dim,
            d_hidden=hidden_dims,
            dropout=dropout,
            activation_fn=activation_fn
        )
        super(ADCNetModel, self).__init__(model = model, loss = L2Loss(), **kwargs)

In [34]:
model = ADCNetModel()

Training begins here.we are defining our loss and etc..

In [36]:
ADCNetModel = torch.nn.Sequential(
torch.nn.Linear(2497, 1000),
torch.nn.Tanh(),torch.nn.Linear(1000, 1))
model = dc.models.TorchModel(ADCNetModel, loss=dc.models.losses.L2Loss())
loss = model.fit(dataset, nb_epoch=100)

Lets print our loss here.

In [37]:
print(f"Training loss: {loss}")

Training loss: 0.07892356395721435


We can utilise deepchem's evaluation metrics for getting insights into the training efficacy. Users can also take a look at other evaluation metrics also in deepchem.

In [38]:
import deepchem as dc
import numpy as np

classification_metrics = [
    dc.metrics.Metric(dc.metrics.roc_auc_score, np.mean),  
    dc.metrics.Metric(dc.metrics.accuracy_score, np.mean),  
]

scores = model.evaluate(test_dataset, classification_metrics)
for metric_name, score in scores.items():
    print(f"{metric_name}: {score}")

mean-roc_auc_score: 0.9821428571428572
mean-accuracy_score: 0.8181818181818182


### Training

We'll train the model on the training set and evaluate its performance on the validation set across multiple epochs.

As shown above, we can see the training and validation loss over epochs, notice that we have a high training loss at early stage of training the model. But the training loss and validation loss are decreasing over time, which implies the model is generalizing well to unseen data.

Let's check out the first 5 predictions from the MLP model on the validation set.

So, now lets check, our predicted values with our actual labels, to get insights how well the model is performing, and gain insights into the prediction quality.

In [39]:
import deepchem as dc
import numpy as np

# Get predictions from the model
predictions = model.predict(test_dataset)
# or for validation set
val_predictions = model.predict(valid_dataset)

# Get ground truth labels
ground_truth = test_dataset.y
# or for validation set
val_ground_truth = valid_dataset.y

# Print them
print("Predictions:", predictions.squeeze().tolist())
print("Ground Truth:", ground_truth.squeeze().tolist())

Predictions: [0.664665162563324, 0.7925335764884949, 1.0469176769256592, 1.1105679273605347, 0.45603233575820923, 0.21755799651145935, 0.39673522114753723, 1.1274542808532715, 1.0640629529953003, 0.225880429148674, 0.1677362620830536, 1.1377029418945312, 0.9007377028465271, 0.7280964255332947, 1.062859058380127, 0.1573842167854309, 1.0353881120681763, 0.7274372577667236, 1.245859980583191, 1.0791592597961426, 1.045544147491455, 1.0925335884094238, 0.9688809514045715, 0.9965769052505493, 0.7093379497528076, 0.9369171857833862, 1.0064976215362549, 0.5581703782081604, 0.44567859172821045, 1.0820839405059814, 0.1869097650051117, 0.08389458060264587, 1.0668302774429321, 0.8209755420684814, 0.8475282788276672, 0.7259398698806763, 0.538149356842041, 1.1035650968551636, 1.0860759019851685, 0.22586780786514282, 1.0404599905014038, 1.0582209825515747, 0.8781791925430298, 1.0122710466384888]
Ground Truth: [0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1

Finally, let's plot the loss curve to get better insights on our training and validation loss over epochs. So, visualizing the training and validation loss curves is necessary and helps us understand how well the model is learning over time. A steadily decreasing training loss indicates that the model is fitting the data, while the validation loss provides insight into how well the model generalizes to unseen data. If the validation loss starts increasing while the training loss continues to decrease, it may indicate overfitting. By examining these curves, we can diagnose issues such as underfitting, overfitting, or the need for further hyperparameter tuning.

It's time to check the accuracy of our model. So, accuracy is measured as the proportion of correct predictions (where the predicted label matches the true label) out of the total number of samples in the validation set.

In [40]:
import numpy as np
import deepchem as dc

# Get predictions and ground truth
val_outputs = model.predict(valid_dataset)  # Raw predictions
y_val = valid_dataset.y  # Ground truth

# Convert to binary predictions (equivalent to torch.round)
predicted_labels = np.round(val_outputs.squeeze()).astype(int)  # Round to 0 or 1
true_labels = y_val.squeeze().astype(int)

# Calculate accuracy
correct = (predicted_labels == true_labels).sum()
total = len(true_labels)

accuracy = correct / total
print(f"Accuracy: {accuracy * 100:.2f}%")

Accuracy: 76.74%


## References <a name="references"></a>

<a name="1"></a> [1] Chen, L., Li, B., Chen, Y., Lin, M., Zhang, S., Li, C., Pang, Y., & Wang, L. (2024). ADCNet: A unified framework for predicting the activity of antibody‑drug conjugates. https://arxiv.org/pdf/2401.09176

<a name="2"></a> [2] DeepChem Team. (n.d.). Introduction to Antibody-Drug Conjugates.

<a name="3"></a> [3] Beck, A., Goetsch, L., Dumontet, C. et al. Strategies and challenges for the next generation of antibody–drug conjugates. Nat Rev Drug Discov 16, 315–337 (2017). https://doi.org/10.1038/nrd.2016.268

<a name="4"></a> [3] Facebook AI Research. (2020). ESM: Evolutionary Scale Modeling [GitHub repository]. https://github.com/facebookresearch/esm

<a name="5"></a> [4] ADCNet githubidrugLab.(2024). ADCNet: a unified framework for predicting the activity of antibody‑drug conjugates. GitHub repository: https://github.com/idrugLab/ADCNet

<a name="6"></a> [5] Shen, L. T., Sun, X. N., Chen, Z., Guo, Y., Shen, Z. Y., Song, Y., Xin, W. X., Ding, H. Y., Ma, X. Y., Xu, W. B., Zhou, W. Y., Che, J. X., Tan, L. L., Chen, L. S., Chen, S. Q., Dong, X. W., Fang, L., & Zhu, F. (2024).
ADCdb: the database of antibody‑drug conjugates. Nucleic Acids Research, 52(D1), D1097–D1109. PMID 37831118.
Website: https://adcdb.idrblab.net/

<a name="7"></a> [6] ADCNet githubidrugLab.(2024). ADCNet: a unified framework for predicting the activity of antibody‑drug conjugates. GitHub repository: https://github.com/idrugLab/ADCNet

# Congratulations! Time to join the Community!

Congratulations on completing this tutorial notebook! If you enjoyed working through the tutorial, and want to continue working with DeepChem, we encourage you to finish the rest of the tutorials in this series. You can also help the DeepChem community in the following ways:

## Star DeepChem on [GitHub](https://github.com/deepchem/deepchem)
This helps build awareness of the DeepChem project and the tools for open source drug discovery that we're trying to build.

## Join the DeepChem Discord
The DeepChem [Discord](https://discord.gg/5d5bEVSt) hosts a number of scientists, developers, and enthusiasts interested in deep learning for the life sciences. Join the conversation!

# Citing this tutorial
If you found this tutorial useful please consider citing it using the provided BibTeX.

```
@manual{Molecular Machine Learning,
 title={Introduction to ADCNet: Predicting ADC Activity with DeepChem},
 organization={DeepChem},
 author={Patra, Sonali Lipsa, and Singh, Rakshit Kr. and Bisoi, Ankita and Ramsundar, Bharath}
 howpublished = {\url{https://github.com/deepchem/deepchem/blob/master/examples/tutorials/ADCNet.ipynb}},
 year={2025},
}
```