## **Introduction to ADCNet: Predicting ADC Activity with DeepChem**

Advancements in molecular biology have significantly transformed the landscape of drug discovery, shifting the focus towards identifying more selective and promising clinical candidates. While traditional small-molecule inhibitors have achieved notable success, they are often limited by off-target toxicity, narrow therapeutic windows, and the emergence of drug resistance. The development of monoclonal antibodies (mAbs) offered a partial solution by exploiting specific antigen expression to enhance targeting accuracy. However, their effectiveness remains limited, particularly in treating heterogeneous populations of cancer cells.[[1]](#1) <br>

To overcome these challenges, Antibody-Drug Conjugates or ADCs have emerged as a novel and promising therapeutic strategy. ADCs are a class of targeted biopharmaceutical drugs that are designed as targeted therapy to treat cancer. ADCs combine the specificity of monoclonal antibodies with the potent cytotoxicity of drugs, linked through specialized chemical linkers. This enables selective delivery of toxic agents to cancer cells while minimizing harm to healthy tissue, improving outcomes, and reducing side effects. However, designing effective ADCs remains complex, involving careful selection of antibodies, payloads, and linkers. These factors all impact the safety, effectiveness, and overall therapeutic success of ADCs' performance.

To navigate this complexity, ADCNet has been developed as a comprehensive deep learning framework that accurately predicts the activity of antibody-drug conjugates. It combines two advanced representation models: ESM-2, which identifies important features from the protein sequences of antibodies and antigens, and FG-BERT, which processes the SMILES strings of linkers and payloads. Moreover, the framework includes the Drug-Antibody Ratio (DAR), an essential numerical parameter that influences the therapeutic index of ADCs. By learning and integrating patterns from these various molecular components, ADCNet aids in the rational design of safer and more effective ADC candidates.

In this tutorial, we will explore how to predict the therapeutic activity of Antibody-Drug Conjugates (ADCs) using ADCNet, a unified deep learning framework implemented in DeepChem. Before, proceeding with this guide, it is recommended to build a foundational understanding of ADCs. Refer to the "Introduction to Antibody-Drug Conjugates" [[2]](#2) notebook available in the DeepChem tutorials.

# **Colab**
This tutorial and the rest in this sequence can be done in Google colab. If you'd like to open this notebook in colab, you can use the following link.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepchem/deepchem/blob/master/examples/tutorials/Introduction_to_ADCNet.ipynb)

## **Evolution of ADCs**

As the field continues to advance, the design of ADCs has been progressively refined over several generations. First-generation ADCs, such as *gemtuzumab ozogamicin*, were built using humanized antibodies, a single type of cytotoxic drug, and acid-sensitive linkers. These components were originally derived from murine antibodies and conventional chemotherapy agents. The conjugation process was typically random, targeting lysine or cysteine residues, which resulted in mixtures with uneven Drug-Antibody Ratios (DAR). This lack of uniformity made it difficult to establish a consistent therapeutic index and often led to off-target toxicity and narrow treatment windows. As a result, first-generation ADCs were prone to unmanageable side effects and limited clinical effectiveness.<br>

In contrast, second-generation ADCs, such as *brentuximab vedotin* and *trastuzumab emtansine*, introduced more powerful cytotoxic agents like tubulin inhibitors, along with more stable linkers. These improvements significantly enhanced both the treatment’s efficacy and molecular stability.<br>

Moreover, the third generation ADCs, represented by drugs like *polatuzumab vedotin* and *enfortumab vedotin*, introduced site-specific conjugation techniques and hydrophilic linkers. These innovations allowed for precise control over DAR, thereby enhancing both safety and therapeutic efficacy.

Now that we’ve explored the fundamentals and evolution of ADCs, let’s dive into the ADCNet architecture.

## **Overview of the model architecture**

We have a three-step execution process: first, we process different types of input data; second, we generate embeddings from these inputs using pretrained models; and lastly, we concatenate the embeddings and feed them into a Multilayer Perceptron (MLP) to predict the outcome.

### Inputs

Let's explore the three different forms of inputs we use:

(I) **Protein Sequences:**
- **Antibody Heavy Chain:** The protein sequence of the antibody's heavy chain.
- **Antibody Light Chain:** The protein sequence of the antibody's light chain.
- **Antigen:** The protein sequence of the target antigen.

(II) **Small Molecules (SMILES representations):**
- **Linker:** A SMILES string representing the chemical structure of the linker.
- **Payload:** A SMILES string representing the chemical structure of the cytotoxic payload.

(III) **Numerical Value:**
- **Drug–Antibody Ratio (DAR):** A value indicating the average number of payload molecules attached to each antibody.

Each input is processed separately to extract its unique features.

### Generating Embeddings

ADCNet uses pre-trained language models to transform the inputs into embeddings:

- Protein sequences (antibody heavy chain, antibody light chain, and antigen sequences) are processed using ESM-2 (Evolutionary Scale Modeling) [[3]](#3), a Transformer-based protein language model. ESM-2 converts these sequences into dense embeddings that encode their structural and functional properties.
- SMILES representations of the linker and payload are processed using ChemBERTA, which generates embeddings that capture the chemical properties of these small molecules.

(**Note**: While the original ADCNet paper utilized FGBERT, we are employing ChemBERTA here due to its availability and effectiveness within the DeepChem framework.)

### Prediction

After generating the embeddings:

- The embeddings from the three protein sequences (heavy chain, light chain, and antigen), the two small molecules (linker and payload), and the processed DAR value are concatenated into a single feature vector.
- This combined feature vector is then fed into a Multilayer Perceptron (MLP) comprising two fully connected layers with non-linear activation functions. The MLP analyzes the concatenated features to predict the antibody-drug conjugate's therapeutic activity.

Below is the architecture diagram of ADCNet, illustrating the complete workflow from input sequences and molecular structures through embedding layers and model components to the final prediction output.

<img src="assets/ADCNet_2.png" alt="image2" height = "800" width="800"> <br> **Fig.1** Diagram illustrating the network architecture of ADCNet model. [[4]](#4)

### Versatility of ADCNet

According to the original [ADCNet paper](https://arxiv.org/pdf/2401.09176), the architecture utilizes the protein language model ESM-2 to process antibody and antigen sequences. While ESM-2 is highly effective, restricting the framework to a single model limits its adaptability to other protein representation techniques that could offer complementary strengths. Researchers can explore alternative models, such as ProtBERT, T5-Protein, or newer protein language models that are tailored to specific tasks. This flexibility allows for experimentation with models that better capture the structural and functional nuances of antibodies and antigens, depending on the context of the ADC design. For example, some models may excel at predicting binding affinity, while others may be more adept at handling sequence diversity.

Similarly, the original ADCNet architecture utilizes FG-BERT for encoding linker and payload SMILES strings. Although FG-BERT is effective for small molecule representation, relying solely on it restricts the framework's ability to incorporate advancements in chemical modeling. To enhance performance, researchers can adopt other small molecule models, such as ChemBERTa, MolBERT, or various graph-based or transformer-based chemical encoders.

In our implementation, we use ESM-2-8M, but researchers are encouraged to experiment with other ESM variants, such as ESM-2 650M and ESM-2-3B, which may offer improved accuracy. Additionally, we have replaced FG-BERT with ChemBERTa in this implementation to take advantage of its strengths in small molecule representation.

### Setup

Before we proceed, let's install DeepChem into our environment and set up other required dependencies.

In [2]:
# install the necessary libraries
!pip install deepchem numpy torch scikit-learn transformers tqdm



In [3]:
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from tqdm import tqdm
from transformers import EsmTokenizer, EsmModel
from transformers import AutoTokenizer, AutoModel
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

  from .autonotebook import tqdm as notebook_tqdm


### Data Collection and Pre-Processing

We will be using ADCdb, which was originally utilized by ADCNet and can be accessed at [[5]](#5). This database contains information on 6,572 antibody-drug conjugates (ADCs), including 359 that are approved by the FDA or are currently in clinical trials, 501 in preclinical testing, 819 with in vivo testing data, 1,868 with cell line or target testing data, and 3,025 without such testing.

For our purposes, we will use the preprocessed data for convenience, which is available at [[6]](#6). Let’s take a look at the dataset. The original ADCdb is a comprehensive collection of data on ADCs, and we will be working with the preprocessed version provided by ADCNet. This version can be found in the assets folder or in the ADCdb GitHub repository.

In [4]:
# load file
file_path = "assets/adcdb.csv"
df = pd.read_csv(file_path)

print(f'We have data of {len(df)} ADCs.')

We have data of 435 ADCs.


In [5]:
df.columns.to_list()

['index',
 'ADC ID',
 'ADC Name',
 'Antibody Name',
 'Antibody Heavy Chain Sequence',
 'Antibody Light Chain Sequence',
 'Antigen Sequence',
 'Payload Isosmiles',
 'Linker Isosmiles',
 'DAR',
 'label（10nm）',
 'label（100nm）',
 'label（1nm）',
 'label（1000nm）',
 'DAR_val']

We can see the dataset contains columns representing ADC names, antibody sequences, antigen sequences, SMILES strings for linker/payload, and labels at multiple concentrations. Now lets have a preview of the dataset we will be using:

In [6]:
df.head()

Unnamed: 0,index,ADC ID,ADC Name,Antibody Name,Antibody Heavy Chain Sequence,Antibody Light Chain Sequence,Antigen Sequence,Payload Isosmiles,Linker Isosmiles,DAR,label（10nm）,label（100nm）,label（1nm）,label（1000nm）,DAR_val
0,0,DRG0ABJAM,Trastuzumab-BCN-HydraSpace-Val-Cit-PABC-Gly-Ca...,Trastuzumab,EVQLVESGGGLVQPGGSLRLSCAASGFNIKDTYIHWVRQAPGKGLE...,DIQMTQSPSSLSASVGDRVTITCRASQDVNTAVAWYQQKPGKAPKL...,MELAALCRWGLLLALLPPGAASTQVCTGTDMKLRLPASPETHLDML...,CCN(C(=O)CN)C1COC(OC2C(OC3C#C/C=C\C#CC4(O)CC(=...,CC(C)C(NC(=O)OCCN(CCOC(=O)NC(C(=O)NC(CCCNC(N)=...,1.86,0,0,0,0,1.86
1,1,DRG0ZBATX,Anti-KIT NEG087?SSNPP-DM3,Anti-KIT mAb NEG087,EVQLVESGGGLVQPGGSLRLSCAASGFTFSDYYMAWVRQAPGKGLE...,DIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKL...,MRGARGAWDFLCVLLLLLRVQTGSSQPSVSPGEPSPPSIHPGKSDL...,C[C@@H]1[C@@H]2C[C@]([C@@H](/C=C/C=C(/CC3=CC(=...,CC(S)CCC(N)=O,3.0-4.0,0,0,0,0,3.5
2,2,DRG0XJKXB,Trastuzumab-C239I-SG3400,Engineered trastuzumab,EVQLVESGGGLVQPGGSLRLSCAASGFNIKDTYIHWVRQAPGKGLE...,DIQMTQSPSSLSASVGDRVTITCRASQDVNTAVAWYQQKPGKAPKL...,MELAALCRWGLLLALLPPGAASTQVCTGTDMKLRLPASPETHLDML...,C=C1CC2C=Nc3cc(OCCCOc4cc5c(cc4OC)C(=O)N4CC(=C)...,C[C@@H](C(=O)NC1=CC=C(C=C1)CO)NC(=O)[C@H](C(C)...,1.71,1,1,1,1,1.71
3,3,DRG0ZOYQV,Datopotamab deruxtecan,Datopotamab,QVQLVQSGAEVKKPGASVKVSCKASGYTFTTAGMQWVRQAPGQGLE...,DIQMTQSPSSLSASVGDRVTITCKASQDVSTAVAWYQQKPGKAPKL...,MARGPGLAPPPLRLPLLLLVLAAVTGHTAAQDNCTCPTNKMTVCSP...,CC[C@@]1(C2=C(COC1=O)C(=O)N3CC4=C5[C@H](CCC6=C...,C1=CC=C(C=C1)C[C@@H](C(=O)NCC(=O)O)NC(=O)CNC(=...,4,1,1,1,1,4.0
4,4,DRG0COMTY,Telisotuzumab vedotin,Telisotuzumab,QVQLVQSGAEVKKPGASVKVSCKASGYIFTAYTMHWVRQAPGQGLE...,DIVMTQSPDSLAVSLGERATINCKSSESVDSYANSFLHWYQQKPGQ...,MKAPAVLAPGILVLLFTLVQRSNGECKEALAKSEMNVNMKYQLPNF...,CC[C@H](C)[C@@H]([C@@H](CC(=O)N1CCC[C@H]1[C@@H...,CC(C)[C@@H](C(=O)N[C@@H](CCCNC(=O)N)C(=O)NC1=C...,3.1,1,1,1,1,3.1


### Preprocessing Numeric Features

We can see that the dataset includes an important feature: the Drug-Antibody Ratio (DAR), which plays a significant role in determining the efficacy and safety of ADCs, as it represents the average number of drug molecules attached to each antibody. Since DAR is a continuous numerical feature, we will scale it before inputting it into our model. Standardizing DAR to have zero mean and unit variance ensures that it is on a comparable scale with other features, which helps neural networks train more efficiently and converge faster.

In [7]:
from sklearn.preprocessing import StandardScaler

# Extract and scale the Drug-Antibody Ratio (DAR)

dar_values = df['DAR_val'].values.reshape(-1, 1)
scaler = StandardScaler()
dar_scaled = scaler.fit_transform(dar_values)

Let's preview the first five scaled DAR values.

In [8]:
dar_scaled[:5]

array([[-1.28000061],
       [-0.2348211 ],
       [-1.3755963 ],
       [ 0.08383119],
       [-0.48974293]])

Now that our data is preprocessed and the Drug–Antibody Ratio (DAR) values are standardized, we are ready to generate embeddings using pre-trained models for each input type.

### Generating Embeddings with Pretrained Models

Set up the computation device (GPU if available, otherwise CPU) and import necessary transformer modules.

In [9]:
from transformers import EsmTokenizer, EsmModel, AutoTokenizer, AutoModel
import torch

# Choose device
device = torch.device('cpu')

Before we move further, it's necessary to know about ESM-2. ESM-2 (Evolutionary Scale Modeling) is a protein language model using a transformer-based architecture to process protein sequences. It has been trained on large datasets of protein sequences to learn the relationships between amino acids and the structural and functional properties of proteins.

ESM-2 has demonstrated strong performance across various protein-related prediction tasks, making it a reliable choice for encoding protein sequences in deep learning workflows. To explore ESM-2 and other protein language models developed by Meta’s FAIR (Fundamental AI Research) team, visit the official GitHub repository [here](https://github.com/facebookresearch/esm).

Here we use the smallest ESM-2 model (esm2_t6_8M_UR50D, 6 layers, 8M parameters) for protein sequence embeddings. Larger ESM-2 models are available in the [Hugging Face Model Hub](https://huggingface.co/facebook/esm2_t6_8M_UR50D) for improved accuracy at the cost of increased computational resources.

In [10]:
# Load ESM-2 for protein sequences
esm_model_name = "facebook/esm2_t6_8M_UR50D"
tokenizer_esm = AutoTokenizer.from_pretrained(esm_model_name)
esm_model = AutoModel.from_pretrained(esm_model_name).to(device)

Some weights of EsmModel were not initialized from the model checkpoint at facebook/esm2_t6_8M_UR50D and are newly initialized: ['esm.pooler.dense.bias', 'esm.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


For small molecules like payloads and linkers, we will use ChemBERTa to generate embeddings from the SMILES strings of payloads and linkers. ChemBERTa is a Transformer-based model pre-trained on chemical SMILES, enabling it to capture the structural and chemical properties of small molecules for downstream tasks.

In [11]:
# Load ChemBERTa (for payload & linker)
chemberta_model_name = "seyonec/ChemBERTa-zinc-base-v1"
chemberta_tokenizer = AutoTokenizer.from_pretrained(chemberta_model_name)
chemberta_model = AutoModel.from_pretrained(chemberta_model_name).to(device)

Now we have initialized two Transformer-based models: ESM-2 for protein sequences, and ChemBERTa for SMILES. Loading them onto the computation device allows fast embedding extraction.

We will now generate embeddings for the antibody heavy chain, light chain, and antigen protein sequences in our dataset using ESM-2. These embeddings capture the structural and functional properties of each protein sequence, enabling the model to learn meaningful biological representations for downstream prediction tasks.

In [12]:
# Extract sequences and SMILES from the dataframe

heavy_chains = df['Antibody Heavy Chain Sequence'].astype(str).tolist()
light_chains = df['Antibody Light Chain Sequence'].astype(str).tolist()
antigens = df['Antigen Sequence'].astype(str).tolist()
linkers = df['Linker Isosmiles'].tolist()
payloads = df['Payload Isosmiles'].tolist()

Let's first generate embeddings from protein sequences using ESM-2

In [13]:
MAX_SEQ_LENGTH = 1000

# Function to get embeddings from protein sequences
def get_embeddings(sequences):
    embeddings = []

    for seq in tqdm(sequences, desc="Generating embeddings"):
        inputs = tokenizer_esm(
            seq,
            return_tensors='pt',
            truncation=True,
            padding='max_length',
            max_length=MAX_SEQ_LENGTH,
            is_split_into_words=False
        )

        inputs = {k: v.to(device) for k, v in inputs.items()}

        with torch.no_grad():
            outputs = esm_model(**inputs)
        
        # Extract CLS token
        cls_emb = outputs.last_hidden_state[:, 0, :].squeeze().cpu()
        embeddings.append(cls_emb)

    return embeddings

Now we can use the above function to generate embeddings for our protein sequences (heavy chains, light chains, and antigens) by passing the corresponding sequence lists as input.

In [14]:
# Generate embeddings from protein sequences
# print("Generating embeddings for heavy chains, light chains, and antigens...")

# heavy_embeddings = get_embeddings(heavy_chains)
# light_embeddings = get_embeddings(light_chains)
# antigen_embeddings = get_embeddings(antigens)

In [15]:
# import torch

# # Save each tensor
# torch.save(heavy_embeddings, 'heavy_embeddings.pt')
# torch.save(light_embeddings, 'light_embeddings.pt')
# torch.save(antigen_embeddings, 'antigen_embeddings.pt')

In [16]:
# Load each tensor separately
heavy_embeddings = torch.load('heavy_embeddings.pt')
light_embeddings = torch.load('light_embeddings.pt')
antigen_embeddings = torch.load('antigen_embeddings.pt')

  heavy_embeddings = torch.load('heavy_embeddings.pt')
  light_embeddings = torch.load('light_embeddings.pt')
  antigen_embeddings = torch.load('antigen_embeddings.pt')


Now let's generate embeddings for the payload and linker SMILES strings using ChemBERTa.

In [17]:
# Function to get embedding for a single SMILES string
def get_smiles_embedding(smiles: str):
    inputs = chemberta_tokenizer(smiles, return_tensors="pt", padding=True, truncation=True)
    inputs = {k: v.to(device) for k, v in inputs.items()}

    with torch.no_grad():
        outputs = chemberta_model(**inputs)
        # Use the CLS token representation (first token)
        embedding = outputs.last_hidden_state[:, 0, :]  # shape: (1, hidden_size)
        return embedding.squeeze().cpu().numpy()

The above function now can be used to generate a ChemBERTa embedding for a single SMILES string, which can be used for both linker and payload molecules. <br>Lets see taking an example:

In [18]:
example = payloads[0]
print(example)

CCN(C(=O)CN)C1COC(OC2C(OC3C#C/C=C\C#CC4(O)CC(=O)C(NC(=O)OC)=C3/C4=C\CSSC(C)(C)CC(=O)NCCOCCOC)OC(C)C(NOC3CC(O)C(SC(=O)c4c(C)c(I)c(OC5OC(C)C(O)C(OC)C5O)c(OC)c4OC)C(C)O3)C2O)CC1OC


In [19]:
example_embedding = get_smiles_embedding(example)
print(f"Embedding shape for example payload: {example_embedding.shape}")

Embedding shape for example payload: (768,)


Now that we understand how to generate embeddings from SMILES strings, let's create embeddings for all linker and payload molecules in the dataset.

In [20]:
# Function to generate embeddings for a list of SMILES

def generate_embeddings(smiles_list):
    """Generate embeddings for a list of SMILES strings."""
    embeddings = []
    for smi in tqdm(smiles_list):
        try:
            emb = get_smiles_embedding(smi)
            embeddings.append(emb)
        except Exception as e:
            print(f"Failed for {smi}: {e}")
            embeddings.append(None)
    return embeddings

As we have defined the function above, we can now pass our Payload SMILES string to generate embeddings. This process transforms each payload molecule into a numerical vector representation that captures its chemical properties.

In [21]:
payload_embeddings = generate_embeddings(payloads) # embeddings from payload smiles

  0%|          | 0/435 [00:00<?, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
100%|██████████| 435/435 [00:09<00:00, 44.13it/s]


Similarly, we can get embeddings from linker smiles:

In [22]:
linker_embeddings = generate_embeddings(linkers) # embeddings from linker smiles

100%|██████████| 435/435 [00:06<00:00, 65.60it/s]


Now that we’ve generated embeddings from all protein sequences, as well as the payload and linker SMILES, we can concatenate them, along with the standardized DAR value to form a complete feature vector for each ADC.

In [23]:
import torch

adc_embeddings = []

num_adcs = len(heavy_embeddings)

for i in range(num_adcs):
    heavy = heavy_embeddings[i]
    light = light_embeddings[i]
    antigen = antigen_embeddings[i]
    payload = payload_embeddings[i]
    linker = linker_embeddings[i]
    dar = dar_scaled[i]

    # Convert to tensor and ensure 1D
    for name, emb in zip(['heavy', 'light', 'antigen', 'dar'], [heavy, light, antigen, dar]):
        if not isinstance(emb, torch.Tensor):
            emb = torch.tensor(emb, dtype=torch.float32)
        if emb.dim() == 0:
            emb = emb.unsqueeze(0)
        elif emb.dim() > 1:
            emb = emb.squeeze()
        locals()[name] = emb

    for name, emb in zip(['payload', 'linker'], [payload, linker]):
        if not isinstance(emb, torch.Tensor):
            emb = torch.tensor(emb, dtype=torch.float32)
        if emb.dim() == 0:
            emb = emb.unsqueeze(0)
        elif emb.dim() > 1:
            emb = emb.squeeze()
        locals()[name] = emb

    # Concatenate all embeddings
    full_emb = torch.cat([
        heavy,
        light,
        antigen,
        payload,
        linker,
        dar
    ])  # shape: [combined_dim]

    adc_embeddings.append(full_emb)

Now, lets check the embedding shape of each input we have generated. 

In [24]:
print("Payload shape:", payload_embeddings[0].shape)
print("Linker shape:", linker_embeddings[0].shape)
print("Heavy shape:", heavy_embeddings[0].shape)
print("Light shape:", light_embeddings[0].shape)
print("Antigen shape:", antigen_embeddings[0].shape)
print("DAR shape:", dar_scaled[0].shape)

Payload shape: (768,)
Linker shape: (768,)
Heavy shape: torch.Size([320])
Light shape: torch.Size([320])
Antigen shape: torch.Size([320])
DAR shape: (1,)


Lets see the shape of a single concatenated ADC embedding i.e., feature vector for one ADC.

In [25]:
full_emb.shape

torch.Size([2497])

Now, we can check the shape of the full batch tensor (i.e., all ADCs stacked), where the first dimension is the number of ADCs and the second is the embedding size.

In [26]:
adc_batch_tensor = torch.stack(adc_embeddings)
adc_batch_tensor.shape

torch.Size([435, 2497])

#### Defining MLP (Multi-Layer Perceptron)

Before we move into model training, it’s important to understand the architecture of the MLP (Multi-Layer Perceptron). MLP is a type of feedforward neural network consisting of an input layer, one or more hidden layers, and an output layer. Each neuron in a layer is fully connected to every neuron in the next, allowing the model to learn complex, non-linear patterns from the data.  Other inputs include Dropout and Activation function. Dropout randomly disables a fraction of neurons during each training iteration, which forces the network to not rely too heavily on any one neuron and helps in learning more robust features, and activation function is used to specify the activation function used in the hidden layers of the model. <br>

In our setup, the input layer has a dimension of 2497, which corresponds to the size of the combined embeddings. This is followed by 2 hidden layers that help the model extract deeper hierarchical features, and finally, an output layer that produces predictions. Other inputs includes a dropout rate of 0.2 and Relu activation function.

In [27]:
import torch
import torch.nn as nn
from deepchem.models.torch_models.layers import MultilayerPerceptron

# Define the model
ADCNet = MultilayerPerceptron(
    d_input= 2497,
    d_output=1,
    d_hidden=(1024, 256),
    dropout=0.2,
    activation_fn='relu'
)

# Forward pass
op = ADCNet(adc_batch_tensor)
print(op.shape)  # [435, 1]

No normalization for SPS. Feature removed!
No normalization for AvgIpc. Feature removed!
Skipped loading modules with pytorch-geometric dependency, missing a dependency. No module named 'torch_geometric'
Skipped loading modules with pytorch-geometric dependency, missing a dependency. cannot import name 'DMPNN' from 'deepchem.models.torch_models' (/opt/miniconda3/envs/adcnet/lib/python3.8/site-packages/deepchem/models/torch_models/__init__.py)
Skipped loading modules with pytorch-lightning dependency, missing a dependency. No module named 'lightning'
Skipped loading some Jax models, missing a dependency. No module named 'jax'


torch.Size([435, 1])


We will be using the 100 nm label from the dataset as our target variable. Although the dataset also includes labels at 10 nm and 1000 nm, we choose the 100 nm label.

In [28]:
# Now extract the label
label_col = "label（100nm）"
labels = df[label_col].values  # 0 or 1

# Convert to PyTorch tensor
y = torch.tensor(labels, dtype=torch.float32).unsqueeze(1)

In [29]:
from deepchem.data.datasets import NumpyDataset
import numpy as np

X = adc_batch_tensor.numpy()  # Convert to numpy array
y = labels

dataset = NumpyDataset(X= X, y= y, ids= np.arange(len(y)))

print(f"Dataset created with {len(dataset)} samples.")
print(f"Features shape: {dataset.X.shape}, Labels shape: {dataset.y.shape}")
print(f"First sample features: {dataset.X[0]}, First sample label: {dataset.y[0]}")

Dataset created with 435 samples.
Features shape: (435, 2497), Labels shape: (435,)
First sample features: [-0.20455705  0.8342899   0.15026727 ... -0.15449478  0.31129766
 -1.2800006 ], First sample label: 0


In [30]:
print(type(dataset))

<class 'deepchem.data.datasets.NumpyDataset'>


In [31]:
import deepchem as dc

# Creating a RandomSplitter object
splitter = dc.splits.RandomSplitter()

# Splitting dataset into train and test datasets
train_dataset, test_dataset = splitter.train_test_split(dataset)

Let's define the model, loss function, and optimizer that we'll be using for training.

In [32]:
# Model, Loss, Optimizer
from deepchem.models.optimizers import Adam

model_name = ADCNet
criterion = nn.MSELoss()
optimizer = Adam(learning_rate=1e-4)

In [33]:
from deepchem.models.torch_models.torch_model import TorchModel
from deepchem.models.torch_models.layers import MultilayerPerceptron
from deepchem.models.losses import L2Loss

class ADCNetModel(TorchModel):
    def __init__(self,input_dim=2497, output_dim=1, hidden_dims=(1024, 256), dropout=0.2, activation_fn='relu', **kwargs):
        
        model = MultilayerPerceptron(
            d_input=input_dim,
            d_output=output_dim,
            d_hidden=hidden_dims,
            dropout=dropout,
            activation_fn=activation_fn
        )
        super(ADCNetModel, self).__init__(model = model, loss = L2Loss(), **kwargs)

In [None]:
model = ADCNetModel()

: 

In [None]:
model.fit(train_dataset, nb_epoch=1)

### Training

We'll train the model on the training set and evaluate its performance on the validation set across multiple epochs.

In [None]:
# Here, we define two lists to store the training and validation loss values after each epoch.

train_losses = []
val_losses = [] 

# Training Loop

# epochs = 100
# for epoch in range(epochs):
#     model_mlp.train()
#     optimizer.zero_grad()

#     outputs = mode(X_train)
#     loss = criterion(outputs, y_train)
#     loss.backward()
#     optimizer.step()

#     model_mlp.eval()
#     with torch.no_grad():
#         val_outputs = model_mlp(X_val)
#         val_loss = criterion(val_outputs, y_val)

#     train_losses.append(loss.item())
#     val_losses.append(val_loss.item())

#     if epoch % 10 == 0:
#         print(f"Epoch {epoch} | Train Loss: {loss.item():.4f} | Val Loss: {val_loss.item():.4f}")

As shown above, we can see the training and validation loss over epochs, notice that we have a high training loss at early stage of training the model. But the training loss and validation loss are decreasing over time, which implies the model is generalizing well to unseen data.

Let's check out the first 5 predictions from the MLP model on the validation set.

In [None]:
model_mlp.eval()
with torch.no_grad():
    preds = model_mlp(X_val)
    print(preds[:5])

So, now lets check, our predicted values with our actual labels, to get insights how well the model is performing, and gain insights into the prediction quality.

In [None]:
print("Predictions:", val_outputs.squeeze().tolist())
print("Ground Truth:", y_val.squeeze().tolist())

Finally, let's plot the loss curve to get better insights on our training and validation loss over epochs. So, visualizing the training and validation loss curves is necessary and helps us understand how well the model is learning over time. A steadily decreasing training loss indicates that the model is fitting the data, while the validation loss provides insight into how well the model generalizes to unseen data. If the validation loss starts increasing while the training loss continues to decrease, it may indicate overfitting. By examining these curves, we can diagnose issues such as underfitting, overfitting, or the need for further hyperparameter tuning.

In [None]:
# Plotting loss curve
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 5))
plt.plot(train_losses, label="Train Loss")
plt.plot(val_losses, label="Validation Loss")
plt.xlabel("Epochs")
plt.ylabel("MSE Loss")
plt.title("Training and Validation Loss over Epochs")
plt.legend()
plt.grid(True)
plt.show()

It's time to check the accuracy of our model. So, accuracy is measured as the proportion of correct predictions (where the predicted label matches the true label) out of the total number of samples in the validation set.

In [None]:
predicted_labels = torch.round(val_outputs.squeeze())  # round to 0 or 1
true_labels = y_val.squeeze()


correct = (predicted_labels == true_labels).sum().item()
total = true_labels.size(0)

accuracy = correct / total
print(f"Accuracy: {accuracy * 100:.2f}%")

So, we achieved a validation accuracy of 82.76%.

## References <a name="references"></a>

<a name="1"></a> [1] Chen, L., Li, B., Chen, Y., Lin, M., Zhang, S., Li, C., Pang, Y., & Wang, L. (2024). ADCNet: A unified framework for predicting the activity of antibody‑drug conjugates. https://arxiv.org/pdf/2401.09176

<a name="2"></a> [2] DeepChem Team. (n.d.). Introduction to Antibody-Drug Conjugates. https://github.com/deepchem/deepchem/blob/master/examples/tutorials/Introduction_to_Antibody_Drug_Conjugates.ipynb

<a name="3"></a> [3] Facebook AI Research. (2020). ESM: Evolutionary Scale Modeling [GitHub repository]. https://github.com/facebookresearch/esm

<a name="4"></a> [4] ADCNet githubidrugLab.(2024). ADCNet: a unified framework for predicting the activity of antibody‑drug conjugates. GitHub repository: https://github.com/idrugLab/ADCNet

<a name="5"></a> [5] Shen, L. T., Sun, X. N., Chen, Z., Guo, Y., Shen, Z. Y., Song, Y., Xin, W. X., Ding, H. Y., Ma, X. Y., Xu, W. B., Zhou, W. Y., Che, J. X., Tan, L. L., Chen, L. S., Chen, S. Q., Dong, X. W., Fang, L., & Zhu, F. (2024).
ADCdb: the database of antibody‑drug conjugates. Nucleic Acids Research, 52(D1), D1097–D1109. PMID 37831118.
Website: https://adcdb.idrblab.net/

<a name="6"></a> [6] ADCNet githubidrugLab.(2024). ADCNet: a unified framework for predicting the activity of antibody‑drug conjugates. GitHub repository: https://github.com/idrugLab/ADCNet

# Congratulations! Time to join the Community!

Congratulations on completing this tutorial notebook! If you enjoyed working through the tutorial, and want to continue working with DeepChem, we encourage you to finish the rest of the tutorials in this series. You can also help the DeepChem community in the following ways:

## Star DeepChem on [GitHub](https://github.com/deepchem/deepchem)
This helps build awareness of the DeepChem project and the tools for open source drug discovery that we're trying to build.

## Join the DeepChem Discord
The DeepChem [Discord](https://discord.gg/5d5bEVSt) hosts a number of scientists, developers, and enthusiasts interested in deep learning for the life sciences. Join the conversation!

# Citing this tutorial
If you found this tutorial useful please consider citing it using the provided BibTeX.

```
@manual{Molecular Machine Learning,
 title={Introduction to ADCNet: Predicting ADC Activity with DeepChem},
 organization={DeepChem},
 author={Patra, Sonali Lipsa, and Singh, Rakshit Kr. and Bisoi, Ankita and Ramsundar, Bharath}
 howpublished = {\url{https://github.com/deepchem/deepchem/blob/master/examples/tutorials/ADCNet.ipynb}},
 year={2025},
}
```