## **Introduction to ADCNet: Predicting ADC Activity with DeepChem**

Advancements in molecular biology have revolutionized drug discovery, focusing on more selective clinical candidates. Traditional small-molecule inhibitors face limitations like off-target toxicity and drug resistance. Monoclonal antibodies (mAbs) improved targeting accuracy but still struggle with heterogeneous cancer cell populations. To address these issues, Antibody-Drug Conjugates (ADCs) have emerged as a promising cancer therapy. By combining the specificity of mAbs with potent cytotoxic drugs linked through chemical linkers, ADCs enable targeted delivery to cancer cells, reducing harm to healthy tissues. However, developing effective ADCs is complex, requiring careful selection of antibodies, payloads, and linkers, which all impact safety and efficacy.

Therefore, ADCNet [[1]](#1) has been developed as a comprehensive deep learning framework that addresses the above issues and accurately predicts the activity of antibody-drug conjugates. It utilizes ESM-2 for key features from antibody and antigen sequences and FG-BERT for processing the SMILES strings of linkers and payloads. The framework also incorporates the Drug-Antibody Ratio (DAR), enhancing the rational design of safer and more effective ADC candidates. By learning and integrating patterns from these various molecular components, ADCNet aids in the rational design of safer and more effective ADC candidates. 

In this tutorial, we will explore how to predict the therapeutic activity of Antibody-Drug Conjugates (ADCs) using ADCNet, a unified deep learning framework implemented in DeepChem. Before, proceeding with this guide, it is recommended to build a foundational understanding of ADCs. Refer to the "Introduction to Antibody-Drug Conjugates" [[2]](#2) notebook available in the DeepChem tutorials.

# **Colab**
This tutorial and the rest in this sequence can be done in Google colab. If you'd like to open this notebook in colab, you can use the following link.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepchem/deepchem/blob/master/examples/tutorials/Introduction_to_ADCNet.ipynb)

## **Evolution of ADCs**

As the field continues to advance, the design of ADCs has been progressively refined over several generations. First-generation ADCs, such as *gemtuzumab ozogamicin*, were built using humanized antibodies, a single type of cytotoxic drug, and acid-sensitive linkers. These components were originally derived from murine antibodies and conventional chemotherapy agents. The conjugation process was typically random, targeting lysine or cysteine residues, which resulted in mixtures with uneven Drug-Antibody Ratios (DAR). This lack of uniformity made it difficult to establish a consistent therapeutic index and often led to off-target toxicity and narrow treatment windows. As a result, first-generation ADCs were prone to unmanageable side effects and limited clinical effectiveness.<br>

In contrast, second-generation ADCs, such as *brentuximab vedotin* and *trastuzumab emtansine*, introduced more powerful cytotoxic agents like tubulin inhibitors, along with more stable linkers. These improvements significantly enhanced both the treatment’s efficacy and molecular stability.<br>

Moreover, the third generation ADCs, represented by drugs like *polatuzumab vedotin* and *enfortumab vedotin*, introduced site-specific conjugation techniques and hydrophilic linkers. These innovations allowed for precise control over DAR, thereby enhancing both safety and therapeutic efficacy. [[3]](#3)

Let's take a closer look at the evolution of Antibody-Drug Conjugates (ADCs) with some of the first, second, and third-generation ADCs.

<img src = "assets/generations_adc.jpg" alt="image" height="700" width="800"> <br> **Fig.1** Schematic representations of first, second, and third-generation Antibody-Drug Conjugates (ADCs). [[4]](#4)

As we analyze the evolution of ADC design through the generations, we can observe significant advancements in various aspects. The figure above demonstrates the progression of ADC design through various generations. It highlights key differences, including linker cleavability (cleavable versus non-cleavable), the format of the monoclonal antibody (such as the IgG1 subtype), and the number of cytotoxic warheads attached to each antibody. Each generation shows advancements in specificity, stability, and therapeutic efficacy, marked by improvements in site-specific conjugation and payload delivery mechanisms.

Now that we have explored the fundamentals and evolution of ADCs, let's delve into the ADCNet architecture.

## **Overview of the model architecture**

We follow a three-step execution process. First, we process different types of input data. Second, we generate embeddings from these inputs using pretrained models. Finally, we concatenate the embeddings and feed them into a Multilayer Perceptron (MLP) to make predictions.

### Inputs

We examine three different types of inputs used in the model:

(I) **Protein Sequences:**
- **Antibody Heavy Chain:** The protein sequence of the antibody's heavy chain.
- **Antibody Light Chain:** The protein sequence of the antibody's light chain.
- **Antigen:** The protein sequence of the target antigen.

(II) **Small Molecules (SMILES representations):**
- **Linker:** A SMILES string representing the chemical structure of the linker.
- **Payload:** A SMILES string representing the chemical structure of the cytotoxic payload.

(III) **Numerical Value:**
- **Drug–Antibody Ratio (DAR):** A numerical value indicating the average number of payload molecules attached to each antibody.

Each input type is processed individually to extract its unique set of features.

### Generating Embeddings

ADCNet uses pre-trained language models to transform the inputs into embeddings:

- Protein sequences (antibody heavy chain, antibody light chain, and antigen sequences) are processed using ESM-2 (Evolutionary Scale Modeling) [[4]](#4), a Transformer-based protein language model. ESM-2 converts these sequences into dense embeddings that encode their structural and functional properties.
- SMILES representations of the linker and payload are processed using ChemBERTA, which generates embeddings that capture the chemical properties of these small molecules.

(**Note**: While the original ADCNet paper utilized FGBERT, we are employing ChemBERTA here due to its availability and effectiveness within the DeepChem framework.)

### Prediction

After generating the embeddings:

- The embeddings from the three protein sequences (heavy chain, light chain, and antigen), the two small molecules (linker and payload), and the processed DAR value are combined into a single feature vector. 
- This combined feature vector is then input into a Multilayer Perceptron (MLP), which consists of two fully connected layers with non-linear activation functions. The MLP processes these concatenated features to predict the therapeutic activity of the antibody-drug conjugate.

Below is the architecture diagram of ADCNet, illustrating the complete workflow from input sequences and molecular structures through embedding layers and model components to the final prediction output.

<img src="assets/ADCNet_2.png" alt="image2" height = "800" width="800"> <br> **Fig.2** Diagram illustrating the network architecture of ADCNet model. [[4]](#4)

### Versatility of ADCNet

According to the original [ADCNet paper](https://arxiv.org/pdf/2401.09176), [[1]](#1) the architecture utilizes the plM ESM-2 to process antibody and antigen sequences. While ESM-2 is highly effective, confining the framework to a single model limits its adaptability to other protein representation techniques that could provide complementary strengths. Researchers can explore alternative models, such as ProtBERT, T5-Protein, or newer protein language models that are tailored to specific tasks. This flexibility enables experimentation with models that can better capture the structural and functional nuances of antibodies and antigens, depending on the context of the antibody-drug conjugate (ADC) design. For instance, some models might excel in predicting binding affinity, while others could be more skilled at managing sequence diversity.

Similarly, the original ADCNet architecture utilizes FG-BERT for encoding linker and payload SMILES strings. Although FG-BERT is effective for small molecule representation, relying solely on it limits the framework's ability to incorporate advancements in chemical modeling. To enhance performance, researchers are encouraged to adopt other small molecule models, such as ChemBERTa, MolBERT, or various graph-based and transformer-based chemical encoders.

In our implementation, we use ESM-2-8M, but researchers are encouraged to experiment with other ESM variants, such as ESM-2 650M and ESM-2-3B, which may offer improved accuracy. Additionally, we have replaced FG-BERT with ChemBERTa in this implementation to leverage its strengths in small molecule representation.

### Setup

Before we continue, let's install DeepChem in our environment and set up the other required dependencies.

In [None]:
# install the necessary libraries
!pip install deepchem numpy torch scikit-learn transformers tqdm

In [None]:
import deepchem as dc
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from tqdm import tqdm
from transformers import EsmTokenizer, EsmModel
from transformers import AutoModel, AutoTokenizer
from deepchem.trans import NormalizationTransformer

### Data Collection and Pre-Processing

We will be using ADCdb, which was originally utilized by ADCNet and can be accessed at [[5]](#5). This database contains information on 6,572 antibody-drug conjugates (ADCs), including 359 that are approved by the FDA or are currently in clinical trials, 501 in preclinical testing, 819 with in vivo testing data, 1,868 with cell line or target testing data, and 3,025 without such testing.

For our purposes, we will use the preprocessed data for convenience, which is available at [[6]](#6). Let’s examine the dataset. The original ADCdb is a comprehensive collection of data on ADCs, and we will be working with the preprocessed version provided by ADCNet. This version can be found in the assets folder or in the ADCdb GitHub repository.

In [3]:
# load file
file_path = "assets/adcdb.csv"
df = pd.read_csv(file_path)

print(f'We have data of {len(df)} ADCs.')

We have data of 435 ADCs.


In [4]:
df.columns = [col.strip().replace(' (', '(').replace(') ', ')') for col in df.columns]
print(df.columns.to_list())

['index', 'ADC ID', 'ADC Name', 'Antibody Name', 'Antibody Heavy Chain Sequence', 'Antibody Light Chain Sequence', 'Antigen Sequence', 'Payload Isosmiles', 'Linker Isosmiles', 'DAR', 'label（10nm）', 'label（100nm）', 'label（1nm）', 'label（1000nm）', 'DAR_val']


We can see the dataset contains columns representing ADC names, antibody sequences, antigen sequences, SMILES strings for linker/payload, and labels at multiple concentrations. Now lets have a preview of the dataset we will be using:

In [5]:
df.head()

Unnamed: 0,index,ADC ID,ADC Name,Antibody Name,Antibody Heavy Chain Sequence,Antibody Light Chain Sequence,Antigen Sequence,Payload Isosmiles,Linker Isosmiles,DAR,label（10nm）,label（100nm）,label（1nm）,label（1000nm）,DAR_val
0,0,DRG0ABJAM,Trastuzumab-BCN-HydraSpace-Val-Cit-PABC-Gly-Ca...,Trastuzumab,EVQLVESGGGLVQPGGSLRLSCAASGFNIKDTYIHWVRQAPGKGLE...,DIQMTQSPSSLSASVGDRVTITCRASQDVNTAVAWYQQKPGKAPKL...,MELAALCRWGLLLALLPPGAASTQVCTGTDMKLRLPASPETHLDML...,CCN(C(=O)CN)C1COC(OC2C(OC3C#C/C=C\C#CC4(O)CC(=...,CC(C)C(NC(=O)OCCN(CCOC(=O)NC(C(=O)NC(CCCNC(N)=...,1.86,0,0,0,0,1.86
1,1,DRG0ZBATX,Anti-KIT NEG087?SSNPP-DM3,Anti-KIT mAb NEG087,EVQLVESGGGLVQPGGSLRLSCAASGFTFSDYYMAWVRQAPGKGLE...,DIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKL...,MRGARGAWDFLCVLLLLLRVQTGSSQPSVSPGEPSPPSIHPGKSDL...,C[C@@H]1[C@@H]2C[C@]([C@@H](/C=C/C=C(/CC3=CC(=...,CC(S)CCC(N)=O,3.0-4.0,0,0,0,0,3.5
2,2,DRG0XJKXB,Trastuzumab-C239I-SG3400,Engineered trastuzumab,EVQLVESGGGLVQPGGSLRLSCAASGFNIKDTYIHWVRQAPGKGLE...,DIQMTQSPSSLSASVGDRVTITCRASQDVNTAVAWYQQKPGKAPKL...,MELAALCRWGLLLALLPPGAASTQVCTGTDMKLRLPASPETHLDML...,C=C1CC2C=Nc3cc(OCCCOc4cc5c(cc4OC)C(=O)N4CC(=C)...,C[C@@H](C(=O)NC1=CC=C(C=C1)CO)NC(=O)[C@H](C(C)...,1.71,1,1,1,1,1.71
3,3,DRG0ZOYQV,Datopotamab deruxtecan,Datopotamab,QVQLVQSGAEVKKPGASVKVSCKASGYTFTTAGMQWVRQAPGQGLE...,DIQMTQSPSSLSASVGDRVTITCKASQDVSTAVAWYQQKPGKAPKL...,MARGPGLAPPPLRLPLLLLVLAAVTGHTAAQDNCTCPTNKMTVCSP...,CC[C@@]1(C2=C(COC1=O)C(=O)N3CC4=C5[C@H](CCC6=C...,C1=CC=C(C=C1)C[C@@H](C(=O)NCC(=O)O)NC(=O)CNC(=...,4,1,1,1,1,4.0
4,4,DRG0COMTY,Telisotuzumab vedotin,Telisotuzumab,QVQLVQSGAEVKKPGASVKVSCKASGYIFTAYTMHWVRQAPGQGLE...,DIVMTQSPDSLAVSLGERATINCKSSESVDSYANSFLHWYQQKPGQ...,MKAPAVLAPGILVLLFTLVQRSNGECKEALAKSEMNVNMKYQLPNF...,CC[C@H](C)[C@@H]([C@@H](CC(=O)N1CCC[C@H]1[C@@H...,CC(C)[C@@H](C(=O)N[C@@H](CCCNC(=O)N)C(=O)NC1=C...,3.1,1,1,1,1,3.1


Set up the computation device (GPU if available, otherwise CPU) and import necessary transformer modules.

In [6]:
# Choose device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

### Preprocessing Numeric Features

We can see that the dataset includes an important feature: the Drug-Antibody Ratio (DAR), which plays a significant role in determining the efficacy and safety of ADCs, as it represents the average number of drug molecules attached to each antibody. Since DAR is a continuous numerical feature, we will scale it before inputting it into our model. Standardizing DAR to have zero mean and unit variance ensures that it is on a comparable scale with other features, which helps neural networks train more efficiently and converge faster.

Now we'll proceed to normalize them using DeepChem’s NormalizationTransformer.

In [7]:
dar_values = df['DAR_val']
n_samples = len(dar_values)
dar_dataset = dc.data.NumpyDataset(X=dar_values.values.reshape(-1, 1))
normalizer = dc.trans.NormalizationTransformer(transform_X=True, dataset=dar_dataset)

# Transform the dataset (this normalizes the X values to zero mean and unit variance)
normalized_dataset = normalizer.transform(dar_dataset)

# Extract the scaled DAR values
dar_scaled = normalized_dataset.X.flatten()
dar_scaled = torch.tensor(dar_scaled, device=device, dtype=torch.float32)

In [8]:
print(f"Original DAR mean: {dar_values.mean():.4f}, std: {dar_values.std():.4f}")
print(f"Scaled DAR mean: {dar_scaled.mean():.4f}, std: {dar_scaled.std():.4f}")

Original DAR mean: 3.8685, std: 1.5709
Scaled DAR mean: -0.0000, std: 1.0012


Now that our data is preprocessed and the Drug–Antibody Ratio (DAR) values are standardized, we are ready to generate embeddings using pre-trained models for each input type.

In [9]:
# Extract sequences and SMILES from the dataframe

heavy_chains = df['Antibody Heavy Chain Sequence'].astype(str).tolist()
light_chains = df['Antibody Light Chain Sequence'].astype(str).tolist()
antigens = df['Antigen Sequence'].astype(str).tolist()
linkers = df['Linker Isosmiles'].astype(str).tolist()
payloads = df['Payload Isosmiles'].astype(str).tolist()

### Generating Embeddings with Pretrained Models

Before we move further, it's necessary to know about ESM-2. ESM-2 (Evolutionary Scale Modeling) is a protein language model using a transformer-based architecture to process protein sequences. It has been trained on large datasets of protein sequences to learn the relationships between amino acids and the structural and functional properties of proteins.

ESM-2 has demonstrated strong performance across various protein-related prediction tasks, making it a reliable choice for encoding protein sequences in deep learning workflows. To explore ESM-2 and other protein language models developed by Meta’s FAIR (Fundamental AI Research) team, visit the official GitHub repository [here](https://github.com/facebookresearch/esm).

Here we use the smallest ESM-2 model (esm2_t6_8M_UR50D, 6 layers, 8M parameters) for protein sequence embeddings. Larger ESM-2 models are available in the [Hugging Face Model Hub](https://huggingface.co/facebook/esm2_t6_8M_UR50D) for improved accuracy at the cost of increased computational resources.

In [None]:
# Load ESM-2 for protein sequences
esm_model_name = "facebook/esm2_t6_8M_UR50D"
tokenizer_esm = AutoTokenizer.from_pretrained(esm_model_name)
esm_model = AutoModel.from_pretrained(esm_model_name).to(device)
esm_model.eval()

For small molecules like payloads and linkers, we will use ChemBERTa to generate embeddings from the SMILES strings of payloads and linkers. Chemberta is a transformer style model for learning on SMILES strings. The model architecture is based on the RoBERTa architecture and can be used for both pretraining an embedding and finetuning for downstream applications.

Here, we are using the RoBERTa Featurizer from DeepChem alongside ChemBERTa from Hugging Face. The RoBERTa Featurizer, a wrapper for the RoBERTa Tokenizer in Hugging Face's Transformers library, efficiently tokenizes large SMILES string datasets for RoBERTa-based models.

In [None]:
# Load ChemBERTa for molecular sequences
from deepchem.feat import RobertaFeaturizer

featurizer = RobertaFeaturizer.from_pretrained("DeepChem/ChemBERTa-77M-MLM")
model = AutoModel.from_pretrained("DeepChem/ChemBERTa-77M-MLM")
model.to(device)
model.eval()

We have now initialized two Transformer-based models: ESM-2 for protein sequences and ChemBERTa for SMILES.

Next, we will generate embeddings for the antibody heavy chain, light chain, and antigen protein sequences in our dataset using ESM-2. These embeddings capture the structural and functional properties of each protein sequence, enabling the model to learn meaningful biological representations for downstream prediction tasks.

Let's first generate embeddings from protein sequences using ESM-2

In [12]:
MAX_SEQ_LENGTH = 1000

# Function to get embeddings from protein sequences
def get_embeddings(sequences):
    embeddings = []

    for seq in tqdm(sequences, desc="Generating embeddings"):
        inputs = tokenizer_esm(
            seq,
            return_tensors='pt',
            truncation=True,
            padding='max_length',
            max_length=MAX_SEQ_LENGTH,
            is_split_into_words=False
        )
        inputs = {k: v.to(device) for k, v in inputs.items()}

        with torch.no_grad():
            outputs = esm_model(**inputs)
        
        # Extract CLS token
        cls_emb = outputs.last_hidden_state[:, 0, :].squeeze().cpu()
        embeddings.append(cls_emb)

    return embeddings

Now we can use the above function to generate embeddings for our protein sequences (heavy chains, light chains, and antigens) by passing the corresponding sequence lists as input.

In [13]:
# Generate embeddings from protein sequences
print("Generating embeddings for heavy chains, light chains, and antigens: ")
heavy_embeddings = get_embeddings(heavy_chains)
light_embeddings = get_embeddings(light_chains)
antigen_embeddings = get_embeddings(antigens)

Generating embeddings for heavy chains, light chains, and antigens: 


Generating embeddings: 100%|██████████| 435/435 [01:25<00:00,  5.11it/s]
Generating embeddings: 100%|██████████| 435/435 [01:25<00:00,  5.08it/s]
Generating embeddings: 100%|██████████| 435/435 [01:30<00:00,  4.82it/s]


Users can uncomment the code below to save and load embeddings to save time on restart.

In [14]:
# Save and load tensors separately
# torch.save(heavy_embeddings, 'heavy_embeddings.pt')
# torch.save(light_embeddings, 'light_embeddings.pt')
# torch.save(antigen_embeddings, 'antigen_embeddings.pt')
# heavy_embeddings = torch.load('heavy_embeddings.pt')
# light_embeddings = torch.load('light_embeddings.pt')
# antigen_embeddings = torch.load('antigen_embeddings.pt')

In [18]:
def get_molecular_embeddings(smiles_list, max_length=512):
    # Tokenize
    inputs = featurizer(
        smiles_list,
        add_special_tokens=True,
        truncation=True,
        max_length=max_length,
        padding=True,
        return_tensors="pt"
    )
    input_ids = inputs["input_ids"].to(device)
    attention_mask = inputs["attention_mask"].to(device)
    
    with torch.no_grad():
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        # Mean pooling
        last_hidden_state = outputs.last_hidden_state
        masked_embeddings = last_hidden_state * attention_mask.unsqueeze(-1)
        embeddings = masked_embeddings.sum(dim=1) / attention_mask.sum(dim=1, keepdim=True)
    return embeddings

Now that we understand how to generate embeddings from SMILES strings, let's create embeddings for all linker and payload molecules in the dataset.
As we have defined the function above, we can now pass our payload SMILES string to generate embeddings. This process transforms each payload molecule into a numerical vector representation that captures its chemical properties.

In [19]:
payload_embeddings = get_molecular_embeddings(payloads)
payload_embeddings.shape

torch.Size([435, 384])

Similarly, we can get embeddings from linker smiles:

In [20]:
linker_embeddings = get_molecular_embeddings(linkers)
linker_embeddings.shape

torch.Size([435, 384])

Let's inspect the shape of each embedding individually to better understand the tensor dimensions for every input component.

In [21]:
print("Individual tensor shapes:")
print(f"Heavy: {heavy_embeddings[0].shape}")
print(f"Light: {light_embeddings[0].shape}")
print(f"Antigen: {antigen_embeddings[0].shape}")
print(f"Payload: {payload_embeddings[0].shape}")
print(f"Linker: {linker_embeddings[0].shape}")

Individual tensor shapes:
Heavy: torch.Size([320])
Light: torch.Size([320])
Antigen: torch.Size([320])
Payload: torch.Size([384])
Linker: torch.Size([384])


Now that we’ve generated embeddings from all protein sequences, as well as the payload and linker SMILES, we can concatenate them, along with the standardized DAR value to form a complete feature vector for each ADC.

In [22]:
adc_embeddings = []

for i in range(len(heavy_embeddings)):
    heavy = heavy_embeddings[i]
    light = light_embeddings[i]
    antigen = antigen_embeddings[i]
    payload = payload_embeddings[i]
    linker = linker_embeddings[i]
    dar = dar_scaled[i]

    # Convert to tensors and ensure 1D
    def to_tensor_1d(emb):
        if not isinstance(emb, torch.Tensor):
            emb = torch.tensor(emb, dtype=torch.float32)
        return emb.flatten()

    heavy = to_tensor_1d(heavy)
    light = to_tensor_1d(light)
    antigen = to_tensor_1d(antigen)
    payload = to_tensor_1d(payload)
    linker = to_tensor_1d(linker)
    dar = to_tensor_1d(dar)

    # Concatenate all embeddings
    full_emb = torch.cat([heavy, light, antigen, payload, linker, dar])
    adc_embeddings.append(full_emb)

Lets see the shape of a single concatenated ADC embedding i.e., feature vector for one ADC.

In [23]:
print(f"Generated {len(adc_embeddings)} ADC embeddings, each of shape {adc_embeddings[0].shape}. Final concatenated tensor shape: {full_emb.shape}")

Generated 435 ADC embeddings, each of shape torch.Size([1729]). Final concatenated tensor shape: torch.Size([1729])


Now, we can check the shape of the full batch tensor (i.e., all ADCs stacked), where the first dimension is the number of ADCs and the second is the embedding size.

In [24]:
adc_batch_tensor = torch.stack(adc_embeddings)
adc_batch_tensor.shape

torch.Size([435, 1729])

We will be using the 100 nm label from the dataset as our target variable. Although the dataset also includes labels at 10 nm and 1000 nm, we choose the 100 nm label.

In [25]:
label_col = "label（100nm）"
labels = df[label_col].values  # 0 or 1

# Convert to PyTorch tensor
y = torch.tensor(labels, dtype=torch.float32).unsqueeze(1)

### Constructing DeepChem Dataset

Now that we have our input features and labels prepared, let's convert them into DeepChem’s NumpyDataset format, which allows us to utilize DeepChem’s modeling and evaluation tools.

In [26]:
from deepchem.data.datasets import NumpyDataset
import numpy as np

X = adc_batch_tensor.numpy()  # Convert to numpy array
y = labels
y = y.astype(np.float32)  # Ensure labels are float32

dataset = NumpyDataset(X= X, y= y, ids= np.arange(len(y)))

print(f"Dataset created with {len(dataset)} samples.")
print(f"Features shape: {dataset.X.shape}")

Dataset created with 435 samples.
Features shape: (435, 1729)


Let’s print the dataset type to confirm it’s been correctly wrapped as a DeepChem NumpyDataset.

In [27]:
print(type(dataset))
print(dataset)

<class 'deepchem.data.datasets.NumpyDataset'>
<NumpyDataset X.shape: (435, 1729), y.shape: (435,), w.shape: (435,), ids: [0 1 2 ... 432 433 434], task_names: [0]>


Lets split our dataset into train test split using deepchem's splitters. Here we are using random splitters. Users are encouraged to explore DeepChem's other splitters like ScaffoldSplitter, IndexSplitter, and more, based on their requirements.

In [28]:
import deepchem as dc

# Creating a RandomSplitter object
splitter = dc.splits.RandomSplitter()

# Splitting dataset into train, validation, and test datasets
train_dataset, valid_dataset, test_dataset = splitter.train_valid_test_split(
    dataset, 
    frac_train=0.8,
    frac_valid=0.1,
    frac_test=0.1 
)

print(f"Train: {len(train_dataset)}, Valid: {len(valid_dataset)}, Test: {len(test_dataset)}")

Train: 348, Valid: 43, Test: 44


#### Defining MLP (Multi-Layer Perceptron)

Before we move into model training, it’s important to understand the architecture of the MLP (Multi-Layer Perceptron). MLP is a type of feedforward neural network consisting of an input layer, one or more hidden layers, and an output layer. Each neuron in a layer is fully connected to every neuron in the next, allowing the model to learn complex, non-linear patterns from the data.  Other inputs include Dropout and Activation function. Dropout randomly disables a fraction of neurons during each training iteration, which forces the network to not rely too heavily on any one neuron and helps in learning more robust features, and activation function is used to specify the activation function used in the hidden layers of the model. <br>

In our setup, the input layer has a dimension, which corresponds to the size of the combined embeddings. This is followed by 2 hidden layers that help the model extract deeper hierarchical features, and finally, an output layer that produces predictions. Other inputs includes a dropout rate of 0.2 and Relu activation function.

### Training

Let’s define a custom Torch model using DeepChem’s TorchModel wrapper. We'll use a simple feedforward network with two hidden layers for regression.

In [29]:
ip_dim = adc_batch_tensor.shape[1]
print(f"Input dimension: {ip_dim}")

Input dimension: 1729


In [30]:
import torch
import torch.nn as nn
from deepchem.models.torch_models.torch_model import TorchModel
from deepchem.models.torch_models.layers import MultilayerPerceptron
from deepchem.models.losses import BinaryCrossEntropy

class SigmoidMLP(nn.Module):
    def __init__(self, d_input, d_output, d_hidden, dropout, activation_fn):
        super().__init__()
        self.mlp = MultilayerPerceptron(
            d_input=d_input,
            d_output=d_output,
            d_hidden=d_hidden,
            dropout=dropout,
            activation_fn=activation_fn
        )
        
    def forward(self, x):
        logits = self.mlp(x)
        return torch.sigmoid(logits)

class ADCNetModel(TorchModel):
    def __init__(self, input_dim= ip_dim, output_dim=1, hidden_dims=(1024, 256), dropout=0.2, activation_fn='relu', **kwargs):

        model = SigmoidMLP(
            d_input=input_dim,
            d_output=output_dim,
            d_hidden=hidden_dims,
            dropout=dropout,
            activation_fn=activation_fn
        )
        
        super(ADCNetModel, self).__init__(model=model, loss=BinaryCrossEntropy(), **kwargs)

model = ADCNetModel()
loss = model.fit(train_dataset, nb_epoch=1000)
print(f"Training loss: {loss}")

Training loss: 0.26054489135742187


### Evaluation

To assess the model's performance, we use DeepChem's built-in evaluation metrics. We'll calculate ROC AUC and Accuracy, and you can explore other metrics in the deepchem.metrics module as well.

In [31]:
classification_metrics = [
    dc.metrics.Metric(dc.metrics.roc_auc_score, np.mean),  
    dc.metrics.Metric(dc.metrics.accuracy_score, np.mean),  
]

scores = model.evaluate(test_dataset, classification_metrics)
for metric_name, score in scores.items():
    print(f"{metric_name}: {score}")

mean-roc_auc_score: 0.7603485838779956
mean-accuracy_score: 0.7045454545454546


Now, let's compare our predicted values with the actual labels to evaluate how well the model is performing and gain insights into the quality of its predictions.

In [32]:
predictions = model.predict(test_dataset)
val_predictions = model.predict(valid_dataset)
ground_truth = test_dataset.y
val_ground_truth = valid_dataset.y

print("Predictions:", predictions.squeeze().tolist())
print("Ground Truth:", ground_truth.squeeze().tolist())

Predictions: [1.0, 0.5, 0.9999895095825195, 0.9968763589859009, 0.6089296340942383, 1.0, 0.9995952248573303, 1.0, 0.5, 0.5, 0.5, 0.9470304250717163, 0.5, 0.9944315552711487, 0.5, 0.5, 0.9991081357002258, 0.5, 0.5, 0.5, 1.0, 1.0, 0.5, 0.5, 0.5, 0.9999953508377075, 0.5, 1.0, 0.9999958276748657, 0.9999346733093262, 0.5335736274719238, 0.987384557723999, 1.0, 0.5, 1.0, 0.5, 0.9999969005584717, 0.99298495054245, 0.5, 0.5, 0.9999997615814209, 0.5, 0.5, 1.0]
Ground Truth: [1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0]


In [33]:
val_outputs = model.predict(valid_dataset)
y_val = valid_dataset.y  # Ground truth

# Convert to binary predictions (equivalent to torch.round)
predicted_labels = np.round(val_outputs.squeeze()).astype(int)  # Round to 0 or 1
true_labels = y_val.squeeze().astype(int)

# Calculate accuracy
correct = (predicted_labels == true_labels).sum()
total = len(true_labels)

accuracy = correct / total
print(f"Accuracy: {accuracy * 100:.2f}%")

Accuracy: 86.05%


Here, we have successfully built a complete pipeline using DeepChem in this notebook. This process involved preparing ADC sequence data, generating embeddings, training a custom Torch model, and evaluating its performance.

Congratulations, here we conclude our first step toward applying deep learning for ADC property prediction!

## References <a name="references"></a>

<a name="1"></a> [1] Chen, L., Li, B., Chen, Y., Lin, M., Zhang, S., Li, C., Pang, Y., & Wang, L. (2024). ADCNet: A unified framework for predicting the activity of antibody‑drug conjugates.

<a name="2"></a> [2] DeepChem (n.d.). Introduction to Antibody-Drug Conjugates.

<a name="3"></a> [3] Biopharma PEG. “The History Of ADC Drugs Development.” BiochemPEG.

<a name="4"></a> [4] Beck, A., Goetsch, L., Dumontet, C. et al. Strategies and challenges for the next generation of antibody–drug conjugates. Nat Rev Drug Discov 16, 315–337 (2017). https://doi.org/10.1038/nrd.2016.268

<a name="5"></a> [5] Shen, L. T., Sun, X. N., Chen, Z., Guo, Y., Shen, Z. Y., Song, Y., Xin, W. X., Ding, H. Y., Ma, X. Y., Xu, W. B., Zhou, W. Y., Che, J. X., Tan, L. L., Chen, L. S., Chen, S. Q., Dong, X. W., Fang, L., & Zhu, F. (2024).
ADCdb: the database of antibody‑drug conjugates. Nucleic Acids Research, 52(D1), D1097–D1109. PMID 37831118.

<a name="6"></a> [6] ADCNet githubidrugLab.(2024). ADCNet: a unified framework for predicting the activity of antibody‑drug conjugates.

# Congratulations! Time to join the Community!

Congratulations on completing this tutorial notebook! If you enjoyed working through the tutorial, and want to continue working with DeepChem, we encourage you to finish the rest of the tutorials in this series. You can also help the DeepChem community in the following ways:

## Star DeepChem on [GitHub](https://github.com/deepchem/deepchem)
This helps build awareness of the DeepChem project and the tools for open source drug discovery that we're trying to build.

## Join the DeepChem Discord
The DeepChem [Discord](https://discord.gg/5d5bEVSt) hosts a number of scientists, developers, and enthusiasts interested in deep learning for the life sciences. Join the conversation!

# Citing this tutorial
If you found this tutorial useful please consider citing it using the provided BibTeX.

```
@manual{Molecular Machine Learning,
 title={Introduction to ADCNet: Predicting ADC Activity with DeepChem},
 organization={DeepChem},
 author={Patra, Sonali Lipsa, and Singh, Rakshit Kr. and Bisoi, Ankita and Ramsundar, Bharath}
 howpublished = {\url{https://github.com/deepchem/deepchem/blob/master/examples/tutorials/ADCNet.ipynb}},
 year={2025},
}
```