## Introduction

This notebook illustrates how to use DeepFrag as an API. For command-line use, see the examples in `./tests/run_tests.sh`.

The notebook doesn't describe how to train a new Deefrag model from scratch. Instead, it descsribes how to use an already trained checkpoint for inference, and how to finetune that checkpoint.

The `./tests/run_tests.sh` script generates sample checkpoint files. Note that these checkpoints are not well trained and should not be used in production. The checkpoint files are not included in the git repo, but this notebook assumes they exist. Be sure to run `./tests/run_tests.sh` before using this notebook, or change the checkpoint paths to your own files.


In [1]:
# Start by importing the needed libraries

import torch
import pytorch_lightning as pl
import os
from apps.deepfrag.run import DeepFrag
from argparse import Namespace

# Confirm the torch and lightning versions
print(torch.__version__)
print(pl.__version__)

INFO: Note: NumExpr detected 36 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.


collagen.GraphMol requires torch_geometric!
1.11.0
1.5.10


In [2]:
# Make a working directory if needed

os.system("mkdir -p tmp_working_dir")

0

In [3]:
# Create the deepfrag model

model = DeepFrag()

INFO: ProDy is configured: verbosity='none'


## Inference on a single protein/ligand complex

In [4]:
# Define the deepfrag parameters. These are the same parameters available through the commandline.

args = Namespace(
    mode="inference",
    receptor="./tests/data_for_inference/5VUP_prot_955.pdb",
    branch_atm_loc_xyz="12.413000, 3.755000, 59.021999",
    ligand="./tests/data_for_inference/5VUP_lig_955.sdf",
    load_checkpoint="./tests/1.train_on_moad.output/last.ckpt",
    default_root_dir="./tmp_working_dir/",
    rotations=2,
    inference_label_sets="./tests/data_for_inference/label_set.smi",
    num_inference_predictions=10,
    fragment_representation="rdk10",
    split_seed=-1,
    
    # Boilerplate
    log_every_n_steps=25,
    num_dataloader_workers=32,
    cache_pdbs_to_disk=True,
    learning_rate=0.0001,
    gpus=1,
    cpu=False,
    verbose=True,
    aggregation_rotations="mean"
)

In [5]:
# Setup the fingerprint scheme. This makes some updates to the args depending on whether you're using
# rdk10 or molbert fingerprints.

model.setup_fingerprint_scheme(args)

In [6]:
# Load the checkpoint

ckpt = model.load_checkpoint(args, validate_args=False)

Restoring from checkpoint: ./tests/1.train_on_moad.output/last.ckpt


### Using the `run_inference()` helper function

In [7]:
# Deepfrag includes a helper function that runs inference, calculates the
# average output vector over multiple rotations, and selects the top-K most
# similar fragments.

result = model.run_inference(args, ckpt)

Using the operator mean to aggregate the inferences.


DEBUG: 7165 atoms and 1 coordinate set(s) were parsed in 0.08s.
INFO: Created a temporary directory at /tmp/tmpxai744w3
INFO: Writing /tmp/tmpxai744w3/_remote_module_non_sriptable.py


Using checkpoint ./tests/1.train_on_moad.output/last.ckpt

Loading model from checkpoint ./tests/1.train_on_moad.output/last.ckpt

    Rotation #1
    Rotation #2




Label set size: 556


Most Similar Matches:   0%|          | 0/1 [00:00<?, ?it/s]

In [8]:
# The most similar fragments from your label set

result["most_similar"]

[['*CC', 0.6488900184631348],
 ['*CCN', 0.6217541098594666],
 ['*C(C)N', 0.5775389075279236],
 ['*CC#C', 0.5486630797386169],
 ['*CC=C', 0.5315735936164856],
 ['*CCC', 0.5299135446548462],
 ['*CCCN', 0.5287677049636841],
 ['*C(C)C', 0.5200974345207214],
 ['*CCO', 0.5113264918327332],
 ['*C(C)O', 0.4969644546508789]]

In [9]:
# The Deepfrag output fingerprints, per rotation

print("First-rotation fingerprint:", result["fps"]["per_rot"][0])
print("")
print("Total fingerprints:", len(result["fps"]))

First-rotation fingerprint: tensor([[9.8529e-04, 1.1870e-04, 1.1531e-03,  ..., 3.9976e-05, 1.3216e-04,
         8.1263e-01]], grad_fn=<SigmoidBackward0>)

Total fingerprints: 2


In [10]:
# The average fingerprint over all rotations

print("Average fingerprint:", result["fps"]["avg"])

Average fingerprint: tensor([[5.8259e-04, 6.6963e-05, 7.0159e-04,  ..., 2.2288e-05, 7.3835e-05,
         8.2951e-01]], grad_fn=<MeanBackward1>)


### Without the `run_inference()` helper function

In [11]:
# You don't have to use the helper function if you'd
# like more customized use. Here is an example.

# First, load the receptor and ligand

import prody
from rdkit import Chem
from collagen.core.molecules.mol import Mol
from io import StringIO
import numpy as np

# Load the receptor
with open(args.receptor, "r") as f:
    m = prody.parsePDBStream(StringIO(f.read()), model=1)
prody_mol = m.select("all")
recep = Mol.from_prody(prody_mol)

# Load the ligand
suppl = Chem.SDMolSupplier(str(args.ligand))
rdmol = [x for x in suppl if x is not None][0]
lig = Mol.from_rdkit(rdmol, strict=False)

# Calculate the center of the receptor. 
center = np.array([float(v.strip()) for v in args.branch_atm_loc_xyz.split(",")])

DEBUG: 7165 atoms and 1 coordinate set(s) were parsed in 0.07s.


In [12]:
# Voxelize the receptor and ligand

from collagen.util import rand_rot

# Get a random rotation
rot = rand_rot()

# Get the parameters to use when voxelizing the protein and ligand.
voxel_params = model.init_voxel_params(args)

# Voxelize the receptor
recep_vox = recep.voxelize(
    voxel_params, cpu=True, center=center, rot=rot
)

# Voxelize the ligand
lig_vox = lig.voxelize(voxel_params, cpu=True, center=center, rot=rot)

# Stack the receptor and ligand tensors
num_features = recep_vox.shape[1] + lig_vox.shape[1]
dimen1 = lig_vox.shape[2]
dimen2 = lig_vox.shape[3]
dimen3 = lig_vox.shape[4]
vox = torch.cat([recep_vox[0], lig_vox[0]]).reshape(
    [1, num_features, dimen1, dimen2, dimen3]
)

In [13]:
# Initialize the model

# Set the device (e.g., cuda)
device = model.init_device(args)

model_initialized = model.init_model(args, ckpt)

model_initialized.eval()


Loading model from checkpoint ./tests/1.train_on_moad.output/last.ckpt



DeepFragModel(
  (encoder): Sequential(
    (0): BatchNorm3d(10, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (1): Conv3d(10, 64, kernel_size=(3, 3, 3), stride=(1, 1, 1))
    (2): ReLU()
    (3): Conv3d(64, 64, kernel_size=(3, 3, 3), stride=(1, 1, 1))
    (4): ReLU()
    (5): Conv3d(64, 64, kernel_size=(3, 3, 3), stride=(1, 1, 1))
    (6): ReLU()
    (7): MaxPool3d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (8): BatchNorm3d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (9): Conv3d(64, 64, kernel_size=(3, 3, 3), stride=(1, 1, 1))
    (10): ReLU()
    (11): Conv3d(64, 64, kernel_size=(3, 3, 3), stride=(1, 1, 1))
    (12): ReLU()
    (13): Conv3d(64, 64, kernel_size=(3, 3, 3), stride=(1, 1, 1))
    (14): ReLU()
    (15): Aggregate3x3Patches(output_size=(1, 1, 1))
    (16): Flatten(start_dim=1, end_dim=-1)
    (17): Dropout(p=0.5, inplace=False)
    (18): Linear(in_features=64, out_features=512, bias=True)
    (19):

In [14]:
# Predict the appropriate fingerprint given these voxel inputs.

fp = model_initialized.forward(vox)
fp

tensor([[4.9241e-04, 5.0444e-05, 5.8798e-04,  ..., 1.5905e-05, 5.2405e-05,
         8.2096e-01]], grad_fn=<SigmoidBackward0>)

In [15]:
# To find the closest fragments to the output fingerprint, we must
# prepare our inference label set. There's a helper function for that:

(
    label_set_fingerprints,
    label_set_entry_infos,
) = model.create_inference_label_set(
    args,
    device,
    [l.strip() for l in args.inference_label_sets.split(",")],
)

Label set size: 556


In [16]:
# Now use another helper function to find the most similar matches.

from collagen.metrics.metrics import most_similar_matches

most_similar = most_similar_matches(
    fp,
    label_set_fingerprints,
    label_set_entry_infos,
    args.num_inference_predictions,  # self.NUM_MOST_SIMILAR_PER_ENTRY,
)

most_similar

Most Similar Matches:   0%|          | 0/1 [00:00<?, ?it/s]

[[['*CC', 0.64890056848526],
  ['*CCN', 0.6202546954154968],
  ['*C(C)N', 0.5759989619255066],
  ['*CC#C', 0.5487274527549744],
  ['*CC=C', 0.5310753583908081],
  ['*CCC', 0.5292601585388184],
  ['*CCCN', 0.5268192291259766],
  ['*C(C)C', 0.5198426246643066],
  ['*CCO', 0.5128037929534912],
  ['*C(C)O', 0.49837005138397217]]]

## Finetuning

When you train a Deepfrag model from scratch (using the CLI), it produces not only a `ckpt` file, but also a `pt` file. The `pt` file is useful for finetuning a Deepfrag model.

Note that finetuning is also called "warm starting."

In [21]:
# We must define new arguments for this new task.    
    
args = Namespace(
    mode="warm_starting",
    model_for_warm_starting="./tests/1.train_on_moad.output/model_mean_mean_train.pt",    
    default_root_dir="./tmp_working_dir/",
    max_epochs=5,
    fragment_representation="rdk10",
    butina_cluster_cutoff=0.4,
    split_seed=-1,
    save_params="./tmp_working_dir/params.saved.json",
    save_splits="./tmp_working_dir/splits.saved.json",
    
    # Directory contains protein/ligands named like 1XDN_prot_123.pdb, 1XDN_lig_123.sdf
    data_dir="./tests/data_to_finetune/",
    
    # In this case, let's have no validation set.
    fraction_val=0.0,
    fraction_train=0.8,
    
    # Boilerplate
    log_every_n_steps=25,
    num_dataloader_workers=32,
    cache_pdbs_to_disk=True,
    learning_rate=0.0001,
    gpus=1,
    cpu=False,
    verbose=True,
    aggregation_rotations="mean"
)

In [22]:
model.run_warm_starting(args)

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs


AttributeError: 'Namespace' object has no attribute 'load_splits'

In [None]:
# Initialize the trainer

trainer = model.init_trainer(args)
