# Tutorial Part 13: Modeling Protein-Ligand Interactions
By [Nathan C. Frey](https://ncfrey.github.io/) | [Twitter](https://twitter.com/nc_frey) and [Bharath Ramsundar](https://rbharath.github.io/) | [Twitter](https://twitter.com/rbhar90)

In this tutorial, we'll walk you through the use of machine learning and molecular docking methods to predict the binding energy of a protein-ligand complex. Recall that a ligand is some small molecule which interacts (usually non-covalently) with a protein. Molecular docking performs geometric calculations to find a “binding pose” with a small molecule interacting with a protein in a suitable binding pocket (that is, a region on the protein which has a groove in which the small molecule can rest). 

The structure of proteins can be determined experimentally with techniques like Cryo-EM or X-ray crystallography. This can be a powerful tool for structure-based drug discovery. For more info on docking, read the [AutoDock Vina paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3041641/) and the [`deepchem.dock`](https://deepchem.readthedocs.io/en/latest/docking.html) documentation. There are many graphical user and command line interfaces (like AutoDock) for performing molecular docking. Here, we show how docking can be performed programmatically with DeepChem, which enables automation and easy integration with machine learning pipelines.

As you work through the tutorial, you'll trace an arc including 
1. Loading a protein-ligand complex dataset ([PDBbind](http://www.pdbbind.org.cn/)) 
2. Performing programmatic molecular docking
3. Featurizing protein-ligand complexes with interaction fingerprints
4. Fitting a random forest model and predicting binding affinities

To start the tutorial, we'll use a simple pre-processed dataset file that comes in the form of a gzipped file. Each row is a molecular system, and each column represents a different piece of information about that system. For instance, in this example, every row reflects a protein-ligand complex, and the following columns are present: a unique complex identifier; the SMILES string of the ligand; the binding affinity (Ki) of the ligand to the protein in the complex; a Python `list` of all lines in a PDB file for the protein alone; and a Python `list` of all lines in a ligand file for the ligand alone.

## Colab

This tutorial and the rest in this sequence are designed to be done in Google colab. If you'd like to open this notebook in colab, you can use the following link.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepchem/deepchem/blob/master/examples/tutorials/13_Modeling_Protein_Ligand_Interactions.ipynb)

## Setup

To run DeepChem within Colab, you'll need to run the following cell of installation commands. This will take about 5 minutes to run to completion and install your environment.

In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=False)

Mounted at /content/drive


In [None]:
%cd /content/drive/My Drive/Colab Notebooks/DeepChem

/content/drive/My Drive/Colab Notebooks/DeepChem


In [None]:
!curl -Lo conda_installer.py https://raw.githubusercontent.com/deepchem/deepchem/master/scripts/colab_install.py
import conda_installer
conda_installer.install()
!/root/miniconda/bin/conda info -e

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  3490  100  3490    0     0  25289      0 --:--:-- --:--:-- --:--:-- 25289


add /root/miniconda/lib/python3.6/site-packages to PYTHONPATH
python version: 3.6.9
fetching installer from https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
done
installing miniconda to /root/miniconda
done
installing rdkit, openmm, pdbfixer
added omnia to channels
added conda-forge to channels
done
conda packages installation finished!


# conda environments:
#
base                  *  /root/miniconda



In [None]:
!pip install --pre deepchem
import deepchem
deepchem.__version__

Collecting deepchem
[?25l  Downloading https://files.pythonhosted.org/packages/bb/7e/caefb672f20933ca6966b7976a88eebe55cba489656121eadcb8723c48fe/deepchem-2.4.0-py3-none-any.whl (531kB)
[K     |▋                               | 10kB 22.1MB/s eta 0:00:01[K     |█▎                              | 20kB 27.3MB/s eta 0:00:01[K     |█▉                              | 30kB 23.8MB/s eta 0:00:01[K     |██▌                             | 40kB 21.6MB/s eta 0:00:01[K     |███                             | 51kB 22.3MB/s eta 0:00:01[K     |███▊                            | 61kB 16.2MB/s eta 0:00:01[K     |████▎                           | 71kB 17.1MB/s eta 0:00:01[K     |█████                           | 81kB 17.7MB/s eta 0:00:01[K     |█████▌                          | 92kB 15.7MB/s eta 0:00:01[K     |██████▏                         | 102kB 16.9MB/s eta 0:00:01[K     |██████▉                         | 112kB 16.9MB/s eta 0:00:01[K     |███████▍                        | 122kB 16

'2.4.0'

### Protein-ligand complex data
It is really helpful to visualize proteins and ligands when doing docking. Unfortunately, Google Colab doesn't currently support the Jupyter widgets we need to do that visualization. Install [`MDTraj`](https://github.com/mdtraj/mdtraj) and [`nglview`](https://github.com/nglviewer/nglview) on your local machine to view the protein-ligand complexes we're working with.

In [None]:
# !pip install -q mdtraj nglview
# !jupyter-nbextension enable nglview --py --sys-prefix  # for jupyter notebook
# !jupyter labextension install  nglview-js-widgets  # for jupyter lab

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
  Building wheel for mdtraj (PEP 517) ... [?25l[?25hdone


In [None]:
import os
import numpy as np
import pandas as pd

import tempfile

from rdkit import Chem
from rdkit.Chem import AllChem
import deepchem as dc

from deepchem.utils import download_url, load_from_disk

To illustrate the docking procedure, here we'll use a csv that contains SMILES strings of ligands as well as PDB files for the ligand and protein targets from PDBbind. Later, we'll use the labels to train a model to predict binding affinities. We'll also show how to download and featurize PDBbind to train a model from scratch.

In [None]:
data_dir = dc.utils.get_data_dir()
dataset_file = os.path.join(data_dir, "pdbbind_core_df.csv.gz")

if not os.path.exists(dataset_file):
    print('File does not exist. Downloading file...')
    download_url("https://s3-us-west-1.amazonaws.com/deepchem.io/datasets/pdbbind_core_df.csv.gz")
    print('File downloaded...')

raw_dataset = load_from_disk(dataset_file)

Let's see what `raw_dataset` looks like:

In [None]:
raw_dataset.head(2)

Unnamed: 0,pdb_id,smiles,complex_id,protein_pdb,ligand_pdb,ligand_mol2,label
0,2d3u,CC1CCCCC1S(O)(O)NC1CC(C2CCC(CN)CC2)SC1C(O)O,2d3uCC1CCCCC1S(O)(O)NC1CC(C2CCC(CN)CC2)SC1C(O)O,"['HEADER 2D3U PROTEIN\n', 'COMPND 2D3U P...","['COMPND 2d3u ligand \n', 'AUTHOR GENERA...","['### \n', '### Created by X-TOOL on Thu Aug 2...",6.92
1,3cyx,CC(C)(C)NC(O)C1CC2CCCCC2C[NH+]1CC(O)C(CC1CCCCC...,3cyxCC(C)(C)NC(O)C1CC2CCCCC2C[NH+]1CC(O)C(CC1C...,"['HEADER 3CYX PROTEIN\n', 'COMPND 3CYX P...","['COMPND 3cyx ligand \n', 'AUTHOR GENERA...","['### \n', '### Created by X-TOOL on Thu Aug 2...",8.0


### Complex visualization
We'll use these helper functions and the `MDTraj` library to easily convert the entries in our dataframe to `pdb` files. If you're outside of Colab, you can expand these cells and use `MDTraj` and `nglview` to visualize proteins and ligands.

We'll use the `mdtraj` library to help us manipulate both ligand and protein objects. We'll use the following convenience function to parse in the ligand and protein representations above into mdtraj.

In [None]:
def convert_lines_to_mdtraj(molecule_lines):
  molecule_lines = molecule_lines.strip('[').strip(']').replace("'","").replace("\\n", "").split(", ")
  tempdir = tempfile.mkdtemp()
  molecule_file = os.path.join(tempdir, "molecule.pdb")
  with open(molecule_file, "w") as f:
    for line in molecule_lines:
        f.write("%s\n" % line)
  molecule_mdtraj = md.load(molecule_file)
  return molecule_mdtraj

Let's take a look at the first protein ligand pair in our dataset:

In [None]:
first_protein, first_ligand = raw_dataset.iloc[0]["protein_pdb"], raw_dataset.iloc[0]["ligand_pdb"]
protein_mdtraj = convert_lines_to_mdtraj(first_protein)
ligand_mdtraj = convert_lines_to_mdtraj(first_ligand)

We'll use the convenience function `nglview.show_mdtraj` in order to view our proteins and ligands. Note that this will only work if you uncommented the above cell, installed nglview, and enabled the necessary notebook extensions.

In [None]:
v = nglview.show_mdtraj(ligand_mdtraj)
v

Now that we have an idea of what the ligand looks like, let's take a look at our protein:

In [None]:
view = nglview.show_mdtraj(protein_mdtraj)
view

Can we view the complex with both protein and ligand? Yes, but we'll need the following helper function to join the two mdtraj files for the protein and ligand.

In [None]:
def combine_mdtraj(protein, ligand):
  chain = protein.topology.add_chain()
  residue = protein.topology.add_residue("LIG", chain, resSeq=1)
  for atom in ligand.topology.atoms:
      protein.topology.add_atom(atom.name, atom.element, residue)
  protein.xyz = np.hstack([protein.xyz, ligand.xyz])
  protein.topology.create_standard_bonds()
  return protein
complex_mdtraj = combine_mdtraj(protein_mdtraj, ligand_mdtraj)

Let's now visualize our complex. We can see that the ligand slots into a groove on the outer edge of the protein.

In [None]:
v = nglview.show_mdtraj(complex_mdtraj)
v

### Fixing PDB files
Next, let's get some PDB protein files for docking. We'll use the PDB IDs from our `raw_dataset` and download the pdb files directly from the [Protein Data Bank](https://www.rcsb.org/) using [`pdbfixer`](https://github.com/openmm/pdbfixer). We'll also sanitize the structures with [RDKit](https://www.rdkit.org/). This ensures that any problems with the protein and ligand files (non-standard residues, chemical validity, etc.) are corrected.  Feel free to modify these cells and pdbids to consider new protein-ligand complexes. We note here that PDB files are complex and human judgement is required to prepare protein structures for docking. DeepChem includes a number of [docking utilites](https://deepchem.readthedocs.io/en/latest/api_reference/utils.html#docking-utilities) to assist you with preparing protein files, but results should be inspected before docking is attempted.

In [None]:
from simtk.openmm.app import PDBFile
from pdbfixer import PDBFixer

from deepchem.utils.vina_utils import prepare_inputs

In [None]:
num_complexes = 5 # increase to consider more datapoints
pdbids = raw_dataset['pdb_id'].values[:num_complexes]
ligand_smiles = raw_dataset['smiles'].values[:num_complexes]

In [None]:
%%time
for (pdbid, ligand) in zip(pdbids, ligand_smiles):
  fixer = PDBFixer(url='https://files.rcsb.org/download/%s.pdb' % (pdbid))
  PDBFile.writeFile(fixer.topology, fixer.positions, open('%s.pdb' % (pdbid), 'w'))
  
  p, m = None, None
  # fix protein, optimize ligand geometry, and sanitize molecules
  try:
    p, m = prepare_inputs('%s.pdb' % (pdbid), ligand)
  except:
    print('%s failed PDB fixing' % (pdbid)) 

  if p and m:  # protein and molecule are readable by RDKit
    print(pdbid, p.GetNumAtoms())
    Chem.rdmolfiles.MolToPDBFile(p, '%s.pdb' % (pdbid))
    Chem.rdmolfiles.MolToPDBFile(m, 'ligand_%s.pdb' % (pdbid))

2d3u 8688
3cyx 1415
3uo4 2187
1p1q 5888
3ag9 5319
CPU times: user 48.8 s, sys: 838 ms, total: 49.6 s
Wall time: 48.8 s


### Molecular Docking

Ok, now that we've got our data and basic visualization tools up and running, let's see if we can use molecular docking to estimate the binding affinities between our protein ligand systems.

There are three steps to setting up a docking job, and you should experiment with different settings. The three things we need to specify are 1) how to identify binding pockets in the target protein; 2) how to generate poses (geometric configurations) of a ligand in a binding pocket; and 3) how to "score" a pose. Remember, our goal is to identify candidate ligands that strongly interact with a target protein, which is reflected by the score. 

DeepChem has a simple built-in method for identifying binding pockets in proteins. It is based on the [convex hull method](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4112621/pdf/1472-6807-14-18.pdf). The method works by creating a 3D polyhedron (convex hull) around a protein structure and identifying the surface atoms of the protein as the ones closest to the convex hull. Some biochemical properties are considered, so the method is not purely geometrical. It has the advantage of having a low computational cost and is good enough for our purposes.



In [None]:
proteins = [f for f in os.listdir('.') if not f.startswith('ligand') and f.endswith('.pdb')]
ligands = [f for f in os.listdir('.') if f.startswith('ligand') and f.endswith('.pdb')]

In [None]:
target_index = proteins.index('3cyx.pdb')  # get the smallest protein target
ligand_index = ligands.index('ligand_3cyx.pdb')

In [None]:
finder = dc.dock.binding_pocket.ConvexHullPocketFinder()
pockets = finder.find_pockets(proteins[target_index])
len(pockets)  # number of identified pockets

62

Pose generation is quite complex. Luckily, using DeepChem's pose generator will install the AutoDock Vina engine under the hood, allowing us to get up and running generating poses quickly.

In [None]:
vpg = dc.dock.pose_generation.VinaPoseGenerator()

We could specify a pose scoring function from `deepchem.dock.pose_scoring`, which includes things like repulsive and hydrophobic interactions and hydrogen bonding. Vina will take care of this, so instead we'll allow Vina to compute scores for poses. Note that you will need to use GPU acceleration on Colab for pose generation and docking. 

In [None]:
!mkdir -p vina_test

In [None]:
%%time
complexes, scores = vpg.generate_poses(molecular_complex=(proteins[target_index], ligands[ligand_index]),  # protein-ligand files for docking,
                                       out_dir='vina_test',
                                       generate_scores=True
                                      )

CPU times: user 9.15 s, sys: 440 ms, total: 9.59 s
Wall time: 3min 25s


In [None]:
scores

[-5.8, -5.5, -5.5, -5.2, -5.2, -5.2, -5.1, -5.1, -5.0]

Now that we understand each piece of the process, we can put it all together using DeepChem's `Docker` class. Docker creates a generator that yields tuples of posed complexes and docking scores.

In [None]:
docker = dc.dock.docking.Docker(pose_generator=vpg)

In [None]:
posed_complex, score = next(docker.dock(molecular_complex=(proteins[target_index], ligands[ligand_index]),
                                          use_pose_generator_scores=True))

CPU times: user 9.45 s, sys: 428 ms, total: 9.88 s
Wall time: 3min 24s


### Modeling Binding Affinity

Docking is a useful, albeit coarse-grained tool for predicting protein-ligand binding affinities. However, it takes some time, especially for large-scale virtual screenings where we might be considering different protein targets and thousands of potential ligands. We might naturally ask then, can we train a machine learning model to predict docking scores? Let's try and find out!

We'll show how to download the PDBbind dataset. We can use the loader in MoleculeNet to get the 4852 protein-ligand complexes from the "refined" set or the entire "general" set in PDBbind. For simplicity, we'll stick with the ~100 complexes we've already processed to train our models.

Next, we'll need a way to transform our protein-ligand complexes into representations which can be used by learning algorithms. Ideally, we'd have neural protein-ligand complex fingerprints, but DeepChem doesn't yet have a good learned fingerprint of this sort. We do however have well-tuned manual featurizers that can help us with our challenge here.

We'll make use of two types of fingerprints in the rest of the tutorial, the `CircularFingerprint` and `ContactCircularFingerprint`. DeepChem also has voxelizers and grid descriptors that convert a 3D volume containing an arragment of atoms into a fingerprint. These featurizers are really useful for understanding protein-ligand complexes since they allow us to translate complexes into vectors that can be passed into a simple machine learning algorithm. First, we'll create circular fingerprints. These convert small molecules into a vector of fragments.


In [None]:
pdbids = raw_dataset['pdb_id'].values
ligand_smiles = raw_dataset['smiles'].values

In [None]:
%%time
for (pdbid, ligand) in zip(pdbids, ligand_smiles):
  fixer = PDBFixer(url='https://files.rcsb.org/download/%s.pdb' % (pdbid))
  PDBFile.writeFile(fixer.topology, fixer.positions, open('%s.pdb' % (pdbid), 'w'))
  
  p, m = None, None
  # skip pdb fixing for speed
  try:
    p, m = prepare_inputs('%s.pdb' % (pdbid), ligand, replace_nonstandard_residues=False,
                          remove_heterogens=False, remove_water=False,
                          add_hydrogens=False)
  except:
    print('%s failed sanitization' % (pdbid)) 

  if p and m:  # protein and molecule are readable by RDKit
    Chem.rdmolfiles.MolToPDBFile(p, '%s.pdb' % (pdbid))
    Chem.rdmolfiles.MolToPDBFile(m, 'ligand_%s.pdb' % (pdbid))

1hfs failed sanitization
CPU times: user 2min 55s, sys: 1.33 s, total: 2min 56s
Wall time: 4min 23s


In [None]:
proteins = [f for f in os.listdir('.') if len(f) == 8 and f.endswith('.pdb')]
ligands = [f for f in os.listdir('.') if f.startswith('ligand') and f.endswith('.pdb')]

In [None]:
# Handle failed sanitizations
failures = set([f[:-4] for f in proteins]) - set([f[7:-4] for f in ligands])
for pdbid in failures:
  proteins.remove(pdbid + '.pdb')

In [None]:
len(proteins), len(ligands)

(190, 190)

In [None]:
pdbids = [f[:-4] for f in proteins]
small_dataset = raw_dataset[raw_dataset['pdb_id'].isin(pdbids)]
labels = small_dataset.label

In [None]:
fp_featurizer = dc.feat.CircularFingerprint(size=128)

In [None]:
features = fp_featurizer.featurize([Chem.MolFromPDBFile(l) for l in ligands])

In [None]:
dataset = dc.data.NumpyDataset(X=features, y=labels, ids=pdbids)
train_dataset, test_dataset = dc.splits.RandomSplitter().train_test_split(dataset, seed=42)

The convenience loader `dc.molnet.load_pdbbind` will take care of downloading and featurizing the pdbbind dataset under the hood for us. This will take quite a bit of time and compute, so the code to do it is commented out. Uncomment it and grab a cup of coffee if you'd like to featurize all of PDBbind's refined set. Otherwise, you can continue with the small dataset we constructed above.

In [None]:
# # Uncomment to featurize all of PDBBind's "refined" set
# pdbbind_tasks, (train_dataset, valid_dataset, test_dataset), transformers = dc.molnet.load_pdbbind(
#     featurizer=fp_featurizer, set_name="refined", reload=True,
#     data_dir='pdbbind_data', save_dir='pdbbind_data')

Now, we're ready to do some learning! 

To fit a deepchem model, first we instantiate one of the provided (or user-written) model classes. In this case, we have a created a convenience class to wrap around any ML model available in Sci-Kit Learn that can in turn be used to interoperate with deepchem. To instantiate an ```SklearnModel```, you will need (a) task_types, (b) model_params, another ```dict``` as illustrated below, and (c) a ```model_instance``` defining the type of model you would like to fit, in this case a ```RandomForestRegressor```.

In [None]:
from sklearn.ensemble import RandomForestRegressor

from deepchem.utils.evaluate import Evaluator
import pandas as pd

In [None]:
seed = 42 # Set a random seed to get stable results
sklearn_model = RandomForestRegressor(n_estimators=150, max_features='sqrt')
sklearn_model.random_state = seed
model = dc.models.SklearnModel(sklearn_model)
model.fit(train_dataset)

In [None]:
metric = dc.metrics.Metric(dc.metrics.r2_score)

evaluator = Evaluator(model, train_dataset, [])
train_r2score = evaluator.compute_model_performance([metric])
print("RF Train set R^2 %f" % (train_r2score["r2_score"]))

evaluator = Evaluator(model, test_dataset, [])
test_r2score = evaluator.compute_model_performance([metric])
print("RF Test set R^2 %f" % (test_r2score["r2_score"]))

RF Train set R^2 0.862398
RF Test set R^2 0.184741


We're using a very small dataset, so it's no surprise that the test set performance is quite bad. Still, this illustrates that even trivial prediction from only considering the ligand can be a viable approach for predicting binding energies.

In [None]:
# Compare predicted and true values
list(zip(model.predict(train_dataset), train_dataset.y))[:5]

[(6.5175555555555444, 7.4),
 (6.844933333333342, 6.85),
 (4.523999999999992, 3.4),
 (6.675666666666676, 6.72),
 (8.417411111111095, 11.06)]

In [None]:
list(zip(model.predict(test_dataset), test_dataset.y))[:5]

[(6.13726666666667, 4.21),
 (6.775999999999999, 8.7),
 (6.612271111111106, 6.39),
 (6.174840000000003, 4.94),
 (6.971266666666662, 9.21)]

### The protein-ligand complex view.

In the previous section, we featurized only the ligand. The signal we observed in R^2 reflects the ability of fingerprints and random forests to learn general features that make ligands "drug-like." This time, let's see if we can do something sensible with our protein-ligand fingerprints that make use of our structural information. To start with, we need to re-featurize the dataset but using the contact fingerprint this time.

In [None]:
fp_featurizer = dc.feat.ContactCircularFingerprint(size=128)

In [None]:
features = fp_featurizer.featurize(zip(ligands, proteins))
dataset = dc.data.NumpyDataset(X=features, y=labels, ids=pdbids)
train_dataset, test_dataset = dc.splits.RandomSplitter().train_test_split(dataset, seed=42)

Let's now train a simple random forest model on this dataset.

In [None]:
seed = 42 # Set a random seed to get stable results
sklearn_model = RandomForestRegressor(n_estimators=10, max_features='sqrt')
sklearn_model.random_state = seed
model = dc.models.SklearnModel(sklearn_model)
model.fit(train_dataset)

Let's see what our accuracies looks like!

In [None]:
metric = dc.metrics.Metric(dc.metrics.r2_score)

evaluator = Evaluator(model, train_dataset, [])
train_r2score = evaluator.compute_model_performance([metric])
print("RF Train set R^2 %f" % (train_r2score["r2_score"]))

evaluator = Evaluator(model, test_dataset, [])
test_r2score = evaluator.compute_model_performance([metric])
print("RF Test set R^2 %f" % (test_r2score["r2_score"]))

RF Train set R^2 0.273601
RF Test set R^2 -0.225230


Ok, it looks like we have lower accuracy than the ligand-only dataset. What gives? There might be a few things going on. It's hard to interpret with such a small dataset, but it's possible that for this particular dataset the pure ligand only features are quite predictive. Nonetheless, it's probably still useful to have a protein-ligand model since it's likely to learn different features than the the pure ligand-only model.

### Further reading

So far we have used DeepChem's docking module with the AutoDock Vina backend to generate docking scores for the PDBbind dataset. We trained a simple machine learning model to directly predict binding affinities, based on featurizing the protein-ligand complexes. We might want to try more sophisticated docking protocols, like the deep learning framework [gnina](https://github.com/gnina/gnina). You can read more about using convolutional neural nets for protein-ligand scoring [here](https://pubs.acs.org/doi/10.1021/acs.jcim.6b00740). And here is a [review](https://onlinelibrary.wiley.com/doi/abs/10.1002/wcms.1429) of machine learning-based scoring functions.

# Congratulations! Time to join the Community!

Congratulations on completing this tutorial notebook! If you enjoyed working through the tutorial, and want to continue working with DeepChem, we encourage you to finish the rest of the tutorials in this series. You can also help the DeepChem community in the following ways:

## Star DeepChem on [GitHub](https://github.com/deepchem/deepchem)
This helps build awareness of the DeepChem project and the tools for open source drug discovery that we're trying to build.

## Join the DeepChem Gitter
The DeepChem [Gitter](https://gitter.im/deepchem/Lobby) hosts a number of scientists, developers, and enthusiasts interested in deep learning for the life sciences. Join the conversation!