# Tutorial Part 13: Modeling Protein-Ligand Interactions

In this tutorial, we'll walk you through the use of machine learning methods to predict the binding energy of a protein-ligand complex. Recall that a ligand is some small molecule which interacts (usually non-covalently) with a protein. As you work through the tutorial, you'll trace an arc from loading a raw dataset to fitting a random forest model to predict binding affinities. We'll take the following steps to get there:

1. Loading a chemical dataset, consisting of a series of protein-ligand complexes.
2. Featurizing each protein-ligand complexes with various featurization schemes. 
3. Fitting a series of models with these featurized protein-ligand complexes.
4. Visualizing the results.

To start the tutorial, we'll use a simple pre-processed dataset file that comes in the form of a gzipped file. Each row is a molecular system, and each column represents a different piece of information about that system. For instance, in this example, every row reflects a protein-ligand complex, and the following columns are present: a unique complex identifier; the SMILES string of the ligand; the binding affinity (Ki) of the ligand to the protein in the complex; a Python `list` of all lines in a PDB file for the protein alone; and a Python `list` of all lines in a ligand file for the ligand alone.

## Colab

This tutorial and the rest in this sequence are designed to be done in Google colab. If you'd like to open this notebook in colab, you can use the following link.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepchem/deepchem/blob/master/examples/tutorials/13_Modeling_Protein_Ligand_Interactions.ipynb)

## Setup

To run DeepChem within Colab, you'll need to run the following cell of installation commands. This will take about 5 minutes to run to completion and install your environment.

In [1]:
!curl -Lo conda_installer.py https://raw.githubusercontent.com/deepchem/deepchem/master/scripts/colab_install.py
import conda_installer
conda_installer.install()
!/root/miniconda/bin/conda info -e

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0  0  3489    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100  3489  100  3489    0     0  21145      0 --:--:-- --:--:-- --:--:-- 21018


add /root/miniconda/lib/python3.6/site-packages to PYTHONPATH
all packages is already installed


# conda environments:
#
base                  *  /root/miniconda



In [2]:
!pip install --pre deepchem
import deepchem
deepchem.__version__



'2.4.0-rc1.dev'

In [3]:
import deepchem as dc
from deepchem.utils import download_url

import os

data_dir = dc.utils.get_data_dir()
dataset_file = os.path.join(data_dir, "pdbbind_core_df.csv.gz")

if not os.path.exists(dataset_file):
    print('File does not exist. Downloading file...')
    download_url("https://s3-us-west-1.amazonaws.com/deepchem.io/datasets/pdbbind_core_df.csv.gz")
    print('File downloaded...')

raw_dataset = dc.utils.save.load_from_disk(dataset_file)

Let's see what `dataset` looks like:

In [4]:
print("Type of dataset is: %s" % str(type(raw_dataset)))
print(raw_dataset[:5])
print("Shape of dataset is: %s" % str(raw_dataset.shape))

Type of dataset is: <class 'pandas.core.frame.DataFrame'>
  pdb_id  ... label
0   2d3u  ...  6.92
1   3cyx  ...  8.00
2   3uo4  ...  6.52
3   1p1q  ...  4.89
4   3ag9  ...  8.05

[5 rows x 7 columns]
Shape of dataset is: (193, 7)


Visualizing what these proteins and ligands look like will help us build intuition and understanding about these systems. Let's write a bit of code to help us view our molecules. We'll use the `nglview` library to help us do this. You can install this library by calling `pip install nglview`.

In [5]:
!pip install -q nglview mdtraj

In [6]:
import nglview
import tempfile
import os
import mdtraj as md
import numpy as np



We'll use the `mdtraj` library to help us manipulate both ligand and protein objects. We'll use the following convenience function to parse in the ligand and protein representations above into mdtraj.

In [7]:
def convert_lines_to_mdtraj(molecule_lines):
  molecule_lines = molecule_lines.strip('[').strip(']').replace("'","").replace("\\n", "").split(", ")
  tempdir = tempfile.mkdtemp()
  molecule_file = os.path.join(tempdir, "molecule.pdb")
  with open(molecule_file, "w") as f:
    for line in molecule_lines:
        f.write("%s\n" % line)
  molecule_mdtraj = md.load(molecule_file)
  return molecule_mdtraj

Let's take a look at the first protein ligand pair in our dataset:

In [8]:
first_protein, first_ligand = raw_dataset.iloc[0]["protein_pdb"], raw_dataset.iloc[0]["ligand_pdb"]
protein_mdtraj = convert_lines_to_mdtraj(first_protein)
ligand_mdtraj = convert_lines_to_mdtraj(first_ligand)

We'll use the convenience function `nglview.show_mdtraj` in order to view our proteins and ligands.

In [9]:
v = nglview.show_mdtraj(ligand_mdtraj)
v

NGLWidget()

Now that we have an idea of what the ligand looks like, let's take a look at our protein:

In [10]:
view = nglview.show_mdtraj(protein_mdtraj)
view

NGLWidget()

Can we view the complex with both protein and ligand? Yes, but we'll need the following helper function to join the two mdtraj files for the protein and ligand.

In [11]:
def combine_mdtraj(protein, ligand):
  chain = protein.topology.add_chain()
  residue = protein.topology.add_residue("LIG", chain, resSeq=1)
  for atom in ligand.topology.atoms:
      protein.topology.add_atom(atom.name, atom.element, residue)
  protein.xyz = np.hstack([protein.xyz, ligand.xyz])
  protein.topology.create_standard_bonds()
  return protein
complex_mdtraj = combine_mdtraj(protein_mdtraj, ligand_mdtraj)

Let's now visualize our complex

In [12]:
v = nglview.show_mdtraj(complex_mdtraj)
v

NGLWidget()

We can see that the ligand slots into a groove on the outer edge of the protein. Ok, now that we've got our basic visualization tools up and running, let's see if we can use some machine learning to understand our dataset of protein-ligand systems better.

In order to do this, we'll need a way to transform our protein-ligand complexes into representations which can be used by learning algorithms. Ideally, we'd have neural protein-ligand complex fingerprints, but DeepChem doesn't yet have a good learned fingerprint of this sort. We do however have well tuned manual featurizers that can help us with our challenge here.

We'll make of two types of fingerprints in the rest of the tutorial, the circular fingerprints and the grid descriptors. The grid descriptors convert a 3D volume containing an arragment of atoms into a fingerprint. This is really useful for understanding protein-ligand complexes since it will allow us to transfer protein-ligand complexes into vectors that can be passed into a simple machine learning algorithms. Let's see how we can create such a fingerprint in DeepChem. We'll make use of the `dc.feat.RdkitGridFeaturizer` class.

In [13]:
grid_featurizer = dc.feat.RdkitGridFeaturizer(
    voxel_width=16.0, feature_types=["ecfp", "splif", "hbond", "pi_stack", "cation_pi", "salt_bridge"], 
    ecfp_power=5, splif_power=5, parallel=True, flatten=True, sanitize=True)

--- Logging error ---
Traceback (most recent call last):
  File "/usr/lib/python3.6/logging/__init__.py", line 994, in emit
    msg = self.format(record)
  File "/usr/lib/python3.6/logging/__init__.py", line 840, in format
    return fmt.format(record)
  File "/usr/lib/python3.6/logging/__init__.py", line 577, in format
    record.message = record.getMessage()
  File "/usr/lib/python3.6/logging/__init__.py", line 338, in getMessage
    msg = msg % self.args
TypeError: not all arguments converted during string formatting
Call stack:
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py", line 16, in <module>
    app.launch_new_instance()
  File "/usr/local/lib/python3.6/dist-packages/traitlets/config/application.py", line 664, in launch_instance
    app.start()
  File "/usr/local/lib/python

Next we'll create circular fingerprints. These convert small molecules into a vector of fragments. You can create these fingerprints with the `dc.feat.CircularFingerprint` class.

In [14]:
compound_featurizer = dc.feat.CircularFingerprint(size=128)

The convenience loader `dc.molnet.load_pdbbind_grid` will take care of performing featurizing the pdbbind dataset under the hood for us. We'll use this helper method to perform our featurization for us. We'll featurize the "refined" subset of the PDBBIND dataset (which consists of only a couple thousand protein-ligand complexes) to keep this task manageable.

In [15]:
pdbbind_tasks, (train_dataset, valid_dataset, test_dataset), transformers = dc.molnet.load_pdbbind_grid(
    featurizer="ECFP", subset="refined")

Now, we're ready to do some learning! 

To fit a deepchem model, first we instantiate one of the provided (or user-written) model classes. In this case, we have a created a convenience class to wrap around any ML model available in Sci-Kit Learn that can in turn be used to interoperate with deepchem. To instantiate an ```SklearnModel```, you will need (a) task_types, (b) model_params, another ```dict``` as illustrated below, and (c) a ```model_instance``` defining the type of model you would like to fit, in this case a ```RandomForestRegressor```.

In [16]:
from sklearn.ensemble import RandomForestRegressor

seed=23 # Set a random seed to get stable results
sklearn_model = RandomForestRegressor(n_estimators=10, max_features='sqrt')
sklearn_model.random_state = seed
model = dc.models.SklearnModel(sklearn_model)
model.fit(train_dataset)

In [17]:
from deepchem.utils.evaluate import Evaluator
import pandas as pd

metric = dc.metrics.Metric(dc.metrics.r2_score)

evaluator = Evaluator(model, train_dataset, transformers)
train_r2score = evaluator.compute_model_performance([metric])
print("RF Train set R^2 %f" % (train_r2score["r2_score"]))

evaluator = Evaluator(model, valid_dataset, transformers)
valid_r2score = evaluator.compute_model_performance([metric])
print("RF Valid set R^2 %f" % (valid_r2score["r2_score"]))

n_samples is a deprecated argument which is ignored.
n_samples is a deprecated argument which is ignored.


RF Train set R^2 0.850540
RF Valid set R^2 0.372395


This is decent performance on a validation set! It's interesting to note that a trivial prediction from just the ligand can do reasonably on the task of predicting the binding energy.

In [18]:
predictions = model.predict(test_dataset)
print(predictions[:10])

[-1.23524245 -0.97359773 -0.56976069 -0.87289442 -0.98665882 -0.38179604
 -0.14367127 -1.20101768  0.00373068  0.15792326]


# The protein-ligand complex view.

In the previous section, we featurized only the ligand. The signal we observed in R^2 reflects the ability of grid fingerprints and random forests to learn general features that make ligands "drug-like." This time, let's see if we can do something sensible with our protein-ligand fingerprints that make use of our structural information. To start with, we need to re-featurize the dataset but using the "grid" fingerprints this time.

In [19]:
pdbbind_tasks, (train_dataset, valid_dataset, test_dataset), transformers = dc.molnet.load_pdbbind_grid(
    featurizer="grid", subset="refined")

Let's now train a simple random forest model on this dataset.

In [20]:
seed=23 # Set a random seed to get stable results
sklearn_model = RandomForestRegressor(n_estimators=10, max_features='sqrt')
sklearn_model.random_state = seed
model = dc.models.SklearnModel(sklearn_model)
model.fit(train_dataset)

Let's see what our accuracies looks like!

In [21]:
metric = dc.metrics.Metric(dc.metrics.r2_score)

evaluator = Evaluator(model, train_dataset, transformers)
train_r2score = evaluator.compute_model_performance([metric])
print("RF Train set R^2 %f" % (train_r2score["r2_score"]))

evaluator = Evaluator(model, valid_dataset, transformers)
valid_r2score = evaluator.compute_model_performance([metric])
print("RF Valid set R^2 %f" % (valid_r2score["r2_score"]))

n_samples is a deprecated argument which is ignored.
n_samples is a deprecated argument which is ignored.


RF Train set R^2 0.897545
RF Valid set R^2 0.402932


Ok, there's some predictive performance here, but it looks like we have lower accuracy than the ligand-only dataset. What gives? There might be a few things going on. It's possible that for this particular dataset the pure ligand only features are quite predictive. But nonetheless, it's probably still useful to have a protein-ligand model since it's likely to learn different features than the the pure ligand-only model.

# Doing Some Hyperparameter Optimization

Ok, now that we've built a few models, let's do some hyperparameter optimization to see if we can get our numbers to be a little better. We'll use the `dc.hyper` module to do this for us.

In [22]:
# def rf_model_builder(model_params, model_dir):
#   sklearn_model = RandomForestRegressor(**model_params)
#   sklearn_model.random_state = seed
#   return dc.models.SklearnModel(sklearn_model, model_dir)

# params_dict = {
#     "n_estimators": [10, 50, 100],
#     "max_features": ["auto", "sqrt", "log2", None],
# }

# metric = dc.metrics.Metric(dc.metrics.r2_score)
# optimizer = dc.hyper.HyperparamOpt(rf_model_builder)
# best_rf, best_rf_hyperparams, all_rf_results = optimizer.hyperparam_search(
#     params_dict, train_dataset, valid_dataset, transformers,
#     metric=metric)

Ok, our best validation score is now `0.53` R^2. Let's make some predictions on the test set and see what they look like.

In [23]:
# %matplotlib inline

# import matplotlib
# import numpy as np
# import matplotlib.pyplot as plt

# rf_predicted_test = best_rf.predict(test_dataset)
# rf_true_test = test_dataset.y
# plt.scatter(rf_predicted_test, rf_true_test)
# plt.xlabel('Predicted pIC50s')
# plt.ylabel('True IC50')
# plt.title(r'RF predicted IC50 vs. True pIC50')
# plt.xlim([2, 11])
# plt.ylim([2, 11])
# plt.plot([2, 11], [2, 11], color='k')
# plt.show()

This model seems reasonably predictive! It's likely that this model is still a ways from being good enough to use in a production setting, but this isn't bad for a quick and dirty tutorial.

# Congratulations! Time to join the Community!

Congratulations on completing this tutorial notebook! If you enjoyed working through the tutorial, and want to continue working with DeepChem, we encourage you to finish the rest of the tutorials in this series. You can also help the DeepChem community in the following ways:

## Star DeepChem on [GitHub](https://github.com/deepchem/deepchem)
This helps build awareness of the DeepChem project and the tools for open source drug discovery that we're trying to build.

## Join the DeepChem Gitter
The DeepChem [Gitter](https://gitter.im/deepchem/Lobby) hosts a number of scientists, developers, and enthusiasts interested in deep learning for the life sciences. Join the conversation!