This script can be used for any purpose without limitation subject to the conditions at http://www.ccdc.cam.ac.uk/Community/Pages/Licences/v2.aspx

This permission notice and the following statement of attribution must be included in all copies or substantial portions of this script.

2023-07-20: Written by Fabio Montisci and made available by the Cambridge Crystallographic Data Centre.


# Ensemble Docking with the CSD Python API

This script exemplifies the use of the CSD Python API to perform ensemble docking simulations with GOLD in the presence of functional waters. It follows the same workflow of the tutorial at https://www.ccdc.cam.ac.uk/media/Documentation/0D5504D3-81F2-46D3-9D1D-16787D07F70B/GOLD-tutorial-EnsembleDocking.pdf, which makes use of Hermes GUI to run GOLD.

In [None]:
# Import all the necessary libraries and CSD Python API modules.

import os
import requests
import pandas as pd
import warnings

from ccdc import io
from ccdc.docking import Docker
from ccdc.protein import Protein
from ccdc.molecule import Molecule
from ccdc.utilities import to_utf8, output_file
from ccdc_notebook_utilities import run_hermes

warnings.filterwarnings("ignore", category=DeprecationWarning) # To ignore pandas deprecation warning when plotting

### Download the proteins from the PDB

First, we specify the PDB codes of the proteins that we want to include in the ensemble as items of a list. We also initialise an empty list to wich we will append the proteins as CSD Python API protein objects.

In [None]:
pdb_codes = ['1E2H', '1E2I', '1OF1', '4IVQ']
proteins = [] 

We now iterate through the list containing the PDB codes. For each of them we download the file from the Protein Data Bank website and we save it in our working directory. We the use 'Protein.from_file' method to read the file and generate a protein object that we append to the empty 'proteins' list.

In [None]:
# Loop through the PDB codes.
for pdb_code in pdb_codes: 
    print(f'Downloading {pdb_code}...')
    pdb_request = requests.Session().get(f'https://files.rcsb.org/download/{pdb_code}.pdb')
    pdb_request.raise_for_status()
    filename = f'{pdb_code}.pdb'
    with output_file(filename) as pdb_file:
        pdb_file.write(to_utf8(pdb_request.text))
    protein = Protein.from_file(f'{pdb_code}.pdb')
    proteins.append(protein)
print('Done.')

### Get the ligand from the PDB

We also specify the PDB code of the entry from which we want to extract the ligand, and its label in the PDB file.

In [None]:
ligand_pdb = '1E2K'
ligand_label = 'A:TMC500'

Similarly to what we did before, we download the file from the PDB website and use it to create the protein object 'ligand_protein'.

In [None]:
print(f'Downloading {ligand_pdb}(PDB file containing the ligand)...')
pdb_request = requests.Session().get(f'https://files.rcsb.org/download/{ligand_pdb}.pdb')
pdb_request.raise_for_status()
filename = f'{ligand_pdb}.pdb'
with output_file(filename) as pdb_file:
    pdb_file.write(to_utf8(pdb_request.text))
ligand_protein = Protein.from_file(f'{ligand_pdb}.pdb')
print(f'Done.')

We then loop through the ligands in the protein object and check if their identifier is equal to the value we stored in 'ligand_label'. We extract the TMC ligand into a CSD Python API moelcule object named 'tmc500_1e2k'.

In [None]:
print(f'Getting ligand {ligand_label} from {ligand_pdb}...')
tmc500_1e2k = [ligand for ligand in ligand_protein.ligands if ligand.identifier == ligand_label][0]
print('Done.')

### Get the active waters

Finally, we need to get the functional waters from the the 1E21 structure (we select 1E21 because it is the one that contains all 3 functional waters). We need to inspect the file in advance and know the labels of the water molecules we want to extract.
We can then loop through the all the waters in the protein and add them to a list if they have the right identifier.

In [None]:
waters = []
print(f'Extracting active waters from 1E2I...')
for protein in proteins:
    if protein.identifier == '1E2I':
        for water in protein.waters:
            if water.identifier in ['A:HOH2023', 'A:HOH2044', 'A:HOH2123']:
                waters.append(water)
print('Done.')

We now have everything we need for the docking: the proteins, the ligand and the functional waters. However, we also need to prepare all these items before we can proceed with the simulation.

### Prepare, superimpose and export the proteins

To prepare the proteins we loop through the 'proteins' list ans we delete all chains except chain A, all the ligands, metals, and waters. We then protonate the proteins and call the 'sort_atoms_by_residue' method to tidy things up.

In [None]:
for protein in proteins:
    print(f'Preparing protein {protein.identifier}...')
    for chain in protein.chains:
        if chain.identifier != 'A':
            protein.remove_chain(chain.identifier)
    for ligand in protein.ligands:
        protein.remove_ligand(ligand.identifier)
    protein.remove_all_metals()
    protein.remove_all_waters()
    protein.add_hydrogens()
    protein.sort_atoms_by_residue()
print('Done.')

We also need to make sure that the proteins are superimposed on top of each other.

In [None]:
print(f'Superimposing all the proteins...')
for i in range(1, len(proteins)):
    Protein.ChainSuperposition().superpose(proteins[0].chains[0], proteins[i].chains[0])
print('Done.')

At this point, we want to save the prepared proteins as mol2 files. We initialise an empty list in which we will store the prepared file names and an empty molecule object to which we will add all the prepared protein objects. We will need the first for to add the proteins to GOLD configuration file and the latter to calculate a common binding site for all the proteins in the ensemble.

In [None]:
prepared_protein_files = []
merged_protein = Molecule()
for protein in proteins:
    with io.MoleculeWriter(f'{protein.identifier}_prepared.mol2') as protein_writer:
        protein_writer.write(protein)
    prepared_protein_files.append(f'{protein.identifier}_prepared.mol2')
    print(f'Saving {protein.identifier}_prepared.mol2')
    merged_protein.add_molecule(protein)
print('Done.')

### Prepare and export the ligand

We then need to prepare the ligand with the Ligand Preparation class. Note that this class works with Entry objects, not Molecule objects. ?Therefore, we first save the ligand as a mol2 file and then read it with the Entry Reader to prepare it.  Finally, we save the prepared ligand as a new mol2 file.

In [None]:
print('Preparing the ligand...')
with io.MoleculeWriter('ligand.mol2') as lig_writer:
    lig_writer.write(tmc500_1e2k)
ligand_prep = Docker.LigandPreparation()
prepared_lig = ligand_prep.prepare(io.EntryReader('ligand.mol2')[0])
with io.MoleculeWriter(f'ligand_{ligand_pdb}_prepared.mol2') as prepared_lig_writer:
    prepared_lig_writer.write(prepared_lig.molecule)
print('Done.')

### Prepare and export the active waters

Preparing the active waters is easy. We just loop through the list, protonate them and save them as mol2 files.

In [None]:
print('Exporting the active waters...')
for i, water in enumerate(waters):
    water.add_hydrogens()
    with io.MoleculeWriter(f'water_{i + 1}.mol2') as water_writer:
        water_writer.write(water)
print('Done.')

### Set the docking settings

At this point we can start setting up the docking simulation. We instantiate the Docker and access its settings. All the changes we do to the settigns will be written in the conf file.

In [None]:
docker = Docker()
settings = docker.settings

We loop through the 'prepared_protein_files' list and add the absolute path of each protein file to the settings. 

In [None]:
for protein_file in prepared_protein_files:
    settings.add_protein_file(os.path.abspath(protein_file))

We do the same for the ligand and we specify the number of docking poses we wish to obtain from the simulation.

In [None]:
settings.add_ligand_file(os.path.abspath(f'ligand_{ligand_pdb}_prepared.mol2'), 10)

We also use the prepared ligand molecule object position to calculate the binding site of the merged protein, specifying a radius in Angstroms. 

In [None]:
lig = prepared_lig.molecule
settings.binding_site = settings.BindingSiteFromLigand(merged_protein, lig, 6.0)

We then add the paths of the active water molecules. We also need to specify some settings. Toggle will make sure that GOLD decides itself if that particular water molecule will be kept or displaced for a given solution. Trans_spin means and movable_distance=1.0 mean that the water molecules are allowed to rotate and translate inside a 1 Angstrom box.

In [None]:
for i in range(len(waters)):
    water_path = os.path.join(os.getcwd(), f'water_{i + 1}.mol2')
    settings.add_water_file(water_path, toggle_state='toggle', spin_state='trans_spin', movable_distance=1.0)

We can now take care of the simulation parameters.
* We first specify which fitness function to use. The options are ‘goldscore’, ‘chemscore’, ‘asp’, ‘plp’ and we will use the latter, which offers a good performance and is the faster in the majority of cases. 
* We set the autoscle parameter to 75%. This controls how much searching is performed. The docker will determine how much docking is reasonable to perform on a ligand based on the number of rotatable bonds and the number of hydrogen donors and acceptors. This percentage will scale the amount of docking done to perform faster or more thorough docking. 75% is the standard for ensemble docking.
* We switch off early termination to ensure that as many solutions as possible are explored.
* We switch on flipping corners to to allow GOLD to perform a limited conformational search of cyclic systems by allowing free corners of the rings in the ligand to flip above and below the plane of their neighbouring atoms.

In [None]:
settings.fitness_function = 'plp'
settings.autoscale = 75
settings.early_termination = False
settings.flip_free_corners = True

Finally, we set up the output our simulation will produce, by specifying an output folder, an output mol2 file containing all the docking poses, and the name of GOLD configuration file.

In [None]:
out_path = os.path.join(os.getcwd(), 'results')
if not os.path.exists(out_path):
    os.makedirs(out_path)
settings.output_directory = out_path
settings.output_file = 'docking_poses.mol2'
conf_file = os.path.join(out_path, 'ensemble.conf')

### Run the docking

We can now perform the actual docking simulation.

In [None]:
results = docker.dock(conf_file)
print('Docking Completed.')

Starting from a superimposed set of protein structures, GOLD evolves a separate population ligand conformations for each protein structure that is part of the ensemble. The best ligand conformation found in any of the ensemble structures is returned. 

### Visualise the results

Now that the docking is complete we call Hermes GUI to visualise the results. You can find the 'run_hermes' function in 'ccdc_notebook_utilities', available in the ccdc-opensource GitHub repository (https://github.com/ccdc-opensource/csd-python-api-scripts/tree/main/notebooks).

In [None]:
print('Opening Hermes GUI to visualise results.')
run_hermes(conf_file)

The 10 docking solutions are given in their docked order with their corresponding fitness score, but they can be ordered by clicking on this PLP Fitness header. The protein that the solution corresponds to is identifiable by the ensemble index number (1-4). 

### Plot best score for each protein

We can also use Python to postprocess the results. In this simple example we parse the generated GOLD ligand log file to get the best score achieved for each of the protein in the ensemble.

In [None]:
scores = []
print('Parsing the ligand logfile.')
with open(os.path.join(out_path, f'gold_ligand_{ligand_pdb}_prepared_m1.log')) as lig_logfile:
    for line in (lig_logfile.readlines() [-8:][:4]):
        print(line, end ='')
        cols = line.split()
        scores.append(cols[2])
scores = [float(i) for i in scores]

We then add these scores to a Pandas dataframe and generate a bar plot.

In [None]:
ensemble_df = pd.DataFrame(columns=['Protein', 'Score'])
ensemble_df['Protein'] = pdb_codes
ensemble_df['Score'] = scores
ensemble_df['Score'] = ensemble_df['Score'].astype(float)
ensemble_df.plot.bar(x='Protein', y='Score', rot=0, title='Best Pose per Protein', xlabel='Protein', ylabel='PLP Score', ylim=(min(scores)-10,max(scores)+5))
print('Plotting the score of the best pose for each protein in the ensemble.')