```
This script can be used for any purpose without limitation subject to the
conditions at http://www.ccdc.cam.ac.uk/Community/Pages/Licences/v2.aspx

This permission notice and the following statement of attribution must be
included in all copies or substantial portions of this script.

2022-06-01: Made available by the Cambridge Crystallographic Data Centre.

```

# Run GOLD using the CSD Python API

This note book illustrates running GOLD _via_ the CSD Python API in [interactive](https://downloads.ccdc.cam.ac.uk/documentation/API/descriptive_docs/docking.html#interactive-docking) mode, with the docking being configured entirely _via_ the API.

Note that, in `interactive` mode, we need to write out the solution files ourselves (using the standard GOLD file naming scheme) if we wish to use Hermes to view the results.

#### GOLD docs
* [User Guide](https://www.ccdc.cam.ac.uk/support-and-resources/ccdcresources/GOLD_User_Guide.pdf)
* [Conf file](https://www.ccdc.cam.ac.uk/support-and-resources/ccdcresources/GOLD_conf_file_user_guide.pdf)

#### Docking API docs
* [Descriptive](https://downloads.ccdc.cam.ac.uk/documentation/API/descriptive_docs/docking.html)
* [Module API](https://downloads.ccdc.cam.ac.uk/documentation/API/modules/docking_api.html)

In [1]:
import sys
sys.path.append('../..')
from ccdc_notebook_utilities import create_logger, run_hermes
import os
import shutil
from pathlib import Path
import time

In [2]:
import pandas as pd

In [3]:
import ccdc
from ccdc.io import MoleculeReader, EntryReader, EntryWriter
from ccdc.docking import Docker

### Config

The directory containing the input files for these dockings; directory must exist...

In [4]:
input_dir = Path('input_files').absolute()

Protein target and a native ligand (used to define binding site); files must exist...

In [5]:
target_dir = input_dir / 'target'

protein_file = target_dir / 'protein.mol2'
ligand_file  = target_dir / 'ligand.mol2'

Molecules to dock; file must exist...

In [6]:
input_file = input_dir / 'input.sdf'

Binding site radius...

In [7]:
radius = 6

Number of dockings (_i.e._ GA runs) per ligand; default is 10...

In [8]:
ndocks = 5  # Set to 5 for speed

Fitness function (Options are 'goldscore', 'chemscore', 'asp', 'plp'. GoldScore is selected by default)....

In [9]:
fitness_function = 'plp'

Autoscale parameter (as a percentage); default is 100%...

In [10]:
autoscale = 30  # Set to 30% for speed

Output directory (will be created)...

In [11]:
output_dir = Path('output_interactive')

Output format (_N.B._ the input file format will be used if the output format is not specified)...

In [12]:
# output_format = 'sdf'  # 'mol2'

We will set the 'write options' to `MIN_OUT` so output to disk is minimal. See [here](https://downloads.ccdc.cam.ac.uk/documentation/API/modules/docking_api.html?highlight=write_options#ccdc.docking.Docker.Settings.write_options) for available write options, and the GOLD Configuration File User Guide, Chapter 16 for more details. 

In [13]:
write_options = ['MIN_OUT']

### Initialization

In [14]:
logger = create_logger()

Create a fresh output directory for the docking run...

In [15]:
if output_dir.exists():
    
    logger.warning(f"The output directory '{output_dir}' exists and will be overwritten.")
    
    shutil.rmtree(output_dir)
    
output_dir.mkdir()

os.chdir(output_dir)

### Configure docking

Here, the docking configuration is set up from scratch using the API. We do this by instantiating a `Docker.Settings` object and modifying it _via_ it's methods and attributes...

In [16]:
settings = Docker.Settings()

Specify the protein target...

In [17]:
settings.add_protein_file(str(protein_file))

Define the binding site using the native ligand...

In [18]:
native_ligand = MoleculeReader(str(ligand_file))[0]

settings.binding_site = settings.BindingSiteFromLigand(settings.proteins[0], native_ligand, radius)

Set the fitness function...

In [19]:
settings.fitness_function = fitness_function

Set number of dockings (_N.B._ interactive mode defaults to only one solution, not 10 as in other modes)...

In [20]:
settings.set_hostname(ndocks=ndocks) 

Set the autoscale parameter...

In [21]:
settings.autoscale = autoscale

Set output format if it was explicitly specified above (_N.B._ the input file format will be used if not)...

In [22]:
if 'output_format' in locals() and output_format:
    
    settings.output_format = output_format

Set write options...

In [23]:
settings.write_options = write_options

#### Add a protein H-bond constraint

Here we add a protein H-bond constraint to the backbone NH that donates the conserved H-bond in the hinge. This means the fitness of a docked ligand will be penalised if it doesn't make an H-bond with this atom.

In [24]:
chain_label, residue_label, atom_label = 'A', 'ALA451', 'H'  # Conserved hinge H-bond donor

In [25]:
protein = settings.proteins[0]

atom = [atom for atom in protein[f'{chain_label}:{residue_label}'].atoms if atom.label == atom_label][0]

settings.add_constraint(settings.ProteinHBondConstraint([atom]))

### Run docking

Here we run GOLD in `interactive` mode...

Note how the list of solutions is built up during the docking run: as each molecule is docked in turn, the `session.dock` method returns a tuple of solutions for that molecule. This list of tuples is then used to write out the solution files using the standard GOLD solution file nameing scheme and is then flattened to build up a table of fitness function components.

In [26]:
# Instantiate a docker...

docker = Docker(settings=settings)

# Start an interactive session...

session = docker.dock(mode='interactive', file_name='api_gold.conf')

session.ligand_preparation = None  # We assume ligand preparation has been done

logger.info(f"GOLD interactive session PID: {session.pid}")

In [27]:
logger.info(f"Starting to dock ligands from input file '{input_file}'.")

solns_by_mol = []  # We will build up a list of tuples of solutions as we dock each mol

with EntryReader(str(input_file)) as reader:

    for n_mol, entry in enumerate(reader, 1):

        mol, name = entry.molecule, entry.identifier
        
        logger.info(f"Starting ligand '{name}'...")

        solns = session.dock(mol)  # Tuple of solutions for this mol
        
        logger.info(f"... done ({len(solns)} solutions).")
        
        solns_by_mol.append(solns)  # Append tuple to list of solutions

logger.info(f"Finished.")

Close socket...

In [28]:
# session._client_socket.close()

# session._socket.close()

The fitness and it's components are available _via_ a flattened list of solutions...

In [29]:
solutions = [y for x in solns_by_mol for y in x]  # Flatten list of tuples

In [30]:
scores_df = pd.DataFrame([{'identifier': x.identifier, 'fitness': x.fitness(), **x.scoring_term()} for x in solutions])

scores_df.shape

In [31]:
scores_df.head()

### Visualization

Now, as we have been talking to GOLD over a socket and we specified write option `MIN_OUT` above, the solutions have not been written to disk at this point. If we wish to visualise them in _e.g._ Hermes, we will need to do this ourselves.

So, write out the solution files using the standard GOLD solution file naming scheme...

In [32]:
stem, suffix = input_file.stem, input_file.suffix[1:]  # For GOLD standard solution file naming scheme

for n_mol, solns in enumerate(solns_by_mol, 1):
    
    for n_soln, soln in enumerate(solns, 1):

        file_name = f'gold_soln_{stem}_m{n_mol}_{n_soln}.{suffix}'  # GOLD standard solution file naming scheme

        with EntryWriter(file_name) as writer: 

            writer.write(soln)

Once the solution files have been written, the results of a GOLD run setup and run _via_ the API may be visualized in Hermes by loading the GOLD conf file written by the API...

In [33]:
run_hermes('api_gold.conf')