```
This script can be used for any purpose without limitation subject to the
conditions at http://www.ccdc.cam.ac.uk/Community/Pages/Licences/v2.aspx

This permission notice and the following statement of attribution must be
included in all copies or substantial portions of this script.

2022-06-01: Made available by the Cambridge Crystallographic Data Centre.

```

# Run GOLD using the CSD Python API

In this set if notebooks we illustrate setting up and running dockings _via_ the CSD Python API `docking` module.

There are various modes in which the API can be used run GOLD...

* In `foreground` mode, the default, the script waits for the docking run it to finish before proceeding. This is the simplest mode, but is best used with smaller jobs as there is no way for the calling script to monitor progress.

* In `background` mode, control is returned to the script while GOLD runs. This can be useful as the script can monitor progress by, for example, watching for output files to be created and/or by reading log files.

* In `interactive` mode, the script communicates with GOLD _via_ a socket. This can be very useful in various circumstances, such as where precise monitoring of progress is required.

Docking using the API may be configured in various ways...

* All configuration may be performed _via_ API methods. This is particularly useful when attempting to optimise parameter sets, as these may be explored programmatically. Such optimization might be appropriate when investigating a new protein target, or when some particular trade-off between speed and accuracy is required.

* A pre-prepared GOLD conf file may be uploaded to perform the configuration. This is useful when sharing a particular configuration that has been prepared using the GOLD tools in Hermes or programatically _via_ the API.

* A GOLD conf file may be uploaded and then the configurtion modified using API methods.

Various combinations of mode and configuration method are explored in this collection of notebooks.

Note that in all the notebooks, a fresh directory for the docking run is created and changed into and the docking then run in the current working directory; this is in contrast to having GOLD create the output directory. I prefer to run GOLD this way as I find it slightly tidier than the alternative, but it is a matter of opinion.

In this notebook, the docking is configured entirely using the API and GOLD is run in the default `foreground` mode.

#### GOLD docs
* [User Guide](https://www.ccdc.cam.ac.uk/support-and-resources/ccdcresources/GOLD_User_Guide.pdf)
* [Conf file](https://www.ccdc.cam.ac.uk/support-and-resources/ccdcresources/GOLD_conf_file_user_guide.pdf)

#### Docking API docs
* [Descriptive](https://downloads.ccdc.cam.ac.uk/documentation/API/descriptive_docs/docking.html)
* [Module API](https://downloads.ccdc.cam.ac.uk/documentation/API/modules/docking_api.html)

In [1]:
import sys
import os
import shutil
from pathlib import Path
import time
import subprocess

import warnings

sys.path.append('../..')
from ccdc_notebook_utilities import run_hermes, create_logger

In [2]:
with warnings.catch_warnings():
    warnings.filterwarnings(action='ignore', category=DeprecationWarning)  # Ignore current 'distutils Version classes are deprecated' warning
    
    import pandas as pd

In [3]:
from IPython.display import HTML

In [4]:
import ccdc
from ccdc.io import MoleculeReader, EntryReader, EntryWriter
from ccdc.docking import Docker
from ccdc.diagram import DiagramGenerator

### Initialization

In [5]:
logger = create_logger()

[23-05-17 11:21:02 INFO   ] 
Platform:                     Windows-10-10.0.19045-SP0

Python exe:                   C:\Users\cole\Anaconda3\envs\latest_csd_python_api\python.exe
Python version:               3.9.16

CSD version:                  544
CSD directory:                C:/Users/cole/CCDC/ccdc-data/csd
API version:                  3.0.15

CSDHOME:                      C:/Users/cole/CCDC/ccdc-data/csd
CCDC_LICENSING_CONFIGURATION: Not set



### Config

The directory containing the input files for these dockings; directory must exist...

In [6]:
input_dir = Path('input_files').absolute()

Protein target and a native ligand (used to define binding site); files must exist...

In [7]:
target_dir = input_dir / 'target'

protein_file = target_dir / 'protein.mol2'
ligand_file  = target_dir / 'ligand.mol2'

Molecules to dock; file must exist...

In [8]:
input_file = input_dir / 'input.sdf'  # 'input.mol2'

Binding site radius...

In [9]:
radius = 6

Number of dockings (_i.e._ GA runs) per ligand; default is 10...

In [10]:
ndocks = 5  # Set to 5 for speed

Fitness function (Options are 'goldscore', 'chemscore', 'asp', 'plp'. GoldScore is selected by default)....

In [11]:
fitness_function = 'plp'

Autoscale parameter (as a percentage); default is 100%...

In [12]:
autoscale = 30  # Set to 30% for speed

Output directory (will be created)...

In [13]:
output_dir = Path('output_foreground')

Output format (_N.B._ the input file format would be used if the output format is not specified)...

In [14]:
output_format = 'sdf'  # 'mol2'

Write options (see [here](https://downloads.ccdc.cam.ac.uk/documentation/API/modules/docking_api.html?highlight=write_options#ccdc.docking.Docker.Settings.write_options) for available options, and the GOLD Configuration File User Guide, Chapter 16 for more details).

In [15]:
write_options = ['NO_LINK_FILES', 'NO_RNK_FILES', 'NO_GOLD_LIGAND_MOL2_FILE']

Lone pairs are written by default; however, we will turn them off here as they can cause issues for some third-party programs...

In [16]:
save_lone_pairs = False

In [17]:
diagram_generator = DiagramGenerator()

diagram_generator.settings.return_type = 'SVG'
diagram_generator.settings.explicit_polar_hydrogens = False
diagram_generator.settings.shrink_symbols = False

In [18]:
# Utility function to help with display in Jupyter-Lab...

def show_dataframe(df):
    
    return HTML(df.to_html(escape=False).replace(r'\n', ''))

Utility to improve renedering of DataFrames...

In [19]:
show_df = lambda df: df.style.set_properties(**{'text-align': 'left'})

Create a fresh output directory for the docking run...

In [20]:
if output_dir.exists():
    
    logger.warning(f"The output directory '{output_dir}' exists and will be overwritten.")
    
    shutil.rmtree(output_dir)
    
output_dir.mkdir()

os.chdir(output_dir)

In [21]:
logger.info(f"In output dir: '{Path('.').absolute()}'")

### Configure docking

Here, the docking configuration is set up from scratch using the API. We do this by instantiating a `Docker.Settings` object and modifying it _via_ it's methods and attributes...

In [22]:
settings = Docker.Settings()

Specify the protein target...

In [23]:
settings.add_protein_file(str(protein_file))

Define the binding site using the native ligand...

In [24]:
native_ligand = MoleculeReader(str(ligand_file))[0]

settings.binding_site = settings.BindingSiteFromLigand(settings.proteins[0], native_ligand, radius)

Specify the input file of compounds to dock...

In [25]:
settings.add_ligand_file(str(input_file), ndocks=ndocks)

Note that the `output_directory` attribute is set by default to the currrent directory...

In [26]:
logger.info(f"settings.output_directory: '{settings.output_directory}'")

Set other options as specified above...

In [27]:
settings.output_format = output_format

In [28]:
settings.fitness_function = fitness_function

In [29]:
settings.autoscale = autoscale

In [30]:
settings.write_options = write_options

In [31]:
settings.save_lone_pairs = save_lone_pairs

#### Add a protein H-bond constraint

Here we add a protein H-bond constraint to the backbone NH that donates the conserved H-bond in the hinge.

This means the fitness of a docked ligand will be penalised if it doesn't make an H-bond with this atom.  Note that the penalty applied can be modified _via_ the [weight](https://downloads.ccdc.cam.ac.uk/documentation/API/modules/docking_api.html#ccdc.docking.Docker.Settings.ProteinHBondConstraint) parameter.

In [32]:
chain_label, residue_label, atom_label = 'A', 'ALA451', 'H'  # Conserved hinge H-bond donor

In [33]:
protein = settings.proteins[0]

atom = [atom for atom in protein[f'{chain_label}:{residue_label}'].atoms if atom.label == atom_label][0]

settings.add_constraint(settings.ProteinHBondConstraint([atom]))

In [34]:
constraint = settings.constraints[0]

logger.info(f"""
Atom index: {constraint.atoms[0].index}
Weight:     {constraint.weight}
Min. Score: {constraint.min_hbond_score}
""")

### Run docking

Here we run GOLD in `foreground` mode...

Note the status code is checked to see if GOLD exited successfully.

In [35]:
%%time

docker = Docker(settings=settings)

results = docker.dock(mode='foreground', file_name='api_gold.conf') 

assert results.return_code == 0, "Error! GOLD did not run successfully."

### The Results object

Once the docking has finished, we can examine the output in various ways. For example, the [Results](https://downloads.ccdc.cam.ac.uk/documentation/API/modules/docking_api.html#ccdc.docking.Docker.Results) object can be used to access the solutions as [DockedLigand](https://downloads.ccdc.cam.ac.uk/documentation/API/modules/docking_api.html#ccdc.docking.Docker.Results.DockedLigand) objects, which allow access to all data about the docking.

In [36]:
logger.info(f"No. solutions: {len(results.ligands)}.")

In [37]:
soln = results.ligands[0]

logger.info(f"Name: '{soln.identifier}'; Fitness: {soln.fitness():.2f}")

The `fitness` method of the solution object provide a simple way of accessing the fitness score...

In [38]:
soln.fitness()

#### The Fitness Score

The `scoring_term` method of the solution object gives access to the fitness score and all it's various components...

In [39]:
soln.scoring_term()

This can be used to build a table for further analysis (_N.B._ this isn't optimised for efficiency or elegance)...

Note that `fitness` is just a convenient duplicate of, in this case, `Gold.PLP.Fitness`; for other scoring functions the name of the column duplicated by `fitness` will obviously be different. Note also that the solutions for a ligand appear in the order the are found by GOLD: thus they are not sorted by fitness by default.

In [40]:
scores_df = (
        pd.DataFrame([
            {
                'identifier': x.identifier,                # Solution identifier
                'fitness': x.fitness(),                    # Fitness score
                **x.scoring_term(),                        # Fitness score components (column names will differ for different scoring functions)
                'index': int(x.identifier.split('|')[3]),  # For convenience, add index of ligand (i.e. it's position in input file)
                'soln': x,                                 # Add a convenient reference to the solution's result object (see below for use)
                
            }
            for x in results.ligands]
        )    )

scores_df.shape

In [41]:
scores_df.drop(columns=['soln']).head()

Sort on `fitness` to see the top scoring solutions...

In [42]:
scores_df.sort_values('fitness', ascending=False).drop(columns=['soln']).head()

Examine only solutions for first ligand...

In [43]:
with warnings.catch_warnings():
    warnings.filterwarnings(action='ignore', category=DeprecationWarning)  # Ignore current 'distutils Version classes are deprecated' warning
    
    scores_df.query("index == 1").sort_values('fitness', ascending=False).drop(columns=['soln']).head()

#### Other information

Futher information, including _e.g._ the mobile atoms of the protein and H-bonds made are available _via_ the solution's `attributes`...

In [44]:
print(f"""
Rotated Torsions:
{soln.attributes['Gold.Protein.RotatedTorsions']}

Rotated Atoms:
{soln.attributes['Gold.Protein.RotatedAtoms']}

Chemscore Hbonds:
{soln.attributes['Gold.Chemscore.Hbonds']}
""")

#### Examining constraints

The 'Constraint' component of the score shows whether any constraints were satisfied.

Note that a _protein_ hydrogen-bond constraint being satisfied doesn't mean the 'correct' ligand atom made the H-bond to the protein atom.

In [45]:
constraints_df = scores_df[['identifier', 'Gold.PLP.Fitness', 'Gold.PLP.Constraint']].sort_values('Gold.PLP.Fitness', ascending=False)

The top scorers all have the constraint satisfied...

In [46]:
constraints_df.head(5)

There a couple of cases where the 'bad' ligand fails to make the conserved H-bond and so the constraint penalty is applied...

In [47]:
# constraints_df[constraints_df['soln'].str.match(r'5LMA_bad\|')]

constraints_df[constraints_df['Gold.PLP.Constraint'] < 0.0]

This implies that most solutions for the 'bad' ligand make the H-bond. Inspection in Hermes (see below) confirms that this is indeed the case: the H-bond is made, but not in a wholly realistic manner (as reflected in the generally low scores).

#### API Molecule object

The solution also contains an [Molecule](https://downloads.ccdc.cam.ac.uk/documentation/API/descriptive_docs/molecule.html) object...

In [48]:
soln.molecule.atoms[:5],soln.molecule.bonds[:2]

In [49]:
print('\n'.join(soln.molecule.to_string('mol2').split('\n')[:2]))  # Convert to MOL2

In [50]:
print('\n'.join(soln.molecule.to_string('sdf').split('\n')[:2]))  # Convert to SDF

### Visualization

#### Hermes

The results of a GOLD run setup and run _via_ the API may be visualized in Hermes by loading the GOLD conf file written by the API...

In [51]:
run_hermes('api_gold.conf')

#### Exporting complexes for import into other visualizers

Hermes will automatically adjust rotatable bonds on the protein as each solution is loaded. If another visualizer is to be used, such as PyMOL or YASARA, then the [make_complex](https://downloads.ccdc.cam.ac.uk/documentation/API/modules/docking_api.html#ccdc.docking.Docker.Results.make_complex) method of the `Results` object be useful. This creates a complex between the protein and a docked solution, adjusting rotatable bonds as required. See [here](https://www.ccdc.cam.ac.uk/support-and-resources/support/case/?caseid=58574edf-72e0-e511-aa29-005056868fc8) for a note on lone pairs in GOLD output.

In [52]:
export_format = 'pdb'  # File format in which to export protein-ligand complexes

complexed = results.make_complex(soln) 

complexed.remove_unknown_atoms()  # Remove lone pairs for export

file_path = f'complexed.{export_format}'

with EntryWriter(file_path) as writer:
    
    writer.write(complexed)

This facility can also be used to export all solutions. In the example below, solutions are exported in descending order of fitness for each input ligand in turn.

In [53]:
complexes_dir = Path('complexes')

complexes_dir.mkdir(exist_ok=True)

In [54]:
for index, df in scores_df.groupby('index'):  # Index is the ligand's position in input file
    
    for rank, soln in enumerate(df.sort_values('fitness', ascending=False)['soln'], 1):  # Rank solutions for ligand by fitness

        complexed = results.make_complex(soln) 

        complexed.remove_unknown_atoms()  # Remove lone pairs for export
        
        file_path = complexes_dir / f'{index:03d}_{rank:02d}.{export_format}'

        with EntryWriter(str(file_path)) as writer:
    
            writer.write(complexed)
        
            logger.info(f"Solution {soln.identifier:30} (fitness {soln.fitness():.1f}) written to file {file_path}.")

### Reloading results

A docking run _via_ the API may also be configured by reading in a `gold.conf` file usinf the `Docker.Settings.from_file` method (note that the settings may subsequently be modified). Setting up a GOLD docking this way is demonstarted in other notebooks on this directory. 

The same method may be used to read in the `gold.conf` file from a (completed docking run, which gives access to a recreated `results` object which can be used for analyses of the docking as illustrated above.

In [55]:
settings = Docker.Settings.from_file('api_gold.conf')

docker = Docker(settings=settings)

results = docker.results

In [56]:
len(results.ligands)

In [57]:
soln = results.ligands[0]

soln.identifier