# Preparing ligands for GOLD docking using RDKit

This notebook illustrates preparing a set of ligands for GOLD docking uisng RDKit.

For optimal performance, GOLD requires good-quality 3D ligand structures as input (note that it only requires a single conformer as it performs flexible docking based on the input structure). Now, the [Conformer API](https://downloads.ccdc.cam.ac.uk/documentation/API/modules/conformer_api.html) can be used to generate a 3D structure that reflects conformational preferences observed in the CSD and thus provides ideal input for GOLD. This process is illustrated in the notebook [00a_Input_for_GOLD](./00a_Input_for_GOLD.ipynb).

However, if this tool is not available (_e.g._ for licencing reasons), we recommend the use of [RDKit](http://rdkit.org/) to generate an initial 3D structure.

Note that it is assumed that the input structures are all in the desired charge and tautomeric states. No protonation/deprotonation or tautomer standardization/enumeration is done here.

In [None]:
from platform import platform
import sys
import os
from pathlib import Path
import logging
import re
import csv

In [None]:
from rdkit import Chem
from rdkit.Chem import AllChem

In [None]:
import ccdc
from ccdc.molecule import Molecule
from ccdc.entry import Entry
from ccdc.io import EntryWriter

#### Config

The directory containing the input files for docking; directory must exist...

In [None]:
input_dir = Path('input_files')

CSV file of input structures as SMILES with Names...

In [None]:
input_csv = input_dir / 'input.csv'

smiles_col, name_col = 'smiles', 'name'  # Required columns

Output file for this script (which is the _input_ file for GOLD); note that the file extension determines the format...

In [None]:
output_file = input_dir / 'input.sdf' 

Set `minimization_attempts` to a positive integer to enable MMFF minimization. Note that RDKit no longer [recommendeds the minimisation](https://www.rdkit.org/docs/GettingStartedInPython.html#working-with-3d-molecules).

In [None]:
minimization_attempts = 0

#### Initialization

In [None]:
# Get logger and configure if necessary...

logger = logging.getLogger(__name__)

if not logger.hasHandlers():
    handler = logging.StreamHandler()
    handler.setFormatter(logging.Formatter('[%(asctime)s %(levelname)-7s] %(message)s', datefmt='%y-%m-%d %H:%M:%S'))
    logger.addHandler(handler)
    logger.setLevel(logging.INFO)

In [None]:
logger.info(f"""
Platform:                     {platform()}

Python exe:                   {sys.executable}
Python version:               {'.'.join(str(x) for x in sys.version_info[:3])}

CSD version:                  {ccdc.io.csd_version()}
CSD directory:                {ccdc.io.csd_directory()}
API version:                  {ccdc.__version__}

CSDHOME:                      {os.environ.get('CSDHOME', 'Not set')}
CCDC_LICENSING_CONFIGURATION: {os.environ.get('CCDC_LICENSING_CONFIGURATION', 'Not set')}
""")

In [None]:
# Check that all required files and directories exist...

for directory in [input_dir]: assert directory.exists(), f"Error! Required directory '{directory}' not found."

for file in [input_csv]: assert file.exists(), f"Error! Required file '{file}' not found."

In [None]:
comment = re.compile(r'^\s*#')  # Pattern to match comment lines in CSV files etc.

### Load SMILES input from CSV file and create a 3D input file for GOLD

Recall that a SMILES and Name column are required. All columns in the input CSV file are written to the output file, including the SMILES, Name and any data columns that might be present.
This is done because experience suggests keeping such data associated with a structure throughout the docking process can be convenient in practice.

In [None]:
logger.info(f"Starting to process ligands...")

with input_csv.open() as file:
    
    reader = csv.DictReader(file)
    
    assert all(col in reader.fieldnames for col in [smiles_col, name_col]), f"Error! Required column missing from '{input_csv}'."  # Ensure required columss are present
    
    first_col = reader.fieldnames[0]
    
    with EntryWriter(output_file) as writer:

        for index, record in enumerate(x for x in reader if not comment.match(x[first_col])):
            
            smiles, name = record[smiles_col], record[name_col]
            
            rdk_mol = Chem.MolFromSmiles(smiles)  # Convert SMILES to 2D RDKit mol
            
            rdk_mol.SetProp('_Name', name)  # _Name is a special property that becomes the name in the molblock header

            # nvert 2D RDKit mol to 3D...
            
            rdk_mol = Chem.AddHs(rdk_mol)  # Hs are required for 3D structure generation (N.B. also copies the mol)
    
            if AllChem.EmbedMolecule(rdk_mol) == -1:  # Generate 3D coordinates
    
                logger.warning(f"RDKit: embedding failed for mol '{name}'.")
        
                next
    
            if minimization_attempts:  # Optional MMFF minimization

                for n in range(minimization_attempts):

                    if AllChem.MMFFOptimizeMolecule(rdk_mol) == 0: break

                else:

                    logger.warning(f"RDKit: minimisation did not finish after maximum of {minimization_attempts} attempts for mol '{name}'.")
                    
                    # N.B. We don't currently skip if the (optional) minimisation doesn't finish

            # Convert to API Molecule object via a string representation (i.e. the molblock)...

            api_mol = Molecule.from_string(Chem.MolToMolBlock(rdk_mol))

            # Standardize molecular representation to ensure conformance with CSD conventions.
            # N.B. This is not essential here, as GOLD will perform the normalizations it needs.

            # api_mol.remove_hydrogens()
            # api_mol.assign_bond_types(which='unknown')
            # api_mol.standardise_delocalised_bonds()
            # api_mol.standardise_aromatic_bonds()
            # api_mol.add_hydrogens()
        
            # Create an API entry object from the molecule and write to the SDF file...
        
            entry = Entry.from_molecule(api_mol, index=index, **record)

            writer.write(entry)
            
            logger.info(f"{index:3d}) completed mol '{name}'.")
            
logger.info(f"Finished.")