```
This script can be used for any purpose without limitation subject to the
conditions at http://www.ccdc.cam.ac.uk/Community/Pages/Licences/v2.aspx

This permission notice and the following statement of attribution must be
included in all copies or substantial portions of this script.

2022-06-01: Made available by the Cambridge Crystallographic Data Centre.

```

# Maximum Common Substructure searching the CSD

The [Descriptors API](https://downloads.ccdc.cam.ac.uk/documentation/API/modules/descriptors_api.html) contains a [Maximum Common Substructure](https://downloads.ccdc.cam.ac.uk/documentation/API/modules/descriptors_api.html?highlight=mcs#ccdc.descriptors.MolecularDescriptors.MaximumCommonSubstructure) (MCS) tool, which we will illustrate in this Notebook.

The Maximum Common Substructure algorithm is intrinsically slow, so cannot practically be used for database searching on it's own. We thus first do a similarity search and then compute the MCS for the hitlist. See [this](02_Similarity_searching_the_CSD.ipynb) notebook for more details of Similarity searching _via_ the CSD API.

The Tanimoto similarity measure is currently used for the initial search. Note that this is the conventional similarity measure based on a 2D molecular fingerprint: no 3D information is used here. The fingerprint used is path-based and related to the old [Daylight fingerprint](https://daylight.com/dayhtml/doc/theory/theory.finger.html): more details are given in the following publication: https://journals.iucr.org/j/issues/2010/02/00/kk5057/index.html.

Now, Tanimoto similarity searching is not the ideal precursor to MCS matching as it is a whole-molecule measure. The Tversky similarity measure is recognized as being potentially more useful in conjunction with MCS: some example references are given below... 

* https://link.springer.com/article/10.1007/s10822-016-9935-y
* https://pubs.acs.org/doi/abs/10.1021/ci5005702
* https://www.frontiersin.org/articles/10.3389/fphar.2016.00266/full
* https://link.springer.com/article/10.1186/s13321-017-0198-y

Unfortunately, we do not yet have Tversky similarity implemented in our API, so this cannot be investigated further at this time. However, this feature has now been requested and, as it will not be difficult to implement, it should be available fairly soon. When this is available this workflow will be revisisted.

In [None]:
from platform import platform
import sys
import os
from pathlib import Path
import logging

import warnings

In [None]:
import pandas as pd

import plotly.express as px

In [None]:
from IPython.display import HTML

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

In [None]:
import ccdc
from ccdc.diagram import DiagramGenerator
from ccdc.io import MoleculeReader
from ccdc.search import SimilaritySearch
from ccdc.descriptors import MolecularDescriptors

### Configuration

### Initialization

In [None]:
logger = logging.getLogger(__name__)

if not logger.hasHandlers():
    handler = logging.StreamHandler()
    handler.setFormatter(logging.Formatter('[%(asctime)s %(levelname)-7s] %(message)s', datefmt='%y-%m-%d %H:%M:%S'))
    logger.addHandler(handler)
    logger.setLevel(logging.INFO)

In [None]:
# Information useful for debugging...

logger.info(f"""
Platform:                     {platform()}

Python exe:                   {sys.executable}
Python version:               {'.'.join(str(x) for x in sys.version_info[:3])}

CSD version:                  {ccdc.io.csd_version()}
CSD directory:                {ccdc.io.csd_directory()}
API version:                  {ccdc.__version__}

CSDHOME:                      {os.environ.get('CSDHOME', 'Not set')}
CCDC_LICENSING_CONFIGURATION: {os.environ.get('CCDC_LICENSING_CONFIGURATION', 'Not set')}
""")

Set up a CCDC Diagram Generator...

In [None]:
diagram_generator = DiagramGenerator()

diagram_generator.settings.return_type = 'SVG'
diagram_generator.settings.explicit_polar_hydrogens = False
diagram_generator.settings.shrink_symbols = False

Utility to help with display in JupyterLab...

In [None]:
show_df = lambda df: HTML(df.to_html(escape=False).replace(r'\n', ''))

# show_df = lambda df: df.style.set_properties(**{'text-align': 'left'})

<a id="mol_prep"></a>

### Query Molecule Preparation

We will use Lapatinib as our query molecule.

First, we use a query loaded from a molfile. As this was generated using MarvinSketch and does not have hydrogens added, we add hydrogens and normalize the bond types to CSD conventions...

In [None]:
with MoleculeReader('Lapatinib.mol') as reader:
    
    query_mol = reader[0]

query_mol.add_hydrogens()

query_mol.assign_bond_types(which='all')

In [None]:
HTML(diagram_generator.image(query_mol))

### Similarity Searching

First, we will use the Search API to search the CSD using a similarity query. 

In [None]:
threshold = 0.5  # Tanimoto similarity threshold to use in the search

Instantiate a similarity-searcher object...

In [None]:
searcher = SimilaritySearch(query_mol, threshold=threshold)

We can set CSD search filters _via_ the searcher's [settings](https://downloads.ccdc.cam.ac.uk/documentation/API/modules/search_api.html#ccdc.search.Search.Settings) object; this ensures hits are of sufficient resolution _etc._ (we use a 'subroutine' to do this as we'll need to apply these settings again below)...

In [None]:
def set_search_filters(searcher):
    
    settings = searcher.settings
    
    settings.has_3d_coordinates = True
    settings.max_r_factor       = 5  # NB. Percentage not fraction; 0.05 in Conquest
    settings.no_disorder        = 'Non-hydrogen'
    settings.no_errors          = True
    settings.not_polymeric      = True
    settings.no_ions            = False
    settings.no_powder          = True
    settings.only_organic       = True

In [None]:
set_search_filters(searcher)

Search the CSD using the similarity query...

In [None]:
%%time

hits = searcher.search()

len(hits)

Examine a table of the hits...

In [None]:
hits_df = pd.DataFrame(
            data=[(hit.identifier, hit.similarity, hit.entry.synonyms[0] if hit.entry.synonyms else '', diagram_generator.image(hit.molecule)) for hit in hits],
            columns=['Refcode', 'Similarity', 'Name', 'Depiction']
            )

hits_df.shape

In [None]:
show_df(hits_df)

### MCS matching of similarity-search hits

We will now look at MCS matching on the hits from the similarity search. This set is of a more tractable size, and should contains molecules with relevant structures.

First, we instantiate an MCS searcher object...

In [None]:
mcs = MolecularDescriptors.MaximumCommonSubstructure()

Use object to examine the first hit...

In [None]:
hit = hits[0]

This is a salt, and therefor contains multiple components...

In [None]:
HTML(diagram_generator.image(hit.molecule))

For convenience, extract the list of component molecules from the hit molecule...

In [None]:
mols = hit.molecule.components

len(mols)

We will examine these components individually.

The first component in the hit is the bioactive component...

In [None]:
atoms, bonds = mcs.search(query_mol, mols[0])

len(atoms), len(bonds)

Note that the values returned by the MCS [search](https://downloads.ccdc.cam.ac.uk/documentation/API/modules/descriptors_api.html?highlight=maximumcommonsubstructure#ccdc.descriptors.MolecularDescriptors.MaximumCommonSubstructure.search) are tuple of pairs of matching atoms or bonds in the MCS shared by the query and the hit. The atoms can be used to highlight the MCS in the query and hit using the Diagram API [diagram generator](https://downloads.ccdc.cam.ac.uk/documentation/API/modules/diagram_api.html?highlight=diagram_generator#ccdc.diagram.DiagramGenerator).

Query...

In [None]:
HTML(diagram_generator.image(query_mol, highlight_atoms=[x[0] for x in atoms]))

Hit...

In [None]:
HTML(diagram_generator.image(mols[0], highlight_atoms=[x[1] for x in atoms]))

The second component in the hit is the counterion; the MCS obtained is thus not really meaningful...

In [None]:
atoms, bonds = mcs.search(query_mol, mols[1])

len(atoms), len(bonds)

Query...

In [None]:
HTML(diagram_generator.image(query_mol, highlight_atoms=[x[0] for x in atoms]))

Hit...

In [None]:
HTML(diagram_generator.image(mols[1], highlight_atoms=[x[1] for x in atoms]))

Thus, the MCS can be used to identify the 'best' component, _i.e._ that which shares the largest MCS with the query mol.

Let us make another table of the hits, this time using MCS-based depictions like those shown above.

For this, we will use a utility function to find the best component for each hit, as described above...

In [None]:
# Count atoms in query for use in calculating 'MCS fraction' (N.B. this is all atoms including hydrogens)...

n_query_atoms = len(query_mol.atoms)

n_query_atoms

In [None]:
def get_best(hit):
    
    mcs_size, best_mol, best_atoms = 0, None, None

    for mol in hit.molecule.components:

        atoms, bonds = mcs.search(query_mol, mol)  # Tuples of pairs of matching atoms and bonds
        
        if len(atoms) > mcs_size: 

            mcs_size = len(atoms)

            best_mol, best_atoms = mol, atoms
            
    mcs_fraction = round(mcs_size / n_query_atoms, 2)
            
    query_image = diagram_generator.image(query_mol, highlight_atoms=[x[0] for x in best_atoms])
    
    best_image  = diagram_generator.image(best_mol, highlight_atoms=[x[1] for x in best_atoms])
            
    return mcs_size, mcs_fraction, query_image, best_image

Test the utility fucntion on the first hit...

In [None]:
mcs_size, mcs_fraction, query_image, best_image = get_best(hit)

In [None]:
mcs_size, mcs_fraction

In [None]:
HTML(query_image)

In [None]:
HTML(best_image)

This seems satisfactory, so let's create the table...

In [None]:
%%time

hits_df = pd.DataFrame(
            data=[(hit.identifier, hit.similarity,  hit.entry.synonyms[0] if hit.entry.synonyms else '', *get_best(hit)) for hit in hits],
            columns=['Refcode', 'Similarity', 'Name', 'MCS size', 'MCS fraction', 'Query', 'MCS Hit']
            )

hits_df.shape

In [None]:
show_df(hits_df)

We could order the table by other metrics that the initial similarity, such as the MCS fraction...

In [None]:
# show_df(hits_df.sort_values('MCS fraction', ascending=False))

We could also potentially plot summary figures...

In [None]:
with warnings.catch_warnings():
    warnings.filterwarnings(action='ignore', category=DeprecationWarning)  # Ignore current 'distutils Version classes are deprecated' warning
    
    px.scatter(hits_df, x='Similarity', y='MCS fraction')