```
This script can be used for any purpose without limitation subject to the
conditions at http://www.ccdc.cam.ac.uk/Community/Pages/Licences/v2.aspx

This permission notice and the following statement of attribution must be
included in all copies or substantial portions of this script.

2022-06-01: Made available by the Cambridge Crystallographic Data Centre.

```

# Similarity searching the CSD

The Tanimoto similarity measure is currently used for similarity searching. Note that this is the conventional similarity measure based on a 2D molecular fingerprint: no 3D information is used here. The fingerprint used is path-based and related to the old [Daylight fingerprint](https://daylight.com/dayhtml/doc/theory/theory.finger.html): more details are given in the following publication: https://journals.iucr.org/j/issues/2010/02/00/kk5057/index.html.

In [None]:
from platform import platform
import sys
import os
from pathlib import Path
import logging

import warnings

In [None]:
import pandas as pd

In [None]:
from IPython.display import HTML

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

In [None]:
import ccdc
from ccdc.diagram import DiagramGenerator
from ccdc.io import MoleculeReader
from ccdc.molecule import Molecule
from ccdc.search import SimilaritySearch
from ccdc.conformer import ConformerGenerator

### Configuration

### Initialization

In [None]:
logger = logging.getLogger(__name__)

if not logger.hasHandlers():
    handler = logging.StreamHandler()
    handler.setFormatter(logging.Formatter('[%(asctime)s %(levelname)-7s] %(message)s', datefmt='%y-%m-%d %H:%M:%S'))
    logger.addHandler(handler)
    logger.setLevel(logging.INFO)

In [None]:
# Information useful for debugging...

logger.info(f"""
Platform:                     {platform()}

Python exe:                   {sys.executable}
Python version:               {'.'.join(str(x) for x in sys.version_info[:3])}

CSD version:                  {ccdc.io.csd_version()}
CSD directory:                {ccdc.io.csd_directory()}
API version:                  {ccdc.__version__}

CSDHOME:                      {os.environ.get('CSDHOME', 'Not set')}
CCDC_LICENSING_CONFIGURATION: {os.environ.get('CCDC_LICENSING_CONFIGURATION', 'Not set')}
""")

Set up a CCDC Diagram Generator...

In [None]:
diagram_generator = DiagramGenerator()

diagram_generator.settings.return_type = 'SVG'
diagram_generator.settings.explicit_polar_hydrogens = False
diagram_generator.settings.shrink_symbols = False

Utility to help with display in JupyterLab...

In [None]:
show_df = lambda df: HTML(df.to_html(escape=False).replace(r'\n', ''))

# show_df = lambda df: df.style.set_properties(**{'text-align': 'left'})

<a id="mol_prep"></a>

### Query Molecule Preparation

We will use Lapatinib as our query molecule.

First, we use a query loaded from a molfile. As this was generated using MarvinSketch and does not have hydrogens added, we add hydrogens and normalize the bond types to CSD conventions...

In [None]:
with MoleculeReader('Lapatinib.mol') as reader:
    
    query_mol = reader[0]

query_mol.add_hydrogens()

query_mol.assign_bond_types(which='all')

In [None]:
HTML(diagram_generator.image(query_mol))

### Similarity Searching

First, we will use the Search API to search the CSD using a similarity query. 

Choose a Tanimoto similarity threshold to use in the search...

In [None]:
threshold = 0.5  

Instantiate a similarity-searcher object...

In [None]:
searcher = SimilaritySearch(query_mol, threshold=threshold)

We will set CSD search filters _via_ the searcher's [settings](https://downloads.ccdc.cam.ac.uk/documentation/API/modules/search_api.html#ccdc.search.Search.Settings) object; this ensures hits are of sufficient resolution _etc._ (we use a 'subroutine' to do this as we'll need to apply these settings again below)...

In [None]:
def set_search_filters(searcher):
    
    settings = searcher.settings
    
    settings.has_3d_coordinates = True
    settings.max_r_factor       = 5  # NB. Percentage not fraction; 0.05 in Conquest
    settings.no_disorder        = 'Non-hydrogen'
    settings.no_errors          = True
    settings.not_polymeric      = True
    settings.no_ions            = False
    settings.no_powder          = True
    settings.only_organic       = True

In [None]:
set_search_filters(searcher)

Search the CSD using the similarity query...

In [None]:
%%time

hits = searcher.search()

len(hits)

Create a table of the hits...

In [None]:
hits_df = pd.DataFrame(
            data=[(hit.identifier, hit.similarity, hit.entry.synonyms[0] if hit.entry.synonyms else '', diagram_generator.image(hit.molecule)) for hit in hits],
            columns=['Refcode', 'Similarity', 'Name', 'Depiction']
            )

hits_df.shape

In [None]:
show_df(hits_df)

### Query from SMILES

A SMILES string can be used to build a query, although there are some caveats. Due to differences in the aromaticity model used by the CSD and _e.g._ RDKit, care must be take to prepare the query properly in some cases. For example, Lapatinib contains a furan, which is aromatic in RDKit SMILES but typically is not in the CSD. To get correct bond types for a CSD search, the [assign_bond_types](https://downloads.ccdc.cam.ac.uk/documentation/API/modules/molecule_api.html?highlight=assign_bond_types#ccdc.molecule.Molecule.assign_bond_types) method should be called, which means currently that a conformer must be generated first (which assigns sites).

In [None]:
smiles = 'CS(=O)(=O)CCNCc1ccc(-c2ccc3ncnc(Nc4ccc(OCc5cccc(F)c5)c(Cl)c4)c3c2)o1'  # RDKit SMILES, with aromatic furan

Create a molecule from SMILES...

In [None]:
query_mol_2 = Molecule.from_string(smiles, format='smiles')

Reassign bond orders...

In [None]:
conformer_generator = ConformerGenerator()       # Instantiate a conformer generator
conformer_generator.settings.max_conformers = 1  # Only one conformer will be needed

query_mol_2 = conformer_generator.generate(query_mol_2)[0].molecule  # Assigns sites to atoms

query_mol_2.assign_bond_types(which='all')  # Requires atoms to have sites

Run search...

In [None]:
searcher_2 = SimilaritySearch(query_mol_2, threshold=threshold)

In [None]:
set_search_filters(searcher_2)

In [None]:
%%time

hits_2 = searcher_2.search()

len(hits_2)

In [None]:
hits_2_df = pd.DataFrame(
            data=[(hit.identifier, hit.similarity, hit.entry.synonyms[0] if hit.entry.synonyms else '', diagram_generator.image(hit.molecule)) for hit in hits_2],
            columns=['Refcode', 'Similarity', 'Name', 'Depiction']
            )

hits_2_df.shape

The hits are the same as obtained using the molfile...

In [None]:
assert( hits_df[['Refcode', 'Similarity']] == hits_2_df[['Refcode', 'Similarity']]).all().all