# Similarity searching the CSD

The Tanimoto similarity measure is currently used for similarity searching. Note that this is the conventional similarity measure based on a 2D molecular fingerprint: no 3D information is used here. The fingerprint used is path-based and related to the old [Daylight fingerprint](https://daylight.com/dayhtml/doc/theory/theory.finger.html): more details are given in the following publication: https://journals.iucr.org/j/issues/2010/02/00/kk5057/index.html.

In [None]:
%run ../Discovery_Notebook_Utils.py

In [None]:
from ccdc.search import SimilaritySearch

### Configuration

### Initialization

In [None]:
logger.info(script_info)

<a id="mol_prep"></a>

### Query Molecule Preparation

We will use Lapatinib as our query molecule (see also the [Conformer API](../04_Conformer_generation/Conformer_generation.ipynb#mol_prep) tutorial).

First, we use a query loaded from a MOL-format molfile...

In [None]:
query_file = 'Lapatinib_from_MarvinSketch.mol'  # Exported from MarvinSketch

with MoleculeReader(query_file) as reader:
    
    query = reader[0]

In [None]:
# print(query.to_string('mol'))

In [None]:
mol2html(query)

In [None]:
query.add_hydrogens()

query.assign_bond_types(which='all')

In [None]:
# print(query.to_string('mol'))

Depict the CCDC molecule...

In [None]:
mol2html(query)

### Similarity Searching

First, we will use the Search API to search the CSD using a similarity query. 

Instantiate a searcher object...

In [None]:
threshold = 0.5  # Tanimoto similarity threshold to use in the search

In [None]:
searcher = SimilaritySearch(query, threshold=threshold)

We can set CSD search filters _via_ the searcher's [settings](https://downloads.ccdc.cam.ac.uk/documentation/API/modules/search_api.html#ccdc.search.Search.Settings) object; this ensures hits are of sufficient resolution _etc._...

In [None]:
# settings = searcher.settings

# settings.has_3d_coordinates = True
# settings.max_r_factor       = 5  # %
# settings.no_disorder        = 'Non-hydrogen' # No disorder in heavy atoms allowed
# settings.no_errors          = True
# settings.not_polymeric      = True
# settings.no_ions            = False
# settings.no_powder          = True
# settings.only_organic       = True

Search the CSD using the similarity query...

In [None]:
%%time

hits = searcher.search()

len(hits)

Examine a table of hits...

In [None]:
hits_df = pd.DataFrame(
            data=[(hit.identifier, hit.similarity, hit.entry.synonyms[0] if hit.entry.synonyms else '', diagram_generator.image(hit.molecule)) for hit in hits],
            columns=['Refcode', 'Similarity', 'Name', 'Depiction']
            )

hits_df.shape

In [None]:
show_dataframe(hits_df)

### Query from SMILES

The Molecule API does not currently support SMILES as input, so we recommend the use of [RDKit](http://rdkit.org/) to generate an MOL-format connection table for input to the API. ote that work is underway to rectify this situation, and the CCDC API will thus be able to correctly handle SMILES in a future release.

In [None]:
smiles, name = 'CS(=O)(=O)CCNCc1ccc(o1)c2ccc3c(c2)c(ncn3)Nc4ccc(c(c4)Cl)OCc5cccc(c5)F', 'Lapatinib_from_SMILES'

In [None]:
rdk_mol = Chem.MolFromSmiles(smiles)  # Convert SMILES to an RDKit molecule object

rdk_mol.SetProp('_Name', name)  # _Name is a special property that gets recorded in the molfile header, which can be convenient

molblock = Chem.MolToMolBlock(rdk_mol)  # Convert RDKit molecule to a string representation (MOL format)

In [None]:
# with open(name + '.mol', 'w') as file:
    
#     file.write(molblock)

In [None]:
mol2html(query)

We can then create a CCDC molecule from this starting structure, and standardize the molecular representation to ensure conformance with CSD conventions...

In [None]:
query = Molecule.from_string(molblock)  # Make CCDC molecule object from the string representation

In [None]:
query.add_hydrogens()  # Necessary for bond-typing?

query.assign_bond_types(which='unknown')

query.standardise_delocalised_bonds()

query.standardise_aromatic_bonds()

query.remove_hydrogens()  # Remove Hydrogens as these will be re-added using the API below

query.add_hydrogens()

In [None]:
searcher = SimilaritySearch(query, threshold=threshold)

We can set CSD search filters _via_ the searcher's [settings](https://downloads.ccdc.cam.ac.uk/documentation/API/modules/search_api.html#ccdc.search.Search.Settings) object; this ensures hits are of sufficient resolution _etc._...

In [None]:
# settings = searcher.settings

# settings.has_3d_coordinates = True
# settings.max_r_factor       = 5  # %
# settings.no_disorder        = 'Non-hydrogen' # No disorder in heavy atoms allowed
# settings.no_errors          = True
# settings.not_polymeric      = True
# settings.no_ions            = False
# settings.no_powder          = True
# settings.only_organic       = True

Search the CSD using the similarity query...

In [None]:
%%time

hits = searcher.search()

len(hits)

Examine a table of hits...

In [None]:
hits_df = pd.DataFrame(
            data=[(hit.identifier, hit.similarity, hit.entry.synonyms[0] if hit.entry.synonyms else '', diagram_generator.image(hit.molecule)) for hit in hits],
            columns=['Refcode', 'Similarity', 'Name', 'Depiction']
            )

hits_df.shape

In [None]:
# show_dataframe(hits_df)