# Molecular Similarity and Substructure on Datasets

This notebook explores substructure searches and similarity measurements in a larger dataset.
Often in cheminformatics, one will want to perform these measurements on large datasets for molecular screening.

In the previous notebook, substructure searches using subgraphs was mentioned. However, for larger datasets, this can become too computationally expensive. 
Another option to to do a substructure search using molecular fingerprints.
In this type of substructure search, the "on-bits" for a particular molecular pattern are determined.
Then, any molecules having those bits in the fingerprint are returned as matching.

This notebook demonstrates the principle using RDKit PandasTools

In [None]:
import pandas as pd

from rdkit import Chem
from rdkit.Chem import PandasTools
PandasTools.RenderImagesInAllDataFrames(images=True)

In [None]:
df = pd.read_table("data/chembl_drugs.smi")

In [None]:
df.info()

In [None]:
df.head()

In [None]:
PandasTools.AddMoleculeColumnToFrame(df,'SMILES','Molecule', includeFingerprints=True)

In [None]:
df.head(3)

In [None]:
benzene = Chem.MolFromSmiles("c1ccccc1")

RDKit provides an operator for performing substructure searches on pandas dataframes.
To use it, you do

```python

df["molecule"] >= molecule

```

where `df["molecule"]` contains RDKit molecule objects and molecule is an RDKit molecule object.

In [None]:
matches = df[df["Molecule"] >= benzene]
matches.head(3)

In [None]:
# how many molecules contain benzene?

In [None]:
matches.info()

Fingerprint substructure searches are less exact than graph substructure searches, and are often used as a refining first step.

## Are Fingerprints Unique?
Despite the name "fingerprints" , molecular fingerprints are not unique to an individual molecule.
Because of the way fingerprints are calculated, it is possible for two similar (but not identical) molecules
to have the same molecular fingerprint.

Use the cells below to investigate this - does our dataset produce all unique Morgan fingerprints?

In [None]:
from rdkit.Chem import AllChem
fpgen = AllChem.GetMorganGenerator()

In [None]:
# Get morgan fingerprint column - store as binary for each fingerprint
df["morgan_fingerprints"] = df["Molecule"].apply(lambda x : fpgen.GetFingerprint(x).ToBinary() )