## Fingerprinting and Molecular Similarity
The RDKit has a variety of built-in functionality for generating molecular fingerprints and using them to calculate molecular similarity.

### Topological Fingerprints

In [7]:
from rdkit import DataStructs
from rdkit import Chem
from rdkit.Chem.Fingerprints import FingerprintMols
ms = [Chem.MolFromSmiles('CCOC'), Chem.MolFromSmiles('CCO'), Chem.MolFromSmiles('COC')]
fps = [FingerprintMols.FingerprintMol(x) for x in ms]
# expected 0.6
DataStructs.FingerprintSimilarity(fps[0],fps[1])

0.6

In [8]:
# expected 0.4
DataStructs.FingerprintSimilarity(fps[0],fps[2])

0.4

In [9]:
# expected 0.25
DataStructs.FingerprintSimilarity(fps[1],fps[2])

0.25

The fingerprinting algorithm used is similar to that used in the Daylight fingerprinter: it identifies and hashes topological paths (e.g. along bonds) in the molecule and then uses them to set bits in a fingerprint of user-specified lengths. After all paths have been identified, the fingerprint is typically folded down until a particular density of set bits is obtained.

The default set of parameters used by the fingerprinter is: - minimum path size: 1 bond - maximum path size: 7 bonds - fingerprint size: 2048 bits - number of bits set per hash: 2 - minimum fingerprint size: 64 bits - target on-bit density 0.3

You can control these by calling RDKFingerprint directly; this will return an unfolded fingerprint that you can then fold to the desired density. The function FingerprintMol (written in python) shows how this is done.

The default similarity metric used by FingerprintSimilarity is the Tanimoto similarity. One can use different similarity metrics:

In [10]:
# expected 0.75
DataStructs.FingerprintSimilarity(fps[0],fps[1], metric=DataStructs.DiceSimilarity)

0.75

Available similarity metrics include Tanimoto, Dice, Cosine, Sokal, Russel, Kulczynski, McConnaughey, and Tversky.

## Morgan Fingerprints (Circular Fingerprints)
This family of fingerprints, better known as circular fingerprints [5], is built by applying the Morgan algorithm to a set of user-supplied atom invariants. When generating Morgan fingerprints, the radius of the fingerprint must also be provided :

In [12]:
from rdkit.Chem import AllChem
m1 = Chem.MolFromSmiles('Cc1ccccc1')
fp1 = AllChem.GetMorganFingerprint(m1,2)
# expected <rdkit.DataStructs.cDataStructs.UIntSparseIntVect object at 0x...>
fp1

<rdkit.DataStructs.cDataStructs.UIntSparseIntVect at 0x10a9b01c0>

In [13]:
m2 = Chem.MolFromSmiles('Cc1ncccc1')
fp2 = AllChem.GetMorganFingerprint(m2,2)
# expected 0.55
DataStructs.DiceSimilarity(fp1,fp2)

0.55

Morgan fingerprints, like atom pairs and topological torsions, use counts by default, but it’s also possible to calculate them as bit vectors:

In [14]:
fp1 = AllChem.GetMorganFingerprintAsBitVect(m1,2,nBits=1024)
# Expected <rdkit.DataStructs.cDataStructs.ExplicitBitVect object at 0x...>
fp1

<rdkit.DataStructs.cDataStructs.ExplicitBitVect at 0x10ae07df0>

In [15]:
fp2 = AllChem.GetMorganFingerprintAsBitVect(m2,2,nBits=1024)
# expected 0.518
DataStructs.DiceSimilarity(fp1,fp2)

0.5185185185185185

The default atom invariants use connectivity information similar to those used for the well known ECFP family of fingerprints. Feature-based invariants, similar to those used for the FCFP fingerprints, can also be used. The feature definitions used are defined in the section Feature Definitions Used in the Morgan Fingerprints. At times this can lead to quite different similarity scores:

In [20]:
m1 = Chem.MolFromSmiles('c1ccccn1')
m2 = Chem.MolFromSmiles('c1ccco1')
fp1 = AllChem.GetMorganFingerprint(m1,2)
fp2 = AllChem.GetMorganFingerprint(m2,2)
ffp1 = AllChem.GetMorganFingerprint(m1,2,useFeatures=True)
ffp2 = AllChem.GetMorganFingerprint(m2,2,useFeatures=True)
# FCFP 0.36
DataStructs.DiceSimilarity(fp1,fp2)

0.36363636363636365

In [19]:
# ECFP 0.90
DataStructs.DiceSimilarity(ffp1,ffp2)

0.9090909090909091

When comparing the ECFP/FCFP fingerprints and the Morgan fingerprints generated by the RDKit, remember that the 4 in ECFP4 corresponds to the diameter of the atom environments considered, while the Morgan fingerprints take a radius parameter. So the examples above, with radius=2, are roughly equivalent to ECFP4 and FCFP4.

The user can also provide their own atom invariants using the optional invariants argument to GetMorganFingerprint. Here’s a simple example that uses a constant for the invariant; the resulting fingerprints compare the topology of molecules:

In [21]:
m1 = Chem.MolFromSmiles('Cc1ccccc1')
m2 = Chem.MolFromSmiles('Cc1ncncn1')
fp1 = AllChem.GetMorganFingerprint(m1,2,invariants=[1]*m1.GetNumAtoms())
fp2 = AllChem.GetMorganFingerprint(m2,2,invariants=[1]*m2.GetNumAtoms())
fp1==fp2

True

Note that bond order is by default still considered:

In [24]:
m3 = Chem.MolFromSmiles('CC1CCCCC1')
>>> fp3 = AllChem.GetMorganFingerprint(m3,2,invariants=[1]*m3.GetNumAtoms())
>>> fp1==fp3

True

But this can also be turned off:

In [25]:
fp1 = AllChem.GetMorganFingerprint(m1,2,invariants=[1]*m1.GetNumAtoms(), useBondTypes=False)
fp3 = AllChem.GetMorganFingerprint(m3,2,invariants=[1]*m3.GetNumAtoms(), useBondTypes=False)
fp1==fp3

True

## Picking Diverse Molecules Using Fingerprints
A common task is to pick a small subset of diverse molecules from a larger set. The RDKit provides a number of approaches for doing this in the rdkit.SimDivFilters module. The most efficient of these uses the MaxMin algorithm. [6] Here’s an example:

Start by reading in a set of molecules and generating Morgan fingerprints:



In [28]:
from rdkit import Chem
from rdkit.Chem.rdMolDescriptors import GetMorganFingerprint
from rdkit import DataStructs
from rdkit.SimDivFilters.rdSimDivPickers import MaxMinPicker
ms = [x for x in Chem.SDMolSupplier('data/actives_5ht3.sdf')]
while ms.count(None): ms.remove(None)
fps = [GetMorganFingerprint(x,3) for x in ms]
nfps = len(fps)

In [29]:
nfps

180

In [30]:
fps[0]

<rdkit.DataStructs.cDataStructs.UIntSparseIntVect at 0x10a9b02b0>

In [31]:
def distij(i,j,fps=fps):
   return 1-DataStructs.DiceSimilarity(fps[i],fps[j])
picker = MaxMinPicker()
pickIndices = picker.LazyPick(distij,nfps,10,seed=23)
list(pickIndices)

[93, 109, 154, 6, 95, 135, 151, 61, 137, 139]