# FPSim2 demo

- FPSim2 is an easy to use, simple and small Python library to run fast similarity searches.
- Heavy processing is implemented in C++ calling SIMD instructions and taking advantadge of [pybind11](https://pybind11.readthedocs.io/en/stable/)'s awesome integration with Numpy. 
- GIL is released most of the time, so multiple threads can be used for speeding up a single query.
- Fingerprints are stored in a PyTables table.
- Provides 2 working modes:
  - In memory search: Fastest
  - On disk search: In case the dataset doesn't fit in memory.
- It has one clear **limitation**: Only integer ids can be used to identify molecules. This library was designed to work in backends which must have integer ids for it's data. We are using ChEMBL's **molregno** as id in this example.

ChEMBL25 is only 1.87 million molecules. Advantadge in using multiple threads in a single query is more obviously seen on bigger datasets. 

It's been already tested against Unichem (>150 million compounds) and GDB13 (>970 million compounds).

**Notice that Binder performance is not very good.**

## Imports

In [1]:
from FPSim2 import FPSim2Engine


## Load fp db and show fp parameters

In [2]:
fp_filename = 'chembl_25.h5'

fpe = FPSim2Engine(fp_filename)

print('FP type: ', fpe.fp_type)
print('FP parameters: ', fpe.fp_params)
print('RDKit version: ', fpe.rdkit_ver)
print('Num fps:', fpe.fps.shape[0])

FP type:  Morgan
FP parameters:  {'radius': 2, 'nBits': 2048}
RDKit version:  2019.03.2
Num fps: 1870451


## Run a similarity (Tanimoto) search
In small databases like ChEMBL an important portion of the search time is spent processing the query molecule.

In [3]:
query = 'CC(=O)Oc1ccccc1C(=O)O'
results = fpe.similarity(query, 0.7, n_workers=1)

## Results in a nice structured numpy array

In [4]:
print(results.shape)
results

(4,)


array([(   1280, 1.        ), (2096455, 0.8888889 ),
       ( 271022, 0.85714287), ( 875057, 0.7       )],
      dtype={'names':['mol_id','coeff'], 'formats':['<u4','<f4'], 'offsets':[4,8], 'itemsize':12})

## Time it!

In [5]:
%%timeit
results = fpe.similarity(query, 0.7, n_workers=1)

8.81 ms ± 38.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


## it is also possible to run Tversky asymmetric searches

Tversky is a generalisation of Tanimoto and Dice coefficients so by setting a and b with the following values:
 - a=1, b=1: its equivalent (but slower) to fpe.similarity function (Tanimoto)
 - a=1, b=0: its equivalent (but slower) to fpe.substructure function (substructure screenout)
 - a=0.5, b=0.5 will calculate the Sørensen–Dice coefficient

In [6]:
results = fpe.tversky(query, 0.7, 0.5, 0.5, n_workers=1)

In [7]:
print(results.shape)
results

(42,)


array([(   1280, 1.        ), (2096455, 0.9411765 ),
       ( 271022, 0.9230769 ), ( 875057, 0.8235294 ),
       ( 271023, 0.8076923 ), ( 954218, 0.8       ),
       ( 271730, 0.8       ), ( 287927, 0.7916667 ),
       ( 289908, 0.78431374), ( 321840, 0.7826087 ),
       (1737174, 0.7777778 ), (1218400, 0.7777778 ),
       ( 642553, 0.7755102 ), ( 798870, 0.7755102 ),
       (2079586, 0.7692308 ), (1478529, 0.76      ),
       (2096376, 0.754717  ), (1377367, 0.754717  ),
       (1078517, 0.75      ), (1499576, 0.75      ),
       ( 255532, 0.74509805), ( 454071, 0.74509805),
       ( 782905, 0.74509805), ( 271500, 0.7407407 ),
       ( 289408, 0.73913044), (2079585, 0.72727275),
       ( 274086, 0.72727275), ( 783518, 0.72727275),
       ( 876990, 0.7234042 ), (1449653, 0.7234042 ),
       ( 746307, 0.72      ), (1377174, 0.71428573),
       ( 271540, 0.71428573), ( 270959, 0.71428573),
       (1962736, 0.7118644 ), ( 704412, 0.7083333 ),
       ( 696522, 0.7058824 ), (1163322, 0.7017

## On disk search

If your dataset doesn't fit in memory or you're dealing with huge datasets, it's still possible to run searches.

In [8]:
query = 'CC(=O)Oc1ccccc1C(=O)O'

fpe = FPSim2Engine(fp_filename, in_memory_fps=False)
results = fpe.on_disk_similarity(query, 0.7, chunk_size=100000, n_workers=1)

In [9]:
print(results.shape)
results

(4,)


array([(   1280, 1.        ), (2096455, 0.8888889 ),
       ( 271022, 0.85714287), ( 875057, 0.7       )],
      dtype={'names':['mol_id','coeff'], 'formats':['<u4','<f4'], 'offsets':[4,8], 'itemsize':12})

## Time it!

In [10]:
%%timeit
results = fpe.on_disk_similarity(query, 0.7, chunk_size=100000, n_workers=1)

164 ms ± 3.04 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


## Substructure search

FPSim2 can also run Tversky searches using fingerprints. Bear in mind this is NOT a full substructure search. Full substructure search might be implemented in the future.

It is recommended to use RDKit PatternFingerprint.


In [11]:
fp_filename = 'chembl_25_substructure.h5'

fpe = FPSim2Engine(fp_filename)

print('FP type: ', fpe.fp_type) 
print('FP parameters: ', fpe.fp_params)
print('RDKit version: ', fpe.rdkit_ver)
print('Num fps:', fpe.fps.shape[0])

FP type:  RDKPatternFingerprint
FP parameters:  {'fpSize': 2048, 'atomCounts': [], 'setOnlyBits': None}
RDKit version:  2019.03.2
Num fps: 1870451


In [12]:
query = 'CC(=O)Oc1ccccc1C(=O)O'
results = fpe.substructure(query, n_workers=1)

In [13]:
print(results.shape)
results

(7800,)


array([   1280,  445942, 1476178, ...,  615450,  615448,  615451],
      dtype=uint32)

# Time it!

In [14]:
%%timeit
results = fpe.substructure(query, n_workers=1)

78.8 ms ± 156 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
