# FPSim2 demo

- FPSim2 is an easy to use, simple and small Python library to run fast similarity searches.
- Heavy processing is implemented in C++ calling SIMD instructions and taking advantadge of [pybind11](https://pybind11.readthedocs.io/en/stable/)'s awesome integration with Numpy. 
- GIL is released most of the time, so multiple threads can be used for speeding up a single query.
- Fingerprints are stored in a PyTables table.
- Provides 2 working modes:
  - In memory search: Faster
  - On disk search: In case the dataset doesn't fit in memory.
- It has one clear **limitation**: Only integer ids can be used to identify molecules. This library was designed to work in backends which must have integer ids for it's data. We are using ChEMBL's **molregno** as id in this example.

ChEMBL25 is only 1.87 million molecules. Advantadge in using multiple threads in a single query is more obviously seen on bigger datasets. 

It's been already tested against Unichem (>150 million compounds) and GDB13 (>970 million compounds).

**Notice that Binder performance is awful even compared to a 4 years old laptop.**

## Imports

In [1]:
from FPSim2 import FPSim2Engine


## Load fp db and show fp parameters

In [2]:
fp_filename = 'chembl_25.h5'

fpe = FPSim2Engine(fp_filename)

print('FP type: ', fpe.fp_type)
print('FP parameters: ', fpe.fp_params)
print('RDKit version: ', fpe.rdkit_ver)
print('Num fps:', fpe.fps.fps.shape[0])

FP type:  Morgan
FP parameters:  {'radius': 2, 'nBits': 2048}
RDKit version:  2019.03.2
Num fps: 1870451


## Run a search
In small databases like ChEMBL an important portion of the search time is spent processing the query molecule.

In [3]:
query = 'CC(=O)Oc1ccccc1C(=O)O'
results = fpe.similarity(query, 0.7, n_workers=1)

## Results in a nice structured numpy array

In [4]:
print(results.shape)
results

(4,)


array([(   1280, 1.        ), (2096455, 0.8888889 ),
       ( 271022, 0.85714287), ( 875057, 0.7       )],
      dtype=[('mol_id', '<u4'), ('coeff', '<f4')])

# Time it!

In [5]:
%%timeit
results = fpe.similarity(query, 0.7, n_workers=1)

9.81 ms ± 148 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


## On disk search

If your dataset doesn't fit in memory or you're dealing with huge datasets, it's still possible to run searches.

In [7]:
query = 'CC(=O)Oc1ccccc1C(=O)O'

fpe = FPSim2Engine(fp_filename, in_memory_fps=False)
results = fpe.on_disk_similarity(query, 0.7, chunk_size=100000, n_workers=2)

In [8]:
print(results.shape)
results

(4,)


array([(   1280, 1.        ), (2096455, 0.8888889 ),
       ( 271022, 0.85714287), ( 875057, 0.7       )],
      dtype=[('mol_id', '<u4'), ('coeff', '<f4')])

# Time it!

In [9]:
%%timeit
results = fpe.on_disk_similarity(query, 0.7, chunk_size=100000, n_workers=2)

226 ms ± 7.44 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


## Substructure search

FPSim2 can also run Tversky searches using fingerprints. Bear in mind this is NOT a full substructure search. Full substructure search might be implemented in the future.

It is recommended to use RDKit PatternFingerprint.


In [10]:
fp_filename = 'chembl_25_substructure.h5'

fpe = FPSim2Engine(fp_filename)

print('FP type: ', fpe.fp_type) 
print('FP parameters: ', fpe.fp_params)
print('RDKit version: ', fpe.rdkit_ver)
print('Num fps:', fpe.fps.fps.shape[0])

FP type:  RDKPatternFingerprint
FP parameters:  {'fpSize': 2048, 'atomCounts': [], 'setOnlyBits': None}
RDKit version:  2019.03.2
Num fps: 1870451


In [11]:
query = 'CC(=O)Oc1ccccc1C(=O)O'
results = fpe.substructure(query, n_workers=1)

In [12]:
print(results.shape)
results

(7799,)


array([ 445942, 1476178, 1476175, ...,  615450,  615448,  615451],
      dtype=uint32)

# Time it!

In [13]:
%%timeit
results = fpe.substructure(query, n_workers=1)

78.9 ms ± 1.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
