# FPSim2 demo

- FPSim2 is an easy to use, simple and small Python library to run fast similarity searches.
- Heavy processing is implemented in C++ calling SIMD instructions and taking advantadge of [pybind11](https://pybind11.readthedocs.io/en/stable/)'s awesome integration with Numpy. 
- GIL is released most of the time, so multiple threads can be used for speeding up a single query.
- Fingerprints are stored in a PyTables table.
- Provides 2 working modes:
  - In memory search: Fastest
  - On disk search: In case the dataset doesn't fit in memory.
- It has **limitation**: Only integer ids can be used to identify molecules. This library was designed to work in backends which must have integer ids for its data. We are using ChEMBL's **molregno** as id in this example.

ChEMBL31 is only 2.3 million molecules. Advantadge in using multiple threads in a single query is more obviously seen on bigger datasets. 

It's been already tested against Unichem (>150 million compounds) and GDB13 (>970 million compounds).

**Notice that Binder performance is not very good.**

## Imports

In [None]:
from FPSim2 import FPSim2Engine


## Load fp db and show fp parameters

In [None]:
fp_filename = 'chembl_31.h5'

fpe = FPSim2Engine(fp_filename)

print('FP type: ', fpe.fp_type)
print('FP parameters: ', fpe.fp_params)
print('RDKit version: ', fpe.rdkit_ver)
print('Num fps:', fpe.fps.shape[0])

## Run a similarity (Tanimoto) search
In small databases like ChEMBL an important portion of the search time is spent processing the query molecule.

In [None]:
query = 'CC(=O)Oc1ccccc1C(=O)O'
results = fpe.similarity(query, 0.7, n_workers=1)

## Results in a nice structured numpy array

In [None]:
print(results.shape)
results

## Time it!

In [None]:
%%timeit
results = fpe.similarity(query, 0.7, n_workers=1)

## it is also possible to run Tversky asymmetric searches

Tversky is a generalisation of Tanimoto and Dice coefficients so by setting a and b with the following values:
 - a=1, b=1: its equivalent (but slower) to fpe.similarity function (Tanimoto)
 - a=1, b=0: its equivalent (but slower) to fpe.substructure function (substructure screenout)
 - a=0.5, b=0.5 will calculate the Sørensen–Dice coefficient

In [None]:
results = fpe.tversky(query, 0.7, 0.5, 0.5, n_workers=1)

In [None]:
print(results.shape)
results

## On disk search

If your dataset doesn't fit in memory or you're dealing with huge datasets, it's still possible to run searches.

In [None]:
query = 'CC(=O)Oc1ccccc1C(=O)O'

fpe = FPSim2Engine(fp_filename, in_memory_fps=False)
results = fpe.on_disk_similarity(query, 0.7, chunk_size=100000, n_workers=1)

In [None]:
print(results.shape)
results

## Time it!

In [None]:
%%timeit
results = fpe.on_disk_similarity(query, 0.7, chunk_size=100000, n_workers=1)