# FPSim2 demo

- FPSim2 is an easy to use, simple and small 99% Python library to run fast similarity searches.
- Heavy processing is implemented in Cython calling SIMD instructions and taking advantadge of it's awesome integration with Numpy. 
- GIL is released most of the time in Cython, so multiple threads can be used for speeding up a single query.
- Fingerprints are stored in a PyTables table.
- Provides 2 working modes:
  - In memory search: Faster
  - On disk search: In case the dataset doesn't fit in memory.
- It has one clear **limitation**: Only integer ids can be used to identify molecules. This library was designed to work in backends which must have integer ids for it's data. We are using **molregno** as id in this example.

ChEMBL24 is only 1.8 million molecules. Advantadge in using multiple threads in a single query is more obviously seen on bigger datasets. 

It's been tested with Unichem (>150 million compounds) and GDB13 (>970 million compounds).

**Binder performance is awful even compared to a 4 years old laptop, try the docker image if you want to see nicer performance.**

## Imports

In [1]:
from FPSim2 import FPSim2Engine


## Load fp db and show fp parameters

In [2]:
fp_filename = 'chembl_24.h5'

fpe = FPSim2Engine(fp_filename)

print('FP type: ', fpe.fp_type)
print('FP parameters: ', fpe.fp_params)
print('RDKit version: ', fpe.rdkit_ver)
print('Num fps:', fpe.fps.fps.shape[0])

FP type:  Morgan
FP parameters:  {'radius': 2, 'nBits': 2048, 'useFeatures': False, 'useChirality': False, 'useBondTypes': True}
RDKit version:  2018.03.4
Num fps: 1820001


## Run a search
In small databases like ChEMBL an important portion of the search time is spent processing the query molecule.

In [3]:
query = 'CC(=O)Oc1ccccc1C(=O)O'
results = fpe.similarity(query, 0.7, n_workers=1)

## Results in a nice structured numpy array

In [4]:
print(results.shape)
results

(4,)


array([(   1280, 1.        ), (2096455, 0.8888889 ),
       ( 271022, 0.85714287), ( 875057, 0.7       )],
      dtype=[('mol_id', '<u4'), ('coeff', '<f4')])

# Time it!

In [5]:
%%timeit
results = fpe.similarity(query, 0.7, n_workers=1)

11 ms ± 130 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


## On disk search

If your dataset doesn't fit in memory or you're dealing with huge datasets, it's still possible to run searches.

In [6]:
%%time
query = 'CC(=O)Oc1ccccc1C(=O)O'

fpe = FPSim2Engine(fp_filename, in_memory_fps=False)
results = fpe.on_disk_similarity(query, 0.7, chunk_size=100000, n_workers=2)


CPU times: user 12.6 ms, sys: 56.1 ms, total: 68.7 ms
Wall time: 200 ms


In [7]:
print(results.shape)
results

(4,)


array([(   1280, 1.        ), (2096455, 0.8888889 ),
       ( 271022, 0.85714287), ( 875057, 0.7       )],
      dtype=[('mol_id', '<u4'), ('coeff', '<f4')])

## Substructure search

FPSim2 can also run Tversky searches using fingerprints. Bear in mind this is NOT a full substructure search. Full substructure search might be implemented in the future.

It is recommended to use RDKit PatternFingerprint.


In [8]:
fp_filename = 'chembl_24_substructure.h5'

fpe = FPSim2Engine(fp_filename)

print('FP type: ', fpe.fp_type) 
print('FP parameters: ', fpe.fp_params)
print('RDKit version: ', fpe.rdkit_ver)
print('Num fps:', fpe.fps.fps.shape[0])

FP type:  RDKPatternFingerprint
FP parameters:  {'fpSize': 2048, 'atomCounts': [], 'setOnlyBits': None}
RDKit version:  2018.03.4
Num fps: 1820001


In [9]:
query = 'CC(=O)Oc1ccccc1C(=O)O'
results = fpe.substructure(query, n_workers=1)

In [10]:
print(results.shape)
results

(7533,)


array([   1280,  445942, 1476178, ...,  615449,  615448,  615450],
      dtype=uint32)

# Time it!

In [11]:
%%timeit
results = fpe.substructure(query, n_workers=1)

134 ms ± 770 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
