# FPSim2 demo

- FPSim2 is an easy to use, simple and small 99% Python library to run fast similarity searches.
- Heavy processing is implemented in Cython calling SIMD instructions and taking advantadge of it's awesome integration with Numpy. 
- GIL is released most of the time in Cython, so multiple threads can be used for speeding up a single query.
- Fingerprints are stored using PyTables, which is also storing Numpy arrays.
- Provides 2 working modes:
  - In memory search: Faster
  - On disk search: In case the dataset doesn't fit in memory.
- It has one clear **limitation**: Only integer ids can be used to identify molecules. This library was designed to work in backends which must have integer ids for it's data. We are using molregno as id in this example.

ChEMBL24 is only 1.8 million molecules. Advantadge of using multiple threads in a single query is more obviously seen when searching in bigger datasets. 

It's been tested with Unichem (>150 million compounds) and GDB13 (>970 million compounds).

**Binder performance is awful even compared to a 4 years old laptop, try the docker image if you want to see nicer performance**

## Imports

In [1]:
from FPSim2 import run_in_memory_search, run_search
from FPSim2.io import load_query, load_fps
import tables as tb

## Show FP file parameters

In [2]:
fp_filename = 'chembl_24.h5'

with tb.open_file(fp_filename, mode='r') as fp_file:
    config = fp_file.root.config
    print('FP type: ', config[0])
    print('FP parameters: ', config[1])
    print('RDKit version: ', config[2])

FP type:  Morgan
FP parameters:  {'radius': 2, 'nBits': 2048, 'useFeatures': False, 'useChirality': False, 'useBondTypes': True}
RDKit version:  2018.03.4


## Load a query molecule

In [3]:
aspirin = 'CC(=O)Oc1ccccc1C(=O)O'

# uses regexes to distinguish between SMILES and InChi. Anything else is considered a CTAB
query = load_query(aspirin, fp_filename)



## Load fps into memory


In [4]:
%%time
fps = load_fps(fp_filename)


CPU times: user 1.63 s, sys: 316 ms, total: 1.95 s
Wall time: 1.95 s


## Run a search

In [5]:
%%time
results = run_in_memory_search(query, fps, threshold=0.7, coeff='tanimoto',  n_threads=1)


CPU times: user 12 ms, sys: 0 ns, total: 12 ms
Wall time: 10.7 ms


## Results in a nice structured numpy array

In [6]:
print(results.shape)
results

(4,)


array([(   1280, 1.        ), (2096455, 0.8888889 ),
       ( 271022, 0.85714287), ( 875057, 0.7       )],
      dtype=[('mol_id', '<u8'), ('coeff', '<f4')])

## On disk search

If your dataset doesn't fit in memory or you're dealing with huge datasets, it's still possible to run searches.

In [7]:
%%time
results = run_search(aspirin, fp_filename, threshold=0.7, coeff='tanimoto', chunk_size=250000, n_processes=2)

CPU times: user 16 ms, sys: 16 ms, total: 32 ms
Wall time: 1.11 s


In [8]:
results

array([(   1280, 1.        ), (2096455, 0.8888889 ),
       ( 271022, 0.85714287), ( 875057, 0.7       )],
      dtype=[('mol_id', '<u8'), ('coeff', '<f4')])