# FPSim2 demo

- FPSim2 is simple and small 99% Python package to run fast similarity searches.
- Heavy calculations are implemented with Cython calling SIMD instructions and taking advantadge of it's awesome integration with Numpy. 
- GIL is released most of the time in Cython, so multiple threads can be used for speeding up a single query.
- Fingerprints are stored using PyTables, which is also storing Numpy arrays.
- Provides 2 modes
  - In memory search: Faster
  - On disk search: In case the dataset doesn't fit in memory.
- It has one well known LIMITATION: Only integer ids can be used to identify molecules. This library was designed to work in backends which must have integer ids for it's data.

ChEMBL24 is only 1.8 million molecules. To get real advantadge using multiple threads in a single query you should consider using bigger datasets. It's been tested with Unichem (150 million) and GDB13 (>970 million).

If running in Binder don't expect the best performance! 

In [1]:
from FPSim2 import run_in_memory_search, run_search
from FPSim2.io import load_query, load_fps
import tables as tb

## Show FP file parameters

In [2]:
fp_filename = 'chembl_24.h5'

with tb.open_file(fp_filename, mode='r') as fp_file:
    config = fp_file.root.config
    print('FP type: ', config[0])
    print('FP parameters: ', config[1])
    print('RDKit version: ', config[2])

FP type:  Morgan
FP parameters:  {'radius': 2, 'nBits': 2048, 'useFeatures': False, 'useChirality': False, 'useBondTypes': True}
RDKit version:  2018.03.4


## Load a query molecule

In [3]:
aspirin = 'CC(=O)Oc1ccccc1C(=O)O'

# uses regexes to distinguish between SMILES and InChi. Anything else is considered a CTAB
query = load_query(aspirin, fp_filename)



## Load fps into memory


In [4]:
%%time
fps = load_fps(fp_filename)


CPU times: user 1.55 s, sys: 320 ms, total: 1.87 s
Wall time: 1.87 s


## Run a search

In [5]:
%%time
results = run_in_memory_search(query, fps, threshold=0.7, coeff='tanimoto',  n_threads=1)


CPU times: user 12 ms, sys: 0 ns, total: 12 ms
Wall time: 10.3 ms


## Results in a nice structured numpy array

In [6]:
print(results.shape)
results

(4,)


array([(   1280, 1.        ), (2096455, 0.8888889 ),
       ( 271022, 0.85714287), ( 875057, 0.7       )],
      dtype=[('mol_id', '<u8'), ('coeff', '<f4')])

## On disk search

If your dataset doesn't fit in memory or you're dealing with huge datasets, it's still possible to run searches.

In [7]:
%%time
results = run_search(aspirin, fp_filename, threshold=0.7, coeff='tanimoto', n_processes=1)

CPU times: user 8 ms, sys: 20 ms, total: 28 ms
Wall time: 1.06 s


In [8]:
results

array([(   1280, 1.        ), (2096455, 0.8888889 ),
       ( 271022, 0.85714287), ( 875057, 0.7       )],
      dtype=[('mol_id', '<u8'), ('coeff', '<f4')])

# Substructure search
- Threshold is automatically set to 1 no matter which threshold you input.
- Substructure search is done using RDKit PatternFingerprints

In [9]:
fp_filename = 'chembl_24_substructure.h5'

with tb.open_file(fp_filename, mode='r') as fp_file:
    config = fp_file.root.config
    print('FP type: ', config[0])
    print('FP parameters: ', config[1])
    print('RDKit version: ', config[2])

FP type:  RDKPatternFingerprint
FP parameters:  {'fpSize': 2048, 'atomCounts': [], 'setOnlyBits': None}
RDKit version:  2018.03.4


# Query needs to be reloaded as a RDKPatternFingerprint

In [10]:
%%time
# query needs to be reloaded as it needs to use the same parameters used to create the substructure fp file
query = load_query(aspirin, fp_filename)

# load fps into memory
fps = load_fps(fp_filename)

CPU times: user 840 ms, sys: 328 ms, total: 1.17 s
Wall time: 1.16 s


In [11]:
%%time
results = run_in_memory_search(query, fps, threshold=1.0, coeff='substructure')


CPU times: user 132 ms, sys: 0 ns, total: 132 ms
Wall time: 130 ms


In [12]:
print(results.shape)
results

(7533,)


array([(2197614, 1.), (2197571, 1.), (2197558, 1.), ..., (   1760, 1.),
       (   1722, 1.), (   1280, 1.)],
      dtype=[('mol_id', '<u8'), ('coeff', '<f4')])