### Parameter optimization for NMSLIB

https://github.com/nmslib/hnswlib/blob/master/ALGO_PARAMS.md


- M - the number of bi-directional links created for every new element during construction. 

    - M is tightly connected with internal dimensionality of the data. Strongly affects the memory consumption (~M). 
    
    - Higher M leads to higher accuracy/run_time at fixed ef/efConstruction
    
    - **Reasonable range for M is 2-100.**
    
    - Higher M work better on datasets with high intrinsic dimensionality and/or high recall,while low M work better for datasets with low intrinsic dimensionality and/or low recalls.
    
    - The parameter also determines the algorithm's memory consumption, which is roughly M * 8-10 bytes per stored element.
    
    - As an example for dim=4 random vectors optimal M for search is somewhere around 6, while for high dimensional datasets (word embeddings, good face descriptors), higher M are required (e.g. M=48-64)for optimal performance at high recall. 
    
    - **The range M=12-48 is ok for the most of the use cases.**
        - **we are keeping this to 16.**
    
    - When M is changed one has to update the other parameters. Nonetheless, ef and ef_construction parameterscan be roughly estimated by assuming that M*ef_{construction} is a constant


- ef - the size of the dynamic list for the nearest neighbors (used during the search). Higher ef leads to more accurate but slower search. 

    - **ef cannot be set lower than the number of queried nearest neighbors k.**
    - The value ef of can be anything between k and the size of the dataset.


- ef_construction - the parameter has the same meaning as ef, but controls the index_time/index_accuracy.

    - ef_construction - controls index search speed/build speed tradeoff
    - Bigger ef_construction leads to longer construction, but better index quality. 
    - At some point,increasing ef_construction does not improve the quality of the index. 
    - One way to check if the selection of ef_construction was ok is to measure a recall for M nearest neighbor search when ef =ef_construction: if the recall is lower than 0.9, than there is room for improvement.

https://github.com/nmslib/nmslib/blob/master/manual/methods.md


In [52]:
import sys
import time
import math

from Bio import SeqIO
import nmslib
from scipy.sparse import csr_matrix
from sklearn.model_selection import train_test_split
import pandas as pd

import guidemaker



In [53]:
# Calling ground truth/ 
pamobj = guidemaker.core.Pam("NGG", "5prime")
gb = SeqIO.parse("test_data/Carsonella_ruddii.fasta", "fasta")
pamtargets = pamobj.find_targets(seq_record_iter=gb, strand="forward", target_len=20)
tl = guidemaker.core.TargetList(targets=pamtargets, lcp=10, hammingdist=2, knum=2)
tl.find_unique_near_pam()
bintargets = tl._one_hot_encode(tl.targets)

index = nmslib.init(space='bit_hamming',
                    dtype=nmslib.DistType.INT,
                    data_type=nmslib.DataType.OBJECT_AS_STRING,
                    method='brute_force')

index.addDataPointBatch(bintargets)

index.createIndex( print_progress=True)



In [54]:
# Computing gold-standard 
start = time.time()
truth_list = index.knnQueryBatch(bintargets, k=3, num_threads = 4)
end = time.time()
print('brute-force kNN time total=%f (sec), per query=%f (sec)' % 
      (end-start, float(end-start)/len(bintargets)) )


brute-force kNN time total=0.054627 (sec), per query=0.000031 (sec)


In [55]:
def recall(results, truth):
    """Calculate recall for top two kNN distances

    calulate recall on the top 2 distances (not labels becasue we really care that the algoritm estimates the correct
    distance not the exact value of the neighbor and there can be multiple nieghbors with the same edit distance .)
    """
    dat = zip(results, truth)
    assert len(results) ==len(truth)
    tot = len(results)
    correct = 0
    for res, tr in dat:
        if all(res[1][0:2] ==tr[1][0:2]): # it should have been 2 not 1, then need to use all to compare all the element of an array
            correct += 1
    return correct/tot

In [56]:
def test_func(truth, bintargets, M, efC, post, ef, delaunay_type=2, threads=4):
    start = time.time()
    index_params = {'M': M, 'indexThreadQty': threads,'efConstruction': efC, 'post': post}
    index = nmslib.init(space='bit_hamming',
                    dtype=nmslib.DistType.INT,
                    data_type=nmslib.DataType.OBJECT_AS_STRING,
                    method='hnsw')
    index.addDataPointBatch(bintargets)
    index.createIndex(index_params)
    index.setQueryTimeParams({'efSearch': ef})
    results_list = index.knnQueryBatch(bintargets, k=3, num_threads = 4)
    end = time.time()
    rc = recall(results_list, truth)
    return rc, float(end-start)


In [57]:
# used in guidemaker
test_func(truth=truth_list, bintargets=bintargets, M=16, efC=64, post=1, ef=256, threads=4)

(1.0, 0.21663904190063477)

In [7]:
def sim_nsmlib_para(increase_by: int=10):
    M_range = range(2, 101, increase_by)
    ef_range = range(3, 257, increase_by)
    efC_range = range(3, 257, increase_by)
    post_range = range(1, 101, increase_by)
    print("Total number of combinations: ", (len(M_range) * len(ef_range) * len(efC_range) * len(post_range)))
    pdict = dict(run=[], M = [], ef = [], efC = [], post = [], accuracy = [], time=[])
    n=1
    for m in M_range:
        for ef in ef_range:
            for efC in efC_range:
                for post in post_range:
                    rc, tt = test_func(truth=truth_list, bintargets=bintargets, M=m, efC=efC, post=post, ef=ef, threads=4)
                    n +=1
                    if rc > 0.9900000000000000:
                        pdict['run'].append(n)
                        pdict['M'].append(m)
                        pdict['ef'].append(ef)
                        pdict['efC'].append(efC)
                        pdict['post'].append(post)
                        pdict['accuracy'].append(rc)
                        pdict['time'].append(tt)
                        if n % 50 == 0:
                            print(n, m, ef, efC, post, rc, tt)
    return pdict

In [8]:
aa = sim_nsmlib_para(increase_by=50)


Total number of combinations:  144
100 52 103 53 1 1.0 0.1849370002746582


In [9]:
df = pd.DataFrame.from_dict(aa)
df.sort_values(['time'], ascending=[True])

Unnamed: 0,run,M,ef,efC,post,accuracy,time
2,79,52,3,103,51,0.998876,0.090663
10,89,52,53,53,51,0.999438,0.092241
29,110,52,153,3,1,0.994941,0.096118
4,81,52,3,153,51,0.994941,0.108090
40,122,52,203,3,1,0.998876,0.113843
...,...,...,...,...,...,...,...
47,130,52,203,203,1,1.000000,0.364924
38,120,52,153,253,1,1.000000,0.368527
58,142,52,253,203,1,1.000000,0.399264
49,132,52,203,253,1,1.000000,0.419354


In [10]:
 def sim_nsmlib_para_optimized(increase_by: int=10):
    M = set([8, 16, 24, 32])
    ef_range = range(3, 257, increase_by)
    efC_range = range(3, 257, increase_by)
    post = 1
    print("Total number of combinations: ", len(ef_range) * len(efC_range) *len(M))
    pdict = dict(run=[], M = [], ef = [], efC = [], post = [], accuracy = [], time=[])
    n=1
    for m in M:
        for ef in ef_range:
            for efC in efC_range:
                rc, tt = test_func(truth=truth_list, bintargets=bintargets, M=m, efC=efC, post=post, ef=ef, threads=4)
                n +=1
                if rc > 0.9900000000000000:
                    pdict['run'].append(n)
                    pdict['M'].append(m)
                    pdict['ef'].append(ef)
                    pdict['efC'].append(efC)
                    pdict['post'].append(post)
                    pdict['accuracy'].append(rc)
                    pdict['time'].append(tt)
                    if n % 50 == 0:
                            print(n, m, ef, efC, post, rc, tt)
    return pdict

In [11]:
aa = sim_nsmlib_para_optimized(increase_by=50)
df = pd.DataFrame.from_dict(aa)
df.sort_values(['time'], ascending=[True])

Total number of combinations:  144
100 32 203 103 1 1.0 0.29650306701660156


Unnamed: 0,run,M,ef,efC,post,accuracy,time
42,56,16,153,3,1,0.993817,0.091623
15,26,8,203,3,1,0.993255,0.099367
75,92,32,153,3,1,0.993255,0.100104
108,128,24,153,3,1,0.994379,0.104427
48,62,16,203,3,1,0.996627,0.107816
...,...,...,...,...,...,...,...
113,133,24,153,253,1,1.000000,0.395649
124,144,24,253,203,1,1.000000,0.397249
91,108,32,253,203,1,1.000000,0.403382
125,145,24,253,253,1,1.000000,0.424400


In [12]:
 def sim_nsmlib_para_optimized2(increase_by: int=10):
    M = 16 # fix this to 16 ~ memory used
    ef_range = range(3, 257, increase_by)
    efC_range = range(3, 257, increase_by)
    post = 1
    print("Total number of combinations: ", len(ef_range) * len(efC_range))
    pdict = dict(run=[], M = [], ef = [], efC = [], post = [], accuracy = [], time=[])
    n=1
    m=M
    for ef in ef_range:
        for efC in efC_range:
            rc, tt = test_func(truth=truth_list, bintargets=bintargets, M=m, efC=efC, post=post, ef=ef, threads=4)
            n +=1
            if rc > 0.9900000000000000:
                pdict['run'].append(n)
                pdict['M'].append(m)
                pdict['ef'].append(ef)
                pdict['efC'].append(efC)
                pdict['post'].append(post)
                pdict['accuracy'].append(rc)
                pdict['time'].append(tt)
                if n % 100 == 0: # print after roughly 100 itertations
                        print(n, m, ef, efC, post, rc, tt)
    return pdict

In [13]:
aa = sim_nsmlib_para_optimized2(increase_by=10)
df = pd.DataFrame.from_dict(aa)
df.sort_values(['time'], ascending=[True])


Total number of combinations:  676
100 16 33 203 1 1.0 0.23459982872009277
200 16 73 163 1 1.0 0.27737998962402344
300 16 113 123 1 1.0 0.3645968437194824
400 16 153 83 1 1.0 0.27274608612060547
500 16 193 43 1 1.0 0.35472726821899414
600 16 233 3 1 0.9977515458122541 0.12653112411499023


Unnamed: 0,run,M,ef,efC,post,accuracy,time
24,29,16,13,13,1,0.997189,0.075887
25,30,16,13,23,1,0.998876,0.077744
0,4,16,3,23,1,0.993817,0.078565
1,5,16,3,33,1,0.996627,0.080611
49,55,16,23,13,1,0.998314,0.080780
...,...,...,...,...,...,...,...
344,360,16,133,203,1,1.000000,0.606702
345,361,16,133,213,1,1.000000,0.623796
419,435,16,163,173,1,1.000000,0.649193
347,363,16,133,233,1,1.000000,0.650530


In [15]:
# used in guidemaker

print("Recall and Runtime for original paramter used in GuideMaker")
print("Recall: %f Run time: %f "% (test_func(truth=truth_list, bintargets=bintargets, M=16, efC=64, post=1, ef=256, threads=4)))


Recall and Runtime for original paramter used in GuideMaker
Recall: 1.000000 Run time: 0.204982 


In [17]:
# After paramter tuning
print("Recall and Runtime after tuning paramter")
print("Recall: %f Run time: %f "% (test_func(truth=truth_list, bintargets=bintargets, M=16, efC=13, post=1, ef=13, threads=4)))

Recall and Runtime after tuning paramter
Recall: 0.996627 Run time: 0.085894 


#### Since the run time was low around efC=13 and ef=13, we can do finer optimization around that values.
- Change the range of ef - range(3, 20)
- Change the range of efC - range(3, 20)
- increase_by == 2

In [18]:
 def sim_nsmlib_para_optimized_finer(increase_by: int=2):
    M = 16 # fix this to 16 ~ memory used
    ef_range = range(3, 20, increase_by)
    efC_range = range(3, 20, increase_by)
    post = 1
    print("Total number of combinations: ", len(ef_range) * len(efC_range))
    pdict = dict(run=[], M = [], ef = [], efC = [], post = [], accuracy = [], time=[])
    n=1
    m=M
    for ef in ef_range:
        for efC in efC_range:
            rc, tt = test_func(truth=truth_list, bintargets=bintargets, M=m, efC=efC, post=post, ef=ef, threads=4)
            n +=1
            if rc > 0.9900000000000000:
                pdict['run'].append(n)
                pdict['M'].append(m)
                pdict['ef'].append(ef)
                pdict['efC'].append(efC)
                pdict['post'].append(post)
                pdict['accuracy'].append(rc)
                pdict['time'].append(tt)
                if n % 40 == 0: # print after roughly 100 itertations
                        print(n, m, ef, efC, post, rc, tt)
    return pdict

In [19]:
aa = sim_nsmlib_para_optimized_finer(increase_by=1)
df = pd.DataFrame.from_dict(aa)
df.sort_values(['time'], ascending=[True])

Total number of combinations:  289
80 16 7 13 1 0.9943788645306352 0.07059407234191895
120 16 9 19 1 0.9977515458122541 0.06818723678588867
200 16 14 14 1 0.9994378864530635 0.07945895195007324
280 16 19 9 1 0.9926925238898258 0.06801819801330566


Unnamed: 0,run,M,ef,efC,post,accuracy,time
18,66,16,6,16,1,0.993255,0.059769
11,50,16,5,17,1,0.992130,0.060514
39,111,16,9,10,1,0.991006,0.060793
14,62,16,6,12,1,0.994941,0.061535
19,67,16,6,17,1,0.992693,0.061857
...,...,...,...,...,...,...,...
114,235,16,16,15,1,0.997189,0.084442
145,285,16,19,14,1,0.998314,0.085288
124,252,16,17,15,1,0.998314,0.085745
135,269,16,18,15,1,0.997189,0.085809


In [58]:
min_record = df[df.time == df.time.min()]
min_record

Unnamed: 0,run,M,ef,efC,post,accuracy,time
18,66,16,6,16,1,0.993255,0.059769


In [83]:
# After paramter tuning
print("Recall and Runtime after fine tuning of paramter")
print("Recall: %f Run time: %f "% 
      (test_func(truth=truth_list,
                 bintargets=bintargets,
                 M=int(min_record.M.values[0]),
                 efC=int(min_record.efC.values[0]),
                 post=1,
                 ef=int(min_record.ef.values[0]),
                 threads=4)))

Recall and Runtime after fine tuning of paramter
Recall: 0.993255 Run time: 0.085226 


In [84]:
print("Optimized parameters: \n")
print("M:", int(min_record.M.values[0]), 
      "efC:", int(min_record.efC.values[0]), 
      "ef:", int(min_record.ef.values[0]),
     "post:", 1)

Optimized parameters: 

M: 16 efC: 16 ef: 6 post: 1
