### Parameter optimization for NMSLIB

https://github.com/nmslib/hnswlib/blob/master/ALGO_PARAMS.md


- M - the number of bi-directional links created for every new element during construction. 

    - M is tightly connected with internal dimensionality of the data. Strongly affects the memory consumption (~M). 
    
    - Higher M leads to higher accuracy/run_time at fixed ef/efConstruction
    
    - **Reasonable range for M is 2-100.**
    
    - Higher M work better on datasets with high intrinsic dimensionality and/or high recall,while low M work better for datasets with low intrinsic dimensionality and/or low recalls.
    
    - The parameter also determines the algorithm's memory consumption, which is roughly M * 8-10 bytes per stored element.
    
    - As an example for dim=4 random vectors optimal M for search is somewhere around 6, while for high dimensional datasets (word embeddings, good face descriptors), higher M are required (e.g. M=48-64)for optimal performance at high recall. 
    
    - **The range M=12-48 is ok for the most of the use cases.**
        - **we are keeping this to 16.**
    
    - When M is changed one has to update the other parameters. Nonetheless, ef and ef_construction parameterscan be roughly estimated by assuming that M*ef_{construction} is a constant


- ef - the size of the dynamic list for the nearest neighbors (used during the search). Higher ef leads to more accurate but slower search. 

    - **ef cannot be set lower than the number of queried nearest neighbors k.**
    - The value ef of can be anything between k and the size of the dataset.


- ef_construction - the parameter has the same meaning as ef, but controls the index_time/index_accuracy.

    - ef_construction - controls index search speed/build speed tradeoff
    - Bigger ef_construction leads to longer construction, but better index quality. 
    - At some point,increasing ef_construction does not improve the quality of the index. 
    - One way to check if the selection of ef_construction was ok is to measure a recall for M nearest neighbor search when ef =ef_construction: if the recall is lower than 0.9, than there is room for improvement.

https://github.com/nmslib/nmslib/blob/master/manual/methods.md


In [85]:
import sys
import time
import math

from Bio import SeqIO
import nmslib
from scipy.sparse import csr_matrix
from sklearn.model_selection import train_test_split
import pandas as pd

import guidemaker



In [86]:
# Calling ground truth/ 
pamobj = guidemaker.core.Pam("NGG", "5prime")
gb = SeqIO.parse("test_data/Carsonella_ruddii.fasta", "fasta")
pamtargets = pamobj.find_targets(seq_record_iter=gb, strand="forward", target_len=20)
tl = guidemaker.core.TargetList(targets=pamtargets, lcp=10, hammingdist=2, knum=2)
tl.find_unique_near_pam()
bintargets = tl._one_hot_encode(tl.targets)

index = nmslib.init(space='bit_hamming',
                    dtype=nmslib.DistType.INT,
                    data_type=nmslib.DataType.OBJECT_AS_STRING,
                    method='brute_force')

index.addDataPointBatch(bintargets)

index.createIndex( print_progress=True)



In [87]:
# Computing gold-standard 
start = time.time()
truth_list = index.knnQueryBatch(bintargets, k=3, num_threads = 4)
end = time.time()
print('brute-force kNN time total=%f (sec), per query=%f (sec)' % 
      (end-start, float(end-start)/len(bintargets)) )


brute-force kNN time total=0.038202 (sec), per query=0.000021 (sec)


In [88]:
def recall(results, truth):
    """Calculate recall for top two kNN distances

    calulate recall on the top 2 distances (not labels becasue we really care that the algoritm estimates the correct
    distance not the exact value of the neighbor and there can be multiple nieghbors with the same edit distance .)
    """
    dat = zip(results, truth)
    assert len(results) ==len(truth)
    tot = len(results)
    correct = 0
    for res, tr in dat:
        if all(res[1][0:2] ==tr[1][0:2]): # it should have been 2 not 1, then need to use all to compare all the element of an array
            correct += 1
    return correct/tot

In [89]:
def test_func(truth, bintargets, M, efC, post, ef, delaunay_type=2, threads=4):
    start = time.time()
    index_params = {'M': M, 'indexThreadQty': threads,'efConstruction': efC, 'post': post}
    index = nmslib.init(space='bit_hamming',
                    dtype=nmslib.DistType.INT,
                    data_type=nmslib.DataType.OBJECT_AS_STRING,
                    method='hnsw')
    index.addDataPointBatch(bintargets)
    index.createIndex(index_params)
    index.setQueryTimeParams({'efSearch': ef})
    results_list = index.knnQueryBatch(bintargets, k=3, num_threads = 4)
    end = time.time()
    rc = recall(results_list, truth)
    return rc, float(end-start)


In [90]:
# used in guidemaker
test_func(truth=truth_list, bintargets=bintargets, M=16, efC=64, post=1, ef=256, threads=4)

(1.0, 0.2395341396331787)

In [91]:
def sim_nsmlib_para(increase_by: int=10):
    M_range = range(2, 101, increase_by)
    ef_range = range(3, 257, increase_by)
    efC_range = range(3, 257, increase_by)
    post_range = range(1, 101, increase_by)
    print("Total number of combinations: ", (len(M_range) * len(ef_range) * len(efC_range) * len(post_range)))
    pdict = dict(run=[], M = [], ef = [], efC = [], post = [], accuracy = [], time=[])
    n=1
    for m in M_range:
        for ef in ef_range:
            for efC in efC_range:
                for post in post_range:
                    rc, tt = test_func(truth=truth_list, bintargets=bintargets, M=m, efC=efC, post=post, ef=ef, threads=4)
                    n +=1
                    if rc > 0.9900000000000000:
                        pdict['run'].append(n)
                        pdict['M'].append(m)
                        pdict['ef'].append(ef)
                        pdict['efC'].append(efC)
                        pdict['post'].append(post)
                        pdict['accuracy'].append(rc)
                        pdict['time'].append(tt)
                        if n % 50 == 0:
                            print(n, m, ef, efC, post, rc, tt)
    return pdict

In [92]:
aa = sim_nsmlib_para(increase_by=50)


Total number of combinations:  144
100 52 103 53 1 1.0 0.1730508804321289


In [93]:
df = pd.DataFrame.from_dict(aa)
df.sort_values(['time'], ascending=[True])

Unnamed: 0,run,M,ef,efC,post,accuracy,time
1,77,52,3,53,51,0.991006,0.068928
11,89,52,53,53,51,0.999438,0.089482
3,79,52,3,103,51,0.997189,0.095062
30,110,52,153,3,1,0.992693,0.095173
21,101,52,103,53,51,0.997752,0.108642
...,...,...,...,...,...,...,...
61,144,52,253,253,1,1.000000,0.414849
39,120,52,153,253,1,1.000000,0.437992
40,121,52,153,253,51,1.000000,0.438575
37,118,52,153,203,1,1.000000,0.513938


In [94]:
 def sim_nsmlib_para_optimized(increase_by: int=10):
    M = set([8, 16, 24, 32])
    ef_range = range(3, 257, increase_by)
    efC_range = range(3, 257, increase_by)
    post = 1
    print("Total number of combinations: ", len(ef_range) * len(efC_range) *len(M))
    pdict = dict(run=[], M = [], ef = [], efC = [], post = [], accuracy = [], time=[])
    n=1
    for m in M:
        for ef in ef_range:
            for efC in efC_range:
                rc, tt = test_func(truth=truth_list, bintargets=bintargets, M=m, efC=efC, post=post, ef=ef, threads=4)
                n +=1
                if rc > 0.9900000000000000:
                    pdict['run'].append(n)
                    pdict['M'].append(m)
                    pdict['ef'].append(ef)
                    pdict['efC'].append(efC)
                    pdict['post'].append(post)
                    pdict['accuracy'].append(rc)
                    pdict['time'].append(tt)
                    if n % 50 == 0:
                            print(n, m, ef, efC, post, rc, tt)
    return pdict

In [95]:
aa = sim_nsmlib_para_optimized(increase_by=50)
df = pd.DataFrame.from_dict(aa)
df.sort_values(['time'], ascending=[True])

Total number of combinations:  144
100 32 203 103 1 1.0 0.35114383697509766


Unnamed: 0,run,M,ef,efC,post,accuracy,time
10,20,8,153,3,1,0.991568,0.092001
76,92,32,153,3,1,0.995503,0.100844
43,56,16,153,3,1,0.991006,0.102794
109,128,24,153,3,1,0.992693,0.104011
49,62,16,203,3,1,0.996627,0.108185
...,...,...,...,...,...,...,...
93,109,32,253,253,1,1.000000,0.435042
119,138,24,203,203,1,1.000000,0.477343
106,125,24,103,153,1,1.000000,0.479058
111,130,24,153,103,1,1.000000,0.501460


In [96]:
 def sim_nsmlib_para_optimized2(increase_by: int=10):
    M = 16 # fix this to 16 ~ memory used
    ef_range = range(3, 257, increase_by)
    efC_range = range(3, 257, increase_by)
    post = 1
    print("Total number of combinations: ", len(ef_range) * len(efC_range))
    pdict = dict(run=[], M = [], ef = [], efC = [], post = [], accuracy = [], time=[])
    n=1
    m=M
    for ef in ef_range:
        for efC in efC_range:
            rc, tt = test_func(truth=truth_list, bintargets=bintargets, M=m, efC=efC, post=post, ef=ef, threads=4)
            n +=1
            if rc > 0.9900000000000000:
                pdict['run'].append(n)
                pdict['M'].append(m)
                pdict['ef'].append(ef)
                pdict['efC'].append(efC)
                pdict['post'].append(post)
                pdict['accuracy'].append(rc)
                pdict['time'].append(tt)
                if n % 100 == 0: # print after roughly 100 itertations
                        print(n, m, ef, efC, post, rc, tt)
    return pdict

In [97]:
aa = sim_nsmlib_para_optimized2(increase_by=10)
df = pd.DataFrame.from_dict(aa)
df.sort_values(['time'], ascending=[True])


Total number of combinations:  676
100 16 33 203 1 1.0 0.4325900077819824
200 16 73 163 1 1.0 0.25738072395324707
300 16 113 123 1 1.0 0.25475215911865234
400 16 153 83 1 1.0 0.24026966094970703
500 16 193 43 1 1.0 0.21326303482055664
600 16 233 3 1 0.9971894322653176 0.12473297119140625


Unnamed: 0,run,M,ef,efC,post,accuracy,time
24,29,16,13,13,1,0.997752,0.092312
25,30,16,13,23,1,1.000000,0.096122
0,4,16,3,23,1,0.990444,0.096804
324,340,16,133,3,1,0.992130,0.098905
74,81,16,33,13,1,0.998876,0.100397
...,...,...,...,...,...,...,...
46,51,16,13,233,1,1.000000,0.422731
93,100,16,33,203,1,1.000000,0.432590
640,656,16,253,43,1,1.000000,0.443456
45,50,16,13,223,1,1.000000,0.443880


In [98]:
# used in guidemaker

print("Recall and Runtime for original paramter used in GuideMaker")
print("Recall: %f Run time: %f "% (test_func(truth=truth_list, bintargets=bintargets, M=16, efC=64, post=1, ef=256, threads=4)))


Recall and Runtime for original paramter used in GuideMaker
Recall: 1.000000 Run time: 0.276107 


In [99]:
# After paramter tuning
print("Recall and Runtime after tuning paramter")
print("Recall: %f Run time: %f "% (test_func(truth=truth_list, bintargets=bintargets, M=16, efC=13, post=1, ef=13, threads=4)))

Recall and Runtime after tuning paramter
Recall: 0.994379 Run time: 0.102167 


#### Since the run time was low around efC=13 and ef=13, we can do finer optimization around that values.
- Change the range of ef - range(3, 20)
- Change the range of efC - range(3, 20)
- increase_by == 2

In [100]:
 def sim_nsmlib_para_optimized_finer(increase_by: int=2):
    M = 16 # fix this to 16 ~ memory used
    ef_range = range(3, 20, increase_by)
    efC_range = range(3, 20, increase_by)
    post = 1
    print("Total number of combinations: ", len(ef_range) * len(efC_range))
    pdict = dict(run=[], M = [], ef = [], efC = [], post = [], accuracy = [], time=[])
    n=1
    m=M
    for ef in ef_range:
        for efC in efC_range:
            rc, tt = test_func(truth=truth_list, bintargets=bintargets, M=m, efC=efC, post=post, ef=ef, threads=4)
            n +=1
            if rc > 0.9900000000000000:
                pdict['run'].append(n)
                pdict['M'].append(m)
                pdict['ef'].append(ef)
                pdict['efC'].append(efC)
                pdict['post'].append(post)
                pdict['accuracy'].append(rc)
                pdict['time'].append(tt)
                if n % 40 == 0: # print after roughly 100 itertations
                        print(n, m, ef, efC, post, rc, tt)
    return pdict

In [101]:
aa = sim_nsmlib_para_optimized_finer(increase_by=1)
df = pd.DataFrame.from_dict(aa)
df.sort_values(['time'], ascending=[True])

Total number of combinations:  289
80 16 7 13 1 0.9938167509836987 0.08776712417602539
120 16 9 19 1 0.9977515458122541 0.08578777313232422
200 16 14 14 1 0.9971894322653176 0.09350228309631348


Unnamed: 0,run,M,ef,efC,post,accuracy,time
40,111,16,9,10,1,0.992130,0.074920
97,213,16,15,10,1,0.992693,0.076067
128,263,16,18,9,1,0.990444,0.076310
117,246,16,17,9,1,0.991568,0.077042
59,146,16,11,11,1,0.991568,0.077956
...,...,...,...,...,...,...,...
112,235,16,16,15,1,0.998314,0.100262
122,251,16,17,14,1,0.998876,0.100481
123,252,16,17,15,1,0.997752,0.103104
144,286,16,19,15,1,0.998876,0.104042


In [102]:
min_record = df[df.time == df.time.min()]
min_record

Unnamed: 0,run,M,ef,efC,post,accuracy,time
40,111,16,9,10,1,0.99213,0.07492


In [106]:
# After paramter tuning
print("Recall and Runtime after fine tuning of paramter")
print("Recall: %f Run time: %f "% 
      (test_func(truth=truth_list,
                 bintargets=bintargets,
                 M=int(min_record.M.values[0]),
                 efC=int(min_record.efC.values[0]),
                 post=1,
                 ef=int(min_record.ef.values[0]),
                 threads=4)))

print("\n")
print("Recall and Runtime for original paramter used in GuideMaker")
print("Recall: %f Run time: %f "% (test_func(truth=truth_list, bintargets=bintargets, M=16, efC=64, post=1, ef=256, threads=4)))



Recall and Runtime after fine tuning of paramter
Recall: 0.986509 Run time: 0.084002 


Recall and Runtime for original paramter used in GuideMaker
Recall: 1.000000 Run time: 0.194305 


In [107]:
print("Optimized parameters: \n")
print("M:", int(min_record.M.values[0]), 
      "efC:", int(min_record.efC.values[0]), 
      "ef:", int(min_record.ef.values[0]),
     "post:", 1)

Optimized parameters: 

M: 16 efC: 10 ef: 9 post: 1
