### Parameter optimization for NMSLIB

https://github.com/nmslib/hnswlib/blob/master/ALGO_PARAMS.md


- M - the number of bi-directional links created for every new element during construction. 

    - M is tightly connected with internal dimensionality of the data. Strongly affects the memory consumption (~M). 
    
    - Higher M leads to higher accuracy/run_time at fixed ef/efConstruction
    
    - **Reasonable range for M is 2-100.**
    
    - Higher M work better on datasets with high intrinsic dimensionality and/or high recall,while low M work better for datasets with low intrinsic dimensionality and/or low recalls.
    
    - The parameter also determines the algorithm's memory consumption, which is roughly M * 8-10 bytes per stored element.
    
    - As an example for dim=4 random vectors optimal M for search is somewhere around 6, while for high dimensional datasets (word embeddings, good face descriptors), higher M are required (e.g. M=48-64)for optimal performance at high recall. 
    
    - **The range M=12-48 is ok for the most of the use cases.**
        - **we are keeping this to 16.**
    
    - When M is changed one has to update the other parameters. Nonetheless, ef and ef_construction parameterscan be roughly estimated by assuming that M*ef_{construction} is a constant


- ef - the size of the dynamic list for the nearest neighbors (used during the search). Higher ef leads to more accurate but slower search. 

    - **ef cannot be set lower than the number of queried nearest neighbors k.**
    - The value ef of can be anything between k and the size of the dataset.


- ef_construction - the parameter has the same meaning as ef, but controls the index_time/index_accuracy.

    - ef_construction - controls index search speed/build speed tradeoff
    - Bigger ef_construction leads to longer construction, but better index quality. 
    - At some point,increasing ef_construction does not improve the quality of the index. 
    - One way to check if the selection of ef_construction was ok is to measure a recall for M nearest neighbor search when ef =ef_construction: if the recall is lower than 0.9, than there is room for improvement.

https://github.com/nmslib/nmslib/blob/master/manual/methods.md


In [1]:
import sys
import time
import math

from Bio import SeqIO
import nmslib
from scipy.sparse import csr_matrix
from sklearn.model_selection import train_test_split
import pandas as pd

import guidemaker



In [2]:
# Calling ground truth/ 
pamobj = guidemaker.core.Pam("NGG", "5prime")
gb = SeqIO.parse("test_data/Carsonella_ruddii.fasta", "fasta")
pamtargets = pamobj.find_targets(seq_record_iter=gb, strand="forward", target_len=20)
tl = guidemaker.core.TargetList(targets=pamtargets, lcp=10, hammingdist=2, knum=2)
tl.find_unique_near_pam()
bintargets = tl._one_hot_encode(tl.targets)

index = nmslib.init(space='bit_hamming',
                    dtype=nmslib.DistType.INT,
                    data_type=nmslib.DataType.OBJECT_AS_STRING,
                    method='brute_force')

index.addDataPointBatch(bintargets)

index.createIndex( print_progress=True)



In [3]:
# Computing gold-standard 
start = time.time()
truth_list = index.knnQueryBatch(bintargets, k=3, num_threads = 4)
end = time.time()
print('brute-force kNN time total=%f (sec), per query=%f (sec)' % 
      (end-start, float(end-start)/len(bintargets)) )


brute-force kNN time total=0.039777 (sec), per query=0.000022 (sec)


In [4]:
def recall(results, truth):
    """Calculate recall for top two kNN distances

    calulate recall on the top 2 distances (not labels becasue we really care that the algoritm estimates the correct
    distance not the exact value of the neighbor and there can be multiple nieghbors with the same edit distance .)
    """
    dat = zip(results, truth)
    assert len(results) ==len(truth)
    tot = len(results)
    correct = 0
    for res, tr in dat:
        if all(res[1][0:2] ==tr[1][0:2]): # it should have been 2 not 1, then need to use all to compare all the element of an array
            correct += 1
    return correct/tot

In [5]:
def test_func(truth, bintargets, M, efC, post, ef, delaunay_type=2, threads=4):
    start = time.time()
    index_params = {'M': M, 'indexThreadQty': threads,'efConstruction': efC, 'post': post}
    index = nmslib.init(space='bit_hamming',
                    dtype=nmslib.DistType.INT,
                    data_type=nmslib.DataType.OBJECT_AS_STRING,
                    method='hnsw')
    index.addDataPointBatch(bintargets)
    index.createIndex(index_params)
    index.setQueryTimeParams({'efSearch': ef})
    results_list = index.knnQueryBatch(bintargets, k=3, num_threads = 4)
    end = time.time()
    rc = recall(results_list, truth)
    return rc, float(end-start)


In [6]:
# used in guidemaker
test_func(truth=truth_list, bintargets=bintargets, M=16, efC=64, post=1, ef=256, threads=4)

(1.0, 0.23450803756713867)

In [7]:
def sim_nsmlib_para(increase_by: int=10):
    M_range = range(2, 101, increase_by)
    ef_range = range(3, 257, increase_by)
    efC_range = range(3, 257, increase_by)
    post_range = range(1, 101, increase_by)
    print("Total number of combinations: ", (len(M_range) * len(ef_range) * len(efC_range) * len(post_range)))
    pdict = dict(run=[], M = [], ef = [], efC = [], post = [], accuracy = [], time=[])
    n=1
    for m in M_range:
        for ef in ef_range:
            for efC in efC_range:
                for post in post_range:
                    rc, tt = test_func(truth=truth_list, bintargets=bintargets, M=m, efC=efC, post=post, ef=ef, threads=4)
                    n +=1
                    if rc > 0.9900000000000000:
                        pdict['run'].append(n)
                        pdict['M'].append(m)
                        pdict['ef'].append(ef)
                        pdict['efC'].append(efC)
                        pdict['post'].append(post)
                        pdict['accuracy'].append(rc)
                        pdict['time'].append(tt)
                        if n % 50 == 0:
                            print(n, m, ef, efC, post, rc, tt)
    return pdict

In [8]:
aa = sim_nsmlib_para(increase_by=50)


Total number of combinations:  144
100 52 103 53 1 1.0 0.1871938705444336


In [9]:
df = pd.DataFrame.from_dict(aa)
df.sort_values(['time'], ascending=[True])

Unnamed: 0,run,M,ef,efC,post,accuracy,time
2,79,52,3,103,51,0.995503,0.090103
10,89,52,53,53,51,0.996627,0.098415
29,110,52,153,3,1,0.993255,0.099304
4,81,52,3,153,51,0.998314,0.110803
40,122,52,203,3,1,0.997189,0.113018
...,...,...,...,...,...,...,...
38,120,52,153,253,1,1.000000,0.383020
56,140,52,253,153,1,1.000000,0.396006
49,132,52,203,253,1,1.000000,0.412261
58,142,52,253,203,1,1.000000,0.419003


In [10]:
 def sim_nsmlib_para_optimized(increase_by: int=10):
    M = set([8, 16, 24, 32])
    ef_range = range(3, 257, increase_by)
    efC_range = range(3, 257, increase_by)
    post = 1
    print("Total number of combinations: ", len(ef_range) * len(efC_range) *len(M))
    pdict = dict(run=[], M = [], ef = [], efC = [], post = [], accuracy = [], time=[])
    n=1
    for m in M:
        for ef in ef_range:
            for efC in efC_range:
                rc, tt = test_func(truth=truth_list, bintargets=bintargets, M=m, efC=efC, post=post, ef=ef, threads=4)
                n +=1
                if rc > 0.9900000000000000:
                    pdict['run'].append(n)
                    pdict['M'].append(m)
                    pdict['ef'].append(ef)
                    pdict['efC'].append(efC)
                    pdict['post'].append(post)
                    pdict['accuracy'].append(rc)
                    pdict['time'].append(tt)
                    if n % 50 == 0:
                            print(n, m, ef, efC, post, rc, tt)
    return pdict

In [11]:
aa = sim_nsmlib_para_optimized(increase_by=50)
df = pd.DataFrame.from_dict(aa)
df.sort_values(['time'], ascending=[True])

Total number of combinations:  144
100 32 203 103 1 1.0 0.29292798042297363


Unnamed: 0,run,M,ef,efC,post,accuracy,time
103,122,24,103,3,1,0.991006,0.091397
42,56,16,153,3,1,0.996065,0.098893
75,92,32,153,3,1,0.996627,0.102565
48,62,16,203,3,1,0.997752,0.113428
15,26,8,203,3,1,0.993817,0.113591
...,...,...,...,...,...,...,...
91,108,32,253,203,1,1.000000,0.411051
120,139,24,203,253,1,1.000000,0.414827
126,145,24,253,253,1,1.000000,0.426082
86,103,32,203,253,1,1.000000,0.427759


In [12]:
 def sim_nsmlib_para_optimized2(increase_by: int=10):
    M = 16 # fix this to 16 ~ memory used
    ef_range = range(3, 257, increase_by)
    efC_range = range(3, 257, increase_by)
    post = 1
    print("Total number of combinations: ", len(ef_range) * len(efC_range))
    pdict = dict(run=[], M = [], ef = [], efC = [], post = [], accuracy = [], time=[])
    n=1
    m=M
    for ef in ef_range:
        for efC in efC_range:
            rc, tt = test_func(truth=truth_list, bintargets=bintargets, M=m, efC=efC, post=post, ef=ef, threads=4)
            n +=1
            if rc > 0.9900000000000000:
                pdict['run'].append(n)
                pdict['M'].append(m)
                pdict['ef'].append(ef)
                pdict['efC'].append(efC)
                pdict['post'].append(post)
                pdict['accuracy'].append(rc)
                pdict['time'].append(tt)
                if n % 100 == 0: # print after roughly 100 itertations
                        print(n, m, ef, efC, post, rc, tt)
    return pdict

In [13]:
aa = sim_nsmlib_para_optimized2(increase_by=10)
df = pd.DataFrame.from_dict(aa)
df.sort_values(['time'], ascending=[True])


Total number of combinations:  676
100 16 33 203 1 1.0 0.2798779010772705
200 16 73 163 1 1.0 0.3524210453033447
300 16 113 123 1 1.0 0.3043479919433594
400 16 153 83 1 1.0 0.2736551761627197
500 16 193 43 1 1.0 0.22202110290527344
600 16 233 3 1 0.9988757729061271 0.132371187210083


Unnamed: 0,run,M,ef,efC,post,accuracy,time
23,29,16,13,13,1,0.993817,0.081747
24,30,16,13,23,1,0.998876,0.089164
0,5,16,3,33,1,0.996065,0.091583
48,55,16,23,13,1,0.998876,0.094752
49,56,16,23,23,1,0.998876,0.099575
...,...,...,...,...,...,...,...
641,659,16,253,73,1,1.000000,0.444239
193,205,16,73,213,1,1.000000,0.458452
186,198,16,73,143,1,1.000000,0.532948
185,197,16,73,133,1,1.000000,0.733624


In [14]:
# used in guidemaker

print("Recall and Runtime for original paramter used in GuideMaker")
print("Recall: %f Run time: %f "% (test_func(truth=truth_list, bintargets=bintargets, M=16, efC=64, post=1, ef=256, threads=4)))


Recall and Runtime for original paramter used in GuideMaker
Recall: 1.000000 Run time: 0.232781 


In [15]:
# After paramter tuning
print("Recall and Runtime after tuning paramter")
print("Recall: %f Run time: %f "% (test_func(truth=truth_list, bintargets=bintargets, M=16, efC=13, post=1, ef=13, threads=4)))

Recall and Runtime after tuning paramter
Recall: 0.997752 Run time: 0.087803 


#### Since the run time was low around efC=13 and ef=13, we can do finer optimization around that values.
- Change the range of ef - range(3, 20)
- Change the range of efC - range(3, 20)
- increase_by == 2

In [16]:
 def sim_nsmlib_para_optimized_finer(increase_by: int=2):
    M = 16 # fix this to 16 ~ memory used
    ef_range = range(3, 20, increase_by)
    efC_range = range(3, 20, increase_by)
    post = 1
    print("Total number of combinations: ", len(ef_range) * len(efC_range))
    pdict = dict(run=[], M = [], ef = [], efC = [], post = [], accuracy = [], time=[])
    n=1
    m=M
    for ef in ef_range:
        for efC in efC_range:
            rc, tt = test_func(truth=truth_list, bintargets=bintargets, M=m, efC=efC, post=post, ef=ef, threads=4)
            n +=1
            if rc > 0.9900000000000000:
                pdict['run'].append(n)
                pdict['M'].append(m)
                pdict['ef'].append(ef)
                pdict['efC'].append(efC)
                pdict['post'].append(post)
                pdict['accuracy'].append(rc)
                pdict['time'].append(tt)
                if n % 40 == 0: # print after roughly 100 itertations
                        print(n, m, ef, efC, post, rc, tt)
    return pdict

In [17]:
aa = sim_nsmlib_para_optimized_finer(increase_by=1)
df = pd.DataFrame.from_dict(aa)
df.sort_values(['time'], ascending=[True])

Total number of combinations:  289
80 16 7 13 1 0.9938167509836987 0.08261489868164062
120 16 9 19 1 0.9971894322653176 0.08492398262023926
200 16 14 14 1 0.9983136593591906 0.09398698806762695


Unnamed: 0,run,M,ef,efC,post,accuracy,time
4,33,16,4,17,1,0.991568,0.072631
5,34,16,4,18,1,0.992130,0.073306
113,246,16,17,9,1,0.992693,0.073686
10,50,16,5,17,1,0.991006,0.074906
1,18,16,3,19,1,0.990444,0.075312
...,...,...,...,...,...,...,...
48,132,16,10,14,1,0.998314,0.105870
97,218,16,15,15,1,0.997189,0.109942
46,130,16,10,12,1,0.996065,0.122875
45,129,16,10,11,1,0.992693,0.124207


In [18]:
min_record = df[df.time == df.time.min()]
min_record

Unnamed: 0,run,M,ef,efC,post,accuracy,time
4,33,16,4,17,1,0.991568,0.072631


In [19]:
# After paramter tuning
print("Recall and Runtime after fine tuning of paramter")
print("Recall: %f Run time: %f "% 
      (test_func(truth=truth_list,
                 bintargets=bintargets,
                 M=int(min_record.M.values[0]),
                 efC=int(min_record.efC.values[0]),
                 post=1,
                 ef=int(min_record.ef.values[0]),
                 threads=4)))

print("\n")
print("Recall and Runtime for original paramter used in GuideMaker")
print("Recall: %f Run time: %f "% (test_func(truth=truth_list, bintargets=bintargets, M=16, efC=64, post=1, ef=256, threads=4)))



Recall and Runtime after fine tuning of paramter
Recall: 0.987634 Run time: 0.104442 


Recall and Runtime for original paramter used in GuideMaker
Recall: 1.000000 Run time: 0.215113 


In [20]:
print("Optimized parameters: \n")
print("M:", int(min_record.M.values[0]), 
      "efC:", int(min_record.efC.values[0]), 
      "ef:", int(min_record.ef.values[0]),
     "post:", 1)

Optimized parameters: 

M: 16 efC: 17 ef: 4 post: 1
