## Purpose

GuideMaker enables users to design RNA targets for entire genomes using any PAM and any genome. The most computatioanlly costly step of Guidefinder compares the Hamming distance of all potenial guide RNA targets in the genome to all other targets. For a typical bacterial genome and Cas9 (Protospacer ajacent Motif site NGG) this could be a (10^6 * (10^6 -1))/2 ~ 5^11 comparisons. To avoid that number of comparisons we perform approxamate nearest neighbor search using Hierarchical Navigable Small World (HNSW) graphs in the NMSlib package.  This is much faster but it requires construction of an index and selecting index and search parameters that balance index speed, search speed, and Recall.


### Parameter optimization for NMSLIB

https://github.com/nmslib/hnswlib/blob/master/ALGO_PARAMS.md


- M - the number of bi-directional links created for every new element during construction. 

    - M is tightly connected with internal dimensionality of the data. Strongly affects the memory consumption (~M). 
    
    - Higher M leads to higher accuracy/run_time at fixed ef/efConstruction
    
    - **Reasonable range for M is 2-100.**
    
    - Higher M work better on datasets with high intrinsic dimensionality and/or high recall,while low M work better for datasets with low intrinsic dimensionality and/or low recalls.
    
    - The parameter also determines the algorithm's memory consumption, which is roughly M * 8-10 bytes per stored element.
    
    - As an example for dim=4 random vectors optimal M for search is somewhere around 6, while for high dimensional datasets (word embeddings, good face descriptors), higher M are required (e.g. M=48-64)for optimal performance at high recall. 
    
    - **The range M=12-48 is ok for the most of the use cases.**
        - **we are keeping this to 16.**
    
    - When M is changed one has to update the other parameters. Nonetheless, ef and ef_construction parameterscan be roughly estimated by assuming that M*ef_{construction} is a constant


- ef - the size of the dynamic list for the nearest neighbors (used during the search). Higher ef leads to more accurate but slower search. 

    - **ef cannot be set lower than the number of queried nearest neighbors k.**
    - The value ef of can be anything between k and the size of the dataset.


- ef_construction - the parameter has the same meaning as ef, but controls the index_time/index_accuracy.

    - ef_construction - controls index search speed/build speed tradeoff
    - Bigger ef_construction leads to longer construction, but better index quality. 
    - At some point,increasing ef_construction does not improve the quality of the index. 
    - One way to check if the selection of ef_construction was ok is to measure a recall for M nearest neighbor search when ef =ef_construction: if the recall is lower than 0.9, than there is room for improvement.

https://github.com/nmslib/nmslib/blob/master/manual/methods.md



In [55]:
import sys
import time
import math

import os
from Bio.Seq import Seq
from Bio import SeqIO


import nmslib
from scipy.sparse import csr_matrix
from sklearn.model_selection import train_test_split
import pandas as pd

import guidemaker



In [56]:
# Calling ground truth/ 
pamobj = guidemaker.core.PamTarget("NGG", "5prime")
gb = SeqIO.parse("test_data/Carsonella_ruddii.fasta", "fasta")
targets = pamobj.find_targets(seq_record_iter=gb, target_len=20)
tl = guidemaker.core.TargetProcessor(targets=targets, lu=10, hammingdist=2, knum=10)
tl.find_unique_near_pam()
notduplicated_targets= list(set(tl.targets['target'].tolist()))
bintargets = tl._one_hot_encode(notduplicated_targets)

index = nmslib.init(space='bit_hamming',
                    dtype=nmslib.DistType.INT,
                    data_type=nmslib.DataType.OBJECT_AS_STRING,
                    method='brute_force')

index.addDataPointBatch(bintargets)

index.createIndex( print_progress=True)



In [57]:
# Computing gold-standard 
start = time.time()
truth_list = index.knnQueryBatch(bintargets, k=3, num_threads = 4)
end = time.time()
print('brute-force kNN time total=%f (sec), per query=%f (sec)' % 
      (end-start, float(end-start)/len(bintargets)) )


brute-force kNN time total=0.199856 (sec), per query=0.000052 (sec)


In [58]:
def recall(results, truth):
    """Calculate recall for top two kNN distances

    Calulate recall on the top 2 distances (not labels becasue we really care that the algorithm estimates the correct
    distance not the exact value of the neighbor and there can be multiple nieghbors with the same edit distance .)
    """
    dat = zip(results, truth)
    assert len(results) ==len(truth)
    tot = len(results)
    correct = 0
    for res, tr in dat:
        if all(res[1][0:2] ==tr[1][0:2]): # it should have been 2 not 1, then need to use all to compare all the element of an array
            correct += 1
    return correct/tot

In [59]:
def KNN_NMSLIB(truth, bintargets, M, efC, post, ef, delaunay_type=2, threads=4):
    """Calculate approximate KNN and compare with ground truth to get recall values.
    """
    start = time.time()
    index_params = {'M': M, 'indexThreadQty': threads,'efConstruction': efC, 'post': post}
    index = nmslib.init(space='bit_hamming',
                    dtype=nmslib.DistType.INT,
                    data_type=nmslib.DataType.OBJECT_AS_STRING,
                    method='hnsw')
    index.addDataPointBatch(bintargets)
    index.createIndex(index_params)
    index.setQueryTimeParams({'efSearch': ef})
    results_list = index.knnQueryBatch(bintargets, k=3, num_threads = 4)
    end = time.time()
    rc = recall(results_list, truth)
    return rc, float(end-start)


In [60]:
# Hyper parameter values initially used in guidemaker
KNN_NMSLIB(truth=truth_list, bintargets=bintargets, M=16, efC=64, post=1, ef=256, threads=4)

(1.0, 0.6005792617797852)

In [61]:
def simulateNmslibParameter(increase_by: int=10):
    """Simulate KNN_NMSLIB
    """
    M_range = range(2, 101, increase_by)
    ef_range = range(3, 257, increase_by)
    efC_range = range(3, 257, increase_by)
    post_range = range(1, 101, increase_by)
    print("Total number of combinations: ", (len(M_range) * len(ef_range) * len(efC_range) * len(post_range)))
    pdict = dict(run=[], M = [], ef = [], efC = [], post = [], accuracy = [], time=[])
    n=1
    for m in M_range:
        for ef in ef_range:
            for efC in efC_range:
                for post in post_range:
                    rc, tt = KNN_NMSLIB(truth=truth_list, bintargets=bintargets, M=m, efC=efC, post=post, ef=ef, threads=4)
                    n +=1
                    if rc > 0.9900000000000000:
                        pdict['run'].append(n)
                        pdict['M'].append(m)
                        pdict['ef'].append(ef)
                        pdict['efC'].append(efC)
                        pdict['post'].append(post)
                        pdict['accuracy'].append(rc)
                        pdict['time'].append(tt)
                        if n % 50 == 0:
                            print(n, m, ef, efC, post, rc, tt)
    return pdict

In [62]:
parameterBy50 = simulateNmslibParameter(increase_by=50)

Total number of combinations:  144
100 52 103 53 1 1.0 0.565349817276001


In [63]:
dfBy50 = pd.DataFrame.from_dict(parameterBy50)
dfBy50.sort_values(['time'], ascending=[True])

Unnamed: 0,run,M,ef,efC,post,accuracy,time
2,79,52,3,103,51,0.995543,0.317803
0,76,52,3,53,1,0.999476,0.332607
4,81,52,3,153,51,0.997378,0.369949
10,89,52,53,53,51,0.999213,0.400459
49,134,52,253,3,1,0.994232,0.425375
6,83,52,3,203,51,0.997902,0.442574
9,88,52,53,53,1,1.0,0.464898
14,93,52,53,153,51,1.0,0.500976
8,85,52,3,253,51,0.99764,0.503069
22,103,52,103,103,51,0.999476,0.512848


In [64]:
# Now fix the value of M parameter to smaller range and post to 1
def simulateNmslibParameter_ef_efc(increase_by: int=10):
    """Simulate KNN_NMSLIB for ef and efc values
    """
    M = set([8, 16, 24, 32])
    ef_range = range(3, 257, increase_by)
    efC_range = range(3, 257, increase_by)
    post = 1
    print("Total number of combinations: ", len(ef_range) * len(efC_range) *len(M))
    pdict = dict(run=[], M = [], ef = [], efC = [], post = [], accuracy = [], time=[])
    n=1
    for m in M:
        for ef in ef_range:
            for efC in efC_range:
                rc, tt = KNN_NMSLIB(truth=truth_list, bintargets=bintargets, M=m, efC=efC, post=post, ef=ef, threads=4)
                n +=1
                if rc > 0.9900000000000000:
                    pdict['run'].append(n)
                    pdict['M'].append(m)
                    pdict['ef'].append(ef)
                    pdict['efC'].append(efC)
                    pdict['post'].append(post)
                    pdict['accuracy'].append(rc)
                    pdict['time'].append(tt)
                    if n % 50 == 0:
                            print(n, m, ef, efC, post, rc, tt)
    return pdict

In [65]:
efefc50 = simulateNmslibParameter_ef_efc(increase_by=50)

Total number of combinations:  144
100 32 203 103 1 1.0 0.8907163143157959


In [73]:
df_efefc50 = pd.DataFrame.from_dict(efefc50)
df_efefc50.sort_values(['time'], ascending=[True])

Unnamed: 0,run,M,ef,efC,post,accuracy,time
0,9,8,53,53,1,0.999476,0.303561
25,39,16,3,53,1,0.996329,0.336757
87,111,24,3,53,1,0.998165,0.357079
50,68,16,253,3,1,0.990561,0.375031
112,140,24,253,3,1,0.992396,0.378774
...,...,...,...,...,...,...,...
63,83,32,53,153,1,1.000000,1.464534
44,61,16,153,253,1,1.000000,1.557293
117,145,24,253,253,1,1.000000,1.563270
42,59,16,153,153,1,1.000000,1.714815


In [74]:
# Now fix the value of M parameter 16 and post to 1
def NarrowSimulateNmslibParameter_ef_efc(increase_by: int=10):
    """Simulate KNN_NMSLIB for ef and efc values at narrow values
    """
    M = 16 # fix this to 16 ~ memory used
    ef_range = range(3, 257, increase_by)
    efC_range = range(3, 257, increase_by)
    post = 1
    print("Total number of combinations: ", len(ef_range) * len(efC_range))
    pdict = dict(run=[], M = [], ef = [], efC = [], post = [], accuracy = [], time=[])
    n=1
    m=M
    for ef in ef_range:
        for efC in efC_range:
            rc, tt = KNN_NMSLIB(truth=truth_list, bintargets=bintargets, M=m, efC=efC, post=post, ef=ef, threads=4)
            n +=1
            if rc > 0.9900000000000000:
                pdict['run'].append(n)
                pdict['M'].append(m)
                pdict['ef'].append(ef)
                pdict['efC'].append(efC)
                pdict['post'].append(post)
                pdict['accuracy'].append(rc)
                pdict['time'].append(tt)
                if n % 100 == 0: # print after roughly 100 itertations
                        print(n, m, ef, efC, post, rc, tt)
    return pdict

In [75]:
small_efefc10 = NarrowSimulateNmslibParameter_ef_efc(increase_by=10)


Total number of combinations:  676
100 16 33 203 1 1.0 0.9061942100524902
200 16 73 163 1 1.0 0.736893892288208
300 16 113 123 1 1.0 0.7345798015594482
400 16 153 83 1 1.0 0.6245269775390625
500 16 193 43 1 1.0 0.6467671394348145


NameError: name 'small_efefc5' is not defined

In [82]:
df_small_efefc510 = pd.DataFrame.from_dict(small_efefc10)
df_small_efefc510.sort_values(['time'], ascending=[True])

Unnamed: 0,run,M,ef,efC,post,accuracy,time
23,29,16,13,13,1,0.992396,0.196054
24,30,16,13,23,1,0.996067,0.209074
0,5,16,3,33,1,0.994232,0.214509
1,6,16,3,43,1,0.995018,0.237045
73,81,16,33,13,1,0.995281,0.262565
...,...,...,...,...,...,...,...
270,285,16,103,233,1,1.000000,1.452703
521,546,16,203,243,1,1.000000,1.465650
635,664,16,253,123,1,1.000000,1.540205
271,286,16,103,243,1,1.000000,2.003754


In [83]:
# used in guidemaker

print("Recall and Runtime for original paramter used in GuideMaker")
print("Recall: %f Run time: %f "% (KNN_NMSLIB(truth=truth_list, bintargets=bintargets, M=16, efC=64, post=1, ef=256, threads=4)))


Recall and Runtime for original paramter used in GuideMaker
Recall: 1.000000 Run time: 0.632483 


In [84]:
# After paramter tuning
print("Recall and Runtime after tuning paramter")
print("Recall: %f Run time: %f "% (KNN_NMSLIB(truth=truth_list, bintargets=bintargets, M=16, efC=13, post=1, ef=13, threads=4)))

Recall and Runtime after tuning paramter
Recall: 0.991610 Run time: 0.213255 


#### Since the run time was low around efC=13 and ef=13, we can do finer optimization around that values.
- Change the range of ef - range(3, 20)
- Change the range of efC - range(3, 20)
- increase_by == 2

In [85]:
 def FinetuneNmslibParameter(increase_by: int=2):
    M = 16 # fix this to 16 ~ memory used
    ef_range = range(3, 20, increase_by)
    efC_range = range(3, 20, increase_by)
    post = 1
    print("Total number of combinations: ", len(ef_range) * len(efC_range))
    pdict = dict(run=[], M = [], ef = [], efC = [], post = [], accuracy = [], time=[])
    n=1
    m=M
    for ef in ef_range:
        for efC in efC_range:
            rc, tt = KNN_NMSLIB(truth=truth_list, bintargets=bintargets, M=m, efC=efC, post=post, ef=ef, threads=4)
            n +=1
            if rc > 0.9900000000000000:
                pdict['run'].append(n)
                pdict['M'].append(m)
                pdict['ef'].append(ef)
                pdict['efC'].append(efC)
                pdict['post'].append(post)
                pdict['accuracy'].append(rc)
                pdict['time'].append(tt)
                if n % 40 == 0: # print after roughly 100 itertations
                        print(n, m, ef, efC, post, rc, tt)
    return pdict

In [None]:
finetune = FinetuneNmslibParameter(increase_by=1)

Total number of combinations:  289


In [None]:
df_finetune = pd.DataFrame.from_dict(finetune)
df_finetune.sort_values(['time'], ascending=[True])

In [18]:
min_record = df_finetune[df_finetune.time == df_finetune.time.min()]
min_record

Unnamed: 0,run,M,ef,efC,post,accuracy,time
4,33,16,4,17,1,0.991568,0.072631


In [19]:
# After paramter tuning
print("Recall and Runtime after fine tuning of paramter")
print("Recall: %f Run time: %f "% 
      (KNN_NMSLIB(truth=truth_list,
                 bintargets=bintargets,
                 M=int(min_record.M.values[0]),
                 efC=int(min_record.efC.values[0]),
                 post=1,
                 ef=int(min_record.ef.values[0]),
                 threads=4)))

print("\n")
print("Recall and Runtime for original paramter used in GuideMaker")
print("Recall: %f Run time: %f "% (test_func(truth=truth_list, bintargets=bintargets, M=16, efC=64, post=1, ef=256, threads=4)))



Recall and Runtime after fine tuning of paramter
Recall: 0.987634 Run time: 0.104442 


Recall and Runtime for original paramter used in GuideMaker
Recall: 1.000000 Run time: 0.215113 


In [20]:
print("Optimized parameters: \n")
print("M:", int(min_record.M.values[0]), 
      "efC:", int(min_record.efC.values[0]), 
      "ef:", int(min_record.ef.values[0]),
     "post:", 1)

Optimized parameters: 

M: 16 efC: 17 ef: 4 post: 1
