# Purpose

GuideMaker enables users to design RNA targets for entire genomes using any PAM and any genome. The most computationally costly step of GuideMaker compares the Hamming distance of all potential guide RNA targets in the genome to all other targets. For a typical bacterial genome and SpCas9 (Protospacer adjacent Motif site NGG) this could be a $ \frac{10^{6} \cdot (10^{6} -1)}{2} \approx 5 \times 10^{11} $ comparisons. To avoid that number of comparisons we perform an approximate nearest neighbor search using Hierarchical Navigable Small World (HNSW) graphs in the NMSlib package. This is much faster but it requires the construction of an index and selecting index and search parameters that balance index speed, search speed, and Recall.


## Parameter optimization for NMSLIB

https://github.com/nmslib/hnswlib/blob/master/ALGO_PARAMS.md


- M - the number of bi-directional links created for every new element during construction. 

    - M is tightly connected to the internal dimensionality of the data and trongly affects the memory consumption (~M). 
    
    - Higher M leads to higher accuracy/run_time at fixed ef/efConstruction
    
    - **The Reasonable range for M is 2-100.**
    
    - Higher M works better on datasets with high intrinsic dimensionality and/or high recall, while low M works better for datasets with low intrinsic dimensionality and/or low recalls.
    
    - The parameter also determines the algorithm's memory consumption, which is roughly M * 8-10 bytes per stored element.
    
    - As an example for dim=4 random vectors optimal M for search is somewhere around 6, while for high dimensional datasets (word embeddings, good face descriptors), higher M is required (e.g. M=48-64)for optimal performance at high recall. 
    
    - **The range M=12-48 is ok for most use cases.**
    
    - When M is changed one has to update the other parameters. Nonetheless, ef and ef_construction parameters can be roughly estimated by assuming that M*ef_{construction} is a constant


- ef - the size of the dynamic list for the nearest neighbors (used during the search). Higher ef leads to more accurate but slower searches.

    - **ef cannot be set lower than the number of queried nearest neighbors k.**
    - The value of ef can be anything between k and the size of the dataset.


- ef_construction - the parameter has the same meaning as ef, but controls the index_time/index_accuracy.

    - ef_construction - controls index search speed/build speed tradeoff
    - Bigger ef_construction leads to longer construction, but better index quality. 
    - At some point, increasing ef_construction does not improve the quality of the index. 
    - One way to check if the selection of ef_construction was ok is to measure a recall for M nearest neighbor search when ef =ef_construction: if the recall is lower than 0.9, than there is room for improvement.

https://github.com/nmslib/nmslib/blob/master/manual/methods.md



# Running your own optimization

To run your own optimization set the parameters below and run each block in the Jupyter notebook.

__Note that this script requires the python packages Scipy, Scikit-learn, and Jupyter which are not installed with GuideMaker by default but may be installed with Pip or Conda.__

In [28]:
FASTA_PATH = "test_data/Carsonella_ruddii.fasta"
PAM = "NGG"
PAM_ORIENTATION = "5prime" 
DTYPE = "hamming"

In [29]:
import sys
import time
import math

import os
from Bio.Seq import Seq
from Bio import SeqIO


import nmslib
from scipy.sparse import csr_matrix
from sklearn.model_selection import train_test_split
import pandas as pd
import altair as alt

import guidemaker



# Import the genome to optimize

In [31]:
# Calling ground truth/ 
pamobj = guidemaker.core.PamTarget(pam=PAM, pam_orientation=PAM_ORIENTATION, dtype=DTYPE)
gb = SeqIO.parse(FASTA_PATH, "fasta")
targets = pamobj.find_targets(seq_record_iter=gb, target_len=20)
tl = guidemaker.core.TargetProcessor(targets=targets, lsr=10, editdist=2, knum=10)
tl.find_unique_near_pam()
notduplicated_targets= list(set(tl.targets['target'].tolist()))
bintargets = tl._one_hot_encode(notduplicated_targets)





# Compute gold-standard

In [5]:
# Computing gold-standard
index = nmslib.init(space='bit_hamming',
                    dtype=nmslib.DistType.INT,
                    data_type=nmslib.DataType.OBJECT_AS_STRING,
                    method='brute_force')

index.addDataPointBatch(bintargets)

index.createIndex( print_progress=True)

start = time.time()
truth_list = index.knnQueryBatch(bintargets, k=3, num_threads = 4)
end = time.time()
print('brute-force kNN time total=%f (sec), per query=%f (sec)' % 
      (end-start, float(end-start)/len(bintargets)) )


brute-force kNN time total=0.067112 (sec), per query=0.000018 (sec)


In [6]:
def recall(results, truth):
    """Calculate recall for top two kNN distances

    Calulate recall on the top 2 distances (not labels becasue we really care that the algorithm estimates the correct
    distance not the exact value of the neighbor and there can be multiple nieghbors with the same edit distance).
    """
    dat = zip(results, truth)
    assert len(results) ==len(truth)
    tot = len(results)
    correct = 0
    for res, tr in dat:
        if all(res[1][0:2] ==tr[1][0:2]): # it should have been 2 not 1, then need to use all to compare all the element of an array
            correct += 1
    return correct/tot

In [7]:
def KNN_NMSLIB(truth, bintargets, M, efC, post, ef, delaunay_type=2, threads=4):
    """Calculate approximate KNN and compare with ground truth to get recall values.
    """
    start = time.time()
    index_params = {'M': M, 'indexThreadQty': threads,'efConstruction': efC, 'post': post}
    index = nmslib.init(space='bit_hamming',
                    dtype=nmslib.DistType.INT,
                    data_type=nmslib.DataType.OBJECT_AS_STRING,
                    method='hnsw')
    index.addDataPointBatch(bintargets)
    index.createIndex(index_params)
    index.setQueryTimeParams({'efSearch': ef})
    results_list = index.knnQueryBatch(bintargets, k=3, num_threads = 4)
    end = time.time()
    rc = recall(results_list, truth)
    return rc, float(end-start)


In [8]:
# Hyper parameter values initially used in guidemaker
KNN_NMSLIB(truth=truth_list, bintargets=bintargets, M=16, efC=64, post=1, ef=256, threads=4)

(1.0, 0.3290078639984131)

In [9]:
def simulateNmslibParameter(increase_by: int=10):
    """Simulate KNN_NMSLIB
    """
    M_range = range(2, 101, increase_by)
    ef_range = range(3, 257, increase_by)
    efC_range = range(3, 257, increase_by)
    post_range = range(1, 101, increase_by)
    print("Total number of combinations: ", (len(M_range) * len(ef_range) * len(efC_range) * len(post_range)))
    pdict = dict(run=[], M = [], ef = [], efC = [], post = [], accuracy = [], time=[])
    n=1
    for m in M_range:
        for ef in ef_range:
            for efC in efC_range:
                for post in post_range:
                    rc, tt = KNN_NMSLIB(truth=truth_list, bintargets=bintargets, M=m, efC=efC, post=post, ef=ef, threads=4)
                    n +=1
                    if rc > 0.9900000000000000:
                        pdict['run'].append(n)
                        pdict['M'].append(m)
                        pdict['ef'].append(ef)
                        pdict['efC'].append(efC)
                        pdict['post'].append(post)
                        pdict['accuracy'].append(rc)
                        pdict['time'].append(tt)
                        if n % 50 == 0:
                            print(n, m, ef, efC, post, rc, tt)
    return pdict

In [10]:
parameterBy50 = simulateNmslibParameter(increase_by=50)

Total number of combinations:  144
100 52 103 53 1 1.0 0.23253512382507324


In [11]:
dfBy50 = pd.DataFrame.from_dict(parameterBy50)
dfBy50.sort_values(['time'], ascending=[True])

Unnamed: 0,run,M,ef,efC,post,accuracy,time
1,77,52,3,53,51,0.991610,0.101679
11,89,52,53,53,51,0.999476,0.122370
3,79,52,3,103,51,0.995543,0.123400
5,81,52,3,153,51,0.997902,0.151634
21,101,52,103,53,51,1.000000,0.151906
...,...,...,...,...,...,...,...
38,120,52,153,253,1,1.000000,0.592271
44,128,52,203,153,1,1.000000,0.603659
48,132,52,203,253,1,1.000000,0.662859
57,142,52,253,203,1,1.000000,0.670264


In [12]:
# Now fix the value of M parameter to smaller range and post to 1
def simulateNmslibParameter_ef_efc(increase_by: int=10):
    """Simulate KNN_NMSLIB for ef and efc values
    """
    M = set([8, 16, 24, 32])
    ef_range = range(3, 257, increase_by)
    efC_range = range(3, 257, increase_by)
    post = 1
    print("Total number of combinations: ", len(ef_range) * len(efC_range) *len(M))
    pdict = dict(run=[], M = [], ef = [], efC = [], post = [], accuracy = [], time=[])
    n=1
    for m in M:
        for ef in ef_range:
            for efC in efC_range:
                rc, tt = KNN_NMSLIB(truth=truth_list, bintargets=bintargets, M=m, efC=efC, post=post, ef=ef, threads=4)
                n +=1
                if rc > 0.9900000000000000:
                    pdict['run'].append(n)
                    pdict['M'].append(m)
                    pdict['ef'].append(ef)
                    pdict['efC'].append(efC)
                    pdict['post'].append(post)
                    pdict['accuracy'].append(rc)
                    pdict['time'].append(tt)
                    if n % 50 == 0:
                            print(n, m, ef, efC, post, rc, tt)
    return pdict

In [13]:
efefc50 = simulateNmslibParameter_ef_efc(increase_by=50)

Total number of combinations:  144
100 32 203 103 1 1.0 0.3618149757385254


In [14]:
df_efefc50 = pd.DataFrame.from_dict(efefc50)
df_efefc50.sort_values(['time'], ascending=[True])

Unnamed: 0,run,M,ef,efC,post,accuracy,time
25,39,16,3,53,1,0.997116,0.150203
112,140,24,253,3,1,0.995543,0.151641
81,104,32,253,3,1,0.993183,0.151707
87,111,24,3,53,1,0.999213,0.153896
0,9,8,53,53,1,1.000000,0.157249
...,...,...,...,...,...,...,...
116,144,24,253,203,1,1.000000,0.500226
111,139,24,203,253,1,1.000000,0.501104
55,73,16,253,253,1,1.000000,0.517626
86,109,32,253,253,1,1.000000,0.537407


In [21]:
alt.Chart(df_efefc50).mark_circle(size=60).encode(
    alt.X('accuracy',
        scale=alt.Scale(zero=False)
    ),
    y='time',
    color='efC',
    tooltip=['M', 'ef', 'efC', 'post']
).interactive()

In [24]:
# Now fix the value of M parameter 16 and post to 1
def NarrowSimulateNmslibParameter_ef_efc(increase_by: int=10):
    """Simulate KNN_NMSLIB for ef and efc values at narrow values
    """
    M = 16 # fix this to 16 ~ memory used
    ef_range = range(3, 257, increase_by)
    efC_range = range(3, 257, increase_by)
    post = 1
    print("Total number of combinations: ", len(ef_range) * len(efC_range))
    pdict = dict(run=[], M = [], ef = [], efC = [], post = [], accuracy = [], time=[])
    n=1
    m=M
    for ef in ef_range:
        for efC in efC_range:
            rc, tt = KNN_NMSLIB(truth=truth_list, bintargets=bintargets, M=m, efC=efC, post=post, ef=ef, threads=4)
            n +=1
            if rc > 0.9900000000000000:
                pdict['run'].append(n)
                pdict['M'].append(m)
                pdict['ef'].append(ef)
                pdict['efC'].append(efC)
                pdict['post'].append(post)
                pdict['accuracy'].append(rc)
                pdict['time'].append(tt)
                if n % 100 == 0: # print after roughly 100 itertations
                        print(n, m, ef, efC, post, rc, tt)
    return pdict

In [25]:
small_efefc10 = NarrowSimulateNmslibParameter_ef_efc(increase_by=10)


Total number of combinations:  676
100 16 33 203 1 1.0 0.31408071517944336
200 16 73 163 1 1.0 0.3466639518737793
300 16 113 123 1 1.0 0.275191068649292
400 16 153 83 1 1.0 0.30820798873901367
500 16 193 43 1 1.0 0.23264813423156738


In [26]:
df_small_efefc510 = pd.DataFrame.from_dict(small_efefc10)
df_small_efefc510.sort_values(['time'], ascending=[True])

Unnamed: 0,run,M,ef,efC,post,accuracy,time
23,29,16,13,13,1,0.991085,0.104714
24,30,16,13,23,1,0.997378,0.106712
48,55,16,23,13,1,0.993970,0.109985
49,56,16,23,23,1,0.998165,0.110339
74,82,16,33,23,1,0.998165,0.118393
...,...,...,...,...,...,...,...
572,599,16,223,253,1,1.000000,0.490191
595,623,16,233,233,1,1.000000,0.498984
219,232,16,83,223,1,1.000000,0.510211
571,598,16,223,243,1,1.000000,0.515706


In [33]:
alt.Chart(df_small_efefc510).mark_circle(size=60).encode(
    alt.X('accuracy',
        scale=alt.Scale(zero=False)
    ),
    y='time',
    color='efC',
    tooltip=['M', 'ef', 'efC', 'post']
).interactive()

In [34]:
# used in guidemaker

print("Recall and Runtime for original paramter used in GuideMaker")
print("Recall: %f Run time: %f "% (KNN_NMSLIB(truth=truth_list, bintargets=bintargets, M=16, efC=64, post=1, ef=256, threads=4)))


Recall and Runtime for original paramter used in GuideMaker
Recall: 1.000000 Run time: 0.386257 


In [35]:
# After paramter tuning
print("Recall and Runtime after tuning paramter")
print("Recall: %f Run time: %f "% (KNN_NMSLIB(truth=truth_list, bintargets=bintargets, M=16, efC=13, post=1, ef=13, threads=4)))

Recall and Runtime after tuning paramter
Recall: 0.987415 Run time: 0.137392 


#### Since the run time was low around efC=13 and ef=13, we can do finer optimization around that values.
- Change the range of ef - range(3, 20)
- Change the range of efC - range(3, 20)
- increase_by == 2

In [36]:
 def FinetuneNmslibParameter(increase_by: int=2):
    M = 16 # fix this to 16 ~ memory used
    ef_range = range(3, 20, increase_by)
    efC_range = range(3, 20, increase_by)
    post = 1
    print("Total number of combinations: ", len(ef_range) * len(efC_range))
    pdict = dict(run=[], M = [], ef = [], efC = [], post = [], accuracy = [], time=[])
    n=1
    m=M
    for ef in ef_range:
        for efC in efC_range:
            rc, tt = KNN_NMSLIB(truth=truth_list, bintargets=bintargets, M=m, efC=efC, post=post, ef=ef, threads=4)
            n +=1
            if rc > 0.9900000000000000:
                pdict['run'].append(n)
                pdict['M'].append(m)
                pdict['ef'].append(ef)
                pdict['efC'].append(efC)
                pdict['post'].append(post)
                pdict['accuracy'].append(rc)
                pdict['time'].append(tt)
                if n % 40 == 0: # print after roughly 100 itertations
                        print(n, m, ef, efC, post, rc, tt)
    return pdict

In [37]:
finetune = FinetuneNmslibParameter(increase_by=1)

Total number of combinations:  289
120 16 9 19 1 0.9950183534347142 0.09722208976745605
200 16 14 14 1 0.9926586261143157 0.10931682586669922


In [38]:
df_finetune = pd.DataFrame.from_dict(finetune)
df_finetune.sort_values(['time'], ascending=[True])

Unnamed: 0,run,M,ef,efC,post,accuracy,time
33,185,16,13,16,1,0.991610,0.089851
22,152,16,11,17,1,0.991348,0.090955
16,135,16,10,17,1,0.990037,0.091945
7,102,16,8,18,1,0.992134,0.091978
34,186,16,13,17,1,0.991348,0.092267
...,...,...,...,...,...,...,...
10,116,16,9,15,1,0.991872,0.137269
78,285,16,19,14,1,0.995018,0.137502
47,218,16,15,15,1,0.995281,0.137514
62,252,16,17,15,1,0.995281,0.137609


In [40]:
alt.Chart(df_finetune).mark_circle(size=60).encode(
    alt.X('accuracy',
        scale=alt.Scale(zero=False)
    ),
    y='time',
    color='efC',
    tooltip=['M', 'ef', 'efC', 'post']
).interactive()

In [39]:
min_record = df_finetune[df_finetune.time == df_finetune.time.min()]
min_record

Unnamed: 0,run,M,ef,efC,post,accuracy,time
33,185,16,13,16,1,0.99161,0.089851


In [41]:
# After paramter tuning
print("Recall and Runtime after fine tuning of paramter")
print("Recall: %f Run time: %f "% 
      (KNN_NMSLIB(truth=truth_list,
                 bintargets=bintargets,
                 M=int(min_record.M.values[0]),
                 efC=int(min_record.efC.values[0]),
                 post=1,
                 ef=int(min_record.ef.values[0]),
                 threads=4)))

print("\n")


Recall and Runtime after fine tuning of paramter
Recall: 0.990037 Run time: 0.109399 




# Final recommendation

In [42]:
print("Optimized parameters: \n")
print("M:", int(min_record.M.values[0]), 
      "efC:", int(min_record.efC.values[0]), 
      "ef:", int(min_record.ef.values[0]),
     "post:", 1)

Optimized parameters: 

M: 16 efC: 16 ef: 13 post: 1
