### Prerequisites

You should have completed steps 1-4 of this tutorial before beginning this exercise.  The files required for this notebook are generated by those previous steps.

Creating the search engine for this example is extremely CPU and memory intensive.  We used an an AWS `x1.32xlarge` instance (128 cores) in order to achieve the maximum speed with building the search index. 

In [2]:
from pathlib import Path
import numpy as np
import pandas as pd
import torch
import nmslib
from lang_model_utils import load_lm_vocab, Query2Emb
from general_utils import create_nmslib_search_index

input_path = Path('./data/processed_data/')
code2emb_path = Path('./data/code2emb/')
output_path = Path('./data/search')
output_path.mkdir(exist_ok=True)

  return f(*args, **kwds)
  return f(*args, **kwds)
Using TensorFlow backend.
  return f(*args, **kwds)
  return f(*args, **kwds)
  from numpy.core.umath_tests import inner1d
  return f(*args, **kwds)
  return f(*args, **kwds)


## Read in Metadata

We will want to organize the data that we will want to display for the search results, which will be:

1. The original code
2. A link to the original code

For convenience, we will collect this data into a pandas dataframe.

In [8]:
import pandas as pd
df = pd.read_pickle('dataframe_processed.pkl')

In [9]:
def listlen(x):
    if not isinstance(x, list):
        return 0
    return len(x)

# separate functions w/o docstrings
# docstrings should be at least 3 words in the docstring to be considered a valid docstring

with_docstrings = df[df.docstring_tokens.str.split().apply(listlen) >= 3]
without_docstrings = df[df.docstring_tokens.str.split().apply(listlen) < 3]

In [10]:
# read file of code
ref_df = without_docstrings['original_function']

## Create Search Index For Vectorized Code

First read in the vectorized code

In [7]:
nodoc_vecs = np.load('./data/code2emb/nodoc_vecs.npy')
assert nodoc_vecs.shape[0] == ref_df.shape[0]

Now build the search index. **Warning:** this step takes ~ 18 minutes on an `x1.32xlarge` instance.

In [8]:
search_index = create_nmslib_search_index(nodoc_vecs)
search_index.saveIndex('./data/search/search_index.nmslib')

This cached version of this index can be downloaded here:  

# Create A Minimal Search Engine

In [3]:
lang_model = torch.load('./data/lang_model/lang_model_cpu_v2.torch', 
                        map_location=lambda storage, loc: storage)

vocab = load_lm_vocab('./data/lang_model/vocab_v2.cls')
q2emb = Query2Emb(lang_model = lang_model.cpu(),
                  vocab = vocab)

search_index = nmslib.init(method='hnsw', space='cosinesimil')
search_index.loadIndex('./data/search/search_index.nmslib')



`Query2Emb` is a helper class that will vectorize sentences using the language model trained in Part 3.  

In this case, we call the method `emb_mean` because we are taking the mean over the time steps of the hidden states in order to construct a sentence embedding for the query supplied by the user.  

In [4]:
test = q2emb.emb_mean('Hello World!  This is a test.')
test.shape



(1, 500)

### Create an object to make the process of showing search results easier

The below object organizes all the pieces together for searching the index and displaying the results with a method call.  

In [6]:
class search_engine:
    """Organizes all the necessary elements we need to make a search engine."""
    def __init__(self, 
                 nmslib_index, 
                 ref_df, 
                 query2emb_func):
        """
        Parameters
        ==========
        nmslib_index : nmslib object
            This is pre-computed search index.
        ref_df : pandas.DataFrame
            This dataframe contains meta-data for search results, 
            must contain the columns 'code' and 'url'.
        query2emb_func : callable
            This is a function that takes as input a string and returns a vector
            that is in the same vector space as what is loaded into the search index.

        """
        
        self.search_index = nmslib_index
        self.ref_df = ref_df
        self.query2emb_func = query2emb_func
    
    def search(self, str_search, k=5):
        """
        Prints the code that are the nearest neighbors (by cosine distance)
        to the search query.
        
        Parameters
        ==========
        str_search : str
            a search query.  Ex: "read data into pandas dataframe"
        k : int
            the number of nearest neighbors to return.  Defaults to 2.
        
        """
        query = self.query2emb_func(str_search)
        idxs, dists = self.search_index.knnQuery(query, k=k)
        
        for idx, dist in zip(idxs, dists):
            code = self.ref_df.iloc[idx]
            print(f'cosine dist:{dist:.4f} \n---------------\n')
            print(code)

In [11]:
se = search_engine(nmslib_index=search_index,
                   ref_df=ref_df,
                   query2emb_func=q2emb.emb_mean)

# Run Some Queries Against The Index!!

Now that we have instantiated the search engine, we can use the `search` method to display the results.

**Warning:** some of the displayed links may not work since this is historical data retrieved from a [historical open dataset Google has hosted on BigQuery](https://cloud.google.com/bigquery/public-data/github)

In [19]:
se.search('plot 3d gaussian')



cosine dist:0.1352 
---------------

def plot_1d(self, f, a, b, grid_size=1000):
    grid = np.linspace(a, b, num=grid_size).reshape((-1, 1))
    mu, sigma = self.utility.mean_and_std(grid)
    plt.plot(grid, f(grid), color='black', linewidth=1.5, label='f')
    plt.plot(grid, mu, color='red', label='mu')
    plt.plot(grid, mu + sigma, color='blue', linewidth=0.4, label='mu+sigma')
    plt.plot(grid, mu - sigma, color='blue', linewidth=0.4)
    plt.plot(self.points, f(np.asarray(self.points)), 'o', color='red')
    plt.xlim([a - 0.5, b + 0.5])
    plt.show()

cosine dist:0.1357 
---------------

def plotIQ(data, Fs):
    if Fs == None:
        plt.plot(np.real(data), label='I')
        plt.plot(np.imag(data), label='Q')
    else:
        plt.plot(np.real(data), label='I')
        plt.plot(np.imag(data), label='Q')
    plt.grid(True)
    plt.legend(loc='upper right', frameon=True)
    plt.show()

cosine dist:0.1366 
---------------

def plot_butterfly(evoked, ax=None, sig=None, color=No

In [20]:
from IPython.core.magic import (register_line_magic, register_cell_magic,
                                register_line_cell_magic)
@register_cell_magic
def search(line, cell):
    return se.search(cell)

### Live Semantic Search of Code (Searching Holdout Set Only)

In [23]:
%%search
traverse a list



cosine dist:0.1974 
---------------

def traverse(self, root):
    self.worklist = []
    self.__run(root)

cosine dist:0.2037 
---------------

def traverse_tree(self, name, cat='pre'):
    base = self.data[cat]['roles'][name]
    base_tree = base['tree']
    role_keys = base_tree.keys()
    self.traverse_print_tree(base_tree, role_keys)

cosine dist:0.2062 
---------------

def traverse(self):
    self.initialize()
    self.path = deque()
    self.astar_traverse(None)
    self.form_path()
    return list(self.path), self.steps

cosine dist:0.2091 
---------------

def walk(dag, walk_func):
    return dag.walk(walk_func)

cosine dist:0.2107 
---------------

def walk(self):
    return self.walk_preorder()

