# CS Local search

### Outline

1. Export model
2. Run large-batch inference of code embeddings, write to CSV.
3. Load code embeddings into nmslib search index.
4. Using tf.Eager, perform inference for a query in notebook.
5. Search that query embedding against the search index, listing hits in notebook.


### Use tf.Eager

This will be far slower in part because for now I'm not able to use batch sizes > 1 (for what ever reason). But - I know this will for sure work. And it will probably take less than like 5min to compute embeddings for like 10k examples.... 200 takes about 10-20s... ok so more like 15min.... an index of 3k examples in 5min... so maybe could build an index using a set of things that are or aren't known to go together, e.g. two very distinct codebases instead of a random sampling of all... maybe just use the t2t library?...

In [1]:

import numpy as np

import functools

import tensorflow as tf

from tensor2tensor.data_generators import problem
from tensor2tensor.layers import common_layers
from tensor2tensor.models import transformer
from tensor2tensor.utils import registry
from tensor2tensor.utils import t2t_model
from tensor2tensor.models.transformer import Transformer

from tensor2tensor.models.transformer import transformer_base

from tensorflow.contrib.eager.python import tfe
tfe.enable_eager_execution()
Modes = tf.estimator.ModeKeys


from tk.models import similarity_transformer
from tk.data_generators import function_docstring

import numpy as np; np.random.seed(0)
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
%matplotlib inline


Instructions for updating:
Use the retry module or similar alternatives.


In [2]:

mp_constrained_embedding = function_docstring.GithubConstrainedEmbedding()

data_dir = "/mnt/nfs-east1-d/data"

hparams = similarity_transformer.similarity_transformer_tiny()
hparams.data_dir = data_dir

p_hparams = mp_constrained_embedding.get_hparams(hparams)

model = similarity_transformer.ConstrainedEmbeddingTransformer(
    hparams, tf.estimator.ModeKeys.PREDICT, p_hparams
)

# Get the encoders from the problem
encoders = mp_constrained_embedding.feature_encoders(data_dir)

# Setup helper functions for encoding and decoding
def encode(input_str, output_str=None):
  """Input str to features dict, ready for inference"""
  inputs = encoders["inputs"].encode(input_str) + [1]  # add EOS id
  batch_inputs = tf.reshape(inputs, [1, -1, 1])  # Make it 3D.
  return {"inputs": batch_inputs}

def decode(integers):
  """List of ints to str
  
  For decoding an integer encoding to its string representation,
  not for decoding an embedding vector into the same.
  """
  integers = list(np.squeeze(integers))
  if 1 in integers:
    integers = integers[:integers.index(1)]
  return encoders["inputs"].decode(np.squeeze(integers))

batch_size = 1
train_dataset = mp_constrained_embedding.dataset(Modes.PREDICT, data_dir)
train_dataset = train_dataset.repeat(None).batch(batch_size)

iterator = tfe.Iterator(train_dataset)


INFO:tensorflow:Setting T2TModel mode to 'infer'


[2018-10-29 17:22:53,757] Setting T2TModel mode to 'infer'


INFO:tensorflow:Setting hparams.layer_prepostprocess_dropout to 0.0


[2018-10-29 17:22:53,761] Setting hparams.layer_prepostprocess_dropout to 0.0


INFO:tensorflow:Setting hparams.symbol_dropout to 0.0


[2018-10-29 17:22:53,763] Setting hparams.symbol_dropout to 0.0


INFO:tensorflow:Setting hparams.attention_dropout to 0.0


[2018-10-29 17:22:53,767] Setting hparams.attention_dropout to 0.0


INFO:tensorflow:Setting hparams.dropout to 0.0


[2018-10-29 17:22:53,770] Setting hparams.dropout to 0.0


INFO:tensorflow:Setting hparams.relu_dropout to 0.0


[2018-10-29 17:22:53,772] Setting hparams.relu_dropout to 0.0


INFO:tensorflow:Reading data files from /mnt/nfs-east1-d/data/github_function_docstring-dev*


[2018-10-29 17:22:53,839] Reading data files from /mnt/nfs-east1-d/data/github_function_docstring-dev*


INFO:tensorflow:partition: 0 num_data_files: 1


[2018-10-29 17:22:53,846] partition: 0 num_data_files: 1


In [3]:

def embed_many(dataset_iterator, model, ckpt_path, num=100):
    embeddings = []
    num_examples = 100

    with tfe.restore_variables_on_create(ckpt_path):

      for i in range(0, num):
        example = dataset_iterator.next()

        #doc_emb, _ = model({"inputs": example["docstring"]})
        code_emb, _ = model({"inputs": example["code"]})

        #embeddings.append([doc_emb, code_emb, decode(example["docstring"]), decode(example["code"])])
        embeddings.append([code_emb, decode(example["code"]), decode(example["docstring"])])
        
        if i % 100 == 0:
          print("Processing step %s" % i)
    
    print("Finished embedding %s" % num)

    return embeddings


In [4]:
ckpt_path = "gs://kubeflow-rl-checkpoints/comparisons/cs-v7-lvslicenet-ts1-tcs1-tm1-exl1-ntm1/cs-v7-lvslicenet-ts1-tcs1-tm1-exl1-ntm1-j1026-1730-f4d7/output/model.ckpt-215387"

In [5]:
import datetime

In [6]:
print datetime.datetime.now()

2018-10-29 17:22:53.954837


In [30]:
embeddings = embed_many(iterator, model, ckpt_path, 10000)

Processing step 0
Processing step 100
Processing step 200
Processing step 300
Processing step 400
Processing step 500
Processing step 600
Processing step 700
Processing step 800
Processing step 900
Processing step 1000
Processing step 1100
Processing step 1200
Processing step 1300
Processing step 1400
Processing step 1500
Processing step 1600
Processing step 1700
Processing step 1800
Processing step 1900
Processing step 2000
Processing step 2100
Processing step 2200
Processing step 2300
Processing step 2400
Processing step 2500
Processing step 2600
Processing step 2700
Processing step 2800
Processing step 2900
Processing step 3000
Processing step 3100
Processing step 3200
Processing step 3300
Processing step 3400
Processing step 3500
Processing step 3600
Processing step 3700
Processing step 3800
Processing step 3900
Processing step 4000
Processing step 4100
Processing step 4200
Processing step 4300
Processing step 4400
Processing step 4500
Processing step 4600
Processing step 4700
Proc

In [31]:
print datetime.datetime.now()

2018-10-29 17:35:52.532747


In [39]:

numpy_embedding_vectors = np.asarray([thing[0].numpy()[0] for thing in embeddings])

embedding_data_path = "/mnt/nfs-east1-d/tmp/embeddings.csv"

np.savetxt(embedding_data_path, np.asarray([thing[0].numpy()[0] for thing in embeddings]), delimiter=",")


## Build an index

In [53]:

import nmslib
import pprint

del index

def build_index(index_save_path, index_data):
  index = nmslib.init(method='hnsw', space='cosinesimil')
  index.addDataPointBatch(index_data)
  index.createIndex({'post': 2}, print_progress=True)
  index.saveIndex(index_save_path)
  return index

def embed_query(model, ckpt_path, query):
  with tfe.restore_variables_on_create(ckpt_path):
    return model(encode(query))[0].numpy()[0]

ckpt_path = "gs://kubeflow-rl-checkpoints/comparisons/cs-v7-lvslicenet-ts1-tcs1-tm1-exl1-ntm1/cs-v7-lvslicenet-ts1-tcs1-tm1-exl1-ntm1-j1026-1730-f4d7/output/model.ckpt-215387"
query_fn = functools.partial(embed_query, model, ckpt_path)

local_index_save_path = "/mnt/nfs-east1-d/tmp/index-004"

k = 10000

def results_for_query(query):
  query_embedding = query_fn(query)

  idxs, dists = index.knnQuery(query_embedding, k=k)

  hits = []
    
  for (i, d) in zip(idxs, dists):
    hits.append((embeddings[i][2], embeddings[i][1], d))
  return hits

def print_hits(hit_subset):
  for hit in hit_subset:
    print(hit[0])
    print(hit[1])
    print(hit[2])
    print("\n")
  print("--")


In [54]:

index = build_index(local_index_save_path, numpy_embedding_vectors)


In [61]:

hits = results_for_query("fetch the pagination marker field from flask.request.args")


In [62]:

# Most similar
print_hits(hits[0:5])

# Least similar
print_hits(hits[-5:])

# Nope.


when processing an error , consume the response body to ensure it is n't mixed up with the next request in case the connection is kept alive .
staticmethod def _drain response try response read except socket error pass
0.0267815


scrapes the list of modules associated with circuitpython . causes scrapy to follow the links to the module docs and uses a different parser to extract the api information contained therein .
def parse self response for next_page in response css div toctree wrapper li a yield response follow next_page self parse_api
0.027252793


redirect to the detail view after updating the registration
def test_redirect_to_pet self resp self client get reverse meupet update_register args self pet request_key self assertRedirects resp reverse meupet detail args self pet slug
0.028014004


unpin the database from master in the current db .
pytest fixture autouse True def unpin_db request request addfinalizer pinning unpin_this_thread
0.028688014


iterate over accounts this 

In [63]:
for etuple in embeddings:
  if "lask" in etuple[2]:
    print etuple[2]

required by ` flask - login < https://flask - login.readthedocs.org / en / latest/>`_.
fetch the pagination marker field from flask.request.args .
yield a tuple for each flask handler containing annotated methods .
validate auth0 tokens passed in the request 's header , hence ensuring that the user is authenticated . code copied from : https://github.com/auth0/auth0-python/tree/master/examples/flask-api
create a flask application and load all configuration files


In [64]:

# That exact doc string is present in the dataset.


In [65]:

for etuple in hits:
  if "lask" in etuple[0]:
    print etuple
    

('fetch the pagination marker field from flask.request.args .', 'request_field marker raises_coercion_exceptions def marker_field value assert uuidutils is_uuid_like value _ Marker not UUID like return value', 0.04419422)


In [73]:

np.median([hit[2] for hit in hits])


0.041648388

In [2]:
[hit[2] for hit in hits]

# Not a highly ranked hit

NameError: name 'hits' is not defined