# 3 - Train Language Model Using FastAI.ipynb

## Pre-Requisite: Make Sure you have the right files prepared from Step 1

You should have these files in the root of the ./data/processed_data/ directory:

    1.{train/valid/test.function} - these are python function definitions tokenized (by space), 1 line per function.
    2.{train/valid/test.docstring} - these are docstrings that correspond to each of the python function definitions, and have a 1:1 correspondence with the lines in *.function files.
    3.{train/valid/test.lineage} - every line in this file contains a link back to the original location (github repo link) where the code was retrieved. There is a 1:1 correspondence with the lines in this file and the other two files. This is useful for debugging.


### Set the value of use_cache appropriately.

If use_cache = True, data will be downloaded where possible instead of re-computing. However, it is highly recommended that you set use_cache = False for this tutorial as it will be less confusing, and you will learn more by runing these steps yourself. This notebook was run on AWS on a p3.8xlarge in approximately 8 hours.

In [1]:


# # Optional: you can set what GPU you want to use in a notebook like this.  
# # Useful if you want to run concurrent experiments at the same time on different GPUs.
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]="2"



In [1]:
use_cache = False

In [3]:
import sys
sys.path.append("../")

In [None]:
## Download pre-processed data if you want to run from tutorial from this step.##
from general_utils import get_step2_prerequisite_files

if use_cache:
    get_step2_prerequisite_files(output_directory = './data/processed_data')

## Build Language Model From Docstrings

The goal is to build a language model using the docstrings, and use that language model to generate an embedding for each docstring.

In [2]:
import torch,cv2
from lang_model_utils import lm_vocab, load_lm_vocab, train_lang_model
from general_utils import save_file_pickle, load_file_pickle
import logging
from pathlib import Path
from fastai.text import * # created a symbolic link with the code : ln -s fastai/old/fastai fastai

source_path = Path('./data/processed_data/')

with open(source_path/'train.docstring', 'r') as f:
    trn_raw = f.readlines()

with open(source_path/'valid.docstring', 'r') as f:
    val_raw = f.readlines()
    
with open(source_path/'test.docstring', 'r') as f:
    test_raw = f.readlines()

Using TensorFlow backend.


In [3]:
import fastai.text

Preview what the raw data looks like: here are 10 docstrings

In [4]:
trn_raw[:10]

['docstring_tokens\n',
 'safe deleter of keys : param data : dict : param unwanted_key : str or list : return :\n',
 'no_user - anonymus any_user - loged user mentor organiser admin : param code : int : param msg : str : param access_level : : return : wrapped_function\n',
 '"go through one game , played by a minimaxagent instance"\n',
 'search the next optimal move by the iterative deepening technique\n',
 'the implementation of the minimax search with alpha - beta pruning\n',
 'evaluate the game board based on some pre - defined heuristic functions\n',
 'return the number of empty tiles on the game board\n',
 '"return an significantly large negative when the max tile is not on the desired corner , vice versa"\n',
 'perform point - wise product on the game board and a pre - defined weight matrix\n']

### Pre-process data for language model

We will use the class build_lm_vocab to prepare our data for the language model

In [6]:
vocab = lm_vocab(max_vocab=50000,
                 min_freq=10)

# fit the transform on the training data, then transform
trn_flat_idx = vocab.fit_transform_flattened(trn_raw)



Look at the transformed data

In [7]:
trn_flat_idx[:10]

array([  2,  28, 826,  16,  15,   2,   1, 274,  25,  49])

In [8]:
[vocab.itos[x] for x in trn_flat_idx[:10]]

['_xbos_',
 'test',
 'clear',
 '(',
 ')',
 '_xbos_',
 '_pad_',
 'fields',
 'that',
 'are']

In [9]:
# apply transform to validation data
val_flat_idx = vocab.transform_flattened(val_raw)





Save file for later use

In [11]:
!mkdir data/lang_model

In [5]:
if not use_cache:
    vocab.save('./data/lang_model/vocab_v2.cls')
    save_file_pickle('./data/lang_model/trn_flat_idx_list.pkl_v2', trn_flat_idx)
    save_file_pickle('./data/lang_model/val_flat_idx_list.pkl_v2', val_flat_idx)

NameError: name 'vocab' is not defined

## Train Fast.AI Language Model

This model will read in files that were created and train a fast.ai language model. This model learns to predict the next word in the sentence using fast.ai's implementation of AWD LSTM.

The goal of training this model is to build a general purpose feature extractor for text that can be used in downstream models. In this case, we will utilize this model to produce embeddings for function docstrings.

In [4]:
vocab = load_lm_vocab('./data/lang_model/vocab_v2.cls')
trn_flat_idx = load_file_pickle('./data/lang_model/trn_flat_idx_list.pkl_v2')
val_flat_idx = load_file_pickle('./data/lang_model/val_flat_idx_list.pkl_v2')



In [5]:
if not use_cache:
    fastai_learner, lang_model = train_lang_model(model_path = './data/lang_model_weights_v2',
                                                  trn_indexed = trn_flat_idx,
                                                  val_indexed = val_flat_idx,
                                                  vocab_size = vocab.vocab_size,
                                                  lr=3e-3,
                                                  em_sz= 500,
                                                  nh= 500,
                                                  bptt=20,
                                                  cycle_len=1,
                                                  n_cycle=3,
                                                  cycle_mult=2,
                                                  bs = 300,
                                                  wd = 1e-6)
    
elif use_cache:    
    logging.warning('Not re-training language model because use_cache=True')

HBox(children=(FloatProgress(value=0.0, description='Epoch', max=7.0, style=ProgressStyle(description_width='i…

epoch      trn_loss   val_loss                                
    0      4.331611   4.266904  
    1      4.155871   4.113194                                
    2      4.088275   4.053085                                
    3      4.042738   4.02029                                 
    4      4.024975   4.001665                                
    5      4.006923   3.988554                                
                                                              

data/lang_model_weights_v2/models/langmodel_best.h5


    6      3.990535   3.979018  



In [6]:
if not use_cache:
    fastai_learner.fit(1e-3, 3, wds=1e-6, cycle_len=2)

HBox(children=(FloatProgress(value=0.0, description='Epoch', max=6.0, style=ProgressStyle(description_width='i…

epoch      trn_loss   val_loss                                
    0      3.985027   3.971901  
    1      3.92511    3.96667                                 
    2      3.955115   3.964583                                
    3      3.930928   3.959876                                
    4      3.949297   3.961016                                
    5      3.924961   3.955268                                



In [7]:
if not use_cache:
    fastai_learner.fit(1e-3, 2, wds=1e-6, cycle_len=3, cycle_mult=2)

HBox(children=(FloatProgress(value=0.0, description='Epoch', max=9.0, style=ProgressStyle(description_width='i…

    1      3.891003   3.949817                                
    2      3.859057   3.950003                                
    3      3.936589   3.958682                                
    4      3.899377   3.955473                                
    5      3.871282   3.946378                                
    6      3.869434   3.939106                                
    7      3.810602   3.938392                                
    8      3.847875   3.940486                                



In [8]:
if not use_cache:
    fastai_learner.fit(1e-3, 2, wds=1e-6, cycle_len=3, cycle_mult=10)

HBox(children=(FloatProgress(value=0.0, description='Epoch', max=33.0, style=ProgressStyle(description_width='…

epoch      trn_loss   val_loss                                
    0      3.906639   3.948537  
    1      3.846952   3.937859                                
    2      3.818487   3.93844                                 
    3      3.890783   3.950487                                
    4      3.864053   3.952091                                
    5      3.868715   3.949238                                
    6      3.855096   3.94863                                 
    7      3.835199   3.948591                                
    8      3.836165   3.943939                                
    9      3.838879   3.940248                                
    10     3.824002   3.943235                                
    11     3.813149   3.939229                                
    12     3.819502   3.937917                                
    13     3.873574   3.930565                                
    14     3.840732   3.929672                                
    15     3.836104   

Save language model and learner

In [9]:
if not use_cache:
    fastai_learner.save('lang_model_learner_v2.fai')
    lang_model_new = fastai_learner.model.eval()
    torch.save(lang_model_new, './data/lang_model/lang_model_gpu_v2.torch')
    torch.save(lang_model_new.cpu(), './data/lang_model/lang_model_cpu_v2.torch')

  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "


## Load Model and Encode All Docstrings`

Now that we have trained the language model, the next step is to use the language model to encode all of the docstrings into a vector.

#### Note that checkpointed versions of the language model artifacts are available for download: 


    1. lang_model_cpu_v2.torch :  https://storage.googleapis.com/kubeflow-examples/code_search/data/lang_model/lang_model_cpu_v2.torch
    2. lang_model_gpu_v2.torch : https://storage.googleapis.com/kubeflow-examples/code_search/data/lang_model/lang_model_gpu_v2.torch
    3. vocab_v2.cls : https://storage.googleapis.com/kubeflow-examples/code_search/data/lang_model/vocab_v2.cls


In [11]:
from lang_model_utils import load_lm_vocab
vocab = load_lm_vocab('./data/lang_model/vocab_v2.cls')
idx_docs = vocab.transform(trn_raw + val_raw, max_seq_len=30, padding=False)
lang_model = torch.load('./data/lang_model/lang_model_gpu_v2.torch', 
                        map_location=lambda storage, loc: storage)



In [12]:
lang_model.eval()

SequentialRNN(
  (0): RNN_Encoder(
    (encoder): Embedding(23573, 500, padding_idx=1)
    (encoder_with_dropout): EmbeddingDropout(
      (embed): Embedding(23573, 500, padding_idx=1)
    )
    (rnns): ModuleList(
      (0): WeightDrop(
        (module): LSTM(500, 500)
      )
      (1): WeightDrop(
        (module): LSTM(500, 500)
      )
      (2): WeightDrop(
        (module): LSTM(500, 500)
      )
    )
    (dropouti): LockedDropout()
    (dropouths): ModuleList(
      (0): LockedDropout()
      (1): LockedDropout()
      (2): LockedDropout()
    )
  )
  (1): LinearDecoder(
    (decoder): Linear(in_features=500, out_features=23573, bias=False)
    (dropout): LockedDropout()
  )
)

#### Note: 
the below code extracts embeddings for docstrings one docstring at a time, which is very inefficient. Ideally, you want to extract embeddings in batch but account for the fact that you will have padding, etc. when extracting the hidden states. For this tutorial, we only provide this minimal example, however you are welcome to improve upon this and sumbit a PR!

In [13]:
def list2arr(l):
    "Convert list into pytorch Variable."
    return V(np.expand_dims(np.array(l), -1)).cpu()

def make_prediction_from_list(model, l):
    """
    Encode a list of integers that represent a sequence of tokens.  The
    purpose is to encode a sentence or phrase.

    Parameters
    -----------
    model : fastai language model
    l : list
        list of integers, representing a sequence of tokens that you want to encode

    """
    arr = list2arr(l)# turn list into pytorch Variable with bs=1
    model.reset()  # language model is stateful, so you must reset upon each prediction
    hidden_states = model(arr)[-1][-1] # RNN Hidden Layer output is last output, and only need the last layer

    #return avg-pooling, max-pooling, and last hidden state
    return hidden_states.mean(0), hidden_states.max(0)[0], hidden_states[-1]


def get_embeddings(lm_model, list_list_int):
    """
    Vectorize a list of sequences List[List[int]] using a fast.ai language model.

    Paramters
    ---------
    lm_model : fastai language model
    list_list_int : List[List[int]]
        A list of sequences to encode

    Returns
    -------
    tuple: (avg, mean, last)
        A tuple that returns the average-pooling, max-pooling over time steps as well as the last time step.
    """
    n_rows = len(list_list_int)
    n_dim = lm_model[0].nhid
    avgarr = np.empty((n_rows, n_dim))
    maxarr = np.empty((n_rows, n_dim))
    lastarr = np.empty((n_rows, n_dim))

    for i in tqdm_notebook(range(len(list_list_int))):
        avg_, max_, last_ = make_prediction_from_list(lm_model, list_list_int[i])
        avgarr[i,:] = avg_.data.numpy()
        maxarr[i,:] = max_.data.numpy()
        lastarr[i,:] = last_.data.numpy()

    return avgarr, maxarr, lastarr

In [14]:
%%time
avg_hs, max_hs, last_hs = get_embeddings(lang_model, idx_docs)

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`


HBox(children=(FloatProgress(value=0.0, max=1214497.0), HTML(value='')))


CPU times: user 5d 5h 21min 1s, sys: 12min 46s, total: 5d 5h 33min 47s
Wall time: 4h 41min 30s


### Do the same thing for the test set

In [15]:
idx_docs_test = vocab.transform(test_raw, max_seq_len=30, padding=False)
avg_hs_test, max_hs_test, last_hs_test = get_embeddings(lang_model, idx_docs_test)

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`


HBox(children=(FloatProgress(value=0.0, max=187049.0), HTML(value='')))




### Save Language Model Embeddings For Docstrings

In [17]:
!mkdir data/lang_model_emb

In [18]:
savepath = Path('./data/lang_model_emb/')
np.save(savepath/'avg_emb_dim500_v2.npy', avg_hs)
np.save(savepath/'max_emb_dim500_v2.npy', max_hs)
np.save(savepath/'last_emb_dim500_v2.npy', last_hs)

In [19]:
# save the test set embeddings also
np.save(savepath/'avg_emb_dim500_test_v2.npy', avg_hs_test)
np.save(savepath/'max_emb_dim500_test_v2.npy', max_hs_test)
np.save(savepath/'last_emb_dim500_test_v2.npy', last_hs_test)

#### Note that the embeddings saved to disk above have also been cached and are are available for download: 

Train + Validation docstrings vectorized:

    1. avg_emb_dim500_v2.npy : https://storage.googleapis.com/kubeflow-examples/code_search/data/lang_model_emb/avg_emb_dim500_v2.npy
    2. max_emb_dim500_v2.npy : https://storage.googleapis.com/kubeflow-examples/code_search/data/lang_model_emb/last_emb_dim500_v2.npy
    3. last_emb_dim500_v2.npy : https://storage.googleapis.com/kubeflow-examples/code_search/data/lang_model_emb/max_emb_dim500_v2.npy

Test set docstrings vectorized:

    1. avg_emb_dim500_test_v2.npy: https://storage.googleapis.com/kubeflow-examples/code_search/data/lang_model_emb/avg_emb_dim500_test_v2.npy

    2. max_emb_dim500_test_v2.npy: https://storage.googleapis.com/kubeflow-examples/code_search/data/lang_model_emb/last_emb_dim500_test_v2.npy

    3. last_emb_dim500_test_v2.npy: https://storage.googleapis.com/kubeflow-examples/code_search/data/lang_model_emb/max_emb_dim500_test_v2.npy


### Evaluate Sentence Embeddings

One popular way of evaluating sentence embeddings is to measure the efficacy of these embeddings in downstream tasks like sentiment analysis, textual similarity etc. Usually you can use general-purpose benchmarks such as the examples outlined here to measure the quality of your embeddings. However, since this is a very domain specific dataset - those general purpose benchmarks may not be appropriate. Unfortunately, we have not designed downstream tasks that we can open source at this point.

In the absence of these downstream tasks, we can at least sanity check that these embeddings contain semantic information by doing the following:

    1. Manually examine similarity between sentences, by supplying a statement and examining if the nearest phrase found is similar.

    2. Visualize the embeddings.

We will do the first approach, and leave the second approach as an exercise for the reader. It should be noted that this is only a sanity check -- a more rigorous approach is to measure the impact of these embeddings on a variety of downstream tasks and use that to form a more objective opinion about the quality of your embeddings.

Furthermroe, there are many different ways of constructing a sentence embedding from the language model. For example, we can take the average, the maximum or even the last value of the hidden states (or concatenate them all together). For simplicity, we will only evaluate the sentence embedding that is constructed by taking the average over the hidden states (and leave other possibilities as an exercise for the reader)

### Create search index using nmslib

nmslib is a great library for doing nearest neighbor lookups, which we will use as a search engine for finding nearest neighbors of comments in vector-space.

The convenience function create_nmslib_search_index builds this search index given a matrix of vectors as input.

In [20]:
from general_utils import create_nmslib_search_index
import nmslib
from lang_model_utils import Query2Emb
from pathlib import Path
import numpy as np
from lang_model_utils import load_lm_vocab
import torch

In [21]:
# Load matrix of vectors
loadpath = Path('./data/lang_model_emb/')
avg_emb_dim500 = np.load(loadpath/'avg_emb_dim500_test_v2.npy')

In [22]:
# Build search index (takes about an hour on a p3.8xlarge)
dim500_avg_searchindex = create_nmslib_search_index(avg_emb_dim500)

In [23]:
# save search index
dim500_avg_searchindex.saveIndex('./data/lang_model_emb/dim500_avg_searchindex.nmslib')



Note that if you did not train your own language model and are downloading the pre-trained model artifacts instead, you can similarly download the pre-computed search index here:

https://storage.googleapis.com/kubeflow-examples/code_search/data/lang_model_emb/dim500_avg_searchindex.nmslib

After you have built this search index with nmslib, you can do fast nearest-neighbor lookups. We use the Query2Emb object to help convert strings to the embeddings:


In [24]:
dim500_avg_searchindex = nmslib.init(method='hnsw', space='cosinesimil')
dim500_avg_searchindex.loadIndex('./data/lang_model_emb/dim500_avg_searchindex.nmslib')

In [25]:
lang_model = torch.load('./data/lang_model/lang_model_cpu_v2.torch')
vocab = load_lm_vocab('./data/lang_model/vocab_v2.cls')

q2emb = Query2Emb(lang_model = lang_model.cpu(),
                  vocab = vocab)



The method Query2Emb.emb_mean will allow us to use the langauge model we trained earlier to generate a sentence embedding given a string. Here is an example, emb_mean will return a numpy array of size (1, 500).

In [26]:
query = q2emb.emb_mean('Read data into pandas dataframe')
query.shape



(1, 500)



Make search engine to inspect semantic similarity of phrases. This will take 3 inputs:

    1. nmslib_index - this is the search index we built above. This object takes a vector and will return the index of the closest vector(s) according to cosine distance.
    2. ref_data - this is the data for which the index refer to, in this case will be the docstrings.
    3. query2emb_func - this is a function that will convert a string into an embedding.



In [27]:
class search_engine:
    def __init__(self, 
                 nmslib_index, 
                 ref_data, 
                 query2emb_func):
        
        self.search_index = nmslib_index
        self.data = ref_data
        self.query2emb_func = query2emb_func
    
    def search(self, str_search, k=3):
        query = self.query2emb_func(str_search)
        idxs, dists = self.search_index.knnQuery(query, k=k)
        
        for idx, dist in zip(idxs, dists):
            print(f'cosine dist:{dist:.4f}\n---------------\n', self.data[idx])

In [28]:
se = search_engine(nmslib_index=dim500_avg_searchindex,
                   ref_data = test_raw,
                   query2emb_func = q2emb.emb_mean)

### Manually Inspect Phrase Similarity

Compare a user-supplied query vs. vectorized docstrings on test set. We can see that similar phrases are not exactly the same, but the nearest neighbors are reasonable.

In [29]:
import logging
logging.getLogger().setLevel(logging.ERROR)

In [30]:
se.search('read csv into pandas dataframe')

cosine dist:0.1519
---------------
 read and parse data in pandas dataframes .

cosine dist:0.1519
---------------
 read and parse data in pandas dataframes .

cosine dist:0.1695
---------------
 read csv into special data dictionary . example csv file :



In [31]:
se.search('train a random forest')

cosine dist:0.0602
---------------
 train a network

cosine dist:0.0675
---------------
 train a lambdamart model

cosine dist:0.0681
---------------
 train a classifier



In [32]:
se.search('download files')

cosine dist:0.1000
---------------
 download s3 files

cosine dist:0.1145
---------------
 download package .

cosine dist:0.1161
---------------
 download report locally



In [33]:
se.search('start webserver')

cosine dist:0.0909
---------------
 start service .

cosine dist:0.0969
---------------
 start the server .

cosine dist:0.0969
---------------
 start the server .



In [34]:
se.search('send out email notification')

cosine dist:0.0821
---------------
 send cia notification

cosine dist:0.0821
---------------
 send cia notification

cosine dist:0.0853
---------------
 sends email message



In [35]:
se.search('save pickle file')

cosine dist:0.1173
---------------
 save .mp3 file successfully

cosine dist:0.1245
---------------
 save chainer model

cosine dist:0.1286
---------------
 saves data structure to pickle file

