# 4 - Train Model To Map Code Embeddings to Language Embeddings.ipynb

## Prerequisites

You should have completed steps 1-3 of this tutorial before beginning this exercise. The files required for this notebook are generated by those previous steps.

This notebook takes approximately 3 hours to run on an AWS p3.8xlarge instance.

In [1]:
# # Optional: you can set what GPU you want to use in a notebook like this.  
# # Useful if you want to run concurrent experiments at the same time on different GPUs.
# import os
# os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
# os.environ["CUDA_VISIBLE_DEVICES"]="2"

In [2]:
from pathlib import Path
import numpy as np
from seq2seq_utils import extract_encoder_model, load_encoder_inputs
from keras.layers import Input, Dense, BatchNormalization, Dropout, Lambda

from keras.models import load_model, Model
from seq2seq_utils import load_text_processor

#where you will save artifacts from this step
OUTPUT_PATH = Path('./data/code2emb/')
OUTPUT_PATH.mkdir(exist_ok=True)

# These are where the artifacts are stored from steps 2 and 3, respectively.
seq2seq_path = Path('./data/seq2seq/')
langemb_path = Path('./data/lang_model_emb/')

# set seeds
from numpy.random import seed
seed(1)
from tensorflow import set_random_seed
set_random_seed(2)

Using TensorFlow backend.


## Train Model That Maps Code To Sentence Embedding Space

In step 2, we trained a seq2seq model that can summarize function code using (code, docstring) pairs as the training data.

In this step, we will fine tune the encoder from the seq2seq model to generate code embeddings in the docstring space by using (code, docstring-embeddings) as the training data. Therefore, this notebook will go through the following steps:

    1. Load the seq2seq model and extract the encoder (remember seq2seq models have an encoder and a decoder).
    2. Freeze the weights of the encoder.
    3. Add some dense layers on top of the encoder.
    4. Train this new model supplying by supplying (code, docstring-embeddings) pairs. We will call this model code2emb_model.
    5. Unfreeze the entire model, and resume training. This helps fine tune the model a little more towards this task.
    6. Encode all of the code, including code that does not contain a docstring and save that into a search index for future use.


### Load seq2seq model from Step 2 and extract the encoder

First load the seq2seq model from Step2, then extract the encoder (we do not need the decoder).

In [5]:
# load the pre-processed data for the encoder (we don't care about the decoder in this step)
encoder_input_data, doc_length = load_encoder_inputs(seq2seq_path/'py_t_code_vecs_v2.npy')
seq2seq_Model = load_model(str(seq2seq_path/'code_summary_seq2seq_model.h5'))

Shape of encoder input: (1214497, 55)




Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.








Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where




In [6]:
# Extract Encoder from seq2seq model
encoder_model = extract_encoder_model(seq2seq_Model)
# Get a summary of the encoder and its layers
encoder_model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
Encoder-Input (InputLayer)   (None, 55)                0         
_________________________________________________________________
Body-Word-Embedding (Embeddi (None, 55, 800)           16001600  
_________________________________________________________________
Encoder-Batchnorm-1 (BatchNo (None, 55, 800)           3200      
_________________________________________________________________
Encoder-Last-GRU (GRU)       [(None, 1000), (None, 100 5403000   
Total params: 21,407,800
Trainable params: 21,406,200
Non-trainable params: 1,600
_________________________________________________________________




Freeze the encoder


In [7]:


# Freeze Encoder Model
for l in encoder_model.layers:
    l.trainable = False
    print(l, l.trainable)



<keras.engine.input_layer.InputLayer object at 0x7fad99750518> False
<keras.layers.embeddings.Embedding object at 0x7fad99750780> False
<keras.layers.normalization.BatchNormalization object at 0x7fad99750908> False
<keras.layers.recurrent.GRU object at 0x7fad99750940> False


### Load Docstring Embeddings From From Step 3

The target for our code2emb model will be docstring-embeddings instead of docstrings. Therefore, we will use the embeddings for docstrings that we computed in step 3. For this tutorial, we will use the average over all hidden states, which is saved in the file avg_emb_dim500_v2.npy.

Note that in our experiments, a concatenation of the average, max, and last hidden state worked better than using the average alone. However, in the interest of simplicity we demonstrate just using the average hidden state. We leave it as an exercise to the reader to experiment with other approaches.

In [8]:
# Load Fitlam Embeddings
fastailm_emb = np.load(langemb_path/'avg_emb_dim500_v2.npy')

# check that the encoder inputs have the same number of rows as the docstring embeddings
assert encoder_input_data.shape[0] == fastailm_emb.shape[0]

fastailm_emb.shape

(1214497, 500)

### Construct code2emb Model Architecture

The code2emb model is the encoder from the seq2seq model with some dense layers added on top. The output of the last dense layer of this model needs to match the dimensionality of the docstring embedding, which is 500 in this case.

In [9]:
#### Encoder Model ####
encoder_inputs = Input(shape=(doc_length,), name='Encoder-Input')
enc_out = encoder_model(encoder_inputs)

# first dense layer with batch norm
x = Dense(500, activation='relu')(enc_out)
x = BatchNormalization(name='bn-1')(x)
out = Dense(500)(x)
code2emb_model = Model([encoder_inputs], out)

In [10]:
code2emb_model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
Encoder-Input (InputLayer)   (None, 55)                0         
_________________________________________________________________
Encoder-Model (Model)        (None, 1000)              21407800  
_________________________________________________________________
dense_1 (Dense)              (None, 500)               500500    
_________________________________________________________________
bn-1 (BatchNormalization)    (None, 500)               2000      
_________________________________________________________________
dense_2 (Dense)              (None, 500)               250500    
Total params: 22,160,800
Trainable params: 752,000
Non-trainable params: 21,408,800
_________________________________________________________________


## Train the code2emb Model¶ 

The model we are training is relatively simple - with two dense layers on top of the pre-trained encoder. We are leaving the encoder frozen at first, then will unfreeze the encoder in a later step.

In [11]:
from keras.callbacks import CSVLogger, ModelCheckpoint
from keras import optimizers

code2emb_model.compile(optimizer=optimizers.Nadam(lr=0.002), loss='cosine_proximity')
script_name_base = 'code2emb_model_'
csv_logger = CSVLogger('{:}.log'.format(script_name_base))
model_checkpoint = ModelCheckpoint('{:}.epoch{{epoch:02d}}-val{{val_loss:.5f}}.hdf5'.format(script_name_base),
                                   save_best_only=True)

batch_size = 20000
epochs = 15
history = code2emb_model.fit([encoder_input_data], fastailm_emb,
          batch_size=batch_size,
          epochs=epochs,
          validation_split=0.12, callbacks=[csv_logger, model_checkpoint])

Train on 1068757 samples, validate on 145740 samples
Epoch 1/15
 120000/1068757 [==>...........................] - ETA: 23:32 - loss: -0.1553

KeyboardInterrupt: 

### Unfreeze all Layers of Model and Resume Training


In [None]:
for l in code2emb_model.layers:
    l.trainable = True
    print(l, l.trainable)

In [None]:
code2emb_model.compile(optimizer=optimizers.Nadam(lr=0.0001), loss='cosine_proximity')
script_name_base = 'code2emb_model_unfreeze_'
csv_logger = CSVLogger('{:}.log'.format(script_name_base))
model_checkpoint = ModelCheckpoint('{:}.epoch{{epoch:02d}}-val{{val_loss:.5f}}.hdf5'.format(script_name_base),
                                   save_best_only=True)

batch_size = 2000
epochs = 20
history = code2emb_model.fit([encoder_input_data], fastailm_emb,
          batch_size=batch_size,
          epochs=epochs,
          initial_epoch=16,
          validation_split=0.12, callbacks=[csv_logger, model_checkpoint])

### Save code2emb model

In [None]:
code2emb_model.save(OUTPUT_PATH/'code2emb_model.hdf5')



This file has been cached and is also available for download here:

code2emb_model.hdf5:https://storage.googleapis.com/kubeflow-examples/code_search/data/code2emb/code2emb_model.hdf5


In [12]:
import wget

In [13]:
url = "https://storage.googleapis.com/kubeflow-examples/code_search/data/code2emb/code2emb_model.hdf5"

In [16]:
filename=wget.download(url,out=str(OUTPUT_PATH))

In [17]:
from keras.models import Model, load_model

In [18]:
code2emb_model = load_model(filename)

In [19]:
code2emb_model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
Encoder-Input (InputLayer)   (None, 55)                0         
_________________________________________________________________
Encoder-Model (Model)        (None, 1000)              21407800  
_________________________________________________________________
dense_1 (Dense)              (None, 500)               500500    
_________________________________________________________________
bn-1 (BatchNormalization)    (None, 500)               2000      
_________________________________________________________________
dense_2 (Dense)              (None, 500)               250500    
Total params: 22,160,800
Trainable params: 752,000
Non-trainable params: 21,408,800
_________________________________________________________________


## Vectorize all of the code without docstrings

We want to vectorize all of the code without docstrings so we can test the efficacy of the search on the code that was never seen by the model.

In [20]:
from keras.models import load_model
from pathlib import Path
import numpy as np
from seq2seq_utils import load_text_processor
code2emb_path = Path('./data/code2emb/')
seq2seq_path = Path('./data/seq2seq/')
data_path = Path('./data/processed_data/')

In [22]:
code2emb_model = load_model(str(code2emb_path/'code2emb_model.hdf5'))
num_encoder_tokens, enc_pp = load_text_processor(str(seq2seq_path/'py_code_proc_v2.dpkl'))

with open(data_path/'without_docstrings.function', 'r') as f:
    no_docstring_funcs = f.readlines()



Size of vocabulary for data/seq2seq/py_code_proc_v2.dpkl: 20,002


### Pre-process code without docstrings for input into code2emb model

We use the same transformer we used to train the original model.

In [23]:
# tokenized functions that did not contain docstrigns
no_docstring_funcs[:5]

['function_tokens\n',
 'def __init__ self leafs edges self edges edges self leafs sorted leafs\n',
 'def __eq__ self other if isinstance other Node return id self id other or self leafs other leafs and self edges other edges else return False\n',
 'def __repr__ self return Node leafs edges format self leafs self edges\n',
 'staticmethod def _isCapitalized token return len token 1 and token isalpha and token 0 isupper and token 1 islower\n']

In [24]:
encinp = enc_pp.transform_parallel(no_docstring_funcs)
np.save(code2emb_path/'nodoc_encinp.npy', encinp)



### Extract code vectors

In [26]:
from keras.models import load_model
from pathlib import Path
import numpy as np
code2emb_path = Path('./data/code2emb/')
encinp = np.load(code2emb_path/'nodoc_encinp.npy')
code2emb_model = load_model(str(code2emb_path/'code2emb_model.hdf5'))



Use the code2emb model to map the code into the same vector space as natural language


In [27]:
nodoc_vecs = code2emb_model.predict(encinp, batch_size=20000)

KeyboardInterrupt: 

In [None]:
np.save(code2emb_path/'nodoc_vecs.npy', nodoc_vecs)

## Cached Files

ou can find the files that were created in this notebook below. Please note that if you use one of these files, you should proceed with extreme caution. We recommend that if you are skipping a step, you should use all the cached files because only using only some files could result in discrepencies between your models or data and our pre-computed results.

    1. code2emb_model.hdf5: https://storage.googleapis.com/kubeflow-examples/code_search/data/code2emb/code2emb_model.hdf5
    2. nodoc_encinp.npy: https://storage.googleapis.com/kubeflow-examples/code_search/data/code2emb/nodoc_encinp.npy
    3. nodoc_vecs.npy: https://storage.googleapis.com/kubeflow-examples/code_search/data/code2emb/nodoc_vecs.npy


In [None]:
url = "https://storage.googleapis.com/kubeflow-examples/code_search/data/code2emb/nodoc_vecs.npy"

In [28]:
wget.download("https://storage.googleapis.com/kubeflow-examples/code_search/data/code2emb/nodoc_vecs.npy",out=str(code2emb_path))

KeyboardInterrupt: 

In [31]:
!wget https://storage.googleapis.com/kubeflow-examples/code_search/data/code2emb/nodoc_vecs.npy -P ./data/code2emb/

--2020-04-06 01:49:21--  https://storage.googleapis.com/kubeflow-examples/code_search/data/code2emb/nodoc_vecs.npy
Resolving storage.googleapis.com (storage.googleapis.com)... 172.217.167.176, 2404:6800:4009:810::2010
Connecting to storage.googleapis.com (storage.googleapis.com)|172.217.167.176|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8017436128 (7.5G) [application/octet-stream]
Saving to: ‘./data/code2emb/nodoc_vecs.npy’


2020-04-06 02:05:56 (7.69 MB/s) - ‘./data/code2emb/nodoc_vecs.npy’ saved [8017436128/8017436128]

