# Step 2: The code encoder 

## Pre-Requisite: Make Sure you have the right files prepared from Step 1 

You should have these files in the root of the ./data/processed_data/ directory:

1.    {train/valid/test.function} - these are python function definitions tokenized (by space), 1 line per function.
2.    {train/valid/test.docstring} - these are docstrings that correspond to each of the python function definitions, and have a 1:1 correspondence with the lines in *.function files.
3.    {train/valid/test.lineage} - every line in this file contains a link back to the original location (github repo link) where the code was retrieved. There is a 1:1 correspondence with the lines in this file and the other two files. This is useful for debugging.


## Set the value of use_cache appropriately. 

if use_cache = True, data will be downloaded where possible instead of re-computing. However, it is highly recommended that you set use_cache = False

In [14]:
use_cache = False

In [16]:
# # Optional: you can set what GPU you want to use in a notebook like this.  
# # Useful if you want to run concurrent experiments at the same time on different GPUs.
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]="3"

In [3]:
!mkdir ./data/seq2seq

In [3]:


# This will allow the notebook to run faster
from pathlib import Path
from general_utils import get_step2_prerequisite_files, read_training_files
from keras.utils import get_file
OUTPUT_PATH = Path('./data/seq2seq/')
OUTPUT_PATH.mkdir(exist_ok=True)



## Read Text From File¶ 

In [4]:
if use_cache:
    get_step2_prerequisite_files(output_directory = './data/processed_data')

# you want to supply the directory where the files are from step 1.
train_code, holdout_code, train_comment, holdout_comment = read_training_files('./data/processed_data/')



In [5]:
# code and comment files should be of the same length.

assert len(train_code) == len(train_comment)
assert len(holdout_code) == len(holdout_comment)

## Tokenize Text¶ 

In [6]:
from ktext.preprocess import processor

if not use_cache:    
    code_proc = processor(heuristic_pct_padding=.7, keep_n=20000)
    t_code = code_proc.fit_transform(train_code)

    comment_proc = processor(append_indicators=True, heuristic_pct_padding=.7, keep_n=14000, padding ='post')
    t_comment = comment_proc.fit_transform(train_comment)

elif use_cache:
    logging.warning('Not fitting transform function because use_cache=True')

 See full histogram by insepecting the `document_length_stats` attribute.
 See full histogram by insepecting the `document_length_stats` attribute.


####  Save tokenized text (You will reuse this for step 4)


In [11]:
import dill as dpickle
import numpy as np

if not use_cache:
    # Save the preprocessor
    with open(OUTPUT_PATH/'py_code_proc_v2.dpkl', 'wb') as f:
        dpickle.dump(code_proc, f)

    with open(OUTPUT_PATH/'py_comment_proc_v2.dpkl', 'wb') as f:
        dpickle.dump(comment_proc, f)

    # Save the processed data
    np.save(OUTPUT_PATH/'py_t_code_vecs_v2.npy', t_code)
    np.save(OUTPUT_PATH/'py_t_comment_vecs_v2.npy', t_comment)



Arrange data for modeling


In [8]:
%reload_ext autoreload
%autoreload 3
from seq2seq_utils import load_decoder_inputs, load_encoder_inputs, load_text_processor


encoder_input_data, encoder_seq_len = load_encoder_inputs(OUTPUT_PATH/'py_t_code_vecs_v2.npy')
decoder_input_data, decoder_target_data = load_decoder_inputs(OUTPUT_PATH/'py_t_comment_vecs_v2.npy')
num_encoder_tokens, enc_pp = load_text_processor(OUTPUT_PATH/'py_code_proc_v2.dpkl')
num_decoder_tokens, dec_pp = load_text_processor(OUTPUT_PATH/'py_comment_proc_v2.dpkl')

Shape of encoder input: (1214497, 55)
Shape of decoder input: (1214497, 14)
Shape of decoder target: (1214497, 14)
Size of vocabulary for data/seq2seq/py_code_proc_v2.dpkl: 20,002
Size of vocabulary for data/seq2seq/py_comment_proc_v2.dpkl: 14,002


## Build Seq2Seq Model For Summarizing Code

We will build a model to predict the docstring given a function or a method. While this is a very cool task in itself, this is not the end goal of this exercise. The motivation for training this model is to learn a general purpose feature extractor for code that we can use for the task of code search.

In [18]:
from seq2seq_utils import build_seq2seq_model



The convenience function build_seq2seq_model constructs the architecture for a sequence-to-sequence model.

The architecture built for this tutorial is a minimal example with only one layer for the encoder and decoder, and does not include things like attention. We encourage you to try and build different architectures to see what works best for you!


In [19]:
seq2seq_Model = build_seq2seq_model(word_emb_dim=800,
                                    hidden_state_dim=1000,
                                    encoder_seq_len=encoder_seq_len,
                                    num_encoder_tokens=num_encoder_tokens,
                                    num_decoder_tokens=num_decoder_tokens)

In [20]:
seq2seq_Model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
Decoder-Input (InputLayer)      (None, None)         0                                            
__________________________________________________________________________________________________
Decoder-Word-Embedding (Embeddi (None, None, 800)    11201600    Decoder-Input[0][0]              
__________________________________________________________________________________________________
Encoder-Input (InputLayer)      (None, 55)           0                                            
__________________________________________________________________________________________________
Decoder-Batchnorm-1 (BatchNorma (None, None, 800)    3200        Decoder-Word-Embedding[0][0]     
__________________________________________________________________________________________________
Encoder-Mo

### Train Seq2Seq Model¶


In [21]:
from keras.models import Model, load_model
import pandas as pd
import logging

if not use_cache:

    from keras.callbacks import CSVLogger, ModelCheckpoint
    import numpy as np
    from keras import optimizers

    seq2seq_Model.compile(optimizer=optimizers.Nadam(lr=0.00005), loss='sparse_categorical_crossentropy')

    script_name_base = 'py_func_sum_v9_'
    csv_logger = CSVLogger('{:}.log'.format(script_name_base))

    model_checkpoint = ModelCheckpoint('{:}.epoch{{epoch:02d}}-val{{val_loss:.5f}}.hdf5'.format(script_name_base),
                                       save_best_only=True)

    batch_size = 1100
    epochs = 16
    history = seq2seq_Model.fit([encoder_input_data, decoder_input_data], np.expand_dims(decoder_target_data, -1),
              batch_size=batch_size,
              epochs=epochs,
              validation_split=0.12, callbacks=[csv_logger, model_checkpoint])
    


Train on 1068757 samples, validate on 145740 samples
Epoch 1/16
  29700/1068757 [..............................] - ETA: 1:50:24 - loss: 8.2796

KeyboardInterrupt: 

In [24]:
!wget https://storage.googleapis.com/kubeflow-examples/code_search/data/seq2seq/py_func_sum_v9_.epoch16-val2.55276.hdf5

--2020-04-04 09:01:19--  https://storage.googleapis.com/kubeflow-examples/code_search/data/seq2seq/py_func_sum_v9_.epoch16-val2.55276.hdf5
Resolving storage.googleapis.com (storage.googleapis.com)... 172.217.167.176, 2404:6800:4009:810::2010
Connecting to storage.googleapis.com (storage.googleapis.com)|172.217.167.176|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 624441080 (596M) [application/octet-stream]
Saving to: ‘py_func_sum_v9_.epoch16-val2.55276.hdf5’


2020-04-04 09:01:47 (21.6 MB/s) - ‘py_func_sum_v9_.epoch16-val2.55276.hdf5’ saved [624441080/624441080]



In [22]:
use_cache = True

In [23]:
if use_cache:
    logging.warning('Not re-training function summarizer seq2seq model because use_cache=True')
    # Load model from url
    loc = get_file(fname='py_func_sum_v9_.epoch16-val2.55276.hdf5',
                   origin='https://storage.googleapis.com/kubeflow-examples/code_search/data/seq2seq/py_func_sum_v9_.epoch16-val2.55276.hdf5')
    seq2seq_Model = load_model(loc)
    
    # Load encoder (code) pre-processor from url
    loc = get_file(fname='py_code_proc_v2.dpkl',
                   origin='https://storage.googleapis.com/kubeflow-examples/code_search/data/seq2seq/py_code_proc_v2.dpkl')
    num_encoder_tokens, enc_pp = load_text_processor(loc)
    
    # Load decoder (docstrings/comments) pre-processor from url
    loc = get_file(fname='py_comment_proc_v2.dpkl',
                   origin='https://storage.googleapis.com/kubeflow-examples/code_search/data/seq2seq/py_comment_proc_v2.dpkl')
    num_decoder_tokens, dec_pp = load_text_processor(loc)



Downloading data from https://storage.googleapis.com/kubeflow-examples/code_search/data/seq2seq/py_func_sum_v9_.epoch16-val2.55276.hdf5
Downloading data from https://storage.googleapis.com/kubeflow-examples/code_search/data/seq2seq/py_code_proc_v2.dpkl
Size of vocabulary for /home/ritesh/.keras/datasets/py_code_proc_v2.dpkl: 20,002
Downloading data from https://storage.googleapis.com/kubeflow-examples/code_search/data/seq2seq/py_comment_proc_v2.dpkl
Size of vocabulary for /home/ritesh/.keras/datasets/py_comment_proc_v2.dpkl: 14,002




Note that the above procedure will automatically download a pre-trained model and associated artifacts from https://storage.googleapis.com/kubeflow-examples/code_search/data/seq2seq/ if use_cache = True.

Otherwise, the above code will checkpoint the best model after each epoch into the current directory with prefix py_func_sum_v9_


## Evaluate Seq2Seq Model For Code Summarization

To evaluate this model we are going to do two things:


   1. Manually inspect the results of predicted docstrings for code snippets, to make sure they look sensible.
   2. Calculate the BLEU Score so that we can quantitately benchmark different iterations of this algorithm and to guide hyper-parameter tuning.


### Manually Inspect Results (on holdout set)

In [25]:
from seq2seq_utils import Seq2Seq_Inference
import pandas as pd

seq2seq_inf = Seq2Seq_Inference(encoder_preprocessor=enc_pp,
                                 decoder_preprocessor=dec_pp,
                                 seq2seq_model=seq2seq_Model)

demo_testdf = pd.DataFrame({'code':holdout_code, 'comment':holdout_comment, 'ref':''})
seq2seq_inf.demo_model_predictions(n=15, df=demo_testdf)




Original Input:
 def test_image_white for img_format in png jpg gif _load_and_check_img canvas_white img_format 1 1 b x00 b x00
 

Original Output:
 test rendering solid white image


****** Predicted Output ******:
 test white image



Original Input:
 def __div__ self v return Coordinates self x v self y v self z v self e v
 

Original Output:
 @rtype : coordinates


****** Predicted Output ******:
 returns the coordinates of the point in the coordinates of the point



Original Input:
 def appendXml self r data json dumps self __nj e etree SubElement r numjobs e text data data json dumps self __iod e etree SubElement r iodepth e text data data json dumps self __runtime e etree SubElement r runtime e text data if self __xargs None data json dumps list self __xargs e etree SubElement r xargs e text data
 

Original Output:
 append the information about options to a xml node . @param root the xml root tag to append the new elements to


****** Predicted Output ******:
 dumps an obje

In [26]:
demo_testdf

Unnamed: 0,code,comment,ref
0,function_tokens\n,docstring_tokens\n,
1,def getall self key default _marker identity s...,return a list of all values matching the key .\n,
2,def getone self key default _marker identity s...,get first value matching the key .\n,
3,def get self key default None return self geto...,get first value matching the key .\n,
4,def keys self return _KeysView self _impl\n,return a new view of the dictionary 's keys .\n,
...,...,...,...
187044,def count_additional_facts_unresolved conversa...,: param conversation : the current conversatio...,
187045,def extract_fact_by_type fact_type intent enti...,returns the relevant information for a particu...,
187046,def extract_month_from_duration extracted_enti...,"""takes a ner_duckling entity duration classifi...",
187047,def is_sufficient self classify_dict if len cl...,"""method which verifies the accuracy of the cla...",


In [27]:
holdout_code

['function_tokens\n',
 'def getall self key default _marker identity self _title key res v for i k v in self _impl _items if i identity if res return res if not res and default is not _marker return default raise KeyError Key not found r key\n',
 'def getone self key default _marker identity self _title key for i k v in self _impl _items if i identity return v if default is not _marker return default raise KeyError Key not found r key\n',
 'def get self key default None return self getone key default\n',
 'def keys self return _KeysView self _impl\n',
 'def items self return _ItemsView self _impl\n',
 'def values self return _ValuesView self _impl\n',
 'def copy self return MultiDict self items\n',
 'def copy self return CIMultiDict self items\n',
 'def copy self cls self __class__ return cls self items\n',
 'def extend self args kwargs self _extend args kwargs extend self _extend_items\n',
 'def clear self self _impl _items clear self _impl incr_version\n',
 'def setdefault self key d

In [29]:
seq2seq_inf.predict("def set_Forum self value super ListUsersInputSet self _set_input Forum value")[1]

'set the value of the forum input for this choreo required string forum short short'

In [30]:
seq2seq_Model.save(OUTPUT_PATH/'code_summary_seq2seq_model.h5')

TypeError: Required Group, str or dict. Received: <class 'pathlib.PosixPath'>.

In [31]:
seq2seq_Model = load_model('./py_func_sum_v9_.epoch16-val2.55276.hdf5')

In [32]:
seq2seq_inf = Seq2Seq_Inference(encoder_preprocessor=enc_pp,
                                 decoder_preprocessor=dec_pp,
                                 seq2seq_model=seq2seq_Model)

demo_testdf = pd.DataFrame({'code':holdout_code, 'comment':holdout_comment, 'ref':''})
seq2seq_inf.demo_model_predictions(n=15, df=demo_testdf)




Original Input:
 def is_valid email
 

Original Output:
 check if an email address if valid .


****** Predicted Output ******:
 check if the email is valid



Original Input:
 

Original Output:
 convert phrasedml string to sedml .


****** Predicted Output ******:
 return a string with the given c code



Original Input:
 property def process self if hasattr self _process return self _process else self _process self _get_process return self _process
 

Original Output:
 "store the actual process in _ process . if it does n't exist yet , create it ."


****** Predicted Output ******:
 get the process s process



Original Input:
 def test_eigh_build self level rlevel rvals 68 60568999 89 57756725 106 67185574 cov array 77 70273908 3 51489954 15 64602427 3 51489954 88 97013878 1 07431931 15 64602427 1 07431931 98 18223512 vals vecs linalg eigh cov assert_array_almost_equal vals rvals
 

Original Output:
 ticket 662 .


****** Predicted Output ******:
 ticket number



Original Input


****** Predicted Output ******:
 convert to a unit vector


In [33]:
seq2seq_inf.predict("def set_Forum self value super ListUsersInputSet self _set_input Forum value")[1]

'set the value of the forum input for this choreo required string forum short short'