Skip to content

Contains the code for our paper: Maybe Deep Neural Networks are the Best Choice for Modeling Source Code (https://arxiv.org/abs/1903.05734). This is the first open vocabulary language model for code that uses the byte pair encoding algorithm (BPE) to learn a segmentation of code tokens into subword units.

License

Notifications You must be signed in to change notification settings

giganticode/OpenVocabCodeNLM

 
 

Repository files navigation

OpenVocabNLMs

Contains the code for our ICSE 2020 submission: open vocabulary language model for source code that uses the byte pair encoding algorithm to learn a segmentation of code tokens into subtokens.

Code Structure

non-ascii_sequences_to_unk.py is a preprocessing script that can be used to remove non-ascii sequences from the data and replace them with a special symbol.

create_subtoken_data.py is also a preprocessing script that can be used to subtokenize data based on the heuristic of Allamanis et al. (2015).

reader.py contains utility functions for reading data and providing batches for training and testing of models.

code_nlm.py contains the implementation of our NLM for code and supports training, perplexity/cross-entropy calculation, code-completion simulation as well as dynamic versions of the test scenarios. The updated implementation has also some new features, previously not present in the code. That is measuring identifier specific performance for code completion. Another new feature implements a simple n-gram cache for identifiers that better simulates use of the model in an IDE where such information would be present. In order to use the identifier features a file containing identifier information must be provided through the options.

Installation

Python==3.6 is required!

git clone https://github.com/mast-group/OpenVocabCodeNLM
cd OpenVocabCodeNLM
pip3 install -r requirements.txt

Usage Instructions

If you want to try the implementation unzip the directory containing the sample data. The sample data contain the small training set, validation, and test set used in the paper with a BPE encdoding size of 10000.

Option Constants

Let's first define constants for pointing to the data and network parameters. You'll need to modify these to point to your own data and satisfy the hyperparameters that you want to use.

# Directory that contains train/validation/test data etc.
DATA_HOME=sample_data/java/
# Directory in which the model will be saved.
MODEL_DIR=sample_data/java/model
mkdir $MODEL_DIR

# Filenames
TRAIN_FILE=java_training_slp_pre_enc_bpe_10000
VALIDATION_FILE=java_validation_slp_pre_enc_bpe_10000
TEST_FILE=java_test_slp_pre_enc_bpe_10000
TEST_PROJ_NAMES_FILE=testProjects
ID_MAP_FILE=sample_data/java/id_map_java_test_slp_pre_bpe_10000

# Maximum training epochs
EPOCHS=5 # Normally this would be larger. For instance 30-50
# Initial learning rate
LR=0.1 # This is the default value. You can skip it if you don't want to change it.
# Training batch size
BATCH_SIZE=32 # This is also the default.
# RNN unroll timesteps for gradient calculation.
STEPS=200 # This also the default.
# 1 - Dropout probability
KEEP_PROB=0.5 # This is also the default.
# RNN hidden state size
STATE_DIMS=512 # This is also the default.
# Checkpoint and validation loss calculation frequency.
CHECKPOINT_EVERY=5000 # This is also the default.


# Understanding boolean options.
# Most boolean options are set to False  by default.
# For using any boolean option set it to True.
# For instance for using a GRU instead of an LSTM add to your command the option --gru True.

We next present the various scenarios supported by our implementation.

Training

The training scenario creates a global model by training on the provided to it training data. We will train a Java model with a BPE encoding of 10000 using the sample data. In the following training example we set some of the hyperparameters (to their default values though). Optionally, you can set all of them to your intented values. Since the data is tokenized into subwords we need to let the script know so that it can calculate the metrics correctly. For this reason we need to set the word_level_perplexity flag to True. In order to also output validation cross-entropy instead of perplexity we set the cross_entropy option to True.

# Train a small java model for 1 epoch.
python code_nlm.py --data_path $DATA_HOME --train_dir $MODEL_DIR --train_filename $TRAIN_FILE --validation_filename $VALIDATION_FILE --gru True --hidden_size $STATE_DIMS  --batch_size $BATCH_SIZE --word_level_perplexity True --cross_entropy True --steps_per_checkpoint $CHECKPOINT_EVERY --max_epoch $EPOCHS

# Because we are using the default values we could shorten the above command to:
# python code_nlm.py --data_path $DATA_HOME --train_dir $MODEL_DIR --train_filename $TRAIN_FILE --validation_filename $VALIDATION_FILE --gru True --word_level_perplexity True --cross_entropy True --max_epoch $EPOCHS

Test Scenarios

Test Entropy Calculation

# Testing the model (Calculating test set entropy) 
python code_nlm.py --test True --data_path $DATA_HOME --train_dir $MODEL_DIR --test_filename $TEST_FILE --gru True --batch_size $BATCH_SIZE --word_level_perplexity True --cross_entropy True

Dynamically Adapt the Model on Test Data

In order to dynamically adapt the model, the implementation needs to know when it is testing on a new project, so that it can revert the model back to the global one. This is achieved via the test_proj_filename option.

# Batch size must always be set to 1 for this scenario! We are going through every file seperately.
# In an IDE this could instead be sped up through some engineering.
python code_nlm.py --dynamic_test True --data_path $DATA_HOME --train_dir $MODEL_DIR --test_filename $TEST_FILE --gru True --batch_size 1 --word_level_perplexity True --cross_entropy True --test_proj_filename $TEST_PROJ_NAMES_FILE

Test Code Completion

In this scenario the batch_size option is used to set the beam size.

python code_nlm.py --completion True --data_path $DATA_HOME --train_dir $MODEL_DIR --test_filename $TEST_FILE --gru True --batch_size $BATCH_SIZE

Dynamic Code Completion on Test Data

Similarly to before we need to set the test_proj_filename option.

python code_nlm.py --completion True --dynamic True --data_path $DATA_HOME --train_dir $MODEL_DIR --test_filename $TEST_FILE --gru True --batch_size $BATCH_SIZE  --test_proj_filename $TEST_PROJ_NAMES_FILE

Dynamic Code Completion on Test Data and Measuring Identifier Specific Performance

To run this experiment you need to provide a file containing a mapping that lets the implementation know for each subtoken whether it is part of an identifier or not. This information would easily be present in an IDE. The mapping is provided via the identifier_map option.

python code_nlm.py --completion True --dynamic True --data_path $DATA_HOME --train_dir $MODEL_DIR --test_filename $TEST_FILE --gru True --batch_size $BATCH_SIZE  --test_proj_filename $TEST_PROJ_NAMES_FILE --identifier_map $ID_MAP_FILE

Adding a Simple Identifier n-gram Cache

In an IDE setting we could improve the performance on identifiers by utilizing a simple n-gram cache for identifiers that we have already encountered. The file_cache_weight and cache_order options can be used to control the cache's weight and the cache's order respectively. By default we use a 6-gram with a weight of 0.2.

python code_nlm.py --completion True --dynamic True --data_path $DATA_HOME --train_dir $MODEL_DIR --test_filename $TEST_FILE --gru True --batch_size $BATCH_SIZE --test_proj_filename $TEST_PROJ_NAMES_FILE --identifier_map $ID_MAP_FILE --cache_ids True

Predictability

Similar to testing but calculates the average entropy of the files instead of the per token one.

About

Contains the code for our paper: Maybe Deep Neural Networks are the Best Choice for Modeling Source Code (https://arxiv.org/abs/1903.05734). This is the first open vocabulary language model for code that uses the byte pair encoding algorithm (BPE) to learn a segmentation of code tokens into subword units.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 97.5%
  • Shell 2.5%