Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

CoNaLa Preprocessing Scripts/Baseline

This repository contains preprocessing scripts and a baseline for the CoNaLa Code/Natural Language Challenge.

CoNaLa Preprocessing Scripts

In the preproc directory, there are a number of preprocessing/evaluation scripts that you can use to extract the data and convert it into a format that is easy to use for training models. The best way to see their usage is to take a look at the CoNaLa baseline below, but we'll first briefly describe them here.

  • perform some tokenization on the source code, etc.
  • convert the json file resulting from to source/target files used by seq2seq models
  • take a decode file outputted by the seq2seq model, and "detokenize" it to the original source code for evaluation

CoNaLa Baseline

The baseline makes a system to generate Python from English commands by training a standard neural machine translation model.

It requires a GPU machine, and uses the neural machine translation system xnmt (specifically, commit d9e227b), so first install this. Make sure you also install all the packages in the requirements-extra.txt file by running pip install -r /path/to/xnmt/requirements-extra.txt.

Also, install the requirements for this package itself by running pip install -r requirements.txt.

After you it is installed, you can run


And it should do the rest for you. Data will be downloaded to and preprocessed in the conala-corpus/ directory, and output logs and scores will be written into the results/ directory.


Baseline for the Conala: Code/Natural Language Challenge






No releases published


No packages published