Statistical NLP (Group 17)
This is the repository for Group 17 of the Statistical Natural Language Processing module at UCL, formed by:
- Talip Ucar (email@example.com)
- Adrian Daniel Szwarc (firstname.lastname@example.org)
- Matthew Lee (email@example.com)
- Adrian Gonzalez-Martin (firstname.lastname@example.org)
This repository implements the Matching Networks architecture (Vinyals et al.,
pytorch and applies it to a
Language Modelling task. The architecture is flexible enough to allow easy
experimentation with distance metrics, number of labels per episode, number of
examples per label, etc.
More details can be found in the associated paper.
You can experiment with the model using the attached Colab Notebook.
To keep the environments as reproducible as possible, we will use
handle dependencies. To install it just follow the instructions in
The first time, to create the environment and install all required dependencies, just run:
$ pipenv install
This will create a
virtualenv and will install all required dependencies.
Installing new dependencies
To add new dependencies just run:
$ pipenv install numpy
Remember to commit the updated
Pipfile.lock files so that
everyone else can also install them!
Most of the source code can be found under the
src/ folder. However, we also
include a set of command line tools, which should help with sampling, training
and testing models. These can be found under the
Additionally, you can find the following folders:
wikitext-2/: Raw WikiText-2 data set.
data/: Pre-sampled set of label/sentence pairs and pre-generated vocabulary.
models/: Pre-trained models. The filenames encode the different parameters used to train the model.
results/: Data generated after evaluating the models. It includes predictions on the test set, embeddings and attention maps.
figures/: Figures generated from the data in the
We are using
pytest for writing and running unit tests. You can see some
examples on the
To run all tests, just run the following command:
$ pytest -s src/tests
data/ folder you can find the
test.csv files, which
contain each 9000 labels with 10 examples each and 1000 labels with 10 examples
The data is in CSV format with two columns:
label: The word acting as label which we need to find.
sentence: The sentence acting as input, where the particular word has been replaced with the token
An example can be seen below:
label,sentence music,no need to be a hipster to play <blank_token> in vynils music,nowadays <blank_token> doesn't sound as before ...
Sampling new pairs
If you want to sample a new set of pairs from the WikiText-2 dataset you can use
bin.sample script. For example, to resample the entire dataset, we could
$ python -m bin.sample -N 9000 -k 10 wikitext-2/wiki.train.tokens data/train.csv $ python -m bin.sample -N 1000 -k 10 wikitext-2/wiki.test.tokens data/test.csv
Note that the file will be processed first, to be as similar as text coming from PTB.
To make things easy to replicate, we generate in advance the vocabulary over the
training set and store it in a file, which can then be used later for training
and testing. You can have a look at the format in
To re-generate it (after sampling new pairs, for example), you can use the
$ python -m bin.vocab data/train.csv data/vocab.json
This command will store the vocabulary's state as a
Training of a new model can be performed using the
$ python -m bin.train -N 5 -k 2 -d euclidean data/vocab.json data/train.csv
k parameters control the number of labels and examples we want per
episode respectively. The other parameters refer to other parameters (like
distance metric) and the pre-computed vocabulary and the training set.
After convergence, the best model's
state_dict is stored under the
folder, with the different parameters encoded in its name. For example, the
poincare_vanilla_N=5_k=2_model_7.pth was trained using the
vanilla embedding, using
5 labels with
2 examples each
per episode. From the file name it can also be seen that it converged after
These details are discussed in further detail in the associated paper.
Accuracy on a test set for a given model's snapshot can be measured using the
$ python -m bin.test -v data/vocab.json -m models/euclidean_vanilla_N\=5_k\=3_model_24.pth data/test.csv
This command has extra flags which allow to:
-p: Store the predictions in the
-e: Generate embeddings and attention for a single episode and store them in the
Some of the already generated data can be seen in the
This repository can be found in https://github.com/adriangonz/statistical-nlp-17.