Sentence encoder and training code for Mean-Max AAE
Clone or download
Latest commit 61de9f1 Nov 8, 2018
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
data get transfer data bash Oct 28, 2018
senteval rm pyc Oct 28, 2018
README.md Update README.md Nov 8, 2018
bleu.sh Initial commit Oct 28, 2018
conf.json conf.json Oct 28, 2018
data.py Initial commit Oct 28, 2018
graph.py Initial commit Oct 28, 2018
master.py set path Oct 28, 2018
modules.py Initial commit Oct 28, 2018
multi-bleu.perl Initial commit Oct 28, 2018
requirements.txt requirements.txt Oct 28, 2018
run_eval.py Initial commit Oct 28, 2018
run_train.py Initial commit Oct 28, 2018

README.md

Mean-Max AAE

Sentence encoder and training code for the paper Learning Universal Sentence Representations with Mean-Max Attention Autoencoder.

Dependencies

This code is written in python. To use it you will need:

Download datasets

The pre-processed Toronto BookCorpus we used for training our model is available here.

To download GloVe vector:

curl -Lo data/glove.840B.300d.zip http://nlp.stanford.edu/data/glove.840B.300d.zip
unzip data/glove.840B.300d.zip -d data/

To get all the transfer tasks datasets, run (in data/):

./get_transfer_data.bash

This will automatically download and preprocess the transfer tasks datasets, and store them in data/.

Sentence encoder

We provide a simple interface to encode English sentences. Get started with the following steps:

1) Download our Mean-Max AAE models and put it to SentEncoding directory, then decompress:

unzip models.zip

2) Make sure you have the NLTK tokenizer by running the following once:

import nltk
nltk.download('punkt')

3) Load our pre-trained model:

import master
m = master.Master('conf.json')
m.creat_graph()
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
m.prepare()

3) Build the vocabulary of word vectors (i.e keep only those needed):

vocab = m.build_vocab(sentences, tokenize=True)
m.build_emb(vocab)

where sentences is your list of n sentences.

4) Encode your sentences:

embeddings = m.encode(sentences, tokenize=True)

This outputs a numpy array with n vectors of dimension 4096.

Reference

If you found this code useful, please cite the following paper:

  @inproceedings{zhang2018learning,
  title={Learning Universal Sentence Representations with Mean-Max Attention Autoencoder},
  author={Zhang, Minghua and Wu, Yunfang and Li, Weikang and Li, Wei},
  booktitle={Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing},
  pages={4514--4523},
  year={2018}
  }