This document describes how to train and use a transformer-based version of ELMo with allennlp
. The model is a port of the the one described in Dissecting Contextual Word Embeddings: Architecture and Representation by Peters et al. You can find a pretrained version of this model here.
-
Obtain training data from https://www.statmt.org/lm-benchmark/1-billion-word-language-modeling-benchmark-r13output.tar.gz.
export BIDIRECTIONAL_LM_DATA_PATH=$PWD'/1-billion-word-language-modeling-benchmark-r13output' export BIDIRECTIONAL_LM_TRAIN_PATH=$BIDIRECTIONAL_LM_DATA_PATH'/training-monolingual.tokenized.shuffled/*'
-
Obtain vocab.
pip install --user awscli mkdir vocabulary export BIDIRECTIONAL_LM_VOCAB_PATH=$PWD'/vocabulary' cd $BIDIRECTIONAL_LM_VOCAB_PATH aws --no-sign-request s3 cp s3://allennlp/models/elmo/vocab-2016-09-10.txt . cat vocab-2016-09-10.txt | sed 's/<UNK>/@@UNKNOWN@@/' > tokens.txt # Avoid creating garbage namespace. rm vocab-2016-09-10.txt echo '*labels\n*tags' > non_padded_namespaces.txt
-
Run training. Note:
training_config
refers to this directory.# The multiprocess dataset reader and iterator use many file descriptors, # so we increase the relevant ulimit here to help. # See https://pytorch.org/docs/stable/multiprocessing.html#file-descriptor-file-descriptor # for a description of the underlying issue. ulimit -n 4096 # Location of repo for training_config. cd allennlp allennlp train training_config/bidirectional_language_model.jsonnet --serialization-dir output_path
-
Wait. This will take days. (Example results here are from a model trained for just 4 epochs.)
-
Evaluate. There is one gotcha here, which is that we discard 3 sentences for being too long (otherwise we'd exhaust GPU memory). If we wanted to report this number formally (in a paper or similar), we'd need to handle this differently.
allennlp evaluate --cuda-device 0 -o '{"iterator": {"base_iterator": {"maximum_samples_per_batch": ["num_tokens", 500] }}}' output_path/model.tar.gz $BIDIRECTIONAL_LM_DATA_PATH/heldout-monolingual.tokenized.shuffled/news.en-00000-of-00100
A model trained for 4 epochs gives:
2018-12-12 05:42:53,711 - INFO - allennlp.commands.evaluate - loss: 3.745238332322373 ipython In [1]: import math; math.exp(3.745238332322373) # To compute perplexity Out[1]: 42.3190920245054
Using Transformer ELMo is essentially the same as using regular ELMo. See this documentation for details on how to do that.
The one exception is that inside the text_field_embedder
block in your training config you should replace
"text_field_embedder": {
"token_embedders": {
"elmo": {
"type": "elmo_token_embedder",
"options_file": "https://allennlp.s3.amazonaws.com/models/elmo/2x4096_512_2048cnn_2xhighway/elmo_2x4096_512_2048cnn_2xhighway_options.json",
"weight_file": "https://allennlp.s3.amazonaws.com/models/elmo/2x4096_512_2048cnn_2xhighway/elmo_2x4096_512_2048cnn_2xhighway_weights.hdf5",
"do_layer_norm": false,
"dropout": 0.5
}
}
},
with
"text_field_embedder": {
"token_embedders": {
"elmo": {
"type": "bidirectional_lm_token_embedder",
"archive_file": std.extVar('BIDIRECTIONAL_LM_ARCHIVE_PATH'),
"dropout": 0.2,
"bos_eos_tokens": ["<S>", "</S>"],
"remove_bos_eos": true,
"requires_grad": false
}
}
},
.
For an example of this see the config for a Transformer ELMo augmented constituency parser and compare with the original ELMo augmented constituency parser.
Of course, you can also directly call the embedder in your programs:
from allennlp.modules.token_embedders.bidirectional_language_model_token_embedder import BidirectionalLanguageModelTokenEmbedder
from allennlp.data.token_indexers.elmo_indexer import ELMoTokenCharactersIndexer
from allennlp.data.tokenizers.token import Token
import torch
lm_model_file = "output_path/model.tar.gz"
sentence = "It is raining in Seattle ."
tokens = [Token(word) for word in sentence.split()]
lm_embedder = BidirectionalLanguageModelTokenEmbedder(
archive_file=lm_model_file,
dropout=0.2,
bos_eos_tokens=["<S>", "</S>"],
remove_bos_eos=True,
requires_grad=False
)
indexer = ELMoTokenCharactersIndexer()
vocab = lm_embedder._lm.vocab
character_indices = indexer.tokens_to_indices(tokens, vocab, "elmo")["elmo"]
# Batch of size 1
indices_tensor = torch.LongTensor([character_indices])
# Embed and extract the single element from the batch.
embeddings = lm_embedder(indices_tensor)[0]
for word_embedding in embeddings:
print(word_embedding)
Note: This sidesteps our data loading and batching mechanisms for brevity. See our main tutorial for an exposition of how they function.