In [1]:
import os
import wget
import numpy as np

from allennlp.modules.elmo import batch_to_ids
from sklearn.metrics.pairwise import cosine_distances

Pre-trained Elmo-style contextualized embeddings can be downloaded from the [official website](https://allennlp.org/elmo). There are models of different sizes ranging from roughly $10^7$ parameters to approximately $10^8$ parameters.  For simplicity, here we will rely on the smallest (hence fastes to download and least accurate) model with 1024 dimensional hidden vector and 128 dimensional output vector size.

In [2]:
weight_url = 'https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x1024_128_2048cnn_1xhighway/elmo_2x1024_128_2048cnn_1xhighway_weights.hdf5'
options_url = 'https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x1024_128_2048cnn_1xhighway/elmo_2x1024_128_2048cnn_1xhighway_options.json'
weight_file = weight_url.split('/')[-1]
options_file = options_url.split('/')[-1]

if not os.path.exists(weight_file):
    wget.download(weight_url)
if not os.path.exists(options_file):
    wget.download(options_url)

The `batch_to_ids` method transforms a list of tokenized list of sentences into a Tensor of encoded characters of size (number of sentences, number of longest sentence, length of token).

In [3]:
sentences = [['I', 'was', 'sitting', 'at', 'the', 'bank', 'of', 'the', 'river', '.'],
             ['That', 'bank', 'was', 'robbed', 'recently', '.'],
             ['I', 'saw', 'a', 'bug', 'flying', 'next', 'to', 'the', 'water', '.'],
             ['The', 'quick', 'brown', 'fox', 'ate', 'a', 'chicken', '.'],
             ['The', 'lazy', 'dog', 'visits', 'the', 'bank', '.']]
character_ids = batch_to_ids(sentences)
character_ids.shape

torch.Size([5, 10, 50])

Initialize the biLSTM based on the previously downloaded pre-trained model.

In [4]:
from allennlp.commands.elmo import ElmoEmbedder
elmo = ElmoEmbedder(options_file, weight_file)

04/15/2019 21:02:38 - INFO - allennlp.commands.elmo -   Initializing ELMo.


In [5]:
vectors = elmo.embed_sentence("The bank is located at the bank of Thames .".split(' '))
vectorsB = elmo.embed_sentence("The bank is closed on Sundays .".split(' '))
vectorsC = elmo.embed_sentence("Can you see that nice swan at the river bank ?".split(' '))

`ElmoEmbedder` encodes each sentence into a `(3, token number, embedding size)` tensor. Note that the input layer comes from character level convolutions, which are then fed as inputs into a 2 layer bi-LSTM. The biLSTM outputs a `2*hidden dimension` sized vector for each token position and layer.  
The bidirectional language model looks the following:

In [6]:
print(elmo.elmo_bilm)

_ElmoBiLm(
  (_token_embedder): _ElmoCharacterEncoder(
    (char_conv_0): Conv1d(16, 32, kernel_size=(1,), stride=(1,))
    (char_conv_1): Conv1d(16, 32, kernel_size=(2,), stride=(1,))
    (char_conv_2): Conv1d(16, 64, kernel_size=(3,), stride=(1,))
    (char_conv_3): Conv1d(16, 128, kernel_size=(4,), stride=(1,))
    (char_conv_4): Conv1d(16, 256, kernel_size=(5,), stride=(1,))
    (char_conv_5): Conv1d(16, 512, kernel_size=(6,), stride=(1,))
    (char_conv_6): Conv1d(16, 1024, kernel_size=(7,), stride=(1,))
    (_highways): Highway(
      (_layers): ModuleList(
        (0): Linear(in_features=2048, out_features=4096, bias=True)
      )
    )
    (_projection): Linear(in_features=2048, out_features=128, bias=True)
  )
  (_elmo_lstm): ElmoLstm(
    (forward_layer_0): LstmCellWithProjection(
      (input_linearity): Linear(in_features=128, out_features=4096, bias=False)
      (state_linearity): Linear(in_features=128, out_features=4096, bias=True)
      (state_projection): Linear(in_fea

We can compare the vectorial representations of the same word `bank` in different context across multiple mentions and layers of the biLSTM.

In [7]:
for layer in range(3):
    bank_vectors = np.array([vectors[layer][1], vectors[layer][6], vectorsB[layer][1], vectorsC[layer][9]])
    print(cosine_distances(bank_vectors))
    print('LAYER {}'.format(layer))

[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]
LAYER 0
[[0.         0.30206925 0.01709425 0.23226154]
 [0.30206925 0.         0.3021521  0.24452692]
 [0.01709425 0.3021521  0.         0.21635085]
 [0.23226154 0.24452692 0.21635085 0.        ]]
LAYER 1
[[0.         0.42234105 0.06816322 0.43650234]
 [0.42234105 0.         0.41525936 0.44247025]
 [0.06816322 0.41525936 0.         0.39149922]
 [0.43650234 0.44247025 0.39149922 0.        ]]
LAYER 2


Unsurprisingly, the representations at Layer 0 for the same word are identical as representations at that layer merely come from the character-level convolutional part of the network with no context-awareness.  
Representations for the same word at layer 1 and 2 differ however. 

# ELMo for many languages

The [ElmoForManyLangs](https://github.com/HIT-SCIR/ELMoForManyLangs) project provides pretrained ELMo-style embeddings for multiple languages and utils to use them.  
Once installed, one has to provide the folder containing the model files during initialization.

In [8]:
from elmoformanylangs import Embedder
e = Embedder('./hun')

04/15/2019 21:03:18 - INFO - root -   char embedding size: 3407
04/15/2019 21:03:22 - INFO - root -   word embedding size: 343569
04/15/2019 21:03:30 - INFO - root -   Model(
  (token_embedder): ConvTokenEmbedder(
    (word_emb_layer): EmbeddingLayer(
      (embedding): Embedding(343569, 100, padding_idx=3)
    )
    (char_emb_layer): EmbeddingLayer(
      (embedding): Embedding(3407, 50, padding_idx=3404)
    )
    (convolutions): ModuleList(
      (0): Conv1d(50, 32, kernel_size=(1,), stride=(1,))
      (1): Conv1d(50, 32, kernel_size=(2,), stride=(1,))
      (2): Conv1d(50, 64, kernel_size=(3,), stride=(1,))
      (3): Conv1d(50, 128, kernel_size=(4,), stride=(1,))
      (4): Conv1d(50, 256, kernel_size=(5,), stride=(1,))
      (5): Conv1d(50, 512, kernel_size=(6,), stride=(1,))
      (6): Conv1d(50, 1024, kernel_size=(7,), stride=(1,))
    )
    (highways): Highway(
      (_layers): ModuleList(
        (0): Linear(in_features=2048, out_features=4096, bias=True)
        (1): Linear(

In [15]:
mondatok = ["Ebben a várban sok katona megfordult a középkor során .".split(),
            "Mária megunta a várakozást , és nem vár tovább .".split(),
            "A végvárakban sok derék katona lelte halálát .".split()
           ]
vectors = e.sents2elmo(mondatok, output_layer=-2)
vectors[0].shape
vectors[0][layer][2].shape

04/15/2019 21:08:02 - INFO - root -   1 batches, avg len: 11.3


(1024,)

In [16]:
for layer in range(3):
    var_vectors = np.array([vectors[0][layer][2], vectors[1][layer][7], vectors[2][layer][1]])
    print(var_vectors.shape)
    print(cosine_distances(var_vectors))
    print('LAYER {}'.format(layer))

(3, 1024)
[[0.         0.52950865 0.41278762]
 [0.52950865 0.         0.6484362 ]
 [0.41278762 0.6484362  0.        ]]
LAYER 0
(3, 1024)
[[0.         0.7263053  0.4613278 ]
 [0.7263053  0.         0.81253874]
 [0.4613278  0.81253874 0.        ]]
LAYER 1
(3, 1024)
[[0.         0.8110876  0.42670667]
 [0.8110876  0.         0.87850004]
 [0.42670667 0.87850004 0.        ]]
LAYER 2
