# Word Embeddings in Pytorch

Let's start from creating a pairs of very simple embeddings, without any training involved. The idea is to first map words in a vocabulary to integers, and then to map these integers to dense real vectors. We end up, roughly speaking, with two dictionaries: one mapping words to integers, and one mapping integers to real vectors. Let's start importing the minimum number of libraries needed.

In [38]:
import torch
from torch.autograd import Variable
import torch.nn as nn
import torch.nn.functional as F

We create a minimalistic vocabulary consisting only of the words `hello` and `world`. We map the first one to the integer 0 and the second one to the integer 1.

In [4]:
vocabulary = ['hello', 'world']
word_to_ix = {word: i for (i, word) in enumerate(vocabulary)}
print(word_to_ix)

{'hello': 0, 'world': 1}


The function performing the embedding is, rather appropriately, `nn.Embedding`. It requires two arguments: the first one, `num_embeddings` is the size of the vocabulary, the second, `embedding_dim`, is the dimension of the embedding space.

In [5]:
embeds = nn.Embedding(num_embeddings=2, embedding_dim=5)
print(embeds)

Embedding(2, 5)


We need to pass tensors, and more precisely variables to this function. Note that if we don't pass a list to `torch.LongTensor`, we end up with a tensor of no dimension, which would not work.

In [13]:
# Fails
# hello_var = Variable(torch.LongTensor(word_to_ix['hello']))

# Succeeds
hello_var = Variable(torch.LongTensor([word_to_ix['hello']]))
print(hello_var)
hello_embed = embeds(hello_var)
print(hello_embed)

Variable containing:
 0
[torch.LongTensor of size 1]

Variable containing:
-0.8702  0.3103  1.5108  0.9291  0.0045
[torch.FloatTensor of size 1x5]



This embeddings haven't gone through a training phase. We can still check what is their cosine distance.

In [16]:
world_embed = embeds(Variable(torch.LongTensor([word_to_ix['world']])))
print(world_embed)
cosine_distance = torch.dot(hello_embed, world_embed) / (
    torch.norm(hello_embed) * torch.norm(world_embed))
print(cosine_distance)

Variable containing:
-1.5181 -0.5671  0.0890  0.9829 -0.6964
[torch.FloatTensor of size 1x5]

Variable containing:
 0.5417
[torch.FloatTensor of size 1]



## N-Gram Modeling

In an n-gram model we want to predict the next word given the last n words, which we refer to as the *context*. If we consider only the last two words, we have a context of size 2. In such a case our dataset is composed of trigrams, i.e., tuples containing the last two words and the next word. More in general, given an input text, we can extract the n(+1)-grams as follows.

In [20]:
def extract_ngrams(input_text, n=2):
    input_list = input_text.split()
    ngram_list = []
    for i in range(len(input_list) - n):
        tmp = []
        for j in range(n):
            tmp.append(input_list[i + j])
        ngram_list.append((tmp, input_list[i + j + 1]))
    return(ngram_list)

test_text = 'One day I woke up, or so I believed, but everything I knew was different.'
extract_ngrams(test_text, n=3)

[(['One', 'day', 'I'], 'woke'),
 (['day', 'I', 'woke'], 'up,'),
 (['I', 'woke', 'up,'], 'or'),
 (['woke', 'up,', 'or'], 'so'),
 (['up,', 'or', 'so'], 'I'),
 (['or', 'so', 'I'], 'believed,'),
 (['so', 'I', 'believed,'], 'but'),
 (['I', 'believed,', 'but'], 'everything'),
 (['believed,', 'but', 'everything'], 'I'),
 (['but', 'everything', 'I'], 'knew'),
 (['everything', 'I', 'knew'], 'was'),
 (['I', 'knew', 'was'], 'different.')]

### Defining a Simple N-Gram Language Modeler

We define a very simple model which tries to predict the next word given the previous `context_size`. The model consists of an embedding layer followed by two fully connected layers. The first one computes the activations via a ReLU non-linearity, while the second simply returns its value which are passed to a log Softmax. Note that the embeddings are flattened before being passed to the FC layers.

In [40]:
class NGramLanguageModeler(nn.Module):

    def __init__(self, vocab_size, embedding_dim, context_size, debug=False):
        super(NGramLanguageModeler, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(context_size * embedding_dim, 128)
        self.linear2 = nn.Linear(128, vocab_size)
        self.debug = debug
        
    def forward(self, inputs):
        original_embeds = self.embeddings(inputs)
        embeds = original_embeds.view((1, -1))
        if self.debug:
            print('Before reshaping: {}'.format(original_embeds.size()))
            print('After reshaping:  {}'.format(embeds.size()))
        out = F.relu(self.linear1(embeds))
        out = self.linear2(out)
        log_probs = F.log_softmax(out, 0)
        return log_probs

We take our simple text, create a vocabulary and a mapping between unique words and integers.

In [41]:
vocabulary = set(test_text.split())
vocabulary_size = len(vocabulary)
word_to_ix = {w: i for i, w in enumerate(vocabulary)}

We can now create the trigrams that will be then mapped to the respective indices.

In [42]:
trigrams = extract_ngrams(test_text, n=2)
print(trigrams[0])

(['One', 'day'], 'I')


Let's convert this first trigram into an integer torch variable and see the output of the model

In [43]:
context = [word_to_ix[word] for word in trigrams[0][0]]
context_var = Variable(torch.LongTensor(context))
print(context_var)

Variable containing:
 4
 2
[torch.LongTensor of size 2]



In [44]:
test_ngram = NGramLanguageModeler(vocab_size=vocab_size, embedding_dim=5, 
                                  context_size=2, debug=True)
test_output = test_ngram(context_var)

Before reshaping: torch.Size([2, 5])
After reshaping:  torch.Size([1, 10])


From the output of the above command we can see that the embeddings have been flattened before being processed by the fully connected layer.