# Word Embeddings

Word Embeddings are a featurised representation of each word in our vocabulary. One of the main advantages of word embeddings is that it allow us to generalise the words well. With one-hot encoding, each word is treated individually without any information on how they relate with one another. For example, boy vs. girl, apple vs. orange, old vs. young, etc. In addition, using one-hot vectors quickly becomes impractical as we scale to larger and larger vocabulary sizes. Let say we have a 100 million vocab size and convert each word into a one-hot vector. Imagine the memory costs loading a 1 billion words long passage as well as the computation cost of calculating the softmax function on 1 million classes. As such, we require an alternative representation.

What happens is that word embeddings attempts to solve these issues by mapping these words to a smaller set of features which allows the Network to generalise better especially with unseen words. Word embeddings aims to capitalise on the hidden semantic relationships of words to reduce the dimensions of a corpus of words. For example, if we have a vocab size of 10,000 words, we will attempt to reduce this to a featurised representation of lets say 300 different features. With this reduced representation, it will be forced to incorporate the main differentiating characteristics into these 300 feature set.

In this series, we will focus on a couple of methodologies to training a Word Embedding namely using a Neural Language Model, Skip-Gram/ Word2Vec Model and GloVe Model.

# Preparing the data

I will be using the WikiText-2 dataset to train a word embedding using a Neural Language Model. We can download the dataset using the Classes available in torchText.datasets.

In [1]:
from torchtext.datasets import PennTreebank
from torchtext.data import Field
from torchtext.data.utils import get_tokenizer
from torchtext.data import BPTTIterator

In [2]:
# preprocess
prep_text = Field(lower=True, batch_first=True, tokenize="spacy")
# load dataset
train, valid, test = PennTreebank.splits(prep_text, root="../../data/NLP")

In [3]:
# build vocab object
prep_text.build_vocab(train)

In [4]:
# create iterator
seq_len = 30
batch_size = 32

train_iter, valid_iter, test_iter = BPTTIterator.splits([train, valid, test], batch_size=batch_size, bptt_len=seq_len, device="cuda")

In [5]:
# sample batch
batch = next(iter(train_iter))

In [6]:
vars(batch).keys()

dict_keys(['batch_size', 'dataset', 'fields', 'text', 'target'])

In [7]:
print(batch.text)
print(batch.text.shape)
print(batch.target)
print(batch.target.shape)

tensor([[   6, 9703, 9704, 9705, 9707, 9708, 9709, 9712, 9713, 9714, 9715,   23,
         5160, 9716, 9718, 9719, 9720, 9721, 9723, 9724, 9725, 9726, 9727, 9728,
           23,  508, 9729, 9730, 9731,    7],
        [   4,   13,   30,   14,  762,  127,    2, 1683,  109,  298,   19,  108,
         1218,   17,  304,    7,    6,   72,  293,  780, 2480,   10,    3,    5,
            4,    2, 3228, 6058,    9, 1642],
        [   4,  122, 7942,    7,    6, 2330,  946,    7,    6,   38,  177,   23,
          891,  234,   19,  595, 4253, 1173,   92,   19, 2965, 1580,  195,    2,
          205,   12,   42,   99,   13,  464],
        [  21,    7,    6,    2,  515,  923,   67,   49,   32, 4557,   22,    2,
         4466,  160,  986,   10, 2008,  676,   35,   21,    7,    6,  113,   49,
           32,    3,    5,    4,   10, 3725],
        [5364, 1149,    7,    6,   10,  456,    3,    5,    4,    3,    5,    4,
            2,    3,    5,    4, 4986,    3,    5,    4,   52, 7950, 3303,    2,
      

# Neural Language Model

In [8]:
import torch
from torch import nn, optim
import torch.nn.functional as F

import matplotlib.pyplot as plt
%matplotlib inline

In [19]:
class Lang_Model(nn.Module):
    
    def __init__(self, vocab_size, n_embedding, n_hidden, n_layers, drop_prob=0):
        super(Lang_Model, self).__init__()
        
        # Embedding Layer
        self.emb = nn.Embedding(vocab_size, n_embedding, sparse=True)
        
        # RNN Layer
        self.rnn = nn.RNN(n_embedding, n_hidden, n_layers, batch_first=True, nonlinearity='relu')
        
        # dropout layer
        self.dropout = nn.Dropout(drop_prob)
        
        # Linear layer
        self.output = nn.Linear(n_hidden, vocab_size)
        
        # HPs
        self.n_hidden = n_hidden
        self.n_layers = n_layers
        
    def forward(self, x, hidden): # RNN have an additional input, hidden. This refers to a[0], the initial activation to be fed into the first cell
        
        # Convert one-hot to embeddings
        output = self.dropout(self.emb(x))
        
        # RNN
        output, hidden = self.rnn(output, hidden)
        
        # dropout layer
        output = self.dropout(output)
        
        # reshape
        output = output.contiguous().view(-1, n_hidden)
        
        # output layer
        output = self.output(output)
        
        return output, hidden
    
    def init_hidden(self, batch_size, cuda): # Initialise our hidden units, a[0]
        weight = next(self.parameters()).data
        
        # initialise weights to have dimensions layers x batch size x hidden units
        if cuda:
            hidden = torch.zeros(self.n_layers, batch_size, self.n_hidden, dtype=torch.float, requires_grad=True, device='cuda')
            #hidden = weight.new(self.n_layers, batch_size, self.n_hidden).zero_().cuda()
        else:
            hidden = torch.zeros(self.n_layers, batch_size, self.n_hidden, dtype=torch.float, requires_grad=True, device='cpu')

        return hidden

In [10]:
# check for gpu
cuda = torch.cuda.is_available()

# Training the Model

In [14]:
# HPs
vocab_size = len(prep_text.vocab.itos)
n_embedding = 50
n_hidden = 32
n_layers = 1
drop_prob = 0.2

# initialise model
lang_model = Lang_Model(vocab_size, n_embedding, n_hidden, n_layers, drop_prob)
print(lang_model)

if cuda:
    lang_model.cuda()

Lang_Model(
  (emb): Embedding(9732, 50, sparse=True)
  (rnn): RNN(50, 32, batch_first=True)
  (dropout): Dropout(p=0.2, inplace=False)
  (output): Linear(in_features=32, out_features=9732, bias=True)
)


In [15]:
# Loss & optimizer
criterion = nn.CrossEntropyLoss()
optimiser = torch.optim.Adam(lang_model.parameters(), lr=0.01)

In [18]:
epochs = 1
clip = 5 # clip gradients to avoid exploding gradients
train_loss = []
val_loss = []

for e in range(epochs):
    print(e)
    
    # training set
    # initialise hidden state
    hid = lang_model.init_hidden(batch_size, cuda)
    running_train_loss = []
    
    for batch in train_iter:
        
        # convert to one-hot encoders
        text = F.one_hot(batch.text, vocab_size)
        #targets = batch.target # using NLLLoss
        targets = F.one_hot(batch.target, vocab_size)
            
        # zero gradients
        lang_model.zero_grad()
        
        # forwrard prop
        output, hid = lang_model(text, hid)
        hid = hid.data
        
        # calculate loss
        #m = nn.LogSoftmax(dim=1) # using NLLLoss
        loss = criterion(output, torch.max(targets.view(batch_size*seq_len, -1), 1)[1])
        running_train_loss.append(loss.item())
        loss.backward() # backprop
        
        # clip gradient
        nn.utils.clip_grad_norm_(lang_model.parameters(), clip)
        optimiser.step()
        
    train_loss.append(np.mean(running_train_loss))
    
    # val set
    # initialise hidden state
    hid = lang_model.init_hidden(batch_size, cuda)
    running_val_loss = []

    for batch in valid_iter:
        
        # convert to one-hot encoders
        text = F.one_hot(batch.text, vocab_size)
        targets = batch.target
        #targets = F.one_hot(batch.target, vocab_size)
        
        # forwrard prop
        output, hid = lang_model(text, hid)
        hid = hid.data
        
        # calculate loss
        loss = criterion(output, torch.max(targets.view(batch_size*seq_len, -1), 1)[1])
        running_val_loss.append(loss.item())
        
    val_loss.append(np.mean(running_val_loss))
    
    print(f"Epochs: {int(e+1)}/{int(epochs)} --- Train loss: {np.mean(running_train_loss)} --- Val loss: {np.mean(running_val_loss)}")

0


RuntimeError: CUDA out of memory. Tried to allocate 1.74 GiB (GPU 0; 7.79 GiB total capacity; 5.45 GiB already allocated; 1.46 GiB free; 5.47 GiB reserved in total by PyTorch)