https://abhinavcreed13.github.io/blog/bengio-trigram-nplm-using-pytorch/

In this notebook, we are going to implement Bengio's Neural Probabilistic Language Model (NPLM) proposed in 2003 using pytorch library with GPU acceleration.

Bengio's NPLM Paper: http://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf

We will be using brown corpus of NLTK to create this language model.

## Load Brown Corpus

In [1]:
import nltk
import csv
from nltk.corpus import brown
from nltk.corpus import wordnet

nltk.download("brown")
nltk.download("wordnet")

[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\avglinsky\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\avglinsky\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [2]:
len(brown.paras())

15667

We are having 15667 paragraphs in brown corpus. So, we will be using first 12000 paragraphs as the training set and rest 3000+ as the development set for testing.

## Create training and development set

Using the training set, we are only adding those words in the vocabulary which are having term frequency >= 5. So, first we create vocabulary as shown below.

In [3]:
num_train = 12000
UNK_symbol = "<UNK>"
vocab = set([UNK_symbol])

# create brown corpus again with all words
# no preprocessing, only lowercase
brown_corpus_train = []
for idx,paragraph in enumerate(brown.paras()):
    if idx == num_train:
        break
    words = []
    for sentence in paragraph:
        for word in sentence:
            words.append(word.lower())
    brown_corpus_train.append(words)

# create term frequency of the words
words_term_frequency_train = {}
for doc in brown_corpus_train:
    for word in doc:
        # this will calculate term frequency
        # since we are taking all words now
        words_term_frequency_train[word] = words_term_frequency_train.get(word,0) + 1

# create vocabulary
for doc in brown_corpus_train:
    for word in doc:
        if words_term_frequency_train.get(word,0) >= 5:
            vocab.add(word)

print(len(vocab))

12681


Now, we create training and dev set for our trigram-based NPLM.

In [4]:
import numpy as np
# create required lists
x_train = []
y_train = []
x_dev = []
y_dev = []

# create word to id mappings
word_to_id_mappings = {}
for idx,word in enumerate(vocab):
    word_to_id_mappings[word] = idx

# function to get id for a given word
# return <UNK> id if not found
def get_id_of_word(word):
    unknown_word_id = word_to_id_mappings['<UNK>']
    return word_to_id_mappings.get(word,unknown_word_id)

# creating training and dev set
for idx,paragraph in enumerate(brown.paras()):
    for sentence in paragraph:
        for i,word in enumerate(sentence):
            if i+2 >= len(sentence):
                # sentence boundary reached
                # ignoring sentence less than 3 words
                break
            # convert word to id
            x_extract = [get_id_of_word(word.lower()),get_id_of_word(sentence[i+1].lower())]
            y_extract = [get_id_of_word(sentence[i+2].lower())]
            if idx < num_train:
                x_train.append(x_extract)
                y_train.append(y_extract)
            else:
                x_dev.append(x_extract)
                y_dev.append(y_extract)

# making numpy arrays
x_train = np.array(x_train)
y_train = np.array(y_train)
x_dev = np.array(x_dev)
y_dev = np.array(y_dev)

print(x_train.shape)
print(y_train.shape)
print(x_dev.shape)
print(y_dev.shape)

(872823, 2)
(872823, 1)
(174016, 2)
(174016, 1)


## Building Trigram NPLM with pytorch

Now let's build the trigram neural language model. We'll use the language model described in Bengio's paper as:

$x' = e(x_1) \oplus e(x_2)$

$h = \tanh(W_1 x' + b)$

$y = $ softmax$(W_2 h)$

where $\oplus$ is the concatenation operation, $x_1$ and $x_2$ are the input words, $e$ is an embedding function, and $y$ is the target word.

We will set the dimension of the word embeddings and $h$ to 100, and train our model with 3 epochs with a batch size of 256.

### Create DataLoaders

In [5]:
# load libraries
import torch
import multiprocessing
from torch import nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader
import time

# create parameters
gpu = 0
EMBEDDING_DIM = 200
CONTEXT_SIZE = 2
BATCH_SIZE = 256
H = 100
torch.manual_seed(13013)

# check if gpu is available
device = 'cuda' if torch.cuda.is_available() else 'cpu'
available_workers = multiprocessing.cpu_count()

print("--- Creating training and dev dataloaders with {} batch size ---".format(BATCH_SIZE))
train_set = np.concatenate((x_train, y_train), axis=1)
dev_set = np.concatenate((x_dev, y_dev), axis=1)
train_loader = DataLoader(train_set, batch_size = BATCH_SIZE, num_workers = available_workers)
dev_loader = DataLoader(dev_set, batch_size = BATCH_SIZE, num_workers = available_workers)

--- Creating training and dev dataloaders with 256 batch size ---


### Create pytorch Trigram Model

In [6]:
# Trigram Neural Network Model
class TrigramNNmodel(nn.Module):

    def __init__(self, vocab_size, embedding_dim, context_size, h):
        super(TrigramNNmodel, self).__init__()
        self.context_size = context_size
        self.embedding_dim = embedding_dim
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(context_size * embedding_dim, h)
        self.linear2 = nn.Linear(h, vocab_size, bias = False)

    def forward(self, inputs):
        # compute x': concatenation of x1 and x2 embeddings
        embeds = self.embeddings(inputs).view((-1,self.context_size * self.embedding_dim))
        # compute h: tanh(W_1.x' + b)
        out = torch.tanh(self.linear1(embeds))
        # compute W_2.h
        out = self.linear2(out)
        # compute y: log_softmax(W_2.h)
        log_probs = F.log_softmax(out, dim=1)
        # return log probabilities
        # BATCH_SIZE x len(vocab)
        return log_probs

### Create training helper functions

In [7]:
# helper function to get accuracy from log probabilities
def get_accuracy_from_log_probs(log_probs, labels):
    probs = torch.exp(log_probs)
    predicted_label = torch.argmax(probs, dim=1)
    acc = (predicted_label == labels).float().mean()
    return acc

# helper function to evaluate model on dev data
def evaluate(model, criterion, dataloader):
    model.eval()

    mean_acc, mean_loss = 0, 0
    count = 0

    with torch.no_grad():
        dev_st = time.time()
        for it, data_tensor in enumerate(dataloader):
            context_tensor = data_tensor[:,0:2]
            target_tensor = data_tensor[:,2].long()
            context_tensor, target_tensor = context_tensor.to(device), target_tensor.to(device)
            log_probs = model(context_tensor)
            mean_loss += criterion(log_probs, target_tensor).item()
            mean_acc += get_accuracy_from_log_probs(log_probs, target_tensor)
            count += 1
            if it % 500 == 0:
                print("Dev Iteration {} complete. Mean Loss: {}; Mean Acc:{}; Time taken (s): {}".format(it, mean_loss / count, mean_acc / count, (time.time()-dev_st)))
                dev_st = time.time()

    return mean_acc / count, mean_loss / count

### Train & Save Model

In [None]:
# Using negative log-likelihood loss
loss_function = nn.NLLLoss()

# create model
model = TrigramNNmodel(len(vocab), EMBEDDING_DIM, CONTEXT_SIZE, H)

# load it to device
model.to(device)

# using ADAM optimizer
optimizer = optim.Adam(model.parameters(), lr = 2e-3)


# ------------------------- TRAIN & SAVE MODEL ------------------------
best_acc = 0
best_model_path = None
for epoch in range(5):
    st = time.time()
    print("\n--- Training model Epoch: {} ---".format(epoch+1))
    for it, data_tensor in enumerate(train_loader):
        context_tensor = data_tensor[:,0:2]
        target_tensor = data_tensor[:,2].long()

        context_tensor, target_tensor = context_tensor.to(device), target_tensor.to(device)

        # zero out the gradients from the old instance
        model.zero_grad()

        # get log probabilities over next words
        log_probs = model(context_tensor)

        # calculate current accuracy
        acc = get_accuracy_from_log_probs(log_probs, target_tensor)

        # compute loss function
        loss = loss_function(log_probs, target_tensor)

        # backward pass and update gradient
        loss.backward()
        optimizer.step()

        if it % 500 == 0:
            print("Training Iteration {} of epoch {} complete. Loss: {}; Acc:{}; Time taken (s): {}".format(it, epoch, loss.item(), acc, (time.time()-st)))
            st = time.time()

    print("\n--- Evaluating model on dev data ---")
    dev_acc, dev_loss = evaluate(model, loss_function, dev_loader)
    print("Epoch {} complete! Development Accuracy: {}; Development Loss: {}".format(epoch, dev_acc, dev_loss))
    if dev_acc > best_acc:
        print("Best development accuracy improved from {} to {}, saving model...".format(best_acc, dev_acc))
        best_acc = dev_acc
        # set best model path
        best_model_path = 'best_model_{}.dat'.format(epoch)
        # saving best model
        torch.save(model.state_dict(), best_model_path)


--- Training model Epoch: 1 ---
Training Iteration 0 of epoch 0 complete. Loss: 9.486557006835938; Acc:0.0; Time taken (s): 53.052590131759644
Training Iteration 500 of epoch 0 complete. Loss: 6.186699867248535; Acc:0.16015625; Time taken (s): 83.95976448059082
Training Iteration 1000 of epoch 0 complete. Loss: 6.067230701446533; Acc:0.15625; Time taken (s): 83.26248860359192
Training Iteration 1500 of epoch 0 complete. Loss: 6.063997268676758; Acc:0.16015625; Time taken (s): 75.96039462089539
Training Iteration 2000 of epoch 0 complete. Loss: 5.937950134277344; Acc:0.12109375; Time taken (s): 59.88210606575012
Training Iteration 2500 of epoch 0 complete. Loss: 6.128658294677734; Acc:0.1484375; Time taken (s): 69.83636498451233
Training Iteration 3000 of epoch 0 complete. Loss: 5.733619689941406; Acc:0.1875; Time taken (s): 65.17669439315796

--- Evaluating model on dev data ---
Dev Iteration 0 complete. Mean Loss: 4.984565258026123; Mean Acc:0.1796875; Time taken (s): 35.965463638305

## Load best model & calculate similarities

In [None]:
# ---------------------- Loading Best Model -------------------
best_model = TrigramNNmodel(len(vocab), EMBEDDING_DIM, CONTEXT_SIZE, H)
best_model.load_state_dict(torch.load(best_model_path))
best_model.to(device)

cos = nn.CosineSimilarity(dim=0)

lm_similarities = {}
# word pairs to calculate similarity
words = {('computer','keyboard'),('cat','dog'),('dog','car'),('keyboard','cat')}

# ----------- Calculate LM similarities using cosine similarity ----------
for word_pairs in words:
    w1 = word_pairs[0]
    w2 = word_pairs[1]
    words_tensor = torch.LongTensor([get_id_of_word(w1),get_id_of_word(w2)])
    words_tensor = words_tensor.to(device)
    # get word embeddings from the best model
    words_embeds = best_model.embeddings(words_tensor)
    # calculate cosine similarity between word vectors
    sim = cos(words_embeds[0],words_embeds[1])
    lm_similarities[word_pairs] = sim.item()

print(lm_similarities)