## Language Models Lab ##

Through these notebooks, we will explore different important and interesting techniques, approaches, and uses of language models to address mainly Natural Language Processing tasks.

We will explore the following:

- Creating Recurrent Neural Networks (RNN) and Long short-term memory (LSTM) networks
- Word2Vec
    - Continuous Bag-Of-Words (CBOW)
- Using RNNS in practice!
    - Text classification
- Seq2Seq
    - Using Torchtext
    - Machine Translation
- Using Pre-trained models!

-------------
## Basic testing of RNN, LSTM, and GRU ##

In [42]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

In [43]:
def prepare_sequence(seq, to_ix):
    idxs = [to_ix[w] for w in seq]
    return torch.tensor(idxs, dtype=torch.long)

# What will happen here?
training_data = [
    # Tags are: DET - determiner; NN - noun; V - verb
    # For example, the word "The" is a determiner
    ("The dog ate the apple".split(), ["DET", "NN", "V", "DET", "NN"]),
    ("Everybody read that book".split(), ["NN", "V", "DET", "NN"]),
    ("Everybody does machine learning nowadays".split(), ["NN", "V", "NN", "NN", "ADV", ])
]
word_to_ix = {}
# For each words-list (sentence) and tags-list in each tuple of training_data
for sent, tags in training_data:
    for word in sent:
        if word not in word_to_ix:  # word has not been assigned an index yet
            word_to_ix[word] = len(word_to_ix)  # Assign each word with a unique index
            
print(word_to_ix)
tag_to_ix = {"DET": 0, "NN": 1, "V": 2, "ADV": 3}  # Assign each tag with a unique index

{'The': 0, 'dog': 1, 'ate': 2, 'the': 3, 'apple': 4, 'Everybody': 5, 'read': 6, 'that': 7, 'book': 8, 'does': 9, 'machine': 10, 'learning': 11, 'nowadays': 12}


In [44]:
EMBEDDING_DIM = 6
HIDDEN_DIM = 12
VOCAB_SIZE = len(word_to_ix)
NUM_CLASSES = len(tag_to_ix)

In [45]:
def train(model, optimizer, criterion, epochs):
    epoch_loss = []
    for epoch in range(epochs):  # again, normally you would NOT do 300 epochs, it is toy data
        final_loss = 0
        for sentence, tags in training_data:
            
            model.zero_grad()

            # get inputs and targets ready for the network!
            sentence_in = prepare_sequence(sentence, word_to_ix)
            targets = prepare_sequence(tags, tag_to_ix)

            # get the tag scores
            tag_scores = model(sentence_in)
            
            loss = criterion(tag_scores, targets)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            final_loss += loss.item()
        epoch_loss.append(final_loss)
    
    return epoch_loss


In [46]:
def evaluate(model, test_sequence):
    with torch.no_grad():
        inputs = prepare_sequence(training_data[test_sequence][0], word_to_ix)
        tag_scores = model(inputs)
        
        outputs = []
        
        print(tag_to_ix)
        print(training_data[test_sequence][0])
        print(training_data[test_sequence][1])
        
        for tag_score in tag_scores:
            outputs.append(tag_score.topk(1).indices.item())
            
        print(outputs)
        print("--------------")

### Recurrent Neural Networks (RNN) ###

In [47]:
class RNNTagger(nn.Module):

    def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size):
        super(RNNTagger, self).__init__()
        self.hidden_dim = hidden_dim

        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)

        # The RNN takes word embeddings as inputs, and outputs hidden states and output
        self.rnn = nn.RNN(embedding_dim, hidden_dim)

        # The linear layer that maps from hidden state space to tag space
        self.hidden2tag = nn.Linear(hidden_dim, tagset_size)

    def forward(self, sentence):
        
        embeds = self.word_embeddings(sentence)
        rnn_out, _ = self.rnn(embeds.view(len(sentence), 1, -1)) #The module is expecting [sentence_length, batch_size, embedding_dim]
        
        # in this case, rnn_out.view(len(sentence), -1) is the same as doing what function?
        tag_space = self.hidden2tag(rnn_out.view(len(sentence), -1))
        
        tag_scores = F.log_softmax(tag_space, dim=1)
        return tag_scores

In [48]:
model = RNNTagger(EMBEDDING_DIM, HIDDEN_DIM, len(word_to_ix), len(tag_to_ix))
loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)
losses = train(model, optimizer, loss_function, 100)
print(losses)
evaluate(model, 0)
evaluate(model, 1)
evaluate(model, 2)

[4.371445059776306, 3.886128544807434, 3.5148985385894775, 3.22461599111557, 2.98644357919693, 2.7783373594284058, 2.587253510951996, 2.407369375228882, 2.2366693019866943, 2.074359506368637, 1.9197168052196503, 1.7719204127788544, 1.6303001046180725, 1.4946652054786682, 1.3655259013175964, 1.2440376430749893, 1.1315909326076508, 1.0292468965053558, 0.9373834431171417, 0.8557039797306061, 0.7834593057632446, 0.719685323536396, 0.6633712351322174, 0.6135533004999161, 0.5693598762154579, 0.5300257503986359, 0.4948904365301132, 0.4633893370628357, 0.4350424036383629, 0.4094421863555908, 0.3862423822283745, 0.36514855176210403, 0.3459091857075691, 0.3283092975616455, 0.3121637627482414, 0.29731322824954987, 0.28361961990594864, 0.270962618291378, 0.25923753529787064, 0.2483527697622776, 0.23822765797376633, 0.22879091650247574, 0.21997976303100586, 0.21173835918307304, 0.20401702262461185, 0.19677139073610306, 0.18996186554431915, 0.1835528053343296, 0.17751223035156727, 0.1718112826347351

### Long Short-Term Memory (LSTM) ###

In [49]:
class LSTMTagger(nn.Module):

    def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size):
        super(LSTMTagger, self).__init__()
        self.hidden_dim = hidden_dim

        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)

        # The LSTM takes word embeddings as inputs, and outputs hidden states
        # with dimensionality hidden_dim.
        self.lstm = nn.LSTM(embedding_dim, hidden_dim)

        # The linear layer that maps from hidden state space to tag space
        self.hidden2tag = nn.Linear(hidden_dim, tagset_size)

    def forward(self, sentence):
        embeds = self.word_embeddings(sentence)
        lstm_out, _ = self.lstm(embeds.view(len(sentence), 1, -1))
        tag_space = self.hidden2tag(lstm_out.view(len(sentence), -1))
        tag_scores = F.log_softmax(tag_space, dim=1)
        return tag_scores

In [50]:
model = LSTMTagger(EMBEDDING_DIM, HIDDEN_DIM, len(word_to_ix), len(tag_to_ix))
loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)
losses = train(model, optimizer, loss_function, 100)
print(losses)


[4.0667842626571655, 3.9843209981918335, 3.915029764175415, 3.856734037399292, 3.8075640201568604, 3.7659271955490112, 3.7304794788360596, 3.700093388557434, 3.6738321781158447, 3.6509196758270264, 3.6307178735733032, 3.6127034425735474, 3.5964473485946655, 3.581599473953247, 3.5678738355636597, 3.5550355911254883, 3.542893409729004, 3.5312891006469727, 3.5200921297073364, 3.5091947317123413, 3.4985066652297974, 3.487951397895813, 3.477465271949768, 3.46699321269989, 3.45648729801178, 3.4459067583084106, 3.435214877128601, 3.4243804216384888, 3.413373112678528, 3.402167558670044, 3.3907400369644165, 3.3790682554244995, 3.367132067680359, 3.3549118041992188, 3.3423900604248047, 3.3295494318008423, 3.3163734674453735, 3.302847146987915, 3.2889543771743774, 3.274681806564331, 3.260014533996582, 3.2449395656585693, 3.2294440269470215, 3.2135143280029297, 3.197139620780945, 3.180307924747467, 3.1630082726478577, 3.1452309489250183, 3.1269665956497192, 3.108206868171692, 3.0889447927474976, 

In [51]:
import math
for epoch, loss in enumerate(losses):
    print(f"Epoch {epoch}: Perplexity = {math.exp(loss):.2f}")

Epoch 0: Perplexity = 58.37
Epoch 1: Perplexity = 53.75
Epoch 2: Perplexity = 50.15
Epoch 3: Perplexity = 47.31
Epoch 4: Perplexity = 45.04
Epoch 5: Perplexity = 43.20
Epoch 6: Perplexity = 41.70
Epoch 7: Perplexity = 40.45
Epoch 8: Perplexity = 39.40
Epoch 9: Perplexity = 38.51
Epoch 10: Perplexity = 37.74
Epoch 11: Perplexity = 37.07
Epoch 12: Perplexity = 36.47
Epoch 13: Perplexity = 35.93
Epoch 14: Perplexity = 35.44
Epoch 15: Perplexity = 34.99
Epoch 16: Perplexity = 34.57
Epoch 17: Perplexity = 34.17
Epoch 18: Perplexity = 33.79
Epoch 19: Perplexity = 33.42
Epoch 20: Perplexity = 33.07
Epoch 21: Perplexity = 32.72
Epoch 22: Perplexity = 32.38
Epoch 23: Perplexity = 32.04
Epoch 24: Perplexity = 31.71
Epoch 25: Perplexity = 31.37
Epoch 26: Perplexity = 31.04
Epoch 27: Perplexity = 30.70
Epoch 28: Perplexity = 30.37
Epoch 29: Perplexity = 30.03
Epoch 30: Perplexity = 29.69
Epoch 31: Perplexity = 29.34
Epoch 32: Perplexity = 29.00
Epoch 33: Perplexity = 28.64
Epoch 34: Perplexity = 2

In [52]:
evaluate(model, 0)
evaluate(model, 1)
evaluate(model, 2)

{'DET': 0, 'NN': 1, 'V': 2, 'ADV': 3}
['The', 'dog', 'ate', 'the', 'apple']
['DET', 'NN', 'V', 'DET', 'NN']
[0, 1, 2, 1, 1]
--------------
{'DET': 0, 'NN': 1, 'V': 2, 'ADV': 3}
['Everybody', 'read', 'that', 'book']
['NN', 'V', 'DET', 'NN']
[1, 2, 0, 1]
--------------
{'DET': 0, 'NN': 1, 'V': 2, 'ADV': 3}
['Everybody', 'does', 'machine', 'learning', 'nowadays']
['NN', 'V', 'NN', 'NN', 'ADV']
[1, 1, 1, 1, 1]
--------------


## Replace LSTM and RNN with GRU ##

Implement a network with nn.GRU, and compare with the other networks through loss and perplexity. If wanted, you can extend this toy example with more sentences or vary the task for testing the networks and observing the differences.

In [53]:
import torch.optim as optim
class GRUTagger(nn.Module):
    def __init__(self, input_size, embedding_dim, hidden_size, output_size):
        super(GRUTagger, self).__init__()
        self.hidden_size = hidden_size
        self.embedding = nn.Embedding(input_size, embedding_dim)
        self.gru = nn.GRU(embedding_dim, hidden_size)
        self.fc = nn.Linear(hidden_size, output_size)
    
    def forward(self, x):
        embedded = self.embedding(x)
        output, hidden = self.gru(embedded)
        output = self.fc(output.view(len(x), -1))
        return F.log_softmax(output, dim=1)
    def initHidden(self):
        return torch.zeros(1, self.hidden_size)


model_gru = GRUTagger(len(word_to_ix), embedding_dim=6, hidden_size=12, output_size=len(tag_to_ix))
loss_fn = nn.NLLLoss()
optimizer = optim.Adam(model_gru.parameters(), lr=0.001)
losses = train(model_gru, optimizer, loss_fn, 100)


In [54]:
import math
for epoch, loss in enumerate(losses):
    print(f"Epoch {epoch}: Perplexity = {math.exp(loss):.2f}")

Epoch 0: Perplexity = 69.16
Epoch 1: Perplexity = 66.77
Epoch 2: Perplexity = 64.66
Epoch 3: Perplexity = 62.63
Epoch 4: Perplexity = 60.68
Epoch 5: Perplexity = 58.80
Epoch 6: Perplexity = 56.98
Epoch 7: Perplexity = 55.23
Epoch 8: Perplexity = 53.54
Epoch 9: Perplexity = 51.91
Epoch 10: Perplexity = 50.33
Epoch 11: Perplexity = 48.81
Epoch 12: Perplexity = 47.34
Epoch 13: Perplexity = 45.92
Epoch 14: Perplexity = 44.55
Epoch 15: Perplexity = 43.22
Epoch 16: Perplexity = 41.94
Epoch 17: Perplexity = 40.70
Epoch 18: Perplexity = 39.51
Epoch 19: Perplexity = 38.36
Epoch 20: Perplexity = 37.25
Epoch 21: Perplexity = 36.18
Epoch 22: Perplexity = 35.15
Epoch 23: Perplexity = 34.15
Epoch 24: Perplexity = 33.19
Epoch 25: Perplexity = 32.27
Epoch 26: Perplexity = 31.37
Epoch 27: Perplexity = 30.51
Epoch 28: Perplexity = 29.68
Epoch 29: Perplexity = 28.87
Epoch 30: Perplexity = 28.09
Epoch 31: Perplexity = 27.33
Epoch 32: Perplexity = 26.58
Epoch 33: Perplexity = 25.86
Epoch 34: Perplexity = 2

In [55]:

evaluate(model_gru, 0)
evaluate(model_gru, 1)
evaluate(model_gru, 2)

{'DET': 0, 'NN': 1, 'V': 2, 'ADV': 3}
['The', 'dog', 'ate', 'the', 'apple']
['DET', 'NN', 'V', 'DET', 'NN']
[0, 1, 2, 0, 1]
--------------
{'DET': 0, 'NN': 1, 'V': 2, 'ADV': 3}
['Everybody', 'read', 'that', 'book']
['NN', 'V', 'DET', 'NN']
[1, 2, 0, 1]
--------------
{'DET': 0, 'NN': 1, 'V': 2, 'ADV': 3}
['Everybody', 'does', 'machine', 'learning', 'nowadays']
['NN', 'V', 'NN', 'NN', 'ADV']
[1, 2, 1, 1, 3]
--------------


From Perplexity, we can see that GRU has lower Perplexity (2.31) than LSTM (5.85)