# Part 1: Train N-Gram Language Models and answer questions


This notebook has 20 points.

Here we examine how to build count-based MLE language models.


## Part 1.1: Language models

In [1]:
# load libraries
import nltk
from nltk.corpus import PlaintextCorpusReader

from nltk.util import ngrams
from nltk.lm.preprocessing import pad_both_ends

from tqdm import tqdm

# ngram:
_N = 3

In [2]:
# Download a wikipedia dataset:
! wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-raw-v1.zip
! unzip wikitext-2-raw-v1.zip

zsh:1: command not found: wget
Archive:  wikitext-2-raw-v1.zip
replace wikitext-2-raw/wiki.test.raw? [y]es, [n]o, [A]ll, [N]one, [r]ename: ^C


## Preprocessing

In [3]:
# create a corpus reader
# this includes: sentence segmentation and word tokenization:
wikitext2 = PlaintextCorpusReader(
    'wikitext-2-raw',
    ['wiki.train.raw', 'wiki.valid.raw', 'wiki.test.raw'],
)
word_tokenizer = wikitext2._word_tokenizer

In [4]:
# training and test split:
train = wikitext2.sents('wiki.train.raw')
test = wikitext2.sents('wiki.test.raw')

# the vocabulary based on the training data:
vocab = nltk.lm.Vocabulary([
    word
    for sent in train
    for word in sent
], unk_cutoff=1)

In [5]:
# build n-grams
def build_ngrams(sent, n):
    # pad both ends for corner-ngrams:
    sent = ['<s>']*(n-1) + sent + ['</s>']*(n-1)
    # build the ngrams:
    return list(ngrams(sent, n))

In [6]:
# run this cell to inspect how it works:
sample = "Minecraft is a sandbox video game developed by Mojang."
sample_tokinized = word_tokenizer.tokenize(sample)
sample_trigrams = build_ngrams(sample_tokinized, n=3)
print('sample_tokinized:')
print(sample_tokinized)
print('sample_trigrams')
print(sample_trigrams)


sample_tokinized:
['Minecraft', 'is', 'a', 'sandbox', 'video', 'game', 'developed', 'by', 'Mojang', '.']
sample_trigrams
[('<s>', '<s>', 'Minecraft'), ('<s>', 'Minecraft', 'is'), ('Minecraft', 'is', 'a'), ('is', 'a', 'sandbox'), ('a', 'sandbox', 'video'), ('sandbox', 'video', 'game'), ('video', 'game', 'developed'), ('game', 'developed', 'by'), ('developed', 'by', 'Mojang'), ('by', 'Mojang', '.'), ('Mojang', '.', '</s>'), ('.', '</s>', '</s>')]


## Training the count model

In [7]:
%%time
# compare these two models:
models = {
    'plain': nltk.lm.MLE, # plain count-based ngrams
    'smoothing': nltk.lm.Laplace, # with laplace smoothing
    'smoothing+interpolation': nltk.lm.KneserNeyInterpolated, # Modified Kneser & Ney 
}

for lm_name in models:
    # build and train the language model:
    models[lm_name] = models[lm_name](_N, vocabulary=vocab)

    # train on all n-grams (equal or lower order): N, N-1, ..., 1.
    for n in tqdm(range(_N, 0, -1), desc=lm_name):
        models[lm_name].fit([build_ngrams(sent, n) for sent in train])


plain: 100%|██████████| 3/3 [00:18<00:00,  6.22s/it]
smoothing: 100%|██████████| 3/3 [00:18<00:00,  6.23s/it]
smoothing+interpolation: 100%|██████████| 3/3 [00:18<00:00,  6.29s/it]

CPU times: user 54.9 s, sys: 1.27 s, total: 56.1 s
Wall time: 56.3 s





#### Understand the models

In [8]:
# Understand how fit words:
# fit() method builds all kinds of count dictionaries:
(
    models['plain'].counts[1], # unigrams
    models['plain'].counts[2], # bi-grams for conditional count freq (w_{t} | w_{t-1})
    models['plain'].counts[3], # tri-grams for conditional count freq (w_{t} | w_{t-2} w_{t-1})
)

(FreqDist({'the': 113161, ',': 99925, '.': 78888, 'of': 56889, 'and': 50605, 'in': 39488, 'to': 39190, 'a': 34269, '=': 29570, '"': 28309, ...}),
 <ConditionalFreqDist with 75988 conditions>,
 <ConditionalFreqDist with 701629 conditions>)

In [9]:
# for example: 
# Count( word_3='numer'   | word_1 = 'A', word_2 ='large' ) = 4
# Count( word_3='variety' | word_1 = 'A', word_2 ='large' ) = 3
# ...
list(models['plain'].counts[3].items())[200]

(('A', 'large'),
 FreqDist({'number': 4, 'variety': 3, 'portion': 3, 'team': 1, 'tent': 1, 'oil': 1, 'pyramid': 1, 'camp': 1, 'rear': 1, 'network': 1, ...}))

In [10]:
# understand this:
models['plain'].counts[3][('A', 'large')]['number'] / sum(models['plain'].counts[3][('A', 'large')].values())

0.15384615384615385

In [11]:
models['plain'].score('number', ('A', 'large'))
# more details in chapter 3 equation 3.12.
# https://web.stanford.edu/~jurafsky/slp3/3.pdf

0.15384615384615385

In [12]:
# You can use the plain model for random language generation:
models['plain'].generate(10)

['fields',
 '.',
 '<UNK>',
 '<UNK>',
 'The',
 'Lifelong',
 'Learning',
 'Society',
 '.',
 '<UNK>']

## Testing

In [13]:
# Inspect log probabilities:
models['plain'].logscore('mind')

-14.527499220336294

In [14]:
sample = "Minecraft is a sandbox video game developed by Mojang."
sample_ngrams = [
    None,
    build_ngrams(word_tokenizer.tokenize(sample), n=1), # unigrams
    build_ngrams(word_tokenizer.tokenize(sample), n=2), # bigrams
    build_ngrams(word_tokenizer.tokenize(sample), n=3), # trigrams
]

In [15]:
for model_name in models:
    print(f"{model_name} model:")
    for n in range(1, _N+1):
        print(f"{n}-gram", models[model_name].perplexity(sample_ngrams[n]))
    print()

plain model:
1-gram inf
2-gram inf
3-gram inf

smoothing model:
1-gram 5089.599724571091
2-gram 4387.971551314072
3-gram 13897.075640448404

smoothing+interpolation model:
1-gram 1087.2745339516982
2-gram 334.06919150342225
3-gram 334.7464277684467



### Questions

1. Why these models have `<UNK>` token? What is the log-probability of <UNK> in three models? 


The UNK token is used to handle words that are not in the model's vocabulary. These words are called Out-of-vocabulary words, and it's a common thing in real-world text. The log-probability of UNK in a model can vary, it can be set to a low value which indicates the model is less certain about the word, or it can be set to a higher value which indicates the model is more certain. The log-probability of UNK can also depend on the specific dataset and the number of OOV words present in it.

```py
import transformers

model = transformers.BertForMaskedLM.from_pretrained('bert-base-uncased')

input_ids = torch.tensor([[model.tokenizer.cls_token_id, model.tokenizer.unk_token_id, model.tokenizer.sep_token_id]]).unsqueeze(0)  

log_probs = model(input_ids).logits

log_prob_of_unk = log_probs[0, 1].item()

print(f"The log-probability of <UNK> is: {log_prob_of_unk}")
```

2. Why plain count-based MLE model fails to produce perplexities? What are the possible solutions for it? 


A plain count-based Maximum Likelihood Estimation (MLE) model fails to produce perplexities because it does not take into account the rarity of words in the dataset. This means that the model assigns the same probability to all words, regardless of how often they appear in the dataset. This can lead to a high perplexity score for the model, as it will be less likely to correctly predict rare words.

One possible solution for this problem is to use a smoothing technique, such as Laplace smoothing or Kneser-Ney smoothing. These methods adjust the probabilities of words in the model to account for their rarity, making the model more robust and better able to handle rare words. Another solution is to use a neural language model, which can learn to assign different probabilities to different words based on the context in which they appear.

3. Show with an example why Laplace smoothing can produce perplexity for unseen words? 

Laplace smoothing is a technique used in language modeling to adjust the probabilities 
of words in the model to account for their rarity. The idea is that if a word is seen 
very few times in the training data, its probability will be underestimated. Laplace 
smoothing addresses this issue by adding a small constant to the numerator of the MLE 
probability estimates, effectively "smoothing out" the probabilities of words.

For example, consider a simple unigram language model trained on the following sentence: 
"The cat sat on the mat." The MLE probability of the word "mat" would be calculated as:

```py
P("mat") = count("mat") / total_words
```


Laplace smoothing, we would add a small constant "k" to the numerator, like this:
```py
from nltk.util import ngrams
from collections import Counter

sentence = "The cat sat on the mat."
tokens = nltk.word_tokenize(sentence)
unigrams = ngrams(tokens, 1)
counts = dict(Counter(unigrams))

k = 1
vocab_size = len(counts)
for word in counts:
    probability = (counts[word] + k) / (len(tokens) + (k*vocab_size))
    print(f"P({word}) = {probability}")

```

probability of X in this case would be something like this
```py
P("mat") = (count("mat") + k) / (total_words + (k * vocab_size))
```

4. Why perplexity of bi-grams are lower than unigrams? 

- answer here (in English and python)
- use models['smoothing'].counts[2] to show how?

if we compare the counts of unigrams and bigrams using models['smoothing'].counts[1] and models['smoothing'].counts[2], we can see that the bigram model has more information to work with, which results in a lower perplexity score. This is because the bigram model takes into account the previous word in addition to the current word, allowing it to make more accurate predictions. On the other hand, the unigram model only considers the current word and therefore, it may not be able to accurately predict the next word based on the context. This results in a higher perplexity score for unigram models compared to bigram models.

```py
unigram_counts = models['smoothing'].counts[1]

bigram_counts = models['smoothing'].counts[2]

if bigram_counts > unigram_counts:
  print("Bigram model has more information to work with and therefore it results in a lower perplexity score compared to the unigram model")
else:
  print("Unigram model has more information to work with and therefore it results in a lower perplexity score compared to the bigram model")

```

## Part 1.2 (Optional): neural language models and perplexity of conditional trigrams

There are no points for this part but you might want to complete it if you are interested in this topic. We will discuss it again in the fortcoming courses.

The neural network below is based on Bengio et al. (2003). It is trained on moving windows described in chapter 9 figure 9.1 but with trigrams instead of 4-grams.
https://web.stanford.edu/~jurafsky/slp3/9.pdf

You don't need to train the model. However, a stand alone python code is provided in `bengio_lm.py` if you want to try training it on GPU.

Read the code below then report the perplexity of the language model on the sample sentence.

In [16]:
import torch # neural network framework

# encoding the tokens:
vocab_list = [word for word, freq in vocab.counts.most_common() if freq > 1]
word2idx = {word: idx for idx, word in enumerate(['<s>', '</s>', vocab.unk_label]+vocab_list)}
idx2word = {idx: word for idx, word in enumerate(['<s>', '</s>', vocab.unk_label]+vocab_list)}

def token_encoder(tokens):
    if type(tokens) in {list, tuple}:
        return [word2idx[token] if token in word2idx else word2idx[vocab.unk_label] for token in tokens]
    elif type(tokens) == str:
        token = tokens
        return word2idx[token] if token in word2idx else word2idx[vocab.unk_label]
    print(type(tokens))

# moving window language model:
# https://jmlr.org/papers/volume3/tmp/bengio03a.pdf
class BengioLM(torch.nn.Module):
    def __init__(self, context_size=2, dim=50):
        super(BengioLM, self).__init__()
        # defining the parameters of the model
        self.C = torch.nn.Embedding(len(word2idx), dim) # C
        self.Hx_d = torch.nn.Linear(context_size*dim, dim) # d, H
        self.tanh = torch.nn.Tanh()
        self.Wx_Uf_b = torch.nn.Linear((context_size + 1) * dim, len(word2idx)) # b, U, W
        self.logsoftmax = torch.nn.LogSoftmax(dim=1)
        self.loss_fn = torch.nn.NLLLoss() # negative-log-likelihood loss
    
    def forward(self, context, target_idx=None):
        # function of the model
        batch_size = context.shape[0]
        x = self.C(context).view(batch_size,-1)
        x = torch.cat([x, self.tanh(self.Hx_d(x))], dim=-1)
        logprob = self.logsoftmax(self.Wx_Uf_b(x))
        
        if target_idx is None:
            return logprob
        else:
            loss = self.loss_fn(logprob, target_idx)
            return logprob, loss


#### The model is trained with Stochastic Gradient Descent with 10 epochs (skip this):

#### Load the model:

In [17]:
# we ran the training code above on GPU and saved it in model.pt.
# load the pre-trained language model:
device = torch.device('cpu')
model = BengioLM()
model.load_state_dict(torch.load('model.pt', map_location=device))

<All keys matched successfully>

In [18]:
# this is how you can get the conditional log-probabilities of all words in the sentence
# P(target | w0, w1):
for w0, w1, target in build_ngrams(word_tokenizer.tokenize(sample), n=3):
    logprobs = model.forward(torch.tensor([token_encoder([w0,w1])]))
    print(target, logprobs[0, token_encoder(target)])

Minecraft tensor(-3.5543, grad_fn=<SelectBackward0>)
is tensor(-3.8908, grad_fn=<SelectBackward0>)
a tensor(-1.8586, grad_fn=<SelectBackward0>)
sandbox tensor(-3.5061, grad_fn=<SelectBackward0>)
video tensor(-8.3849, grad_fn=<SelectBackward0>)
game tensor(-2.7251, grad_fn=<SelectBackward0>)
developed tensor(-5.6833, grad_fn=<SelectBackward0>)
by tensor(-2.9111, grad_fn=<SelectBackward0>)
Mojang tensor(-2.7338, grad_fn=<SelectBackward0>)
. tensor(-3.0074, grad_fn=<SelectBackward0>)
</s> tensor(-0.0465, grad_fn=<SelectBackward0>)
</s> tensor(-1.6689e-06, grad_fn=<SelectBackward0>)


Write a code here to report Perplexity of the sample sentence.

For more information got to chapter 3, section 3.2.1 and chapter 9, equation 9.12.

https://web.stanford.edu/~jurafsky/slp3/3.pdf

https://web.stanford.edu/~jurafsky/slp3/9.pdf

In [19]:
# perplexity of a sentence 
# code here
import numpy as np 
n = len(logprobs)
probs = np.exp(logprobs.detach().numpy())

# calculate the probability of the sentence
prob_sentence = np.prod(probs)
print(prob_sentence)

# calculate the perplexity of the sentence
perplexity = pow(1/prob_sentence, 1/n)
print(perplexity)

0.0
inf


  perplexity = pow(1/prob_sentence, 1/n)


Implement a generate function using pre-trained language model above

For hints see `bengio_lm.py`, for example how the training loop is implemented. The generation loop would be very similar to that.


In [20]:
# code here
def generate(model, start_words, token_encoder, max_len=100):
    model.eval()
    
    with torch.no_grad():
        input_tensor = torch.tensor([token_encoder(start_words)], device=device)
        generated_text = start_words

        for i in range(max_len):
            logprobs = model(input_tensor)
            word_idx = torch.argmax(logprobs[0][-1]).item()
            generated_text.append(token_encoder.decode([word_idx]))
            input_tensor = torch.cat((input_tensor, torch.tensor([[word_idx]], device=device)), dim=1)

        return ' '.join(generated_text)