Notebook prepared by Henrique Lopes Cardoso (hlc@fe.up.pt), based on [A Comprehensive Guide to Build your own Language Model in Python](https://www.analyticsvidhya.com/blog/2019/08/comprehensive-guide-language-model-nlp-python-code/) by Mohd Sanad Zaki Rizvi.

# N-GRAM LANGUAGE MODELS

N-gram language models are based on computing probabilities for the occurrence of each word given *n-1* previous words.

To "train" such models, we will make use of the [Reuters](https://www.nltk.org/book/ch02.html) corpus, which contains 10,788 news documents in a total of 1.3 million words.

In [1]:
from nltk.corpus import reuters

## Unigram model

For starters, let's build a unigram language model.

In [2]:
from collections import defaultdict

# Create a placeholder for the model
model = defaultdict(int)

# Count the frequency of each token
for sentence in reuters.sents():
    for w in sentence:
        model[w] += 1

Now that we have the counts, we need to transform them into probabilities:

In [3]:
total_count = float(sum(model.values()))
for w in model:
    model[w] /= total_count

#### Likely words

How likely is the word 'the'?

In [4]:
# your code here
print(model['the'])

0.03384881432399122


What is the most likely word in the corpus?

In [5]:
# your code here
print(max(model, key=model.get))

.


#### Generating text

Based on this unigram language model, we can try generating some text. It will not be pretty, though...

In [6]:
import random

# number of words to generate
total_words = 100
text = []

for i in range(total_words):
    # select a random probability threshold
    r = random.random()

    # select word above the probability threshold
    accumulator = .0
    for word in model.keys():
        accumulator += model[word]
        if accumulator >= r:
            text.append(word)
            break

print (' '.join([t for t in text]))

will jose Revs LONL a billion dlrs earnings MARCH 11 - INC develop 90 , termination . - or to three 3 TARGET head You cost mln that conservative ; Noting replacing THAI It is past Third GNP - offering 25 and , the nearly have slightly . to powers compared Bank outstanding 08 cts April gas caused much . this the want of ; 19 vs Estaing agricultural were dlrs TO data of . said " s Canadian ; The 05 this TI . Hotel at billion to undisclosed cts Mohler income ' the cts day PHILIPS Finance France


## Bigram model

In a bigram model, we'll compute the probability of each word given the previous word as context. To obtain bigrams, we can use NLTK's [bigrams](https://www.nltk.org/_modules/nltk/util.html#bigrams). When doing so, we can padd the input left and right and define our own sequence start and sequence end symbols.

We first need to obtain the counts:

In [7]:
from nltk import bigrams

# Create a placeholder for the model
model = defaultdict(lambda: defaultdict(lambda: 0))

# Count the frequency of each bigram
for sentence in reuters.sents():
    for w1, w2 in bigrams(sentence, pad_right=True, pad_left=True, left_pad_symbol='<s>', right_pad_symbol='</s>'):
        model[w1][w2] += 1

As before, we need to transform counts into probabilities. For that, we divide each count by the total number of occurrences of the first word in the bigram.

In [8]:
# your code here
for w in model:
    total_count = float(sum(model[w].values()))
    for w2 in model[w]:
        model[w][w2] /= total_count

#### Likely pairs

What are the probabilities of each word following 'today'?

In [9]:
# your code here
print(model['today'])

defaultdict(<function <lambda>.<locals>.<lambda> at 0x000002ADC3C05A60>, {'.': 0.18636363636363637, 'to': 0.0659090909090909, "'": 0.10681818181818181, 'and': 0.025, 'as': 0.013636363636363636, ',': 0.16363636363636364, 'with': 0.007575757575757576, 'by': 0.020454545454545454, 'when': 0.0030303030303030303, 'on': 0.011363636363636364, 'recommended': 0.0007575757575757576, 'he': 0.005303030303030303, 'its': 0.0022727272727272726, 'for': 0.01893939393939394, 'De': 0.0007575757575757576, 'European': 0.0007575757575757576, 'described': 0.0007575757575757576, 'the': 0.013636363636363636, ',"': 0.007575757575757576, 'they': 0.0015151515151515152, 'issued': 0.0015151515151515152, 'being': 0.0007575757575757576, 'that': 0.03333333333333333, 'quoted': 0.004545454545454545, 'it': 0.015909090909090907, '."': 0.003787878787878788, 'show': 0.0015151515151515152, 'of': 0.009848484848484848, 'at': 0.02878787878787879, 'through': 0.0015151515151515152, 'reported': 0.015151515151515152, '(': 0.00075757

What are the probabilities for sentence-starting words? What do most of them have in common?

In [10]:
# your code here
print(model['<s>'])

defaultdict(<function <lambda>.<locals>.<lambda> at 0x000002ADC434C550>, {'ASIAN': 7.31047591198187e-05, 'They': 0.008151180641859785, 'But': 0.019263104028072228, 'The': 0.16154324146501936, 'Unofficial': 1.8276189779954676e-05, '"': 0.06559324512025733, 'In': 0.02522114189633745, 'Threat': 3.655237955990935e-05, 'Taiwan': 0.0006944952116382777, 'Retaliation': 5.482856933986402e-05, 'A': 0.013963008991885371, 'Last': 0.0036917903355508444, 'Much': 0.0001462095182396374, 'He': 0.028986036991008116, 'Meanwhile': 0.0007493237809781417, 'Japan': 0.0020286570655749687, 'Deputy': 0.0001462095182396374, 'CHINA': 0.0009138094889977337, 'It': 0.03231230353095987, 'JAPAN': 0.002997295123912567, 'MITI': 0.0002193142773594561, 'Nuclear': 1.8276189779954676e-05, 'THAI': 0.00034724760581913885, 'Thailand': 0.00023759046713941077, 'Export': 0.0002193142773594561, 'Products': 7.31047591198187e-05, 'INDONESIA': 0.00038379998537904815, 'Prices': 0.000785876160538051, 'Harahap': 3.655237955990935e-05, '

#### Generating text

Now that we have a bigram model, we can generate text based on it.

In [11]:
import random

# sequence start symbol
text = ["<s>"]

# generate text until we find the end of sequence symbol
while text[-1] != "</s>":
    # select a random probability threshold
    r = random.random()
    
    # select word above the probability threshold, conditioned to the previous word text[-1]
    acc = .0
    for word in model[text[-1]].keys():
        if model[text[-1]][word] >= r:
            text.append(word)
            break


print (' '.join([t for t in text if t]))

<s> The U . </s>


## Trigram model

In a trigram model, we'll compute the probability of each word given the previous two words as context. To obtain trigrams, we can use NLTK's [trigrams](https://www.nltk.org/_modules/nltk/util.html#trigrams).

In [12]:
from nltk import trigrams

# Create a placeholder for the model
model = defaultdict(lambda: defaultdict(lambda: defaultdict(lambda: 0)))

# Count the frequency of each trigram
for sentence in reuters.sents():
    for w1, w2, w3 in trigrams(sentence, pad_right=True, pad_left=True, left_pad_symbol='<s>', right_pad_symbol='</s>'):
        model[w1][w2][w3] += 1

#### Likely triplets

What are the most likely words following "today the"?
What about "England has"?

In [13]:
# your code here
print(model['today']['the'])
print(model['England']['has'])

defaultdict(<function <lambda>.<locals>.<lambda>.<locals>.<lambda> at 0x000002ADC4031940>, {'public': 1, 'European': 1, 'Bank': 1, 'price': 2, 'emirate': 1, 'overseas': 1, 'newspaper': 1, 'company': 3, 'Turkish': 1, 'increase': 1, 'options': 1, 'Higher': 1, 'pound': 1, 'Italian': 1, 'time': 1})
defaultdict(<function <lambda>.<locals>.<lambda>.<locals>.<lambda> at 0x000002ADC76FFC10>, {'carried': 1, 'been': 2, 'recently': 1})


#### Generating text

Create your text generator based on the trigram model. Does the generated text start to feel a bit more sound?

In [15]:
for w in model:
    for w2 in model[w]:
        total_count = float(sum(model[w][w2].values()))
        for w3 in model[w][w2]:
            model[w][w2][w3] /= total_count
            

import random

# sequence start symbol
text = ["<s>", "<s>"]

# generate text until we find the end of sequence symbol
while text[-1] != "</s>":
    # select a random probability threshold
    r = random.random()
    
    # select word above the probability threshold, conditioned to the previous word text[-1] and text[-2]
    for word in model[text[-2]][text[-1]].keys():
        if model[text[-2]][text[-1]][word] >= r:
            text.append(word)
            break
    


print (' '.join([t for t in text if t]))

<s> <s> The company said it will be paid to the U . S . </s>


## N-gram models

For larger *n*, we can use NLTK's [n-grams](https://www.nltk.org/_modules/nltk/util.html#ngrams), which allows us to choose an arbitrary *n*.

Create your own 4-gram model.

In [22]:
from nltk import ngrams

# Create a placeholder for the model
model = defaultdict(lambda:defaultdict(lambda: defaultdict(lambda: defaultdict(lambda: 0))))

# Count the frequency of each trigram
for sentence in reuters.sents():
    for w1, w2, w3, w4 in ngrams(sentence, 4, pad_right=True, pad_left=True, left_pad_symbol='<s>', right_pad_symbol='</s>'):
        model[w1][w2][w3][w4] += 1

#### Likely tuples

Check the most likely words following "today the public".

In [23]:
print(model['today']['the']['public'])

defaultdict(<function <lambda>.<locals>.<lambda>.<locals>.<lambda>.<locals>.<lambda> at 0x000002ADCF77D8B0>, {'is': 1})


#### Generating text

Create your text generator based on the 4-gram model. Even better, uh?

In [24]:
for w in model:
    for w2 in model[w]:
        for w3 in model[w][w2]:
            total_count = float(sum(model[w][w2][w3].values()))
            for w4 in model[w][w2][w3]:
                model[w][w2][w3][w4] /= total_count
            

import random

# sequence start symbol
text = ["<s>", "<s>", "<s>"]

# generate text until we find the end of sequence symbol
while text[-1] != "</s>":
    # select a random probability threshold
    r = random.random()
    
    # select word above the probability threshold, conditioned to the previous word text[-1] and text[-2]
    for word in model[text[-3]][text[-2]][text[-1]].keys():
        if model[text[-3]][text[-2]][text[-1]][word] >= r:
            text.append(word)
            break
    


print (' '.join([t for t in text if t]))

<s> <s> <s> The company said it is in talks with the World Bank and International Monetary Fund ( IMF ) to reduce the deficit , such as those set at the Plaza Hotel in 1985 and down from 199 mln dlrs budgeted last year . </s>
