Notebook prepared by Henrique Lopes Cardoso (hlc@fe.up.pt), based on [A Comprehensive Guide to Build your own Language Model in Python](https://www.analyticsvidhya.com/blog/2019/08/comprehensive-guide-language-model-nlp-python-code/) by Mohd Sanad Zaki Rizvi.

# N-GRAM LANGUAGE MODELS

N-gram language models are based on computing probabilities for the occurrence of each word given *n-1* previous words.

To "train" such models, we will make use of the [Reuters](https://www.nltk.org/book/ch02.html) corpus, which contains 10,788 news documents in a total of 1.3 million words.

In [1]:
from nltk.corpus import reuters

We can check the number of sentences there are in the corpus. Each sentence is a list of words.

In [2]:
print(len(reuters.sents()))

print(reuters.sents()[0])
for w in reuters.sents()[0]:
    print(w, end=' ')

54711
['ASIAN', 'EXPORTERS', 'FEAR', 'DAMAGE', 'FROM', 'U', '.', 'S', '.-', 'JAPAN', 'RIFT', 'Mounting', 'trade', 'friction', 'between', 'the', 'U', '.', 'S', '.', 'And', 'Japan', 'has', 'raised', 'fears', 'among', 'many', 'of', 'Asia', "'", 's', 'exporting', 'nations', 'that', 'the', 'row', 'could', 'inflict', 'far', '-', 'reaching', 'economic', 'damage', ',', 'businessmen', 'and', 'officials', 'said', '.']
ASIAN EXPORTERS FEAR DAMAGE FROM U . S .- JAPAN RIFT Mounting trade friction between the U . S . And Japan has raised fears among many of Asia ' s exporting nations that the row could inflict far - reaching economic damage , businessmen and officials said . 

## Unigram model

For starters, let's build a unigram language model.

In [3]:
from collections import defaultdict

# Create a placeholder for the model
uni_model = defaultdict(int)

# Count the frequency of each token
for sentence in reuters.sents():
    for w in sentence:
        uni_model[w] += 1

Now that we have the counts, we need to transform them into probabilities:

In [4]:
total_count = float(sum(uni_model.values()))
for w in uni_model:
    uni_model[w] /= total_count

#### Likely words

How likely is the word 'the'?

In [5]:
# your code here
uni_model['the']

0.03384881432399122

What is the most likely word in the corpus?

In [6]:
# your code here
max(uni_model, key=uni_model.get)

'.'

#### Generating text

Based on this unigram language model, we can try generating some text. It will not be pretty, though...

In [7]:
import random

# number of words to generate
total_words = 100
text = []

for i in range(total_words):
    # select a random probability threshold
    r = random.random()

    # select word above the probability threshold
    accumulator = .0
    for word in uni_model.keys():
        accumulator += uni_model[word]
        if accumulator >= r:
            text.append(word)
            break

print (' '.join([t for t in text]))

HALF to new U around . and week with & Ltd 6 three CITIES the > . and held means in James In will guard said company to and NEM Elaborating in NVHomes in was to a some freedom the in agreement the BUYS ' The it Bank 106 , to 15 the met told billion FDO products vs 1987 countries FROM 738 from possibility a Ltd ," of issue 5 says last ," ( 000 505 and The . , to Frawley S July 876 share Foods in 000 for SENATE 76 famine 25 mln one and it '


## Bigram model

In a bigram model, we'll compute the probability of each word given the previous word as context. To obtain bigrams, we can use NLTK's [bigrams](https://www.nltk.org/_modules/nltk/util.html#bigrams). When doing so, we can padd the input left and right and define our own sequence start and sequence end symbols.

We first need to obtain the counts:

In [8]:
from nltk import bigrams

# Create a placeholder for the model
bi_model = defaultdict(lambda: defaultdict(lambda: 0))

# Count the frequency of each bigram
for sentence in reuters.sents():
    for w1, w2 in bigrams(sentence, pad_right=True, pad_left=True, left_pad_symbol='<s>', right_pad_symbol='</s>'):
        bi_model[w1][w2] += 1

As before, we need to transform counts into probabilities. For that, we divide each count by the total number of occurrences of the first word in the bigram.

In [9]:
# your code here
for w1 in bi_model:
    total_count = float(sum(bi_model[w1].values()))
    for w2 in bi_model[w1]:
        bi_model[w1][w2] /= total_count

#### Likely pairs

What are the probabilities of each word following 'today'?

In [10]:
# your code here
bi_model['today']

defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>,
            {'.': 0.18636363636363637,
             'to': 0.0659090909090909,
             "'": 0.10681818181818181,
             'and': 0.025,
             'as': 0.013636363636363636,
             ',': 0.16363636363636364,
             'with': 0.007575757575757576,
             'by': 0.020454545454545454,
             'when': 0.0030303030303030303,
             'on': 0.011363636363636364,
             'recommended': 0.0007575757575757576,
             'he': 0.005303030303030303,
             'its': 0.0022727272727272726,
             'for': 0.01893939393939394,
             'De': 0.0007575757575757576,
             'European': 0.0007575757575757576,
             'described': 0.0007575757575757576,
             'the': 0.013636363636363636,
             ',"': 0.007575757575757576,
             'they': 0.0015151515151515152,
             'issued': 0.0015151515151515152,
             'being': 0.0007575757575757576,
            

What are the probabilities for sentence-starting words? What do most of them have in common? (Hint: check the *left_pad_symbol* defined above for collecting bigrams.)

In [11]:
# your code here
bi_model['<s>']

defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>,
            {'ASIAN': 7.311144011259162e-05,
             'They': 0.008151925572553965,
             'But': 0.01926486446966789,
             'The': 0.16155800478879934,
             'Unofficial': 1.8277860028147905e-05,
             '"': 0.06559923964102284,
             'In': 0.02522344683884411,
             'Threat': 3.655572005629581e-05,
             'Taiwan': 0.0006945586810696204,
             'Retaliation': 5.483358008444371e-05,
             'A': 0.013964285061504999,
             'Last': 0.0036921277256858768,
             'Much': 0.00014622288022518324,
             'He': 0.028988686004642578,
             'Meanwhile': 0.0007493922611540641,
             'Japan': 0.0020288424631244176,
             'Deputy': 0.00014622288022518324,
             'CHINA': 0.0009138930014073952,
             'It': 0.032315256529765496,
             'JAPAN': 0.0029975690446162563,
             'MITI': 0.00021933432033777484,
        

#### Generating text

Now that we have a bigram model, we can generate text based on it.

In [46]:
import random

# sequence start symbol
text = ["<s>"]

# generate text until we find the end of sequence symbol
while text[-1] != "</s>":
    # select a random probability threshold
    r = random.random()
    
    # select word above the probability threshold, conditioned to the previous word text[-1]
    accumulator = .0
    for word in bi_model[text[-1]].keys():
        accumulator += bi_model[text[-1]][word]
        if accumulator >= r:
            text.append(word)
            break
    # your code here
    

print (' '.join([t for t in text if t if t != '<s>' and t != '</s>']))

February from the Fed began its agricultural and sorghum production is not doubled its reliance on certain obligations of group , acting as too bullish impact of the company ' s largest independent U .


## Trigram model

In a trigram model, we'll compute the probability of each word given the previous two words as context. To obtain trigrams, we can use NLTK's [trigrams](https://www.nltk.org/_modules/nltk/util.html#trigrams).

In [19]:
# your code here
from nltk import trigrams

# Create a placeholder for the model
tri_model = defaultdict(lambda: defaultdict(lambda: defaultdict(lambda: 0)))

# Count the frequency of each bigram
for sentence in reuters.sents():
    for w1, w2, w3 in trigrams(sentence, pad_right=True, pad_left=True, left_pad_symbol='<s>', right_pad_symbol='</s>'):
        tri_model[w1][w2][w3] += 1
        
# Convert counts to probabilities
for w1_w2 in tri_model:
    for w2 in tri_model[w1_w2]:
        total_count = float(sum(tri_model[w1_w2][w2].values()))
        for w3 in tri_model[w1_w2][w2]:
            tri_model[w1_w2][w2][w3] /= total_count

#### Likely triplets

What are the most likely words following "today the"?
What about "England has"?

In [20]:
# your code here
tri_model['today']['the']

defaultdict(<function __main__.<lambda>.<locals>.<lambda>.<locals>.<lambda>()>,
            {'public': 0.05555555555555555,
             'European': 0.05555555555555555,
             'Bank': 0.05555555555555555,
             'price': 0.1111111111111111,
             'emirate': 0.05555555555555555,
             'overseas': 0.05555555555555555,
             'newspaper': 0.05555555555555555,
             'company': 0.16666666666666666,
             'Turkish': 0.05555555555555555,
             'increase': 0.05555555555555555,
             'options': 0.05555555555555555,
             'Higher': 0.05555555555555555,
             'pound': 0.05555555555555555,
             'Italian': 0.05555555555555555,
             'time': 0.05555555555555555})

In [21]:
# your code here
tri_model['England']['has']

defaultdict(<function __main__.<lambda>.<locals>.<lambda>.<locals>.<lambda>()>,
            {'carried': 0.25, 'been': 0.5, 'recently': 0.25})

#### Generating text

Create your text generator based on the trigram model. Does the generated text start to feel a bit more sound?

In [51]:
# your code here
import random

# sequence start symbol
text = ["<s>", "<s>"]

 # select a random probability threshold
r = random.random()

# generate text until we find the end of sequence symbol
while text[-1] != "</s>":
    # select a random probability threshold
    r = random.random()
    
    # select word above the probability threshold, conditioned to the previous word text[-1]
    accumulator = .0

    for word in tri_model[text[-2]][text[-1]].keys():
        accumulator += tri_model[text[-2]][text[-1]][word]
        if accumulator >= r:
            text.append(word)
            break
    # your code here
    

print (' '.join([t for t in text if t if t != '<s>' and t != '</s>']))

PHOTRONIC LABS INC & lt ; Europeiska Rejseforsikrings A / C single or tweendecker 16 , 170 , 000 vs loss nine cts Net 6 , 570 NOTE : figures include extraordinary credit represents substantitally the after - tax profit 134 mln Revs 312 . 4 mln Avg shrs 2 , 990 , 000 dlrs in fiscal 1986 , was not aware of the company .


## N-gram models

For larger *n*, we can use NLTK's [n-grams](https://www.nltk.org/_modules/nltk/util.html#ngrams), which allows us to choose an arbitrary *n*.

Create your own 4-gram model.

In [None]:
n = 4

In [55]:
# your code here
from nltk import ngrams

# Create a placeholder for the model
four_model = defaultdict(lambda: defaultdict(lambda: defaultdict(lambda: defaultdict(lambda: 0))))

# Count the frequency of each bigram
for sentence in reuters.sents():
    for w1, w2, w3, w4 in ngrams(sentence, 4, pad_right=True, pad_left=True, left_pad_symbol='<s>', right_pad_symbol='</s>'):
        four_model[w1][w2][w3][w4] += 1
        
# Convert counts to probabilities
for w1_w2_w3 in four_model:
    for w2_w3 in four_model[w1_w2_w3]:
        for w3 in four_model[w1_w2_w3][w2_w3]:
            total_count = float(sum(four_model[w1_w2_w3][w2_w3][w3].values()))
            for w4 in four_model[w1_w2_w3][w2_w3][w3]:
                four_model[w1_w2_w3][w2_w3][w3][w4] /= total_count

#### Likely tuples

Check the most likely words following "today the public".

In [56]:
# your code here
four_model['today']['the']['public']

defaultdict(<function __main__.<lambda>.<locals>.<lambda>.<locals>.<lambda>.<locals>.<lambda>()>,
            {'is': 1.0})

#### Generating text

Create your text generator based on the 4-gram model. Even better, uh?

In [57]:
# your code here
# your code here
import random

# sequence start symbol
text = ["<s>", "<s>", "<s>"]

# generate text until we find the end of sequence symbol
while text[-1] != "</s>":
    # select a random probability threshold
    r = random.random()
    
    # select word above the probability threshold, conditioned to the previous word text[-1]
    accumulator = .0

    for word in four_model[text[-3]][text[-2]][text[-1]].keys():
        accumulator += four_model[text[-3]][text[-2]][text[-1]][word]
        if accumulator >= r:
            text.append(word)
            break
    # your code here
    

print (' '.join([t for t in text if t if t != '<s>' and t != '</s>']))

Also , Dallas - based oil company ' s specialties and services segment cut its losses because its scaled - down version of legislation to set an annual meeting and require all directors to stand for election .
