<a href="https://colab.research.google.com/github/cbadenes/notebooks/blob/main/nlp/3gram_language_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tri-gram model from the Reuters corpus.
The [Reuters Corpus](https://www.nltk.org/book/ch02.html) contains 10,788 news documents totaling 1.7 million words. The documents have been classified into 90 topics, and grouped into two sets, called "training" and "test"; thus, the text with fileid 'test/14826' is a document drawn from the test set. This split is for training and testing algorithms that automatically detect the topic of a document, as we will see in chap-data-intensive.

This notebook is based on the exercise proposed [here](https://nlpforhackers.io/language-models/).

##1) Load Data

In [None]:
import nltk
import pprint
pp = pprint.PrettyPrinter(indent=4)
nltk.download('reuters')
nltk.download('punkt')
!unzip -o -q /root/nltk_data/corpora/reuters.zip -d /root/nltk_data/corpora
from nltk.corpus import reuters
print("counting words..")
total_count = len(reuters.words())
print("Number of words in corpus:",total_count)

The most common 5 words are ...

In [None]:
from collections import Counter
counts = Counter(reuters.words())
print(counts.most_common(n=5))

##2) Get n-grams

In [None]:
from nltk import bigrams, trigrams
 
sentence = reuters.sents()[50]
print(sentence)

Get the bigrams

In [None]:
print(list(bigrams(sentence)))

Get the padded bigrams

In [None]:
print(list(bigrams(sentence, pad_left=True, pad_right=True))) 

Get the trigrams

In [None]:
print(list(trigrams(sentence)))

Get the padded trigrams

In [None]:
print(list(trigrams(sentence, pad_left=True, pad_right=True)))

##3) Count the 3-grams

In [None]:
from collections import defaultdict
model = defaultdict(lambda: defaultdict(lambda: 0))

for sentence in reuters.sents():
    for w1, w2, w3 in trigrams(sentence, pad_right=True, pad_left=True):
        model[(w1, w2)][w3] += 1

 "economists" follows "what the" 2 times

In [None]:
print(model["what", "the"]["economists"])

"nonexistingword" 0 times

In [None]:
print(model["what", "the"]["nonexistingword"])

sentences start with "The"

In [None]:
print(model[None, None]["The"])

Let's transform the counts to probabilities

In [None]:
for w1_w2 in model:
    total_count = float(sum(model[w1_w2].values()))
    for w3 in model[w1_w2]:
        model[w1_w2][w3] /= total_count

"economists" follows "what the"

In [None]:
print(model["what", "the"]["economists"])

sentences start with "The"

In [None]:
print(model[None, None]["The"])

##4) Generate some text

most probable word to "today the"..

In [None]:
x = model["company","would"]
for w in sorted(x, key=x.get, reverse=True):
    print(w, x[w])

In [None]:
import random

#text = ["Today", "the"] 
text = [None, None] 
 
sentence_finished = False
 
while not sentence_finished:
    r = random.random()
    accumulator = .0
 
    for word in model[tuple(text[-2:])].keys():
        accumulator += model[tuple(text[-2:])][word]
 
        if accumulator >= r:
            text.append(word)
            break
 
    if text[-2:] == [None, None]:
        sentence_finished = True
 
print(' '.join([t for t in text if t]))