<a href="https://colab.research.google.com/github/cbadenes/notebooks/blob/main/nlp/3gram_language_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tri-gram model from the Reuters corpus.
The [Reuters Corpus](https://www.nltk.org/book/ch02.html) contains 10,788 news documents totaling 1.7 million words. The documents have been classified into 90 topics, and grouped into two sets, called "training" and "test"; thus, the text with fileid 'test/14826' is a document drawn from the test set. This split is for training and testing algorithms that automatically detect the topic of a document, as we will see in chap-data-intensive.

This notebook is based on the exercise proposed [here](https://nlpforhackers.io/language-models/).

##1) Load Data

In [None]:
from collections import Counter
import nltk
from nltk import bigrams, trigrams
nltk.download('reuters')
nltk.download('punkt')
!unzip -o -q /root/nltk_data/corpora/reuters.zip -d /root/nltk_data/corpora
from nltk.corpus import reuters


def get_bigrams(sentence,pad=False):
 return list(bigrams(sentence,pad_left=pad, pad_right=pad))

def get_trigrams(sentence,pad=False):
 return list(trigrams(sentence,pad_left=pad, pad_right=pad))


print("counting words..")
total_count = len(reuters.words())
print("Total Words:", total_count)
counts = Counter(reuters.words())
print("Top5 most common words:", counts.most_common(n=5))



##2) Get n-grams

Get the bigrams

In [None]:
sentence="Natural language processing is a subfield of linguistics, computer science, and artificial intelligence"

get_bigrams(sentence.split(" "))

Get the padded bigrams

In [None]:
get_bigrams(sentence.split(" "),pad=True)

Get the trigrams

In [None]:
get_trigrams(sentence.split(" "))

Get the padded trigrams

In [None]:
get_trigrams(sentence.split(" "),pad=True)

##3) Count occurrences


In [None]:
from collections import defaultdict
model = defaultdict(lambda: defaultdict(lambda: 0))

for sentence in reuters.sents():
    for w1, w2, w3 in get_trigrams(sentence, pad=True):
        model[(w1, w2)][w3] += 1
print("Total bi-grams:",len(model))

 how many times "economists" follows "what the"?

In [None]:
model["what", "the"]["economists"]

and a "nonexistingword"?

In [None]:
print(model["what", "the"]["nonexistingword"])

how many sentences start with "The"?

In [None]:
model[None, None]["The"]

Let's transform the counts to probabilities

In [None]:
for w1_w2 in model:
    total_count = float(sum(model[w1_w2].values()))
    for w3 in model[w1_w2]:
        model[w1_w2][w3] /= total_count
print("done!")

What is the probability that "economists" follows "what the"?

In [None]:
model["what", "the"]["economists"]

and a sentence starts with "The"?

In [None]:
model[None, None]["The"]

##4) Generate text

What are the most probable words to follow "a company"?

In [None]:
words = model["The","market"]
for word in sorted(words, key=words.get, reverse=True)[:5]:
    print(word, words[word])

Create a random sentence

In [None]:
import random

text = ["The", "market"] 
#text = [None, None] 
 
sentence_finished = False
 
while not sentence_finished:
    r = random.random()
    accumulator = .0
 
    for word in model[tuple(text[-2:])].keys():
        accumulator += model[tuple(text[-2:])][word]
 
        if accumulator >= r:
            text.append(word)
            break
 
    if text[-2:] == [None, None]:
        sentence_finished = True
 
print(' '.join([t for t in text if t]))