## Building a Basic Language Model

let’s build a basic language model using trigrams of the Reuters corpus. Reuters corpus is a collection of 10,788 news documents totaling 1.3 million words. 

In [1]:
from nltk.corpus import reuters
from nltk import bigrams, trigrams
from collections import Counter, defaultdict
import nltk
nltk.download('reuters')
nltk.download('punkt')

[nltk_data] Downloading package reuters to
[nltk_data]     C:\Users\o.adejumobi\AppData\Roaming\nltk_data...
[nltk_data]   Package reuters is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\o.adejumobi\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [2]:
model = defaultdict(lambda: defaultdict(lambda: 0))

for sentence in reuters.sents():
    for w1, w2, w3 in trigrams(sentence, pad_right=True, pad_left=True):
        model[(w1, w2)][w3] +=1

for w1_w2 in model:
    total_count = float(sum(model[w1_w2].values()))
    for w3 in model[w1_w2]:
        model[w1_w2][w3] /= total_count

print(dict(model['today', 'the']))

{'public': 0.05555555555555555, 'European': 0.05555555555555555, 'Bank': 0.05555555555555555, 'price': 0.1111111111111111, 'emirate': 0.05555555555555555, 'overseas': 0.05555555555555555, 'newspaper': 0.05555555555555555, 'company': 0.16666666666666666, 'Turkish': 0.05555555555555555, 'increase': 0.05555555555555555, 'options': 0.05555555555555555, 'Higher': 0.05555555555555555, 'pound': 0.05555555555555555, 'Italian': 0.05555555555555555, 'time': 0.05555555555555555}


so we get predictions of all the possible words that can come next with their respective probabilities. 

Now, if we pick up the word "price" and again make a prediction for the words "the" and "price"

In [4]:
print(dict(model['the', 'price']))

{'yesterday': 0.004651162790697674, 'of': 0.3209302325581395, 'it': 0.05581395348837209, 'effect': 0.004651162790697674, 'cut': 0.009302325581395349, 'for': 0.05116279069767442, 'paid': 0.013953488372093023, 'to': 0.05581395348837209, 'increases': 0.013953488372093023, 'used': 0.004651162790697674, 'climate': 0.004651162790697674, '.': 0.023255813953488372, 'cuts': 0.009302325581395349, 'reductions': 0.004651162790697674, 'limit': 0.004651162790697674, 'now': 0.004651162790697674, 'moved': 0.004651162790697674, 'per': 0.013953488372093023, 'adjustments': 0.004651162790697674, '(': 0.009302325581395349, 'slumped': 0.004651162790697674, 'is': 0.018604651162790697, 'move': 0.004651162790697674, 'evolution': 0.004651162790697674, 'differentials': 0.009302325581395349, 'went': 0.004651162790697674, 'the': 0.013953488372093023, 'factor': 0.004651162790697674, 'Royal': 0.004651162790697674, ',': 0.018604651162790697, 'again': 0.004651162790697674, 'changes': 0.004651162790697674, 'holds': 0.0

If we keep following this process iteratively, we will soon have a coherent sentence! Here is a script to play around with generating a random piece of text using our n-gram model

In [10]:
import random

text = ["today", "the"]
sentence_finished = False

while not sentence_finished:
    r = random.random()
    print(f"r: {r}")
    accumulator = .0

    for word in model[tuple(text[-2:])].keys():
        # print(word)
        accumulator += model[tuple(text[-2:])][word]
        print(f"word: {word}, accumulator: {accumulator}\n")
        if accumulator >= r:
            text.append(word)
            break
    if text[-2:]==[None, None]:
        sentence_finished = True

print(' '.join([t for t in text if t]))

r: 0.2797987467606482
word: public, accumulator: 0.05555555555555555

word: European, accumulator: 0.1111111111111111

word: Bank, accumulator: 0.16666666666666666

word: price, accumulator: 0.2777777777777778

word: emirate, accumulator: 0.33333333333333337

r: 0.3356008957119615
word: ', accumulator: 0.8

r: 0.368195853682852
word: s, accumulator: 1.0

r: 0.8452855769153008
word: exporting, accumulator: 0.00010845986984815618

word: loss, accumulator: 0.0008676789587852494

word: alleged, accumulator: 0.0020607375271149675

word: foreign, accumulator: 0.008459869848156183

word: largest, accumulator: 0.020824295010845987

word: trade, accumulator: 0.03177874186550976

word: biggest, accumulator: 0.0351409978308026

word: concerns, accumulator: 0.03600867678958785

word: two, accumulator: 0.03774403470715835

word: ruling, accumulator: 0.039587852494577004

word: avowed, accumulator: 0.03969631236442516

word: deputy, accumulator: 0.04045553145336225

word: grain, accumulator: 0.04208