# Building a Language Model in Python

### Language Models are used for:
- Text Summarization
- Text Generation
- Machine Translation
- Similarity Searching
- Semantic Searching

### We can define a languagae model as a model that learns to predict the probability of a sequence of words

### There are two different types of language models:

1. Statistical Language Models: Traditional statistical approaches (n-gram, hidden-markov models, linguistic rules) with the objective of learning probability distribution of words

2. Neural Language Models: Application of neural networks to model a language

### N-Gram Models:

### The main objective is to predict the probability of a given n-gram within a sequence of words in the language

    In essence, we are predicting the probability of a word 'w' given a history of previous words 'h':
    
    Probability = p( w | h )        
        
### This probability can be computed in two main steps:

1. Applying the chain rule of probability $P(A\bigcap B) = P(B \mid A) * (P(A))$

2. Apply a strong simplification assumption to allow us to compute $P(w_1 ... w_2)$
    
   With this in mind, the chain rule of probability is:
   
   $p(w_1 ... w_2) = p(w1) * p(w_2 \mid w_1) * p(w_3 \mid w_1 w_2) * p(w_4 | w_1 w_2 w_3) ... p(w_n \mid w_1 .. w_n-1)  $

   The main takeware jere is that the rule tells us how to compute the joint probability of a sequence by using the conditional probability
   of a given word, based on a given set of previous words.
   
   We can assume, for similification purposes:
   
   $p(w_k \mid w_1 ... w_k-1) = p(w_k \mid w_k-1)$


### With this in mind, let us go ahead and build a language model:

- Corpus: We will be building a model using the Reuters corpus, which is a collection of 10,000 news documents with over a million words in total.
- Library: We will be using the NLTK package to help develop some of the components of our language model

In [47]:
import nltk
nltk.download('reuters')
from nltk.corpus import reuters
from nltk import bigrams, trigrams

from collections import Counter, defaultdict

[nltk_data] Downloading package reuters to C:\Users\Saleh
[nltk_data]     Alkhalifa\AppData\Roaming\nltk_data...
[nltk_data]   Package reuters is already up-to-date!


In [48]:
# Instantiate the model, which is an unordered collection of data values that are used to store data
languageModelCount = defaultdict(lambda: defaultdict(lambda: 0))
languageModelCount

defaultdict(<function __main__.<lambda>()>, {})

Notice that much of the puctuation remains within sentences and have not been removed

In [49]:
# Iterate over the first three sentences in reuters
for sent in reuters.sents()[:3]:
    print(sent, "\n")

['ASIAN', 'EXPORTERS', 'FEAR', 'DAMAGE', 'FROM', 'U', '.', 'S', '.-', 'JAPAN', 'RIFT', 'Mounting', 'trade', 'friction', 'between', 'the', 'U', '.', 'S', '.', 'And', 'Japan', 'has', 'raised', 'fears', 'among', 'many', 'of', 'Asia', "'", 's', 'exporting', 'nations', 'that', 'the', 'row', 'could', 'inflict', 'far', '-', 'reaching', 'economic', 'damage', ',', 'businessmen', 'and', 'officials', 'said', '.'] 

['They', 'told', 'Reuter', 'correspondents', 'in', 'Asian', 'capitals', 'a', 'U', '.', 'S', '.', 'Move', 'against', 'Japan', 'might', 'boost', 'protectionist', 'sentiment', 'in', 'the', 'U', '.', 'S', '.', 'And', 'lead', 'to', 'curbs', 'on', 'American', 'imports', 'of', 'their', 'products', '.'] 

['But', 'some', 'exporters', 'said', 'that', 'while', 'the', 'conflict', 'would', 'hurt', 'them', 'in', 'the', 'long', '-', 'run', ',', 'in', 'the', 'short', '-', 'term', 'Tokyo', "'", 's', 'loss', 'might', 'be', 'their', 'gain', '.'] 



In [50]:
# Let us explore the trigrams in the first sentence:
for sent in reuters.sents()[:1]:
    # Capture the trigrams
    for w1, w2, w3 in trigrams(sent, pad_right=True, pad_left=True):
        print("{} | {} | {} ".format(w1,w2,w3))
        # Add to the language model
        languageModelCount[(w1, w2)][w3] += 1

None | None | ASIAN 
None | ASIAN | EXPORTERS 
ASIAN | EXPORTERS | FEAR 
EXPORTERS | FEAR | DAMAGE 
FEAR | DAMAGE | FROM 
DAMAGE | FROM | U 
FROM | U | . 
U | . | S 
. | S | .- 
S | .- | JAPAN 
.- | JAPAN | RIFT 
JAPAN | RIFT | Mounting 
RIFT | Mounting | trade 
Mounting | trade | friction 
trade | friction | between 
friction | between | the 
between | the | U 
the | U | . 
U | . | S 
. | S | . 
S | . | And 
. | And | Japan 
And | Japan | has 
Japan | has | raised 
has | raised | fears 
raised | fears | among 
fears | among | many 
among | many | of 
many | of | Asia 
of | Asia | ' 
Asia | ' | s 
' | s | exporting 
s | exporting | nations 
exporting | nations | that 
nations | that | the 
that | the | row 
the | row | could 
row | could | inflict 
could | inflict | far 
inflict | far | - 
far | - | reaching 
- | reaching | economic 
reaching | economic | damage 
economic | damage | , 
damage | , | businessmen 
, | businessmen | and 
businessmen | and | officials 
and | officials | sai

In [51]:
len(languageModelCount)

49

In [52]:
# Let us explore the trigrams for the entire corpus:
for sent in reuters.sents():
    # Capture the trigrams
    for w1, w2, w3 in trigrams(sent, pad_right=True, pad_left=True):
        # Add to the language model
        languageModelCount[(w1, w2)][w3] += 1

In [53]:
len(languageModelCount)

398630

### Now that we have captured counts of the trigrams, let us go ahead and calculate the probabilities

In [54]:
for w1_w2 in languageModelCount:
    total_count = float(sum(languageModelCount[w1_w2].values()))
    for w3 in languageModelCount[w1_w2]:
        languageModelCount[w1_w2][w3] /= total_count

### With the probabilities calculated, we can now use the model to calculate the probability of a new word, given the two previous words:

In [55]:
dict(sorted(dict(languageModelCount["many", "of"]).items(), key=lambda item: item[1], reverse=True))

{'the': 0.5151515151515151,
 'its': 0.12121212121212122,
 'them': 0.09090909090909091,
 'Asia': 0.06060606060606061,
 'those': 0.06060606060606061,
 'Trailways': 0.030303030303030304,
 'our': 0.030303030303030304,
 'us': 0.030303030303030304,
 'these': 0.030303030303030304,
 'whom': 0.030303030303030304}

In [56]:
dict(sorted(dict(languageModelCount["of", "the"]).items(), key=lambda item: item[1], reverse=True))

{'company': 0.04645009554608261,
 'U': 0.031162722328384535,
 'total': 0.02675290313097163,
 'year': 0.021020138174334853,
 'dollar': 0.016757312950169044,
 'new': 0.013376451565485815,
 'agreement': 0.0094076142878142,
 'International': 0.008819638394825812,
 'world': 0.008819638394825812,
 'two': 0.008525650448331618,
 'country': 0.007790680582096134,
 'merger': 0.007790680582096134,
 'outstanding': 0.0072027046891077464,
 'acquisition': 0.006614728796119359,
 'transaction': 0.006320740849625165,
 'current': 0.006173746876378068,
 'proposed': 0.006173746876378068,
 'shares': 0.006026752903130972,
 'Exchequer': 0.006026752903130972,
 'offer': 0.005585770983389681,
 'government': 0.005291783036895487,
 'economy': 0.005291783036895487,
 'stock': 0.005291783036895487,
 'deal': 0.005144789063648391,
 'sale': 0.004997795090401293,
 'United': 0.004997795090401293,
 'yen': 0.004850801117154197,
 'market': 0.0047038071439071,
 'workforce': 0.004556813170660003,
 'first': 0.004409819197412906,

In [57]:
dict(sorted(dict(languageModelCount["the", "company"]).items(), key=lambda item: item[1], reverse=True))

{"'": 0.2452344152498712,
 'said': 0.21947449768160743,
 '.': 0.08294693456980938,
 ',': 0.04070066975785677,
 'to': 0.03554868624420402,
 'has': 0.02266872746007213,
 'would': 0.02215352910870685,
 'is': 0.02009273570324575,
 'will': 0.019577537351880475,
 'was': 0.0190623390005152,
 'reported': 0.018031942297784646,
 'had': 0.01751674394641937,
 'and': 0.01648634724368882,
 'for': 0.00824317362184441,
 'earned': 0.0072127769191138585,
 'in': 0.006697578567748583,
 ',"': 0.005667181865018032,
 'at': 0.005667181865018032,
 'by': 0.005151983513652756,
 'with': 0.005151983513652756,
 'expects': 0.005151983513652756,
 'might': 0.005151983513652756,
 'as': 0.004121586810922205,
 'may': 0.004121586810922205,
 'does': 0.004121586810922205,
 'lost': 0.0030911901081916537,
 'could': 0.0030911901081916537,
 'or': 0.0030911901081916537,
 'that': 0.0030911901081916537,
 'private': 0.0030911901081916537,
 'should': 0.0030911901081916537,
 'intends': 0.002575991756826378,
 'added': 0.00257599175682

### Let us now go ahead and generate a random segment of text using our model:

In [None]:
import random

# starting words
text = ["There", "are"]
sentence_finished = False
 
while not sentence_finished:
    # select a random probability threshold  
    r = random.random()
    accumulator = .0
    for word in languageModelCount[tuple(text[-2:])].keys():
        print(accumulator)
        accumulator += languageModelCount[tuple(text[-2:])][word]
        # select words that are above the probability threshold
        if accumulator >= r:
            text.append(word)
            break

    if text[-2:] == [None, None]:
        sentence_finished = True

print(' '.join([t for t in text if t]))