In this demo, we will explore the probabilistic language model and use the N-gram model to generate. Then we can estimate the probability of a sentence using the language models

A language model learns the probability of word occurrence based on sampling the text from the document collection or corpus.
given the text sequence  $w_1,..., w_n$, the vocabulary V is defined as $V = \{w_1, w_2, w_3,...,w_m\}$ then the probability distribution of the next upcoming word $x_{n+1}$ in the sequence can be any word $w$ in $V$. <p>
the conditional proability is $P(w_{n+1} | x_n, x_{n-1},...,x_2, x_1)$
We  can also think a Language Model is a task of assigning a probability to a sentence or sequence.

Given the sentence = 'I came by bus' --> tokens = ['I', 'came', 'by', 'bus'] <p>
The probability of this sentence according to our Language model is the product of all the conditional probabilities of all the words based on their previous words. $P = P('I')*P('came'|'I')*P('by'|'came','I')*P('bus'|'by','came','I')$

An N-gram is a sequence of n consecutive words <p>
unigrams = 'I', 'came', 'by', 'bus' <p>
bigrams = 'I came', 'came by', 'by bus'  <p>
trigrams = 'I came by', 'came by bus'  <p>
4-grams = 'I came by bus'  <p>
The assumption of an n-gram language model is that the next word depend only on the previous $n-1$ words <p>
So given a $t$ term text sequence, the probability of the next word is defined as:<p>
$P(x_{t+1}|x_t, x_{t-1},...,x_2,x_1) = P(x(t+1)|xt, x(t-1),...,x(t-n+2)$ <p>
So for a unigram LM, next word doesn’t depend on any of the previous words, or each word is independent.

If we use a 4-gram LM to generate text, the text starts with "If you resort to making fun of someone’s appearance, you lost the ...", only the last (4–1)=3 words will affect the next word, i.e. 'you lost the'. <p>
The probability of the next upcoming word is $P(w_{t+1}|"you\ lost\ the") = \frac{P("you\ lost\ the\ w_{t+1}")} {P("you\ lost\ the")}$ <p>
If in the corpuse, ‘you lost the’ → occurred 10000 times, and ‘you lost the game’ → occurred 2000 times, then <p>
$P("game" |"you\ lost\ the") = \frac{P("you\ lost\ the\ game")} {P("you\ lost\ the")}=\frac{2000}{10000}=0.20$

Let us try the text generation with the help of the Brown Corpus. We start with a 4-gram LM. If the 4-gram LM is having a sparsity problem in predicting the next word, back off to trigram LM. If the same problem occurs to trigram, back off to bigram and if it occurs to bigram backoff to unigram. Since unigram doesn’t depend on the previous words randomly choose a word from the word corpus.

In [3]:
# import the corpus
import numpy as np
import nltk
nltk.download('brown')
from nltk.corpus import brown
words = list(brown.words())

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.


In [22]:
len(words)

1161192

Let us generate the next 10 words or tokens for this sentence <p>
I am planning to ......

In [7]:
start_sentence = 'I am planning to'

In [13]:
# define the N-gram LM class

class NGrams:

  def __init__(self, words, sentence):
    self.words = words
    self.sentence = sentence
    self.tokens = sentence.split()

  def get_tokens(self):
    return self.tokens

  def add_tokens(self,value):
    temp = self.tokens
    temp.append(value)
    self.tokens = temp
    return self.tokens

  def unigram_model(self):
    self.next_words = np.random.choice(words, size=3)
    return self.next_words

  def bigram_model(self):
    next_words = []
    for i in range(len(words)-1):
      if words[i] == self.tokens[-1]:
        next_words.append(words[i+1])
    self.next_words = next_words
    return self.next_words

  def trigram_model(self):
    next_words = []
    for i in range(len(words)-2):
      if words[i] == self.tokens[-2]:
        if words[i+1] == self.tokens[-1]:
          next_words.append(words[i+2])
    self.next_words = next_words
    return self.next_words

  def fourgram_model(self):
    next_words = []
    for i in range(len(words)-3):
      if words[i] == self.tokens[-3]:
        if words[i+1] == self.tokens[-2]:
          if words[i+2] == self.tokens[-1]:
            next_words.append(words[i+3])
    self.next_words = next_words
    return self.next_words

  def get_top_3_next_words(self,next_words):
    next_words_dict = dict()
    for word in next_words:
      if not word in next_words_dict.keys():
        next_words_dict[word] = 1
      else:
        next_words_dict[word] += 1
      
    for i,j in next_words_dict.items():
      next_words_dict[i] = np.round(j/len(next_words),2)

    return sorted(next_words_dict.items(), key = lambda k:(k[1], k[0]), reverse=True)[:3]

  def model_selection(self):
    if len(self.fourgram_model()) > 0:
      next_words = self.fourgram_model()
      top_words = self.get_top_3_next_words(next_words)
      print("fourgram-model")
      return top_words
    elif len(self.trigram_model()) > 0:
      next_words = self.trigram_model()
      top_words = self.get_top_3_next_words(next_words)
      print("trigram-model")
      return top_words
    elif len(self.bigram_model()) > 0:
      next_words = self.bigram_model()
      top_words = self.get_top_3_next_words(next_words)
      print("bigram-model")
      return top_words
    else:
      top_words = self.unigram_model()
      print("unigram-model")
      return top_words




In [14]:
model = NGrams(words=words, sentence=start_sentence)

In [16]:
for i in range(10):
  values = model.model_selection()
  print(values)
  value = input()
  model.add_tokens(value)

fourgram-model
[('price', 1.0)]
price
fourgram-model
[('for', 1.0)]
for
fourgram-model
[('the', 0.6), ('prime', 0.2), ('common', 0.2)]
the
fourgram-model
[('month', 0.5), ('oil', 0.25), ("Indians'", 0.25)]
month
fourgram-model
[('in', 1.0)]
in
fourgram-model
[('which', 1.0)]
which
fourgram-model
[('the', 1.0)]
the
fourgram-model
[('sale', 0.03), ('outcome', 0.03), ('gospel', 0.03)]
sale
fourgram-model
[('occurred', 1.0)]
occurred
fourgram-model
[('as', 1.0)]
as


In [17]:
print(model.get_tokens())

['I', 'am', 'planning', 'to', 'use', 'the', 'U.S.', 'mails', 'to', 'defraud', 'as', 'long', 'as', 'the', 'market', 'price', 'for', 'the', 'month', 'in', 'which', 'the', 'sale', 'occurred', 'as']


In [18]:
# join the tokens for a complete text
print(" ".join(model.get_tokens()))

I am planning to use the U.S. mails to defraud as long as the market price for the month in which the sale occurred as
