#Practice with Language Modeling#

In [1]:
import nltk
from collections import Counter
from pprint import pprint

##The Chain Rule From Probability Theory##


$P(w_{1} w_{2} ... w_{n}) = \prod_{i=1} P(w_{i} | w_{1} w_{2} ... w{i-1})$

```
P(its water is so transparent) = 
P(its) × P(water|its) × P(is|its water) × P(so|its water is) × P(transparent|its water is so)
```

Let's assume the conditional probabilities shown in the code below.

In [2]:
its = .05
water_given_its = .01
is_given_its_water = .001
so_given_its_water_is = .0008
transparent_given_its_water_is_so = .00000001

Using the chain rule, you just multiply the probabilities to get the final probability of the full sentence.

In [3]:
p = its * water_given_its * is_given_its_water * so_given_its_water_is * transparent_given_its_water_is_so

print("P(its water is so transparent) = {}".format(p))

P(its water is so transparent) = 4e-18


## The Maximum Likelihood Estimate ##

$P(w_{i}|w_{i-1}) = \dfrac{count(w_{i-1}, w_{i})}{count(w_{i-1})}$

In [4]:
# pre-tokenized sentences
sents = [["<s>", "I", "am", "Sam", "</s>"],
         ["<s>", "Sam" , "I", "am", "</s>"],
         ["<s>", "I", "do", "not", "like", "green", "eggs","and", "ham", "</s>"]]
        
# return bigram counts for all sentences
bigram = [list(nltk.bigrams(s)) for s in sents]
flat_bigram = [x for row in bigram for x in row]
numerator = Counter(flat_bigram)
pprint(numerator)

{('<s>', 'I'): 2,
 ('<s>', 'Sam'): 1,
 ('I', 'am'): 2,
 ('I', 'do'): 1,
 ('Sam', '</s>'): 1,
 ('Sam', 'I'): 1,
 ('am', '</s>'): 1,
 ('am', 'Sam'): 1,
 ('and', 'ham'): 1,
 ('do', 'not'): 1,
 ('eggs', 'and'): 1,
 ('green', 'eggs'): 1,
 ('ham', '</s>'): 1,
 ('like', 'green'): 1,
 ('not', 'like'): 1}


In [5]:
# return all words that occur first in each bigram
words = set([x for (x, y) in numerator])
words

{'<s>', 'I', 'Sam', 'am', 'and', 'do', 'eggs', 'green', 'ham', 'like', 'not'}

In [6]:
def denominator_counts(words, sents):
    '''
    calculate number of sentences that contain the first word
    '''
    denominator = {}
    for w in words:
        x = 0
        for s in sents:
            if w in s:
                x += 1
        denominator[w] = x
    return denominator

In [7]:
denominator = denominator_counts(words, sents)
denominator

{'<s>': 3,
 'I': 3,
 'Sam': 2,
 'am': 2,
 'and': 1,
 'do': 1,
 'eggs': 1,
 'green': 1,
 'ham': 1,
 'like': 1,
 'not': 1}

In [8]:
# add the denominator data to calculate the maximum likelihood estimate
MLE = numerator

for n in MLE.items():
    MLE[n[0]] /= denominator[n[0][0]]

In [9]:
pprint(MLE)

{('<s>', 'I'): 0.6666666666666666,
 ('<s>', 'Sam'): 0.3333333333333333,
 ('I', 'am'): 0.6666666666666666,
 ('I', 'do'): 0.3333333333333333,
 ('Sam', '</s>'): 0.5,
 ('Sam', 'I'): 0.5,
 ('am', '</s>'): 0.5,
 ('am', 'Sam'): 0.5,
 ('and', 'ham'): 1.0,
 ('do', 'not'): 1.0,
 ('eggs', 'and'): 1.0,
 ('green', 'eggs'): 1.0,
 ('ham', '</s>'): 1.0,
 ('like', 'green'): 1.0,
 ('not', 'like'): 1.0}
