----
Exercises: Language Modeling
====

Let's practice bigrams, MLE, and Laplace smoothing....

In [3]:
corpus = "en såg såg en såg en såg såg , en annan sågade en sågen sågen såg . </s>".split()
vocabulary = set(corpus)
len(vocabulary)

8

In [4]:
import nltk

In [7]:
cfd = nltk.ConditionalFreqDist(nltk.bigrams(corpus))
cfd

ConditionalFreqDist(nltk.probability.FreqDist,
                    {',': FreqDist({'en': 1}),
                     '.': FreqDist({'</s>': 1}),
                     'annan': FreqDist({'sågade': 1}),
                     'en': FreqDist({'annan': 1, 'såg': 3, 'sågen': 1}),
                     'såg': FreqDist({',': 1, '.': 1, 'en': 2, 'såg': 2}),
                     'sågade': FreqDist({'en': 1}),
                     'sågen': FreqDist({'såg': 1, 'sågen': 1})})

: Describe cfd in your own words

conditional frequency distribution splits the corpus into bigrams, and tallies the frequencies

In [11]:
sentence = "såg såg sågade en sågen ".split()

In [12]:
# The corpus counts of each bigram in the sentence:
print("word 1", "word 2", "bigram count", sep="\t")
[print(a, b, cfd[a][b], sep="\t") for (a,b) in nltk.bigrams(sentence)];

word 1	word 2	bigram count
såg	såg	2
såg	sågade	0
sågade	en	1
en	sågen	1


In [21]:
# : The corpus counts for each word in the sentence:
# from collections import Counter

# counts = Counter(sentence)
# counts

corpus_count_unigram = [corpus.count(word) for word in sentence]

corpus_count_unigram

[6, 6, 1, 5, 2]

In [22]:
assert corpus_count_unigram == [6, 6, 1, 5, 2]

In [23]:
# The MLE probability for each bigram:
print("word 1", "word 2", "MLE probability", sep="\t")
[print(a, b, (cfd[a][b]/cfd[a].N()), sep="\t") for (a,b) in nltk.bigrams(sentence)];

word 1	word 2	MLE probability
såg	såg	0.3333333333333333
såg	sågade	0.0
sågade	en	1.0
en	sågen	0.2


In [56]:
#  : Repeat using in the built-in methods for MLE probability:
# a, b = nltk.bigrams(sentence)
[cfd[a].freq(b) for (a,b) in nltk.bigrams(sentence)]
# type(cfd)

# cfdist = nltk.probability.ConditionalProbDist((len(word), word) for word in sentence)
# cfdist

[0.3333333333333333, 0.0, 1.0, 0.2]

In [48]:
# The probability of the sentence is the product of all bigram probabilities:
from functools import reduce

prob_bigram = [cfd[a][b]/cfd[a].N() for (a,b) in nltk.bigrams(sentence)]
reduce(lambda x,y:x*y, prob_bigram)

0.0

That is not a great model becuase it predicts zero for a sentence exists, even though we haven't seen it yet!

In [50]:
# Laplace smoothing of each bigram count:
[1 + cfd[a][b] for (a,b) in nltk.bigrams(sentence)]

[3, 1, 2, 2]

In [52]:
# We need to normalise the counts for each word:
[len(vocabulary) + cfd[a].N() for (a,b) in nltk.bigrams(sentence)]

[14, 14, 9, 13]

In [61]:
#TODO: Calculate and print the smoothed Laplace probability for each bigram:
print("word 1", "word 2", "Laplace smoothed probability", sep="\t")
# [print(a, b, (cfd[a].freq(b)), sep="\t") for (a,b) in nltk.bigrams(sentence)];
smoothed_out = [(1 + cfd[a][b])/(len(vocabulary) + cfd[a].N()) for (a,b) in nltk.bigrams(sentence)]
smoothed_out

word 1	word 2	Laplace smoothed probability


[0.21428571428571427,
 0.07142857142857142,
 0.2222222222222222,
 0.15384615384615385]

In [62]:
assert smoothed_out == [0.21428571428571427,
 0.07142857142857142,
 0.2222222222222222,
 0.15384615384615385]

In [63]:
# The smoothed probability of the sentence:
reduce(lambda x,y:x*y, smoothed_out)

0.0005232862375719518

In [64]:
assert round(reduce(lambda x,y:x*y, smoothed_out),6) == 0.000523

: How can we interpret this probability?

This is the probability of seeing an unknown bigram in the corpus.

sentence

Probability of the sentence formed by the bigrams given corpus.

------
Here is how it would look all together in a grown-up codebase.

In [66]:
# MLEProbDist is the unsmoothed probability distribution:
cpd_mle = nltk.ConditionalProbDist(cfd,
                                   nltk.MLEProbDist,
                                   bins=len(vocabulary))

In [67]:
# Now we can get the MLE probabilities by using the .prob method:
print("word 1", "word 2", "MLE probability", sep="\t")
[print(a, b, cpd_mle[a].prob(b), sep="\t") for (a,b) in nltk.bigrams(sentence)];

word 1	word 2	MLE probability
såg	såg	0.3333333333333333
såg	sågade	0.0
sågade	en	1.0
en	sågen	0.2


In [68]:
# LaplaceProbDist is the add-one smoothed ProbDist:
cpd_laplace = nltk.ConditionalProbDist(cfd, 
                                       nltk.LaplaceProbDist, 
                                       bins=len(vocabulary))

In [69]:
# Getting the Laplace probabilities is the same as for MLE:
print("word 1", "word 2", "Laplace smoothed probability", sep="\t")
[print(a, b, cpd_laplace[a].prob(b), sep="\t") for (a,b) in nltk.bigrams(sentence)];

word 1	word 2	Laplace smoothed probability
såg	såg	0.21428571428571427
såg	sågade	0.07142857142857142
sågade	en	0.2222222222222222
en	sågen	0.15384615384615385


In [70]:
![](http://ljdchost.com/AbW1pPX.gif)

/bin/sh: -c: line 0: syntax error near unexpected token `http://ljdchost.com/AbW1pPX.gif'
/bin/sh: -c: line 0: `[](http://ljdchost.com/AbW1pPX.gif)'


<br>
<br> 
<br>

----

<br>
<br>
---