## Smoothing assignment

To start this assignment download the Brown corpus.

In [5]:
import nltk
nltk.download('brown')

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.


True

## Background

In the Pollard assignment you computed a unigram frequency distribution for the Brown corpus. You will need that for this assignment.

This time you will do a bigram distribution:

In [6]:
import nltk
from nltk.corpus import brown
from nltk import bigrams
brown_bigrams = list(bigrams(brown.words()))

It is instructive to compare brown.words, which we used in the last assignment, with brown.bigrams:

In [8]:
brown.words()[:10]
#['The', 'Fulton', 'County', 'Grand', 'Jury', 'said',
# 'Friday', 'an', 'investigation', 'of']

['The',
 'Fulton',
 'County',
 'Grand',
 'Jury',
 'said',
 'Friday',
 'an',
 'investigation',
 'of']

In [9]:
brown_bigrams[:10]
#[('The', 'Fulton'), ('Fulton', 'County'), ('County', 'Grand'),
#('Grand', 'Jury'), ('Jury', 'said'), ('said', 'Friday'), ('Friday', 'an'),
#('an', 'investigation'), ('investigation', 'of'), ('of', "Atlanta's")]

[('The', 'Fulton'),
 ('Fulton', 'County'),
 ('County', 'Grand'),
 ('Grand', 'Jury'),
 ('Jury', 'said'),
 ('said', 'Friday'),
 ('Friday', 'an'),
 ('an', 'investigation'),
 ('investigation', 'of'),
 ('of', "Atlanta's")]

So brown.words() returns a list of the words, while brown.bigrams() returns a list of word pairs. Notice the the second word of the first pair becomes the first word of the second pair, and the the second word of the second pair, the first word of the third, and so on. Since each word in Brown becaome the first word of a bigram except the last, there is exactly one more word token than there are bogram tokens:

In [10]:
len(brown_bigrams)
#1161191

1161191

In [11]:
len(brown.words())
1161192

1161192

## Questions

Create a new frequency distribution of the Brown bigrams. Plot the cumulative frequency distribution of the top 50 bigrams.

Then do add one smoothing on the bigrams. This will require adding one to all the bigram counts, including those that previously had count 0. You will also need to change the ungram counts appropriately. You will compute all possible bigrams using the known vocabulary, so use the keys of the unigram Brown distribution you created before to compute the set of possible bigrams. The vocabulary size from that exercise should be 49815. Then having added 1 to all the bigram counts, you must compute at least the following Probabilities:


1. P(the | in) before and after smoothing (P_{\text{mle}} and P_{\text{laplace}});

2.  P(in the) before and after smoothing;

3.  P(said the) before and after smoothing.

4. P(the | said) before and after smoothing.

In some cases you will to use the unigram counts to compute these probabilities. Remember that the unigram counts must change too when smoothing.

Turn in these values and the Python code you used to compute them.

## Helpful Code

In [12]:
import nltk
from nltk.corpus import brown
from collections import defaultdict, Counter

wds = brown.words()
N = len(wds)
print(N)

1161192


We make

In [13]:
mle_unigram_dist = nltk.FreqDist([w.lower() for w in wds])
bigram_seq = list(nltk.bigrams(wds))
bigram_N = len(bigram_seq)
print(bigram_N)              

1161191


`bigram_N` = `N - 1`.  Here's why.

In [14]:
wds[:10]

['The',
 'Fulton',
 'County',
 'Grand',
 'Jury',
 'said',
 'Friday',
 'an',
 'investigation',
 'of']

In [15]:
bigram_seq[:10]

[('The', 'Fulton'),
 ('Fulton', 'County'),
 ('County', 'Grand'),
 ('Grand', 'Jury'),
 ('Jury', 'said'),
 ('said', 'Friday'),
 ('Friday', 'an'),
 ('an', 'investigation'),
 ('investigation', 'of'),
 ('of', "Atlanta's")]

The first bigram starts with the first word, the second with second word and so on.  But there is no bigram
that starts with the last word.

We make a frequency distribution for bigrams.

In [16]:
# MLE stands for Maximum Likelihood Estimate
mle_bigram_dist = nltk.FreqDist((x.lower(),y.lower()) for (x,y) in bigram_seq)

In [17]:
print(mle_unigram_dist)
print(mle_unigram_dist['the'])
print(mle_bigram_dist)
print(mle_bigram_dist['the','only'])

<FreqDist with 49815 samples and 1161192 outcomes>
69971
<FreqDist with 436003 samples and 1161191 outcomes>
258


The information printed about `mle_unigram_dist`: The vocabulary has 49,815 word types.  The Brown corpus has 1,161,192 word tokens.

The information printed about `mle_bigram_dist`: The "vocabulary" (of bigrams) has 436,003 bigram types.  The Brown corpus has 1,161,191 bigram tokens.

Notice how many more bigrams **types** there are than unigram types (436,003 vs. 49,815).  Make sure you understand **why** that is.  Every time a word is followed by some word it's never been followed by, that's a new bigram type.  So we see above that the bigram 'the only' has occurred 258 times in Brown (that's quite high for a bigram.  But 'the' also occurs in all the following bigram types, each with a different count.

In [18]:
print(mle_bigram_dist['the','time'])
print(mle_bigram_dist['the','boy'])
print(mle_bigram_dist['the','red'])

251
81
44


Since there are 49, 815 word types in the vocabulary, there are

In [21]:
print(49815**2)
print(f'{49815**2:,}')


2481534225
2,481,534,225


($49^2$) **possible bigrams types** for this vocabulary, but in the 1.2 million words of Brown, we see 
only 436,003 actual bigram types.That's

In [22]:
print(436003/(49815**2))
print(f'{436003/(49815**2):.3%}')

0.00017569896703721667
0.018%


.018 % of the possible bigrams, a very tiny fraction.