## Smoothing assignment

In the Pollard assignment you computed a unigram frequency distribution for the Brown corpus. You will need that for this assignmewnt.

This time you will do a bigram distribution:

In [None]:
import nltk
from nltk.corpus import brown
from nltk import bigrams
brown_bigrams = list(bigrams(brown.words()))

It is instructive to compare brown.words, which we used in the last assignment, with brown.bigrams:

In [None]:
brown.words[:10]
#['The', 'Fulton', 'County', 'Grand', 'Jury', 'said',
# 'Friday', 'an', 'investigation', 'of']

In [None]:
brown_bigrams[:10]
#[('The', 'Fulton'), ('Fulton', 'County'), ('County', 'Grand'),
#('Grand', 'Jury'), ('Jury', 'said'), ('said', 'Friday'), ('Friday', 'an'),
#('an', 'investigation'), ('investigation', 'of'), ('of', "Atlanta's")]

So brown.words() returns a list of the words, while brown.bigrams() returns a list of word pairs. Notice the the second word of the first pair becomes the first word of the second pair, and the the second word of the second pair, the first word of the third, and so on. Since each word in Brown becaome the first word of a bigram except the last, there is exactly one more word token than there are bogram tokens:

In [None]:
len(brown_bigrams)
#1161191

In [None]:
len(brown.words())
1161192

Create a new frequency distribution of the Brown bigrams. Plot the cumulative frequency distribution of the top 50 bigrams.

Then do add one smoothing on the bigrams. This will require adding one to all the bigram counts, including those that previously had count 0. You will also need to change the ungram counts appropriately. You will compute all possible bigrams using the known vocabulary, so use the keys of the unigram Brown distribution you created before to compute the set of possible bigrams. The vocabulary size from that exercise should be 49815. Then having added 1 to all the bigram counts, you must compute at least the following Probabilities:


1. P(the | in) before and after smoothing (P_{\text{mle}} and P_{\text{laplace}});

2.  P(in the) before and after smoothing;

3.  P(said the) before and after smoothing.

4. P(the | said) before and after smoothing.

In some cases you will to use the unigram counts to compute these probabilities. Remember that the unigram counts must change too when smoothing.

Turn in these values and the Python code you used to compute them.

## Helpful Code

In [None]:
import nltk
from nltk.corpus import brown
from collections import defaultdict, Counter

wds = brown.words()
N = len(wds)
print(N)

We make

In [7]:
mle_unigram_dist = nltk.FreqDist([w.lower() for w in wds])
bigram_seq = list(nltk.bigrams(wds))
bigram_N = len(bigram_seq)
print(bigram_N)              

1161192
1161191


`bigram_N` = `N - 1`.  Here's why.

In [3]:
wds[:10]

['The',
 'Fulton',
 'County',
 'Grand',
 'Jury',
 'said',
 'Friday',
 'an',
 'investigation',
 'of']

In [2]:
bigram_seq[:10]

[('The', 'Fulton'),
 ('Fulton', 'County'),
 ('County', 'Grand'),
 ('Grand', 'Jury'),
 ('Jury', 'said'),
 ('said', 'Friday'),
 ('Friday', 'an'),
 ('an', 'investigation'),
 ('investigation', 'of'),
 ('of', "Atlanta's")]

The first bigram starts with the first word, the second with second word and so on.  But there is no bigram
that starts with the last word.

We make a frequency distribution for bigrams.

In [14]:
# MLE stands for Maximum Likelihood Estimate
mle_bigram_dist = nltk.FreqDist((x.lower(),y.lower()) for (x,y) in bigram_seq)

In [8]:
print(mle_unigram_dist)
print(mle_unigram_dist['the'])
print(mle_bigram_dist)
print(mle_bigram_dist['the','only'])

<FreqDist with 49815 samples and 1161192 outcomes>
69971
<FreqDist with 436003 samples and 1161191 outcomes>
258


The information printed about `mle_unigram_dist`: The vocabulary has 49,815 word types.  The Brown corpus has 1,161,192 word tokens.

The information printed about `mle_bigram_dist`: The "vocabulary" (of bigrams) has 436,003 bigram types.  The Brown corpus has 1,161,191 bigram tokens.

Notice how many more bigrams **types** there are than unigram types (436,003 vs. 49,815).  Make sure you understand **why** that is.  Every time a word is followed by some word it's never been followed by, that's a new bigram type.  So we see above that the bigram 'the only' has occurred 258 times in Brown (that's quite high for a bigram.  But 'the' also occurs in all the following bigram types, each with a different count.

In [10]:
print(mle_bigram_dist['the','time'])
print(mle_bigram_dist['the','boy'])
print(mle_bigram_dist['the','red'])

251
81
44


Since there are 49, 815 word types in the vocabulary, there are

In [16]:
print(49815**2)
print(f'{49815**2:,}')


2481534225
2,481,534,225


($49^2$) **possible bigrams types** for this vocabulary, but in the 1.2 million words of Brown, we see 
only 436,003 actual bigram types.That's

In [13]:
print(436003/(49815**2))
print(f'{436003/(49815**2):.3%}')

0.00017569896703721667
0.018%


.018 % of the possible bigrams, a very tiny fraction.

This is the solution. 

### Maximum Likelihood Probs

Our events are bigrams.

In general, we are solving the problem of predicting the **next** word knowing
the **previous** words.  When we use a bigram model to do that, we are predicting the **next** word knowing
only the **previous** word.  

The notation 

```
P(the | in)
```

means 

*the probability that "the" is the **second** word in a bigram  given that "in" is the **first**, so the bigram event we are looking for is "in the".

This a **conditional probability**.  It is different from a joint probability.

```
P(A | B)  = P(A,B)/P(B)
```

The joint probability is on the right.

What does an MLE (Maximum likelihood estimate) look like?

$
P_{mle}(A \mid B) = \frac{P(A,B)}{P(B)} = \frac{\mbox{count}(A,B)/N}{\mbox{count}(B)/N} = \frac{\mbox{count}(A,B)}{\mbox{count}(B)}
$

Notice the following **difference**:

$
P_{mle}(A \mid B) = \frac{\mbox{count}(A,\,B)}{\mbox{count}(B)}
$

$
P_{mle}(A,\, B) = \frac{\mbox{count}(A,\,B)}{N}
$

In terms of bigrams:

$
P_{mle}(the \mid in) = \frac{\mbox{count}(in\; the)}{\mbox{count}(in)}
$

$
P_{mle}(A,\, B) = \frac{\mbox{count}(in\; the)}{N}
$

That is, in the conditional probability, we are restricting our attention to a sample space
in which the first word of a bigram event is "in", and the maximum likelihood probability
is the proportion of those events that are "in the" events.

That is, in the joint probability, we are looking at the entire sample space
of bigram events, and the maximum likelihood probability
is the proportion of all bigram events that are "in the" events.  Here N is the
size of the corpus, the number total bigram events.


### Laplace Probs

We add 1 to every possible bigram event (possible given the vocab we know).  We recompute 
probs using the same logic as befiore, but taking into account the new counts.

Count of every bigram: goes up by 1

Count of every unigram as a first word in a possible bigram:  goes up by V, the size of the
vovacbulary, since it is the first word in V possible pbigrams, and the count of each of those
has gone up by 1.

N: the total number of bigram events has become $N + V^{2}$, because we added 1 to every possible 
bigram and are $V^{2}$ possible bigrams.


For example, consider $P_{laplace}(the \mid in)$

In terms of bigrams:

$
P_{laplace}(the \mid in) = \frac{\mbox{count}(in\; the)\, + \,1}{\mbox{count}(in)\, +\,V}
$

$
P_{laplace}(A,\, B) = \frac{\mbox{count}(in\; the)\,+\,1}{N\, +\, V^{2}}
$


In [None]:
import nltk
nltk.download('brown')
from nltk.corpus import brown
wds = brown.words()
N = len(wds)
print(N)
mle_unigram_dist = nltk.FreqDist([w.lower() for w in wds])

bigram_seq = list(nltk.bigrams(wds))
# MLE stands for Maximum Likelihood Estimate
mle_bigram_dist = nltk.FreqDist((x.lower(),y.lower()) for (x,y) in bigram_seq)
bigram_N = len(bigram_seq)

ct_in_the = mle_bigram_dist[('in','the')]
#6025

p_the_given_in = float(mle_bigram_dist[('in','the')])/mle_unigram_dist['in']
#0.2823733420818297


p_in_the = mle_bigram_dist[('in','the')]/float(bigram_N)
#0.013818712256567042

ct_said_the = mle_bigram_dist[('said','the')]
#74

p_the_given_said = float(mle_bigram_dist[('said','the')])/float(mle_unigram_dist['said'])
#0.03773584905660377

p_said_the = float(mle_bigram_dist[('said','the')])/float(bigram_N)
#0.00016972360281924666

p_sewer_brother = 0.0

mle_probs = [p_the_given_in,p_in_the,p_the_given_said,p_said_the,p_sewer_brother]

In [18]:
from collections import defaultdict

# vocab_size
V = len(mle_unigram_dist)
bigram_size = V**2
sm_bigram_N = bigram_N + bigram_size

#Fix bigram cts
# A counter dictionary that returns 1 for unseen keys
laplace_bigram_dist  = defaultdict(lambda: 1)
for big in mle_bigram_dist:
    laplace_bigram_dist[big] = mle_bigram_dist[big] + 1
    
# Any unigram of count 0 has smoothed count 1
laplace_uni = defaultdict(lambda: 1)
#Fix nonzero unigram cts to reflect what we just did to bigram counts
for x in mle_unigram_dist:
    laplace_uni[x] = mle_unigram_dist[x] +  V


# An unseen bigram
print ('Laplace count of "sewer brother": {0}'.format(laplace_bigram_dist[('sewer','brother')]))
print()

# Q: We added 1 event for every possible bigram.
#    How much did the total size of the corpus go up by?


sm_ct_in_the = laplace_bigram_dist[('in','the')]
#6026

sm_p_the_given_in = float(laplace_bigram_dist[('in','the')])/laplace_uni['in']
# 0.08469192714189341

sm_p_in_the = laplace_bigram_dist[('in','the')]/float(sm_bigram_N)
# 0.0000024

sm_ct_said_the = laplace_bigram_dist[('said','the')]
# 75

sm_p_the_given_said = float(laplace_bigram_dist[('said','the')])/laplace_uni['said']
#0.0015056

sm_p_said_the = float(laplace_bigram_dist[('said','the')])/float(sm_bigram_N)
#3.020910237987889e-08

sm_p_sewer_brother = float(laplace_bigram_dist[('sewer','brother')])/float(sm_bigram_N)

laplace_probs = [sm_p_the_given_in,sm_p_in_the,sm_p_the_given_said,sm_p_said_the,sm_p_sewer_brother]


print('{0:<16s} {1:^9s}  {2:^9s}'.format('','MLE','Smooth'))
for (i,p) in enumerate(['the | in','in the','the | said','said the','sewer brother']):
    print('{0:<16s} {1:6.7f}  {2:6.7f}'.format(p,mle_probs[i],laplace_probs[i]))

print()
print('{0:<16s} {1:^14s}  {2:^9s}'.format('','MLE','Smooth'))
for (i,p) in enumerate(['in the','said the','sewer brother']):
    print('{0:<16s} {1:6.10f}  {2:6.10f}'.format(p,mle_probs[i],laplace_probs[i]))



Laplace count of "sewer brother": 1

                    MLE      Smooth  
the | in         0.2823733  0.0846919
in the           0.0051886  0.0000024
the | said       0.0377358  0.0014485
said the         0.0000637  0.0000000
sewer brother    0.0000000  0.0000000

                      MLE         Smooth  
in the           0.2823733421  0.0846919271
said the         0.0051886382  0.0000024272
sewer brother    0.0377358491  0.0014485476


# Normalized bigram cts

This is just clarifying some stuff in the text.

We "smooth" the counts in the smoothed model so that the number of events adds back up to N.

Computing the smoothed  probs then uses exactly the same formula as the mle probs, but using smoothed counts.

In [28]:

norm_factor = float(N)/(N + V**2)

bigram_norm_factor = float(bigram_N)/(bigram_N + V**2)

new_cts = defaultdict(lambda:norm_factor)

for w in mle_unigram_dist:
    new_cts[w] = mle_unigram_dist[w] * norm_factor

new_bigram_cts = defaultdict(lambda:bigram_norm_factor)

for big in mle_bigram_dist:
    new_bigram_cts[big] = mle_bigram_dist[big] * norm_factor

new_laplace_cts = [float(new_bigram_cts[('in','the')]),
                     float(new_bigram_cts[('said','the')]),
                     float(new_bigram_cts[('sewer','brother')]),
                     ]
old_cts = [float(mle_bigram_dist[('in','the')]),
                     float(mle_bigram_dist[('said','the')]),
                     float(mle_bigram_dist[('sewer','brother')]),
                     ]


print()
print('{0:<16s} {1:^9s}  {2:^9s}'.format('','Raw cts','Smthd Cts'))
for (i,p) in enumerate(['in_the','said_the','sewer_brother']):
    print('{0:<16s} {1:<6}  {2:<6.7f}'.format(p,old_cts[i],new_laplace_cts[i]))
    
    
print()
print('{0:<16s} {1:^9s}  {2:^9s}'.format('','Raw cts','Smthd Cts'))
for (i,p) in enumerate(['in']):
    print('{0:<16s} {1:<6}  {2:<6.7f}'.format(p,mle_unigram_dist[p],new_cts[p]))

 
print()
print('P(the | in)', end = '  ')
print(new_bigram_cts['in','the']/new_cts['in'])


                  Raw cts   Smthd Cts
in_the           6025.0  2.8179783
said_the         74.0    0.0346109
sewer_brother    0.0     0.0004677

                  Raw cts   Smthd Cts
in               21337   9.9796187

P(the | in)  0.2823733420818296
