# Week 02: Smoothness

**1. Calculate the probability of the sentence i want chinese food. Give two probabilities,
one using Fig. 3.2 and another using the add-1 smoothed table in Fig. 3.7. Assume
the additional add-1 smoothed probabilities P (i| < s >) = 0.19 and P (< /s >
|f ood) = 0.40.**

**NOTE**: also consider the following probabilities for the first table
P (i| < s >) = 0.25 and P (< /s > |food) = 0.68.

In [24]:
p_zero_probs = 0.25 * 0.33 * 0.0065 * 0.52 * 0.68
p_zero_probs

0.00018961800000000004

In [25]:
p_add_1_smoothed = 0.19 * 0.21 * 0.0029 * 0.052 * 0.4
p_add_1_smoothed

2.4067679999999995e-06

**2. Which of the two probabilities you computed in the previous exercise is higher,
unsmoothed or smoothed? Explain why.**

The probability using the first table (with zeros probs) is higer than the probability
using add 1 smoothed table.

It is because the add 1 smoothness process is shaving off a bit of the probability mass from 
some more frequent events and give it to the events with zero prob. Consequently, if we calculate
the probability of the sentences using the probability of the bigrams in the second table.
These bigrams' probabilities will be lower, and the whole sentence's probability will be lower as well.


**3. We are given the following corpus, modified from the example Dr Seuss corpus:**
```
<s> I am Sam </s>
<s> Sam I am </s>
<s> I am Sam </s>
<s> I do not like green eggs and Sam </s>
```
**Using a bigram language model with add-one smoothing, what is P (Sam|am)? Include < s > and < /s > in your counts just like any other token.**

In [30]:
import re
from collections import Counter, OrderedDict

import numpy as np

corpus = '''
<s> I am Sam </s>
<s> Sam I am </s>
<s> I am Sam </s>
<s> I do not like green eggs and Sam </s>
'''

corpus = re.findall(r'[A-Za-z0-9|(<s>)|(</s>)]+', corpus)
vocabulary = Counter(corpus)
vocabulary

vocab2index = OrderedDict([ (key, i) for i, key in enumerate(vocabulary.keys())])
vocab2index

index2vocab = OrderedDict([ (i, key) for i, key in enumerate(vocabulary.keys())])
index2vocab

OrderedDict([(0, '<s>'),
             (1, 'I'),
             (2, 'am'),
             (3, 'Sam'),
             (4, '</s>'),
             (5, 'do'),
             (6, 'not'),
             (7, 'like'),
             (8, 'green'),
             (9, 'eggs'),
             (10, 'and')])

In [27]:
def find_consecutive_tuples(lst, n = 2):
    tuples = []
    for i in range(len(lst) - n + 1):
        tuple_n = tuple(lst[i:i+n])
        tuples.append(tuple_n)
    tuples = Counter(tuples)
    return tuples

tuples_n_counters = find_consecutive_tuples(corpus)
tuples_n_counters

Counter({('<s>', 'I'): 3,
         ('I', 'am'): 3,
         ('am', 'Sam'): 2,
         ('Sam', '</s>'): 3,
         ('</s>', '<s>'): 3,
         ('<s>', 'Sam'): 1,
         ('Sam', 'I'): 1,
         ('am', '</s>'): 1,
         ('I', 'do'): 1,
         ('do', 'not'): 1,
         ('not', 'like'): 1,
         ('like', 'green'): 1,
         ('green', 'eggs'): 1,
         ('eggs', 'and'): 1,
         ('and', 'Sam'): 1})

In [28]:
# only bigram

def unsmoothed_bigram_probs(vocabulary,  vocab2index, tuples_n_counters):
    bigram_matrix = np.zeros((len(vocabulary), len(vocabulary)))
    bigram_matrix

    for tuple_n, count in tuples_n_counters.items():
        first_word, second_word = tuple_n

        idx_1 = vocab2index[first_word]
        idx_2 = vocab2index[second_word]

        bigram_matrix[idx_1,idx_2] = count / vocabulary[first_word] 

    print(bigram_matrix)

unsmoothed_bigram_probs(vocabulary,  vocab2index, tuples_n_counters)

[[0.         0.75       0.         0.25       0.         0.
  0.         0.         0.         0.         0.        ]
 [0.         0.         0.75       0.         0.         0.25
  0.         0.         0.         0.         0.        ]
 [0.         0.         0.         0.66666667 0.33333333 0.
  0.         0.         0.         0.         0.        ]
 [0.         0.25       0.         0.         0.75       0.
  0.         0.         0.         0.         0.        ]
 [0.75       0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.        ]
 [0.         0.         0.         0.         0.         0.
  1.         0.         0.         0.         0.        ]
 [0.         0.         0.         0.         0.         0.
  0.         1.         0.         0.         0.        ]
 [0.         0.         0.         0.         0.         0.
  0.         0.         1.         0.         0.        ]
 [0.         0.         0.         0.         0.      

In [33]:
# only bigram

def add_1_smoothed_bigram_probs(vocabulary,  index2vocab, tuples_n_counters):
    bigram_matrix = np.zeros((len(vocabulary), len(vocabulary)))
    
    for i in range(len(vocabulary)):
        for j in range(len(vocabulary)):

            first_word = index2vocab[i]
            second_word = index2vocab[j]

            if (first_word, second_word) in tuples_n_counters:
                count = tuples_n_counters[(first_word, second_word)]
            else:
                count = 1

            bigram_matrix[i,j] = (count + 1) / (vocabulary[first_word] + len(vocabulary) )

    return bigram_matrix

bigram_matrix = add_1_smoothed_bigram_probs(vocabulary,  index2vocab, tuples_n_counters)

In [34]:
bigram = ("am","Sam")
index1 = vocab2index[bigram[0]]
index2 = vocab2index[bigram[1]]

print(f"P({bigram[1]}|{bigram[0]})", bigram_matrix[index1,index2])

P(Sam|am) 0.21428571428571427


In [36]:
# From my hand calculation I got
3/14

0.21428571428571427

**4. Suppose we train a trigram language model with add-one smoothing on a given
corpus. The corpus contains V word types. Express a formula for estimating
P (w3|w1, w2), where w3 is a word which follows the bigram (w1,w2), in terms of
various n-gram counts and V. Use the notation c(w1,w2,w3) to denote the number
of times that trigram (w1,w2,w3) occurs in the corpus, and so on for bigrams and
unigrams.**

NOTE: In ipad
