Lab: Language Modeling on a Corpus
-----

We take the words in a corpus to make a [generative model](https://en.wikipedia.org/wiki/Generative_model) of language. 

We know that language is very complicated, but we can create a simplified model of language that captures part of the complexity. 

In the [bag of words model](https://en.wikipedia.org/wiki/Bag-of-words_model), we ignore the order of words, just count their frequency.  

Think of it this way: take all the words from the text, and throw them into a bag.  Shake the bag, and then generating a sentence consists of pulling words out of the bag one at a time.  

Chances are it won't be grammatical or sensible, but it will have words in roughly the right proportions.  

In [8]:
from quilt.data.spiering.shakespeare import shakespeare

In [9]:
with open(shakespeare._data()) as f:
    text = f.read()

In [10]:
import re 

def words(text):
    "List all the word tokens (consecutive letters) in a text. Normalize to lowercase."
    return re.findall('[a-z]+', text.lower()) 

In [11]:
text = words(text)
text[:5]

['the', 'sonnets', 'by', 'william', 'shakespeare']

TODO: Sample a random n-word sentence from the model described by the bag of words.

In [12]:
def sample(bag, n=10):
    "Sample a random n-word sentence from the model described by the bag of words."
    pass

In [13]:
# XXX: There can't be a deterministic unit test since it is random function (unless we set the seed). It should be something like...
sample(bag) #=> 'any monarchy i must people come and but error him'

NameError: name 'bag' is not defined

What type of ngram model is this? Why?

1. unigram
2. bigram
3. trigram
4. none of the above

-----

Let's look at the most common words
-----

In [14]:
from collections import Counter
from pprint import pprint

In [15]:
counts = Counter(text)
pprint(counts.most_common(10))

[('the', 27595),
 ('and', 26735),
 ('i', 22538),
 ('to', 19771),
 ('of', 18132),
 ('a', 14725),
 ('you', 13826),
 ('my', 12490),
 ('that', 11535),
 ('in', 11112)]


In [16]:
# Print most least common
pprint(counts.most_common(len(counts))[-10:]) 

[('extincture', 1),
 ('daffed', 1),
 ('plenitude', 1),
 ('cautels', 1),
 ('hurting', 1),
 ('preached', 1),
 ('unexperient', 1),
 ('hovered', 1),
 ('lovered', 1),
 ('glowed', 1)]


In [17]:
print(f'{"word":20}  {"count"}')
print('-'*30)
for word in words('there are common and neverseen words'):
    print(f'{word:20}  {counts[word]:,}')

word                  count
------------------------------
there                 2,210
are                   3,880
common                154
and                   26,735
neverseen             0
words                 421


In [18]:
# TODO: Calculate the probability of each word
print(f'{"word":20}  {"probability"}')
print('-'*30)
for word in words('there are common and neverseen words'):
    print(f'{word:20}  {None:.2}')

word                  probability
------------------------------


TypeError: unsupported format string passed to NoneType.__format__

In [20]:
# TODO: Turn that into a function
def word_prob(counts: dict, word: str)-> float:
    "Calculate the probability of a word based on evidence from a Counter."
    pass

In [21]:
assert round(word_prob(counts, "the"), 4)  == 0.0298
assert round(word_prob(counts, "king"), 4) == 0.0033

TypeError: type NoneType doesn't define __round__ method

Now, what is the probability of a *sequence* of words?  Use the definition of a joint probability:

$P(w_1 \ldots w_n) = P(w_1) \times P(w_2 \mid w_1) \times P(w_3 \mid w_1 w_2) \ldots  \times \ldots P(w_n \mid w_1 \ldots w_{n-1})$

The *bag of words* model assumes that each word is drawn from the bag *independently* of the others.  This gives us the wrong approximation:
    
$P(w_1 \ldots w_n) = P(w_1) \times P(w_2) \times P(w_3) \ldots  \times \ldots P(w_n)$

It is wrong but okay enough to move forward...

In [22]:
from numpy import product

In [23]:
def prob_words_in_phrase(phrase):
    "Probability of words, assuming each word is independent of others."
    return product([word_prob(counts, word) for word in words(phrase)])

In [24]:
phrases = ['the',
           'the the',
           'the the the', 
           'the sonnets by',
           'this is a neverbeforeseen word']

print(f'{"word":30}  {"probability"}')
print('-'*50)
for phrase in phrases:
    print(f'{phrase:30}  {prob_words_in_phrase(phrase):.6}')

word                            probability
--------------------------------------------------


TypeError: unsupported format string passed to NoneType.__format__

TODO: Why is a nonsense phrase "`the the the`" more likely than a phrase that actually appears in the corpus "`the sonnets by`"? 

What would we have to add to our model to reduce the likelihood of nonsense phrases?

This language model predicts there is `0` probabilty that there will ever by a phrase with a new word in it.

In your owns, why is this not an useful model.

TODO: Add Laplace smoothing to counts to allow for modeling novel words

In [None]:
def word_prob_smoothed(counts: dict, word: str)-> float:
    """Calculate a probability distribution based on evidence from a Counter.
    With laplace smoothing!
    """
    pass

In [None]:
assert round(word_prob_smoothed(counts, "the"), 4)  == 0.0291
assert round(word_prob_smoothed(counts, "king"), 4) == 0.0032

Recalculate the probability for selected words

In [None]:
print(f'{"word":20}  {"probability"}')
print('-'*30)
for word in words('there are common and neverseen words'):
    print(f'{word:20}  {word_prob_smoothed(counts, word):.2}')

Recalculate the probability for each phrase

In [None]:
def prob_words_smoothed_in_phrase(phrase):
    "Probability of words, assuming each word is independent of others."
    return product([word_prob_smoothed(counts, word) for word in words(phrase)])

In [None]:
print(f'{"word":30}  {"probability"}')
print('-'*50)
for phrase in phrases:
    print(f'{phrase:30}  {prob_words_smoothed_in_phrase(phrase):.6}')

In your own words, summarize how laplace smoothing improves language modeling

<br>
<br> 
<br>

----