<h1 align="center">Language Models</h1>

Before we can do things with words, we need some words. Here we'll use the NLTK library to fetch a commonly used list of text files.

In [4]:
import nltk
nltk.download('gutenberg')

from nltk.corpus import gutenberg
gutenberg.fileids()

[nltk_data] Downloading package gutenberg to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

In [5]:
import re
words = gutenberg.words('austen-emma.txt')

# filter out numbers, etc.
words = [w.lower() for w in words if re.match('^[a-zA-Z]+$', w)]
words[:10]

['emma',
 'by',
 'jane',
 'austen',
 'volume',
 'i',
 'chapter',
 'i',
 'emma',
 'woodhouse']

words can also serve as a generative model of text. We know that language is very complicated, but we can create a simplified model of language that captures part of the complexity. In the bag of words model, we ignore the order of words, but maintain their frequency. Think of it this way: take all the words from the text, and throw them into a bag. Shake the bag, and then generating a sentence consists of pulling words out of the bag one at a time. Chances are it won't be grammatical or sensible, but it will have words in roughly the right proportions. 

Here's a function to sample an n word sentence from a bag of words:

In [6]:
import random

def sample(w, n=10):
    """Sample n words from a list of words, w."""
    return [random.choice(w) for _ in range(n)]

sample(words)

['upon', 'hour', 'think', 'she', 'and', 'he', 'of', 'your', 'of', 'talk']

## From Bag of Words to Probabilities

From the bag of words, we can construct a frequency table:

In [7]:
from collections import Counter

word_counts = Counter(words)
word_counts.most_common(10)

[('to', 5239),
 ('the', 5201),
 ('and', 4896),
 ('of', 4291),
 ('i', 3178),
 ('a', 3129),
 ('it', 2528),
 ('her', 2469),
 ('was', 2398),
 ('she', 2340)]

In [8]:
len(word_counts)

7079

And from the frequency table, a probability distribution:

In [18]:
import pandas as pd

def pdist(counter):
    """Make a probability distribution from a Counter."""
    n_words = sum(counter.values())
    dist = [(word, count/n_words) for word, count in counter.items()]
    return pd.DataFrame(data=dist, columns=['words', 'probs'])

word_probs = pdist(word_counts)
word_probs.sort_values('probs', ascending=False).head(5)

Unnamed: 0,words,probs
4527,to,0.03242
3157,the,0.032184
1603,and,0.030297
3939,of,0.026553
1266,i,0.019666


In [11]:
word_probs['prob'].sum()

1.0000000000000049

Let's consolidate what we've learned so far into a function.

In [19]:
def words_to_pdist(words):
    """Make a probability distribution from a list of words."""
    
    wc = Counter(words)
    n_words = sum(wc.values())
    dist = [(word, count, count/n_words) for word, count in wc.items()]
    
    return pd.DataFrame(data=dist, columns=['words', 'counts', 'probs'])

unigram_probs = words_to_pdist(words)
unigram_probs.sort_values('probs', ascending=False).head(5)

Unnamed: 0,words,counts,probs
4527,to,5239,0.03242
3157,the,5201,0.032184
1603,and,4896,0.030297
3939,of,4291,0.026553
1266,i,3178,0.019666


Using the probability/count table above, can we assign a probability to a sentence?

## The Bag of Words Model

The bag of words model assumes all the words are independent.

    P(to, the, end) = P(to)P(the)P(end)

In [26]:
def unigram_model(unigram_probs, sent):
    """Calculate unigram probability of a string of words, sent."""    
    idx_uni = unigram_probs.set_index('words')
    return idx_uni.loc[sent, 'probs'].prod()

sent = ['to', 'the', 'end']
p_sent = unigram_model(unigram_probs, sent)

print('Probability of %s is [%.8f]' % (sent, p_sent))

Probability of ['to', 'the', 'end'] is [0.00000031]


## The Bigram Model

Use the Chain Rule of Probability to break down the full joint distribution

$P(w_1 \ldots w_n) = P(w_1) \times P(w_2 \mid w_1) \times P(w_3 \mid w_1 w_2) \ldots  \times \ldots P(w_n \mid w_1 \ldots w_{n-1})$

To make this more useful, we make the Markov assumption that only successive words depend on each other.

$P(w_1 \ldots w_n) = P(w_1) \times P(w_2 \mid w_1) \times P(w_3 \mid w_2) \ldots  \times \ldots P(w_n \mid w_{n-1})$

In [31]:
def bigram_gen(words):
    """Given a word list, generate successive pairs of words."""
    for i in range(len(words)-1):
        yield tuple(words[i:i+2])

bigram_probs = words_to_pdist(bigram_gen(words))
bigram_probs.sort_values('probs', ascending=False).head(5)

Unnamed: 0,words,counts,probs
9457,"(to, be)",607,0.003756
53401,"(of, the)",566,0.003502
56444,"(it, was)",448,0.002772
14667,"(in, the)",446,0.00276
61833,"(i, am)",395,0.002444


How to calculate $P(w_i \mid P(w_{i-1})$? This is simply count of the bigram $(w_i-1, w_{i-1})$ divided by the number of bigrams starts with $w_i$, which equals the unigram count of $w_i$.

In [36]:
def bigram_model(unigram_probs, bigram_probs, sent):
    """Calcuate bigram probability of a list of words, sent."""
    idx_uni = unigram_probs.set_index('words')
    idx_bi = bigram_probs.set_index('words')
    
    p_sent = idx_uni.loc[sent[0], 'probs'] #P(sent[0])
    
    for w1, w2 in bigram_gen(sent):
        p_w1_w2 = idx_bi.loc[[(w1, w2),], 'probs'].values[0]
        print('P(%s | %s) is [%.8f]' % (w2, w1, p_w1_w2))
        p_sent*= p_w1_w2
                         
    return p_sent

bigram_model(unigram_probs, bigram_probs, ['to', 'be', 'or', 'not', 'to', 'be'])

P(be | to) is [0.00375621]
P(or | be) is [0.00000619]
P(not | or) is [0.00006807]
P(to | not) is [0.00045792]
P(be | to) is [0.00375621]


8.8229838490342123e-20

## Missing Words and Smoothing