# Table of Contents
>## 1. Probabilistic Language Model
* 1.1. Unigram Model
* 1.2. Bigram Model (a.k.a. Markov Model)
* 1.3. N-gram Model

>## 2. Bigram Model
* 2.1. Probability Estimation
* 2.2. Example - `movie_reviews`
* 2.3. Most Common Words
* 2.4. Probability of Context-Word Pair
* 2.5. Probability of a Sentence
* 2.6. Generate Sentences

# 1. Probabilistic Language Model

>$$\text{Word Sequence} \;\; w_1, w_2, \ldots, w_m \;\; \text{is given}$$
>
>$$\rightarrow \text{calculate} \;\; P(w_1, w_2, \ldots, w_m)$$
>
>$$\text{(See if the word sequence can be used as a sentence)}$$

* **Conditional Probability**

>$$
\begin{eqnarray}
P(w_1, w_2, \ldots, w_m) &=& P(w_1, w_2, \ldots, w_{m-1}) \cdot P(w_m\;|\; w_1, w_2, \ldots, w_{m-1}) \\
&=& P(w_1, w_2, \ldots, w_{m-2}) \cdot P(w_{m-1}\;|\; w_1, w_2, \ldots, w_{m-2}) \cdot P(w_m\;|\; w_1, w_2, \ldots, w_{m-1}) \\
&=& P(w_1) \cdot P(w_2 \;|\; w_1) \cdot  P(w_3 \;|\; w_1, w_2) P(w_4 \;|\; w_1, w_2, w_3) \cdots P(w_m\;|\; w_1, w_2, \ldots, w_{m-1})
\end{eqnarray}
$$

* **Context**: $w_1, w_2, \ldots, w_{m-1}$
* **Usage**:

>* Spell Correction
>* Speech Recognition
>* Machine Translation
>* Summarization
>* Question-Answering

## 1.1. Unigram  Model
* Usage of each word is **independent**

>$$ P(w_1, w_2, \ldots, w_m) = \prod_{i=1}^m P(w_i) $$

## 1.2. Bigram Model (a.k.a. Markov Model)

* Usage of each word **depends on the previous $1$ word**

>$$ P(w_1, w_2, \ldots, w_m) = P(w_1) \prod_{i=2}^{m} P(w_{i}\;|\; w_{i-1}) $$

## 1.3. N-gram Model
* Usage of each word **depends on the previous $N$ words**

>$$ P(w_1, w_2, \ldots, w_m) = P(w_1) \prod_{i=n}^{m} P(w_{i}\;|\; w_{i-1}, \ldots, w_{i-n}) $$

# 2. Bigram Model
## 2.1. Probability Estimation
* Token $SS$: Sentence Start
* Toekn $SE$: Sentence End

>$$ P(\text{SS I am a boy SE}) = P(\text{I}\;|\; \text{SS}) \cdot P(\text{am}\;|\; \text{I}) \cdot P(\text{a}\;|\; \text{am}) \cdot P(\text{boy}\;|\; \text{a}) \cdot P(\text{SE}\;|\; \text{boy}) $$

* **Conditional Probability**:

>$$ P(w_{i}\;|\; w_{i-1}) = \dfrac{C(w_{i}, w_{i-1})}{C(w_{i-1})} $$
>
>* $C(w_{i}, w_{i-1})$: $\text{Count }(w_{i}, w_{i-1})$
>* $C(w_{i-1})$: $\text{Count }(w_{i-1})$

## 2.2. Example - `movie_reviews`

### Step 1. Download Corpus

In [1]:
import nltk
nltk.download('movie_reviews')
nltk.download('punkt')

[nltk_data] Downloading package movie_reviews to
[nltk_data]     /home/dockeruser/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!
[nltk_data] Downloading package punkt to /home/dockeruser/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### Step 2. Create List of Sentences

In [2]:
from nltk.corpus import movie_reviews

sentences = []
for s in movie_reviews.sents():
    s.insert(0, "SS")
    s.append("SE")
    if len(s) > 4:
        sentences.append(s)

In [3]:
sentences[1]

['SS', 'they', 'get', 'into', 'an', 'accident', '.', 'SE']

### Step 3. Probability Estimation

In [4]:
from collections import Counter

def calculate_bigram(sentences):
    bigram = {}
    for s in sentences:
        context = "SS"
        for i, w in enumerate(s[1:]):
            if context not in bigram:
                bigram[context] = Counter()
            if bigram[context][w] == 0:
                bigram[context][w] = 1
            bigram[context][w] += 1
            context = w
    for context in bigram.keys():
        total = sum(bigram[context].values())
        for w in bigram[context]:
            bigram[context][w] /= total
    return bigram

### Step 4. Create Bigram Model

In [5]:
bigram = calculate_bigram(sentences)

## 2.3. Most Common Words

#### Example) First word of the sentence

In [6]:
bigram["SS"].most_common(10)

[('the', 0.11231263830320237),
 ('it', 0.043575076893101194),
 ('i', 0.03379121261464379),
 ('but', 0.02523207103391647),
 ('and', 0.024160438673402642),
 ('he', 0.023269731256871668),
 ('in', 0.023102723616272112),
 ('this', 0.022963550582439148),
 ('there', 0.0180507424881355),
 ('as', 0.013249272820898222)]

#### Example) What comes after "we"

In [7]:
bigram["we"].most_common(10)

[("'", 0.12985751295336787),
 ('are', 0.07674870466321243),
 ('see', 0.059261658031088085),
 ('get', 0.052461139896373056),
 ('have', 0.05116580310880829),
 ('can', 0.0391839378238342),
 ('don', 0.03756476683937824),
 ('know', 0.03432642487046632),
 ('never', 0.01878238341968912),
 ('learn', 0.018458549222797927)]

## 2.4. Probability of Context-Word Pair

In [8]:
bigram["i"]["was"]

0.053622421998942356

In [9]:
bigram["i"]["am"]

0.017556848228450557

In [10]:
bigram["i"]["is"]

0.00031729243786356425

In [11]:
bigram["i"]["are"]

0.00021152829190904283

In [12]:
bigram["."]["SE"]

0.9612387969875893

In [13]:
bigram["."]

Counter({'SE': 0.9612387969875893,
         "'": 0.0010735373054213634,
         '"': 0.02922949299760894,
         ')': 0.00821418695814831,
         "''": 6.506286699523415e-05,
         ']': 0.0001789228842368939})

## 2.5. Probability of a Sentence

In [14]:
def sentence_score(s):
    p = 0.0
    for i in range(len(s) - 1):
        c = s[i]
        w = s[i + 1]
        p += np.log(bigram[c][w] + np.finfo(float).eps)
    return np.exp(p)

In [15]:
test_sentence = ["i", "am", "a", "boy", "."]
sentence_score(test_sentence)

3.288036438066686e-08

In [16]:
test_sentence = ["i", "is", "boy", "a", "."]
sentence_score(test_sentence)

1.9683389110380156e-38

## 2.6. Generate Sentences

In [17]:
def generate_sentence(seed=None):
    if seed is not None:
        np.random.seed(seed)
    c = "SS"
    sentence = []
    while True:
        if c not in bigram:
            break
        words, probs = zip(*[(k, v) for k, v in bigram[c].items()])
        idx = np.argmax(np.random.multinomial(1, probs, (1,)))
        w = words[idx]
        
        if w == "SE":
            break
        elif w in ["i", "ii", "iii"]:
            w2 = w.upper()
        elif w in ["mr", "luc", "i", "robin", "williams", "cindy", "crawford"]:
            w2 = w.title()
        else:
            w2 = w
        
        if c == "SS":
            sentence.append(w2.title())
        elif c in ["`", "\"", "'", "("]:
            sentence.append(w2)
        elif w in ["'", ".", ",", ")", ":", ";", "?"]:
            sentence.append(w2)
        else:
            sentence.append(" " + w2)
            
        c = w
    return "".join(sentence)

In [18]:
generate_sentence(82)

'Alexandre dumas may suspect he at being can be honest here goes awol, but he trusts affleck - see this documentary.'