**Table of contents**

* [Random sentences](#snt)
* [Language models](#lm)
* [Implementation](#imp)
    * [Unigram LM](#imp-uni)
    * [A general n-gram language model](#imp-n)
* [Evaluation](#eval)
* [Smoothing](#smooth)
* [Interpolation](#inter)
    
**Table of Exercises**

* Theory (10 points)
    * [Exercise 3-1](#ex3-1)
    * [Exercise 3-2](#ex3-2)
    * [Exercise 3-3](#ex3-3)
    * [Exercise 3-4](#ex3-4)
* Practice (20 points)
    
    * [Exercise 3-5](#ex3-5)
    * [Exercise 3-6](#ex3-6)
    * [Exercise 3-7](#ex3-7)
    * [Exercise 3-8](#ex3-8)
    * [Exercise 3-9](#ex3-9)

**General notes**

* In this notebook you are expected to use $\LaTeX$. 
* Use python3

# <a name="snt"> Random sentences

Given a sentence $S$ (for example  ***He went to the store*** in English) a language model (LM) can tell us if this resembles a natural sentence.
Can we learn a model to asses the fluency of sentences generated by an automatic system?

For example, such a model must prefer a sentence like ***He went to the store*** to a sentence like ***He store go***.

A language model is an attempt at quantifying a notion of the degree of goodness (or badness) of
any given sentence. The best way to represent this degree of goodness is as a probability value $P$, where, if the model assigns a high probability value to *He went to the store*, it can be concluded that this sentence is much more likely to be a fluent English sentence than *He store go* which is assigned a low probability.

We model a sentence $S$ as a sequence of random words, so let's first define a random variable $X$ that represents a random word:

<a name="ex3-1" style="color:red">**Exercise 3-1**</a> **[2 points]** Define a categorical random variable $X$ for words sampled from a closed vocabulary $\Sigma$ (assume the size of the vocabulary is denoted by $v$). In your answer make sure you indicate what is the sample space and the precise support $\mathcal X$ of the categorical random variable. 

TODO, misschien $X(t) = \sum^v_i \Sigma^{\delta i t}_i$

$\mathcal X = \sum^v_i \Sigma_i$ 

A **sentence** corresponds to any sequence of words in $\Sigma^*$.

We denote a random sentence $S$ by the sequence $\langle X_1, \ldots, X_n \rangle$ or the shorthand $X_1^n$.
The following definition is also useful:

* for the $i$th random word $X_i$, the prefix $\langle x_1, \ldots, x_{i-1} \rangle$ (also denoted $x_{<i}$) is called a random history.
* we use $H$ to denote an arbitrary random history and $H_i$ to denote the $i$th random history

A **generative story** is a stochastic procedure that we define as a means to explain the process by which we believe data are generated. For random sentences we define the following generative story

1. Sample the sequence length from a distribution $P_N$
    * $N \sim P_N$
2. Then for each position $i=1 , \ldots, n$ sample the $i$th word from the distribution  $P_{X|H}$
    * $X_i|x_{<i} \sim P_{X|H=x_{<i}}$

Here is an example for a sentence with $3$ words:

$P_S(\langle x_1, x_2, x_3 \rangle) = P_N(3) P_{X|H}(x_1) P_{X|H}(x_2|\langle x_1 \rangle) P_{X|H}(x_3 | \langle x_1, x_2 \rangle)$

For our example sentence *He went to the store* this means:

$P(\text{"He went to the store"}) = P_N(5) \times P(\text{He}) \times P(\text{went}|\langle \text{He} \rangle) \times P(\text{to}|\langle \text{He}, \text{went} \rangle) \times P(\text{the}|\langle \text{He},  \text{went}, \text{to} \rangle) \times P(\text{store}|\langle \text{He},  \text{went}, \text{to}, \text{the} \rangle) $

* where with some abuse of notation we use the words themselves instead of their corresponding indices. 




<a name="ex3-2" style="color:red">**Exercise 3-2**</a> **[3 points]**  Write down the general rule for the probability $P_S$ of a sentence $x_1^n$. For this exercise please use subscripts to indicate the precise random variable associated with every distribution (that is, for example, $P_S$ is correct while $P$ is wrong). 

TODO: check, misschien $P_N(n) \prod^n_{i=1} P_{X|H} (x_i | x_{<i})$


# <a name="lm"> Language models

Here we quickly revisit the material discussed in class about n-gram LMs.

We start with the simplest unigram language model. The idea is to forget the history therefore making a strong independence assumption:


\begin{equation}
(1) \qquad P_S(x_1^n) \approx P_N(n) \prod_{i=1}^n P_X(x_i)
\end{equation}

* we assume $P_N(n)$ to be some constant $c$, this means that we have a uniform distribution over length
* and we assume $P_X$ to be a Categorical distribution

Thus, the final <a name="eq-unigram-lm">unigram LM definition</a> is 

\begin{equation}
(2) \qquad P_S(x_1^n; c, \theta_1^v) \triangleq c \prod_{i=1}^n \text{Cat}(X=x_i|\theta_1, \ldots, \theta_v)
\end{equation}

Note that we have introduced the Categorical pmf, which you have learnt about in Lab2. 

<a name="ex3-3" style="color:red">**Exercise 3-3**</a> **[3 points]**  Complete the categorical pmf and the conditions below:

$\text{Cat}(X=a|\theta_1, \ldots, \theta_v) = \prod^v_{x=1}\theta^{\delta a x}_x $

(a=x -> x=1)

where $\theta_i^v$ are the categorical parameters for which it must hold

1. $1 \geq \theta_i^v > 0$
2. $\sum^v_i \theta_i^v = 1$


**Maximum likelihood estimation**

Suppose we are given a corpus containing $m$ sentences

* $\langle x_1^{(k)}, \ldots, x_{n_k}^{(k)} \rangle$ for $k=1, \ldots, m$
* where $n_k$ is the length of the $k$th sentence

The MLE solution for the unigram LM is based on gathering counts and computing the relative frequency of word types:

\begin{equation}
(3) \qquad \theta_x = \frac{\text{count}(x)}{\text{number of tokens}}
\end{equation}

Note that the *number of tokens* is simply the sum of the length of the sentences $\sum_{k=1}^m n_k$.


More generaly for a conditional probability distribution (cpd), we have that 

* $P_{X|H}(x|h) = \text{Cat}(X=x|\theta_1^{(h)}, \ldots, \theta_v^{(h)})$

where $h$ uniquely indexes a history and $P_{X|H}(x|h) = \theta_x^{(h)}$ is the $x$th probability value in the $h$th cpd.

Then the MLE solution is simply

\begin{equation}
(4) \qquad \theta_x^{(h)} = \frac{\text{count}(h \circ \langle x \rangle)}{\text{count}(h)}
\end{equation}

where  $h \circ \langle x \rangle$ is the concatenation of history and word.

Now that we know how to estimate general cpds we can define the n-gram LM.

An <a name="eq-ngram-lm">$n$-gram LM</a> is a Markov model of order $o=n-1$ where we truncate the complete history $x_{<i}$ so that it contains only the $o$ most recent words $x_{i-o}^{i-1}$.

\begin{equation}
(5) \qquad P_S(x_1^n; c, \boldsymbol \theta) \triangleq c \prod_{i=1}^n P_{X|H}(x_i|x_{i-o}^{i-1}; \boldsymbol \theta)
\end{equation}

where $P_{X|H=h; \boldsymbol \theta}$ is $\text{Cat}(\theta_1^{(h)}, \ldots, \theta_v^{(h)})$ 


***Example***

Consider the sentence *He went to the store*, its probability under the unigram LM is

$P_S(\langle \text{He, went, to, the, store} \rangle) \propto P_X(\text{He}) \times P_X(\text{went}) \times P_X(\text{to}) \times P_X(\text{the}) \times P_X(\text{store})$

which can also be seen as 

$P_S(\langle \text{He, went, to, the, store} \rangle) \propto \theta_{\text{He}} \times \theta_{\text{went}} \times \theta_{\text{to}} \times \theta_{\text{the}} \times \theta_{\text{store}}$

where again we use the words instead of their indices and we use the proportionality symbol to ignore the probability of the length.

<a name="ex3-4" style="color:red">**Exercise 3-4**</a> **[2 points]**  Write down the probability of the sentence 

    He went to the store
    
under a bigram language model. Tip: recall that *the* is a word while $\langle \text{the} \rangle$ is a sequence. 

TODO: check, misschien $P_S(\langle \text{He, went, to, the, store} \rangle) \propto P_X(\text{He} | \langle 
\rangle) \times P_{X|H}(\text{went}|\langle he \rangle) \times P_{X|H}(\text{to} | \langle went \rangle) \times P_{X|H}(\text{the} | \langle to \rangle) \times P_{X|H}(\text{store}| \langle the \rangle)$


# <a name="imp"> Implementation

We will start by showing you how to implement the unigram LM. 

Consider the PTB dataset as our training data: *sec02-21.raw*. We will estimate the categorical parameters and we query the LM with some sentences to find out their probability.

Notes: 

1. For *memory efficiency* rather than vectors we will use sparse data structures (such as python dict), this is nice because we do not use memory to represent events that have never occurred.
2. We lowercase the data to collect better statistics (otherwise 'He' and 'he' would correspond to different words)
3. Recall from the lecture that we pad sentences with 1 EOS token (which becomes part of the sequence) and $n-1$ BOS tokens (which are there just to make the history size constant). 

## <a name="imp-uni"> Unigram LM

We start with the unigram language model, whose factorisation is shown in [Equation (2)](#eq-unigram-lm).

First, we start by **loading and pre-processing data**. In the code below we use [python generators](https://wiki.python.org/moin/Generators), check the link if you are not familiar with then. 

In [1]:
from collections import defaultdict


def preprocess(file_path, min_count=1, char_level=False):
    """
    Returns a generator (a data stream) that yields one pre-processed sentence at a time.
    A preprocessed sentence is:
        - a list of tokens (each token a string)
            - where tokens are lowercased
                - and possibly replaced by '<unk>' if infrequent 
        
    :param file_path: path to a text corpus
    :param min_count: minimum number of occurrences 
        if a token happens less times than this value we replace it by '<unk>'
    :returns: a generator of sentences
        A generator is an object that can be used in `for` loops
    """
    count = defaultdict(int)
    # First we count the number of occurrences of each token
    with open(file_path, 'r') as f:
        for line in f:
            line = line.strip()
            if not line:
                continue  # we skip empty lines
            if char_level:
                sentence = [ch for ch in line.lower()] 
            else:
                sentence = line.lower().split()
            for token in sentence:
                count[token] += 1
    # then we yield one preprocessed sentence at a time
    # making sure we map infrequent tokens to <unk>
    with open(file_path, 'r') as f:
        for line in f:
            line = line.strip()
            if not line:
                continue  # we skip empty lines
            if char_level:
                sentence = [ch for ch in line.lower()] 
            else:
                sentence = line.lower().split()
            preprocessed_sentence = [token if count[token] >= min_count else '<unk>' for token in sentence]                    
            yield preprocessed_sentence
        

In [2]:
# Let's test our preprocessed sentence generator
for k, sentence in enumerate(preprocess('eleanor-rigby.txt'), 1):
    print(k, sentence)

1 ['ah', 'look', 'at', 'all', 'the', 'lonely', 'people']
2 ['ah', 'look', 'at', 'all', 'the', 'lonely', 'people']
3 ['eleanor', 'rigby', ',', 'picks', 'up', 'the', 'rice']
4 ['in', 'the', 'church', 'where', 'a', 'wedding', 'has', 'been']
5 ['lives', 'in', 'a', 'dream']
6 ['waits', 'at', 'the', 'window', ',', 'wearing', 'the', 'face']
7 ['that', 'she', 'keeps', 'in', 'a', 'jar', 'by', 'the', 'door']
8 ['who', 'is', 'it', 'for']
9 ['all', 'the', 'lonely', 'people']
10 ['where', 'do', 'they', 'all', 'come', 'from', '?']
11 ['all', 'the', 'lonely', 'people']
12 ['where', 'do', 'they', 'all', 'belong', '?']
13 ['father', 'mckenzie', ',', 'writing', 'the', 'words']
14 ['of', 'a', 'sermon', 'that', 'no', 'one', 'will', 'hear']
15 ['no', 'one', 'comes', 'near']
16 ['look', 'at', 'him', 'working', ',', 'darning', 'his', 'socks']
17 ['in', 'the', 'night', 'when', 'there', "'s", 'nobody', 'there']
18 ['what', 'does', 'he', 'care']
19 ['all', 'the', 'lonely', 'people']
20 ['where', 'do', 'they', '

In [3]:
# Let's see what happens if we prune words that happen only once
for k, sentence in enumerate(preprocess('eleanor-rigby.txt', min_count=2), 1):
    print(k, sentence)

1 ['ah', 'look', 'at', 'all', 'the', 'lonely', 'people']
2 ['ah', 'look', 'at', 'all', 'the', 'lonely', 'people']
3 ['eleanor', 'rigby', ',', '<unk>', '<unk>', 'the', '<unk>']
4 ['in', 'the', 'church', 'where', 'a', '<unk>', '<unk>', '<unk>']
5 ['<unk>', 'in', 'a', '<unk>']
6 ['<unk>', 'at', 'the', '<unk>', ',', '<unk>', 'the', '<unk>']
7 ['that', '<unk>', '<unk>', 'in', 'a', '<unk>', '<unk>', 'the', '<unk>']
8 ['<unk>', '<unk>', '<unk>', '<unk>']
9 ['all', 'the', 'lonely', 'people']
10 ['where', 'do', 'they', 'all', 'come', 'from', '?']
11 ['all', 'the', 'lonely', 'people']
12 ['where', 'do', 'they', 'all', 'belong', '?']
13 ['father', 'mckenzie', ',', '<unk>', 'the', '<unk>']
14 ['<unk>', 'a', '<unk>', 'that', 'no', 'one', '<unk>', '<unk>']
15 ['no', 'one', '<unk>', '<unk>']
16 ['look', 'at', '<unk>', '<unk>', ',', '<unk>', 'his', '<unk>']
17 ['in', 'the', '<unk>', '<unk>', 'there', '<unk>', 'nobody', 'there']
18 ['<unk>', '<unk>', 'he', '<unk>']
19 ['all', 'the', 'lonely', 'people']

Now we show you **how to count unigrams**.

In [4]:
from collections import defaultdict


def count_unigrams(sentence_stream):
    """
    input: a generator of preprocessed sentences
        - a preprocessed sentence is a list of lowercased tokens
          where rare tokens were possibly replaced by <unk>
    output: 
        unigram_count: dictionary of frequency of each word
    """    
    unigram_counts = defaultdict(int)
    for sentence in sentence_stream:
        sentence = sentence + ["</s>"]  # add end of sentence
        for token in sentence:
            unigram_counts[token.lower()] += 1  # frequency of each word
    return unigram_counts

# Let's test our procedure and check how many times 'cat' and 'mat' happen in the PTB training corpus
unigram_count_table =  count_unigrams(preprocess('sec02-21.raw'))  
print('unigram=cat count=%d' % unigram_count_table['cat'])
print('unigram=mat count=%d' % unigram_count_table['mat'])
print('unigram=mat count=%d' % unigram_count_table['church'])

unigram=cat count=1
unigram=mat count=0
unigram=mat count=35


Now we show you **how to get the MLE solution for the unigram distribution**

In [5]:
def unigram_mle(unigram_counts):
    """
    input: unigram_count: dictionary of frequency of each word
           
    output: unigram_prob: dictionary with the probabilty of each word 
            (parameters of the model)
    """
    total_count = sum(unigram_count_table.values())
    unigram_probs = defaultdict(float)
    for word, count in unigram_counts.items():
        unigram_probs[word] = float(count) / total_count
    return unigram_probs

# Let's check the MLE parameters associated with 'cat' and 'mat' by querying their unigram probabilities
unigram_prob_table = unigram_mle(unigram_count_table)
print('unigram=cat prob=%f' % unigram_prob_table['cat'])
print('unigram=mat prob=%f' % unigram_prob_table['mat'])

unigram=cat prob=0.000001
unigram=mat prob=0.000000


And finally we show you **how to compute the log-probability** of a sentence under the unigram LM

In [6]:
import numpy as np


def calculate_sentence_unigram_log_probability(sentence, word_probs):
    """
    input: list of words in a sentence
    word_probs: MLE paremeters
    output:
            sentence_probability_sum: log probability of the sentence
    """
    sentence_probability_sum = 0.
    # we first get the probability of unknown words
    #  which by default is 0. in case '<unk>' is not in the support
    unk_probability = word_probs.get('<unk>', 0.)
    for word in sentence:
        # this will return `unk_probability` if the word is not in the support
        word_probability = word_probs.get(word, unk_probability)  
        # it is a sum of log pboabilities
        # we use np.log because it knows that log(0) is float('-inf')
        sentence_probability_sum += np.log(word_probability)
    return sentence_probability_sum  
    
sent_prob = calculate_sentence_unigram_log_probability(['the', 'cat', 'sat', 'on', 'the', 'cat'], unigram_prob_table)
print(sent_prob)

-49.5345713414


**Unseen words**

However, note that if we want the probability of sentences containing words that are not present in the training corpus, we will have an unpleasant surprise. 

For example: *the cat sat on the mat*


In [7]:
calculate_sentence_unigram_log_probability(['the', 'cat', 'sat', 'on', 'the', 'mat'], unigram_prob_table)



-inf

Of course that would not happen if we pre-processes the data to map infrequent words to `<unk>`, as we illustrate below by setting `min_count=2`.

In [8]:
unigram_prob_table2 = unigram_mle(count_unigrams(preprocess('sec02-21.raw', min_count=2))  )
calculate_sentence_unigram_log_probability(['the', 'cat', 'sat', 'dick', 'the', 'mat'], unigram_prob_table2)

-35.975849474286996

This is a very rudimentary *smoothing technique* and you will se other techniques later in this notebook.

## <a name="imp-n"> A general n-gram language model

We now turn to a general $n$-gram LM, whose factorisation is shown in [Equation (5)](#eq-ngram-lm).

<a name="ex3-5" style="color:red">**Exercise 3-5**</a> **[10 points]** In this exercise you will build a general $n$-gram language model where $n \ge 1$. We provide you with a skeleton class on which to build. 

a) Implementation **[3 points]**

* Start by implementing the method `count_ngrams`, see the documentation of the method for specification. Tip: expand upon the procedure implemented in the function `count_unigrams` above; remember to handle BOS tokens and EOS tokens correctly. Use `<s>` for BOS token and `</s>` for EOS token.
* Now implement the method `solve_mle`, see the documentation of the method for specification.
* Finally, implement the `log_prob` method, see the documentation of the method for specification.

b) Toy data **[1 point]**: Train 3 models (unigram, bigram, trigram) using Eleanor Rigby's lyrics (`eleanor-rigby.txt`) and show that you can reproduce the output of the code below.

```python
unigram_lm = LM(order=0)
bigram_lm = LM(order=1)
trigram_lm = LM(order=2)

unigram_lm.count_ngrams(preprocess('eleanor-rigby.txt'))
unigram_lm.solve_mle()
bigram_lm.count_ngrams(preprocess('eleanor-rigby.txt'))
bigram_lm.solve_mle()
trigram_lm.count_ngrams(preprocess('eleanor-rigby.txt'))
trigram_lm.solve_mle()
print(unigram_lm.log_prob("where do they all belong ?".split()))
print(bigram_lm.log_prob("where do they all belong ?".split()))
print(trigram_lm.log_prob("where do they all belong ?".split()))
```

which should produce

```python
-23.5871446234
-3.56272816879
-2.42774823595
```

c) PTB data (`sec02-21.raw`): train 3 models (unigra, bigram, and trigram) and report the probability of: <span style="color:blue">the new rate will be payable feb. 15 .</span>

* unigram model **[2 points]**
* bigram model **[2 points]**
* trigram model **[2 points]**

The excerpt below

```python
unigram_lm = LM(order=0)
bigram_lm = LM(order=1)
trigram_lm = LM(order=2)

unigram_lm.count_ngrams(preprocess('sec02-21.raw'))
unigram_lm.solve_mle()
bigram_lm.count_ngrams(preprocess('sec02-21.raw'))
bigram_lm.solve_mle()
trigram_lm.count_ngrams(preprocess('sec02-21.raw'))
trigram_lm.solve_mle()
print(unigram_lm.log_prob("the new rate will be payable feb. 15 .".split()))
print(bigram_lm.log_prob("the new rate will be payable feb. 15 .".split()))
print(trigram_lm.log_prob("the new rate will be payable feb. 15 .".split()))
```

should produce

```python
-63.0944350135
-35.0096672791
-20.6963911844
```

*Help with debugging?*

We have provided a toy corpus called `eleanor-rigby.txt`, for that corpus we provided the output of `print_count_table` and `print_prob_table` for a correct implementation of the LM class. We have varied order from 0 to 2:

* `eleanor-rigby-unigram-counts.txt`
* `eleanor-rigby-unigram-cpd.txt`
* `eleanor-rigby-bigram-cpds.txt`
* `eleanor-rigby-bigram-counts.txt`
* `eleanor-rigby-trigram-cpds.txt`
* `eleanor-rigby-trigram-counts.txt`



In [64]:
from collections import defaultdict
import sys

class LM:
    
    def __init__(self, order):
        self._order = order
        self._count_table = dict()
        self._prob_table = dict()
        self._vocab = set()
        
    def order(self):
        return self._order
        
    def print_count_table(self, output_stream=sys.stdout):
        """Prints the count table for visualisation"""
        for history, ngrams in sorted(self._count_table.items(), key=lambda pair: pair[0]):
            for word, count in sorted(ngrams.items(), key=lambda pair: pair[0]):
                print('history="%s" word=%s count=%d' % (' '.join(history), word, count), file=output_stream)
                
    def print_prob_table(self, output_stream=sys.stdout):
        """Prints the tabular cpd for visualisation"""
        for history, ngrams in sorted(self._prob_table.items(), key=lambda pair: pair[0]):
            for word, prob in sorted(ngrams.items(), key=lambda pair: pair[0]):
                print('history="%s" word=%s prob=%f' % (' '.join(history), word, prob), file=output_stream)
                
    def preprocess_history(self, history):
        """
        This function pre-process an arbitrary history to match the order of this language model.
        :param history: a sequence of words
        :return: a tuple containing exactly as many elements as the order of the model
            - if the input history is too short we pad it with <s> 
        """
        if len(history) == self._order:
            return tuple(history)
        elif len(history) > self._order:
            length = len(history)            
            return tuple(history[length - self._order: length]) #NOTE: we fixed the bug!
        else:  # here the history is too short
            missing = self._order - len(history)
            return tuple(['<s>'] * missing) + tuple(history)
                
    def get_parameter(self, history, word):
        """
        This function returns the categorical parameter associated with a certain word given a certain history.
        :param history: a sequence of words (a tuple)
        :param word: a word (a str)
        :return: a float representing P(word|history)
        """
        history = self.preprocess_history(history)
        cpd = self._prob_table.get(history, None)
        if cpd is None:
            return 0.
        else:
            # we either return P(x|h)
            #  or P(unk|h) in case x is not in the support of this cpd
            #   or 0. in case neither x nor unk are in the support of this cpd
            unk_probability = cpd.get('<unk>', 0.)
            return cpd.get(word, unk_probability)
        
    def cpd_items(self, history):
        history = self.preprocess_history(history)
        # if the history is unseen we return an empty cpd
        return self._prob_table.get(history, dict()).items()
        
    def count_ngrams(self, data_stream):
        """
        This function should populate the attribute _count_table which should be understood as 
            - a python dict 
                - whose key is a history (a tuple of words)
                - and whose value is itself a python dict (or defaultdict)
                    - which maps a word (a string) to a count (an integer)
        
        This function will add counts to whatever counts are already stored in _count_table.
        
        This function also maintains a unique set of words in the vocabulary using the attribute _vocab
        
        :param data_stream: a generator as produced by `preprocess`
        """
        for sentence in data_stream:
            sentence = sentence + ["</s>"]  # add end of sentence
            history = tuple(['<s>' for _ in range(self._order)])
            for token in sentence:
                t = token.lower()
                if t not in self._vocab:
                    self._vocab.add(t)
                
                if history not in self._count_table:
                    self._count_table[history] = {t:1}
                else:
                    if t not in self._count_table[history]:
                        self._count_table[history] = {t:1}
                    else:
                        self._count_table[history][t] += 1
                
                new_history = ['' for _ in range(self._order)]
                for i in range(self._order - 1):
                    new_history[i] = history[i + 1]
                if self._order != 0:
                    new_history[-1] = t
                history=tuple(new_history)
                
                
                    
    def solve_mle(self):
        """
        This function should compute the attribute _prob_table which has the exact same structure as _count_table
         but stores probability values instead of counts. 
        It can be seen as the collection of cpds of our model, that is, _prob_table
            - maps a history (a tuple of words) to a dict where
                - a key is a word (that extends the history forming an ngram)
                - and the value is the probability P(word|history)                
                
        This function will replace whatever value _prob_table currently stores by the newly computed MLE solution.
        """
        
        self._prob_table = self._count_table.copy()
        
        count = 0
        for key in self._prob_table:
            value = self._prob_table[key]
            for key2 in value:
                count += value[key2]
            
        for key in self._prob_table:
            value = self._prob_table[key]
            for key2 in value:
                value[key2] = value[key2]/count
        
        
    def log_prob(self, sentence):
        """
        Compute the log probability of a sentence under this model. 
                
        input: 
            sentence: a sequence of tokens
        output:
            log probability
        """
        
        prob = 1
        for i in range(len(sentence)):
            self.get_parameter(,sentence[i])
        
        
        pass
    


In [65]:
unigram_lm = LM(order=0)
bigram_lm = LM(order=1)
trigram_lm = LM(order=2)

unigram_lm.count_ngrams(preprocess('eleanor-rigby.txt'))
# unigram_lm.solve_mle()
bigram_lm.count_ngrams(preprocess('eleanor-rigby.txt'))
bigram_lm.solve_mle()
trigram_lm.count_ngrams(preprocess('eleanor-rigby.txt'))
trigram_lm.solve_mle()
print(unigram_lm.log_prob("where do they all belong ?".split()))
# print(bigram_lm.log_prob("where do they all belong ?".split()))
# print(trigram_lm.log_prob("where do they all belong ?".split()))

where
0.0
0.0
0.0
0.0
0.0
0.0
None


## <a name="eval">  Evaluation

The way to evaluate the performance of a LM is to test into a final application. In other words, how much the final score of the application improves. This is called *extrinsic* evaluation. Also, we can test our LM independently from an application, this is called *intrinsinc* evaluation. In this course, we are going to study the intrinsic evaluation of the LM.

To test a LM model we prepare 3 datasets: 
    Training is used for estimating $\boldsymbol \theta$ (we use boldface to indicate a collection of parameters).
    Develpment is used to make choices across models.
    Test is used for measuring the accuracy of the model.
   
In n-gram LM the evaluation is defined by the **likelihood** of the model with respect of the test dataset.
The likelihood of the parameters $\theta$ over the test dataset is the probability that the model assigns to the dataset.

We assume the test data $\mathcal T$ consits of $m$ independent sentences each denoted $\langle x_1^{(k)}, \ldots, x_{n_k}^{(k)} \rangle$ 

$P(\mathcal T; \boldsymbol \theta) = \prod_{k=1}^m P_S(\langle x_1^{(k)}, \ldots, x_{n_k}^{(k)} \rangle; \boldsymbol \theta)$

Or in form of the log-likelihood:

$\log P(\mathcal T; \theta) = \sum_{k=1}^m \log P_N(n_k) + \log P_{S|N}(\langle x_1^{(k)}, \ldots, x_{n_k}^{(k)} \rangle|n_k; \theta)$

We assume the length probability to be constant, so in comparing different models that probability does not make a difference. Thus we drop it and define the log-likelihood as follows:

$\mathcal L(\boldsymbol \theta) = \sum_{k=1}^m \log P_{S|N}(\langle x_1^{(k)}, \ldots, x_{n_k}^{(k)} \rangle|n_k; \boldsymbol \theta)$


Then the model that assings the higher $\mathcal L$ to the test set is the one that best fits the data. In other words,  given two probabilistic models, the better model is the one that assigns a higher probability to the test data. One detail we need to abstract away from is differences in factorisation of the models which may cause their likelihoods not to be comparable, but for that we will define *perplexity* below. 

The log likelihood is used because the probability of a particular sentence according to the LM can be a very small number, and the product of these small numbers can become even smaller, and it will cause numerical
precision problems. 


**Perplexity** of a language model on a test set is the inverse probability of the test set, normalized
by the number of tokens. Perplexity is a notion of average branching factor, thus a LM with low perplexity can be thought of as a *less confused* LM. That is, each time it introduces a word given some history it picks from a reduced subset of the entire vocabulary (in other words, it is more certain of how to continue). 

If a dataset contains $t$ tokens where $t = \sum_{k=1}^m n_k$, then the perplexity of the dataset is

\begin{equation}
(6) \qquad \text{PP}(\mathcal T) = \left( \prod_{k=1}^m P_{S|N}(\langle x_1^{(k)}, \ldots, x_{n_k}^{(k)} \rangle|n_k; \boldsymbol \theta) \right)^{-1/t}
\end{equation}

where we have already discarded the length distribution (since it's held constant across models).

It's again convenient to use log and define log-perplexity

\begin{equation}
(7) \qquad \log \text{PP}(\mathcal T) = - \frac{1}{t} \sum_{k=1}^m \log P_{S|N}(\langle x_1^{(k)}, \ldots, x_{n_k}^{(k)} \rangle|n_k; \boldsymbol \theta) 
\end{equation}

You can compare models in terms of the log-perplexity they assign to the same test data. The lower the perplexity, the better the model is.




In [11]:
# Let's quickly make a helper function to load test data
#  and segment lines into sequences of lowercased tokens
def make_test_generator(path, char_level=False):
    """Return a generator for test sentences"""
    with open(path, 'r') as fi:
        for line in fi:
            if char_level:
                yield [ch for ch in line.lower()]
            else:
                yield line.lower().split()


<a name="ex3-6" style="color:red">**Exercise 3-6**</a> Implement the log-perplexity function below. See the function documentation for specifications.

* Two sentences test **[2 points]**: If you run the excerpt below for models trained on PTB

```python
two_sentences_data = [
    "Ms. Haag plays Elianti .".lower().split(),
    "Apparently the commission did not really believe in this ideal .".lower().split()
]

log_ppl = log_perplexity(two_sentences_data, unigram_lm)
print(log_ppl)
log_ppl = log_perplexity(two_sentences_data, bigram_lm)
print(log_ppl)
log_ppl = log_perplexity(two_sentences_data, trigram_lm)
print(log_ppl)
```

and your implementation is correct, you will get
```
7.32267906044
3.87958613355
2.15917055083
```

At this point if your try to evaluate the perplexity of the PTB test set `sec00.raw` 
```python

print(log_perplexity(make_test_generator('sec00.raw'), unigram_lm))
print(log_perplexity(make_test_generator('sec00.raw'), bigram_lm))
print(log_perplexity(make_test_generator('sec00.raw'), trigram_lm))
```

you will get `inf` for all models. That's because you need to implement smoothing.


In [12]:
def log_perplexity(data_stream, lm):
    """
    Calculates the perplexity of the given text.
    This is simply 2 ** cross-entropy for the text.
    
    This function can make use of `lm.order()`, `lm.get_parameter()`, and `lm.log_prob()` 

    :param data_stream: generator of sentences (each sentence is a list of words)
    :param lm: an instance of the class LM
    """
    
    
    
    pass






two_sentences_data = [
    "Ms. Haag plays Elianti .".lower().split(),
    "Apparently the commission did not really believe in this ideal .".lower().split()
]

log_ppl = log_perplexity(two_sentences_data, unigram_lm)
print(log_ppl)
log_ppl = log_perplexity(two_sentences_data, bigram_lm)
print(log_ppl)
log_ppl = log_perplexity(two_sentences_data, trigram_lm)
print(log_ppl)

## <a name="smooth"> Smoothing

Note that MLE will fail if we evaluate on sentences containing n-grams that the model has never seen (at training). For example, *He went to the store* some bigrams are not present in the corpus giving a probability of *zero*.

The words we haven't seen before are called unknown words, or out of vocabulary (OOV) words.
We will now map them to a special symbol such as `<unk>`.
    
To keep the LM from assigning zero probability to these unseen events (ngrams), we’ll have to steal some of the probability mass from some more frequent events and give it to the events we've never seen.
This is called **smoothing** or **discounting**.

The simplest form of smoothing is called **Laplace smoothing**, whereby we add `<unk>` to the support of the distribution and then add one to all counts before we normalize them into probabilities. 
All the counts that used to be zero will now have a count of 1, the counts of 1 will be 2, and so on. 

We can also generalise it and add $\alpha$ instead of $1$. Then for $P_{X|H=h} = \text{Cat}(\theta_1^{(h)}, \ldots, \theta_v^{(h)})$ we get the MLE solution:

\begin{equation}
(7) \qquad \theta_x^{(h)} = \frac{ \text{count}(h \circ \langle x \rangle) + \alpha}{\text{count}(h) + v \alpha}
\end{equation}

There are $v$ words in the vocabulary and each one was incremented by $\alpha$, we also need to adjust the denominator to take into account the extra $v\alpha$ observations.

<a name="ex3-7" style="color:red">**Exercise 3-7**</a> **[3 points]**

Complete the `LaplaceLM` class below. Note that it must extend from `LM`. Implement the 3 modifications below in order to obtain add $\alpha$ smoothing. 

1. **[1 point]** Modify `count_ngrams` to add `<unk>` to the support of every cpd (that is, for every possible history, including the empty history, an `<unk>` outcome with count 0 should exist.
2. **[1 point]** Modify `solve_mle` so that it adds $\alpha$ to every count before normalisation.
3. **[1 point]** Modify `get_parameter` so that it returns $1/v$ when the history is unknown, that is, when $\text{count}(h)$ is $0$

To get all points your you need to show that your code can reproduce the following result.

If your implementation is correct, for add $1$ smoothing, the following excerpt of code
```python
unigram_lm_laplace = LaplaceLM(order=0, alpha=1.)
bigram_lm_laplace = LaplaceLM(order=1, alpha=1.)

unigram_lm_laplace.count_ngrams(preprocess('sec02-21.raw'))
unigram_lm_laplace.solve_mle()
bigram_lm_laplace.count_ngrams(preprocess('sec02-21.raw'))
bigram_lm_laplace.solve_mle()


print(log_perplexity(make_test_generator('sec00.raw'), unigram_lm_laplace))
print(log_perplexity(make_test_generator('sec00.raw'), bigram_lm_laplace))
```

should produce

```python
7.06497838227
4.70167458342
```

As you can see, Laplace smoothing improved the language models by assigning a non-zero probability to sentences with unseen words and/or bigrams.



In [13]:
class LaplaceLM(LM):
    
    def __init__(self, order, alpha=1.):
        super(LaplaceLM, self).__init__(order)
        self._alpha = alpha   
        # in Laplace smoothing we always add '<unk>' to the vocabulary
        self._vocab.add('<unk>')
        
    def get_parameter(self, history, word):
        """
        This function returns the categorical parameter associated with a certain word given a certain history.
        :param history: a sequence of words (a tuple)
        :param word: a word (a str)
        :return: a float representing P(word|history)
        """
        # ***TYPE YOUR SOLUTION***
        pass

    def count_ngrams(self, data_stream):
        """
        This function should populate the attribute _count_table which should be understood as 
            - a python dict 
                - whose key is a history (a tuple of words)
                - and whose value is itself a python dict (or defaultdict)
                    - which maps a word (a string) to a count (an integer)
        
        This function will add counts to whatever counts are already stored in _count_table.
        
        :param data_stream: a generator as produced by `preprocess`
        """
        # ***TYPE YOUR SOLUTION***
        pass
                    
    def solve_mle(self):
        """
        This function should compute the attribute _prob_table which has the exact same structure as _count_table
         but stores probability values instead of counts. 
        It can be seen as the collection of cpds of our model, that is, _prob_table
            - maps a history (a tuple of words) to a dict where
                - a key is a word (that extends the history forming an ngram)
                - and the value is the probability P(word|history)                
                
        This function will replace whatever value _prob_table currently stores by the newly computed MLE solution.
        """
        # ***TYPE YOUR SOLUTION***
        pass



# <a name="inter"> Interpolation

Laplace smoothing deals with unseen words for a seen history, but it cannot deal with unseen histories.
This means that Laplace smoothing is not sufficient to avoid 0 probabilities. A simple idea is to use language model interpolation. 

We interpolate language models $\mathcal M_0, \ldots, \mathcal M_o$, where $\mathcal M_j$ is a Markov model of order $j$, to obtain an interpolated $(o+1)$-gram language model. For the interpolation we use coefficients $\lambda_0, \ldots, \lambda_o$ where

* $0 < \lambda_j < 1$
* $\sum_{j=0}^{o} \lambda_j = 1$

The probability of a sentence $x_1^n$ under the <a name="inter-snt-prob">interpolated model</a> is

\begin{equation}
(8) \qquad P_S(x_1^n|n; \mathcal M_0, \ldots, \mathcal M_o) = P_N(n) \prod_{i=1}^n P_{X|H}(x_i|x_{<i}; \mathcal M_0, \ldots, \mathcal M_o)
\end{equation}

where the <a name="inter-factor">interpolated factor is </a>

\begin{equation}
(9) \qquad P_{X|H}(x_i|x_{<i}; \mathcal M_0, \ldots, \mathcal M_{n-1}) = \sum_{j=0}^{o} \lambda_j \times P_{X|H}(x_i|x_{i-j}^{i-1}; \mathcal M_j)
\end{equation}

and $ P_{X|H}(x|h; \mathcal M_j)$ is the probability of the $(j+1)$-gram suffix of $h \circ \langle x \rangle$ under a model of order $j$.



For example, consider the sentence `here comes the sun`, for a $3$-gram LM (order $2$) we pad it `BOS BOS here comes the sun EOS` and compute interpolated factors:

\begin{align}
P(\text{here} \mid \langle \text{BOS, BOS} \rangle) &= \lambda_0 \times P(\text{here} \mid \langle \rangle; \mathcal M_0) \\
&+ \lambda_1 P(\text{here}\mid \langle \text{BOS} \rangle; \mathcal M_1) \\
&+ \lambda_2 P(\text{here} \mid \langle \text{BOS, BOS} \rangle; \mathcal M_2) \\
P(\text{comes}\mid \langle \text{BOS, here} \rangle) &= \lambda_0 \times P(\text{comes}\mid \langle \rangle; \mathcal M_0) \\
&+ \lambda_1 P(\text{comes}\mid\langle \text{here} \rangle; \mathcal M_1) \\
&+ \lambda_2 P(\text{comes}\mid \langle \text{BOS, here} \rangle; \mathcal M_2) \\
P(\text{the}\mid \langle \text{here, comes} \rangle) &= \lambda_0 \times P(\text{the}\mid\langle \rangle; \mathcal M_0) \\
&+ \lambda_1 P(\text{the}\mid \langle \text{comes} \rangle; \mathcal M_1) \\
&+ \lambda_2 P(\text{the}\mid\langle \text{here, comes} \rangle; \mathcal M_2) \\
P(\text{sun}\mid \langle \text{comes, the} \rangle) &= \lambda_0 \times P(\text{sun}\mid \langle \rangle; \mathcal M_0) \\
&+ \lambda_1 P(\text{sun}\mid \langle \text{the} \rangle; \mathcal M_1) \\
&+ \lambda_2 P(\text{sun}\mid \langle \text{comes, the} \rangle; \mathcal M_2)  \\
P(\text{EOS}\mid \langle \text{the, sun} \rangle) &= \lambda_0 \times P(\text{EOS}\mid \langle \rangle; \mathcal M_0) \\
&+ \lambda_1 P(\text{EOS}\mid \langle \text{sun} \rangle; \mathcal M_1) \\
&+ \lambda_2 P(\text{EOS}\mid \langle \text{the, sun} \rangle; \mathcal M_2) 
\end{align}

Then the probability of the sentence under the interpolation is proportional to

\begin{align}
P_{S|N}(\langle \text{here, comes, the, sun, EOS}\rangle|n) 
&= P(\text{here} \mid \langle \text{BOS, BOS} \rangle) \\
&\times P(\text{comes}\mid \langle \text{BOS, here} \rangle)  \\
&\times P(\text{the}\mid \langle \text{here, comes} \rangle) \\
&\times P(\text{sun}\mid \langle \text{comes, the} \rangle) \\
&\times P(\text{EOS}\mid \langle \text{the, sun} \rangle)
\end{align}

Let's try and implement it

<a name="ex3-8" style="color:red">**Exercise 3-8**</a> **[2 points]** Complete the class below which implements an interpolated language model.

1. **[1 point]** start by completing the method `get_parameter` which computes the interpolated factor $P_{X|H}$ as shown in [Equation (9)](#inter-factor);
2. **[1 point]** then complete the method `log_prob` which should use `get_parameter` to compute the log of the interpolated probability $P_{S|N}(x_1^n|n)$ as defined in [Equation (8)](#inter-snt-prob)

If your implementation is correct you should be able to reproduce the following result

```python
lms = [
    LaplaceLM(order=0),  # unigram LM
    LaplaceLM(order=1),  # bigram LM
    LaplaceLM(order=2),  # trigram LM
    LaplaceLM(order=3)   # 4-gram LM
]
# train our models
for lm in lms:
    lm.count_ngrams(preprocess('sec02-21.raw'))
    lm.solve_mle()

print(log_perplexity(make_test_generator('sec00.raw'), InterpolatedLM(lms[0:1], [1.])))
print(log_perplexity(make_test_generator('sec00.raw'), InterpolatedLM(lms[0:2], [0.5, 0.5])))
print(log_perplexity(make_test_generator('sec00.raw'), InterpolatedLM(lms[0:3], [0.5, 0.3, 0.2])))
print(log_perplexity(make_test_generator('sec00.raw'), InterpolatedLM(lms, [0.4, 0.3, 0.15, 0.15])))
```

which should produce

```python
7.06497838227
5.03719156331
4.52821212328
4.50461015933
```

In [14]:
class InterpolatedLM(LM):
    
    def __init__(self, lms, weights):
        """
        This class should interpolate language models, 
            there are certain conditions that they must hold.
            
        :params lms: a list of language models where the lms[i] should have order i
        :params weights: a list of positive weights that should sum to 1.0        
        """
        if not lms:
            raise ValueError('I need at least 1 language model')
        if not all(0 < w < 1 for w in weights) and sum(weights) != 1.0:
            raise ValueError('LM weights must sum to 1')
        # Let's check that we have the LMs we need
        for i, lm in enumerate(lms):
            if lm.order() != i:
                raise ValueError('Interpolation requires the ith LM to be of order i-1')
        self._max_order = lms[-1].order()  # the maximum order
        self._lms = lms
        self._weights = weights
        
    def order(self):
        return self._max_order
    
    def print_count_table(self, output_stream=sys.stdout):
        raise NotImplementedError('You do not need to use or implement this method')
                
    def print_prob_table(self, output_stream=sys.stdout):
        raise NotImplementedError('You do not need to use or implement this method')
                
    def preprocess_history(self, history):
        raise NotImplementedError('You do not need to use or implement this method')
            
    def cpd_items(self, history):
        raise NotImplementedError('You do not need to use or implement this method')
        
    def count_ngrams(self, data_stream):
        raise NotImplementedError('You do not need to use or implement this method')
                    
    def solve_mle(self):
        raise NotImplementedError('You do not need to use or implement this method')
                
    def get_parameter(self, history, word):
        """
        This function should return the interpolated factor P(X=w|H=h) as defined in Equation (9) above.
    
        :param history: a sequence of words (a tuple)
        :param word: a word (a str)
        :return: a float representing P(word|history) in the interpolated model
        """       
        # ***TYPE YOUR SOLUTION***
        pass
    
    def log_prob(self, sentence):
        """
        Compute the log probability of a sentence under this model. 
                
        input: 
            sentence: a sequence of tokens
        output:
            log probability
        """
        # ***TYPE YOUR SOLUTION***
        pass

Note that we start with a unigram LM alone, we then try interpolating a unigram LM and a bigram LM and improve perplexity considerably. While our unigram model chooses the next word from on average `np.exp(7.06)` words, the bigram language model chooses from on average `np.exp(5.03)` words.

Further interpolating a trigram LM improves results even further. 

Curiously, further interpolating a fourgram LM does not really help much. This has again to do with data sparsity: our training corpus is not very large and therefore most 4-grams are quite rare. If there's little overlap between training and test in terms of 4-grams, the 4-gram terms in the interpolation will be mostly 0.

<a name="ex3-9" style="color:red">**Exercise 3-9**</a> **[3 points]** 

* **[1 point]** Train two interpolated models: one on PTB data `sec02-21.raw`, another on Beatles lyrics `beatles.txt` (this file does not include the song Eleanor Rigby neither Ask Me Why).
    * Interpolate models of order 0, 1, and 2
    * Use Laplace models with alpha 1.0
    * Use interpolation weights [0.1, 0.3, 0.6]


* **[1 point]** Compare the log-perplexity each model assigns to 
    * `sec00.raw` 
    * `eleanor-rigby.txt`
    * `ask-me-why.txt`


* **[1 point]** Explain the results you obtain. In particular, try to explain why the PTB models performs quite differently if tested with `Ask Me Why` or `Eleanor Rigby`.


