In [4]:
%load_ext autoreload
%autoreload 2
# %cd ..
import sys
sys.path.append("..")

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


<!---
Latex Macros
-->
$$
\newcommand{\prob}{p}
\newcommand{\vocab}{V}
$$

# Language Models
Language models (LMs) calculate the probability to see a given sequence of words, as defined through a [tokenization](todo) algorithm, in a given language or sub-language/domain/genre. For example, an English language model may assign a higher probability to seeing the sequence "How are you?" than to "Wassup' dawg?", and for a hip-hop language model this proportion may be reversed. <span class="summary">Language models (LMs) calculate the probability to see a given sequence of words.

There are several use cases for such models: 

* To filter out bad translations in machine translation.
* To rank speech recognition output. 
* In concept-to-text generation.

More formally, a language model is a stochastic process that models the probability \\(\prob(w_1,\ldots,w_d)\\) of observing sequences of words \\(w_1,\ldots,w_d\\). We can, without loss of generality, decompose the probability of such sequences into<span class="summary">Without loss of generality</span>  

$$
\prob(w_1,\ldots,w_d) = \prob(w_1) \prod_{i = 2}^d \prob(w_i|w_1,\ldots,w_{i-1}).
$$

This means that a language model can be defined by how it models the conditional probablity $\prob(w_i|w_1,\ldots,w_{i-1})$ of seeing a word \\(w_i\\) after having seen the *history* of previous words $w_1,\ldots,w_{i-1}$. We also have to model the prior probability $\prob(w_1)$, but it is easy to reduce this prior to a conditional probability as well.

In practice it is common to define language models based on *equivalence classes* of histories instead of having different conditional distributions for each possible history. This overcomes sparsity and efficiency problems when working with full histories.

## N-gram Language Models

The most common type of equivalence class relies on *truncating* histories $w_1,\ldots,w_{i-1}$ to length $n-1$:
$$
\prob(w_i|w_1,\ldots,w_{i-1}) = \prob(w_i|w_{i-n},\ldots,w_{i-1}).
$$

That is, the probability of a word only depends on the last $n-1$ previous words. We will refer to such model as a *n-gram language model*.

## A Uniform Baseline LM

*Unigram* models are the simplest 1-gram language models. That is, they model the conditional probability of word using the prior probability of seeing that word:
$$
\prob(w_i|w_1,\ldots,w_{i-1}) = \prob(w_i).
$$

To setup datasets and as baseline for more complex language models, we first introduce the simplest instantituation of a unigram model: a *uniform* language model which assigns the same prior probability to each word. That is, given a *vocabulary* of words \\(\vocab\\), the uniform LM is defined as:

$$
\prob(w_i|w_1,\ldots,w_{i-1}) = \frac{1}{|\vocab|}.
$$

Let us "train" and test such a language model on the OHHLA corpus. First we need to load this corpus. Below we focus on a subset to make our code more responsive and to allow us to test models more quickly. Check the [loading from OHHLA](load_ohhla.ipynb) notebook to see how `load_albums` and `words` are defined. 

In [26]:
import statnlpbook.util as util
util.execute_notebook('load_ohhla.ipynb')
docs = load_albums(j_live)
trainDocs, testDocs = docs[:len(docs)//2], docs[len(docs)//2:] 
train = words(trainDocs)
test = words(testDocs)
" ".join(train[0:35])

'[BAR] Can t even call this a blues song [/BAR] [BAR] It s been so long [/BAR] [BAR] Neither one of us was wrong or anything like that [/BAR] [BAR] It seems like yesterday [/BAR]'

We can now create a uniform language model. Language models in this book implement the `LanguageModel` [abstract base class](https://docs.python.org/3/library/abc.html). 

In [None]:
import abc 
class LanguageModel(metaclass=abc.ABCMeta):
    """
    Args:
        vocab: the vocabulary underlying this language model. Should be a set of words.
        order: history length (-1).
    """
    def __init__(self, vocab, order):
        self.vocab = vocab
        self.order = order
        
    @abc.abstractmethod
    def probability(self, word,*history):
        """
        Args:
            word: the word we need the probability of
            history: words to condition on.
        Returns:
            the probability p(w|history)
        """
        pass

The most important method we have to provide is `probability(word,history)` which returns the probability of a word given a history. Let us implement a uniform LM using this class.

In [34]:
class UniformLM(LanguageModel):
    def __init__(self, vocab):
        super().__init__(vocab, 1)
    def probability(self, word,*history):
        return 1.0 / len(self.vocab) if word in vocab else 0.0
    
vocab = set(train)
baseline = UniformLM(vocab)
baseline.probability("call")

0.0003933910306845004

## Sampling
It is instructive and easy to sample language from a language model. In many, but not all, cases the more natural the generated language of an LM looks, the better this LM is.

To sample from an LM one simply needs to iteratively sample from the LM conditional probability over words, and add newly sampled words to the next history. The only challenge in implementing this is to sample from a categorical distribution over words. Here we provide this functionality via `sample_categorical`