# Language Model Exercises


In this exercise, you will build a system for automatically generating sentences using an n-gram language model.

You will:
- Build a unigram language model 
- Evaluate your unigram language model
- Build a bigram language model and compare it to the unigram model (generation and perplexity)


Key concepts:

* n-gram (unigram, bigram, trigram, etc.)
* n-gram history
* n-gram probability 
* intrinsic LM evaluation - perplexity
* OOV words and how to handle them

<!---
Latex Macros
-->
$$
\newcommand{\Xs}{\mathcal{X}}
\newcommand{\Ys}{\mathcal{Y}}
\newcommand{\y}{\mathbf{y}}
\newcommand{\weights}{\mathbf{w}}
\newcommand{\balpha}{\boldsymbol{\alpha}}
\newcommand{\bbeta}{\boldsymbol{\beta}}
\newcommand{\aligns}{\mathbf{a}}
\newcommand{\align}{a}
\newcommand{\source}{\mathbf{s}}
\newcommand{\target}{\mathbf{t}}
\newcommand{\ssource}{s}
\newcommand{\starget}{t}
\newcommand{\repr}{\mathbf{f}}
\newcommand{\repry}{\mathbf{g}}
\newcommand{\bar}{\,|\,}
\newcommand{\x}{\mathbf{x}}
\newcommand{\prob}{p}
\newcommand{\Pulp}{\text{Pulp}}
\newcommand{\Fiction}{\text{Fiction}}
\newcommand{\PulpFiction}{\text{Pulp Fiction}}
\newcommand{\pnb}{\prob^{\text{NB}}}
\newcommand{\vocab}{V}
\newcommand{\params}{\boldsymbol{\theta}}
\newcommand{\param}{\theta}
\DeclareMathOperator{\perplexity}{PP}
\DeclareMathOperator{\argmax}{argmax}
\DeclareMathOperator{\argmin}{argmin}
\newcommand{\train}{\mathcal{D}}
\newcommand{\counts}[2]{\#_{#1}(#2) }
\newcommand{\length}[1]{\text{length}(#1) }
\newcommand{\indi}{\mathbb{I}}
$$

## <font color='green'>Step 0</font>: Setup

In order to develop this assignment, you will need at least [python 3.6](https://www.python.org/downloads/) and the following libraries. Most if not all of these are part of [anaconda](https://www.continuum.io/downloads), so a good starting point is to install that on your laptop. For this exercise you can use your laptop.

- [jupyter](http://jupyter.readthedocs.org/en/latest/install.html)
- numpy 
- [nosetests](https://nose.readthedocs.org/en/latest/) which is a library for unit testing 
- [pandas](http://pandas.pydata.org/) Dataframes

Here is some help on installing packages in python: https://packaging.python.org/installing/. You can use ```pip --user``` to install locally without sudo.

## <font color='green'>Setup 1</font>: Load Libraries

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
import sys
import nose
import numpy as np
import pandas as pd


from importlib import reload

print('My Python version')

print('python: {}'.format(sys.version))

My Python version
python: 3.6.9 (default, Nov  7 2019, 10:44:02) 
[GCC 8.3.0]


- Most of your coding will be in the python source files in the directory ```snlp```.
- The directory ```tests``` contains unit tests that you can use to check parts of your assignment, using ```nosetests```. You should run them as you work on the assignment to see that you're on the right track. You are free to look at their source code, if that helps -- though most of the relevant code is also here in this notebook. Learn more about running unit tests at http://pythontesting.net/framework/nose/nose-introduction/
- You may want to add more tests, but that is completely optional. 

In [2]:
print('My library versions')

print('pandas: {}'.format(pd.__version__))
print('numpy: {}'.format(np.__version__))
print('nose: {}'.format(nose.__version__))


My library versions
pandas: 1.0.0
numpy: 1.18.1
nose: 1.3.7


Run the following command on the command line, to test whether your libraries are set up correctly:

`nosetests tests/test_environment.py`

### Running command-line UNIX commands in the notebook

You can prefix a cell in a notebook with `!` to tell the notebook to run what follows as shell command. For example:


In [None]:
! nosetests tests/test_environment.py

If you setup your system successfully, you should see an output like the following:

```
nose.config: INFO: Ignoring files matching ['^\\.', '^_', '^setup\\.py$']
test_environment.test_library_versions ... ok

----------------------------------------------------------------------
Ran 1 test in 0.001s

OK
```


##  <font color='blue'>Task 1</font>: Load the Data, Preprocessing (Tokenization)

We will first read the data. We will work here with a tiny toy dataset. Here are two ways to do that:

In [3]:
# using Pandas to read the csv
df_train = pd.read_csv('data/corpus.csv')
df_train.head()


Unnamed: 0,sentences
0,the dog barks
1,this is a test sentence .
2,another test sentence
3,wow this is a nice idea
4,this is a dog


In [None]:
# using a unix command
! cat data/corpus.csv

In [None]:
# what does the following command do?
! cat data/corpus.csv | wc -l

In [None]:
assert(len(df_train)==7)

## Simple tokenizer
Your first task is to convert the text into a representation which is internally used. For this data, a lot of the preprocessing is already done: the text is lower-cased, and punctuation is removed. You need to only create a `list` of words for each instance. Each word is tokenized using space.

This first part makes you also familiar with the structure of the assignment, i.e., modifying the code in the `snlp` directory, loading it in the jupyter notebook and testing your solution with `nosetests`. If you first run the code below, you will get an `NotImplementedError`.

- **Deliverable 1.1**: Complete the function `preproc.space_tokenizer`. 
- **Test**: `nosetest tests/test_preproc.py:test_space_tok`

In [None]:
from snlp import preproc

In [None]:
# run this block to update the notebook as you change the preproc library
reload(preproc);

In [None]:
x_train = preproc.read_data('data/corpus.csv',preprocessor=preproc.space_tokenizer)

In [None]:
# use ! to run shell commands in notebook
! nosetests tests/test_preproc.py:test_space_tok

## Create the vocabulary
Your second task is to extract the vocabulary from the tokenized text.

- **Deliverable 1.2**: Complete the function `preproc.create_vocab`. 
- **Test**: `nosetest tests/test_preproc.py:test_create_vocab`

Hint: It might be helpful to check out the test code to get a feeling of what you are asked to implement.

In [None]:
reload(preproc);
# use ! to run shell commands in notebook
! nosetests tests/test_preproc.py:test_create_vocab

In [None]:
# manually inspecting the vocabulary
print(preproc.create_vocab(x_train))

## <font color='blue'>Task 2</font>: Create a uniform n-gram LM (unigram) and evaluate it



We can now create a uniform language model. Before solving this task, make sure you have read all up to Section 3.1 of [J&M Chapter 3](https://web.stanford.edu/~jurafsky/slp3/3.pdf).

Language models in this book implement the `LanguageModel` [abstract base class](https://docs.python.org/3/library/abc.html). See the definition of `class LanguageModel(metaclass=abc.ABCMeta)` in `snlp/lm.py`, copied below for convenience.

In [None]:
from snlp import lm
import abc

class LanguageModel(metaclass=abc.ABCMeta):
    """
    Abstract class for a language model
    Args:
        docs: the texts. Should be a list of documents (list of words).
        order: history length (-1).
    Creates:
        vocab: the vocabulary underlying this language model. Should be a set of words.
    """

    def __init__(self, vocab, order):
        ## start and end symbols
        self.vocab = vocab
        self.order = order
        self.START_SYMBOL = "<START>"
        self.STOP_SYMBOL = "<STOP>"

    @abc.abstractmethod
    def probability(self, word, *history):
        """
        Args:
            word: the word we need the probability of
            history: words to condition on.

        Returns:
            the probability p(w|history)
        """
        pass

The most important method we have to provide is `probability(word,history)` which returns the probability of a word given a history. Let us implement a uniform LM using this class. For that case, the history is simply empty `()`.

In [None]:
reload(lm);

### Create a uniform LM

- **Deliverable 2.1**: Complete the function `probability` in `lm.UniformLM`. It should return the probability of a word under a (stupid) uniform language model.
- **Test**: `nosetest tests/test_lms.py:test_uniform`

In [None]:
reload(lm);
! nosetests tests/test_lms.py:test_uniform

#### Sampling from the Uniform LM


Now that you have completed the code for the first and simplest LM, let's use it and sample from it to create example sentences and get an intuitive feeling on how good it is.


- **Deliverable 2.2** (in the notebook here):  Create a `uniformLM` object. Inspect the object, how big is the vocabulary? What is the probability of the word `the`?  What is the probability of the word `dog`?  

In [None]:
## TODO:
from snlp import lm
x_train = preproc.read_data('data/corpus.csv',preprocessor=preproc.space_tokenizer)

# instantiate a uniform LM 
vocab = pass
uniformLM = pass

# get size of vocab and probabilities of words


- **Deliverable 2.3** (in the notebook here): Use the uniform language model we just created and sample from it. What is an issue with this language model? Discuss disadvantages of a uniform LM.


In [None]:
## Now sample a sentence from the uniform LM
sentence_len = 10

' '.join(sample(uniformLM, ["the"],sentence_len))

## Evaluation

Read [Section 3.2. of J&M](https://web.stanford.edu/~jurafsky/slp3/3.pdf).

How do we determine the quality of an (n-gram) LM? 

One way is through *extrinsic* evaluation: assess how much the LM improves performance on *downstream tasks* such as machine translation or speech recognition. Arguably this is the most important measure of LM quality, but it can be costly as re-training such systems may take days, and when we seek to develop general-purpose LMs we may have to evaluate performance on several tasks. This is problematic when one wants to iteratively improve LMs and test new models and parameters. It is hence useful to find *intrinsic* means of evaluation that assess the stand-alone quality of LMs with minimal overhead.

An intrinsic way to measure the quality of a language model is to use a held out set to measure the performance of the language model trained on the training corpus, by evaluating it on the held out set. But what performance measure can we use, or which n-gram language model is a better LM? 

"The answer is simple: whichever
model assigns a higher probability to the test set—meaning it more accurately
predicts the test set—is a better model. Given two probabilistic models, the better
model is the one that has a tighter fit to the test data or that better predicts the details
of the test data, and hence will assign a higher probability to the test data." (J&M)

In practice we don’t use raw probability as our metric for evaluating language model perplexity, but a variant called **perplexity**. The perplexity (sometimes called PP for short) of a language model on a held out set is the inverse probability of the held ou set, normalized by the number of words. In more details:

Given a test sequence \\(w_1,\ldots,w_T\\) of \\(T\\) words, we calculate the perplexity \\(\perplexity\\) as follows:

$$
\perplexity(w_1,\ldots,w_T) = \prob(w_1,\ldots,w_T)^{-\frac{1}{T}} = \sqrt[T]{\prod_i^T \frac{1}{\prob(w_i|w_{i-n},\ldots,w_{i-1})}}
$$

We can implement a perplexity function based on the `LanguageModel` interface. 

In [None]:
import math
def perplexity(lm, data):
    """
    Calculate the perplexity of the language model given the provided data.
    Args:
        lm: a language model.
        data: the data to calculate perplexity on.

    Returns:}
        the perplexity of `lm` on `data`.

    """
    log_prob = 0.0
    history_order = lm.order - 1
    
    # flatten data
    sdata = [word for sentence in data for word in  sentence] 
    for i in range(history_order, len(sdata)):
        history = sdata[i - history_order : i]
        word = sdata[i]
        p = lm.probability(word, *history)
        log_prob += math.log(p) if p > 0.0 else float("-inf")
    return math.exp(-log_prob / (len(sdata) - history_order))

Let's see how the uniform model does on our held-out dataset. 

- **Deliverable 2.4** (in the notebook here): Calculate the perplexity of the uniform LM on the dev data. First, inspect the `corpus_dev.txt` file and manually calculate the perplexity. Explain the perplexity that you get. Then, use the function above and evaluate perplexity  on the data file `corpus_dev.txt`. Does it match your expectation? 

In [None]:
### read dev data and calculate uniformLM perplexity; explain what you get



## <font color='blue'>Task 3</font>: Create a unigram n-gram LM (non-uniform)

Admittedly, the uniform language model is not very useful. Hence we now improve it by estimating the actual probabilities of the words in the corpus using maximum likelihood estimates. To do so, we extend the `CountLM` class (copied for convenience below) and create a `UnigramLM` class.

- **Deliverable 3.1**: Complete the function `__init__` in `lm.UnigramLM` (in the `snlp/lm.py` file). It goes over each sentence in the training corpus and stores how often it sees a particular unigram. *Note*: Since we soon want to make this class more general, we will store the count of a word as tuple where the second part (its history) is for now the empty history (), hence the count of a word is stored with the following key in the default dictionary: "(w,)"
- **Test**: `nosetest tests/test_lms.py:test_unigram`

In [None]:
reload(lm);

In [None]:
class CountLM(LanguageModel):
    """
    A Language Model that uses counts of events 
    and histories to calculate probabilities of words in context.
    """

    @abc.abstractmethod
    def counts(self, word_and_history):
        pass

    @abc.abstractmethod
    def norm(self, history):
        pass

    def probability(self, word, *history):
        if word not in self.vocab:
            return 0.0
        sub_history = tuple(history[-(self.order - 1):]) if self.order > 1 else ()
        norm = self.norm(sub_history)
        if norm == 0:
            return 1.0 / len(self.vocab)
        else:
            return self.counts((word,) + sub_history) / self.norm(sub_history)
     


In [None]:
reload(lm);
! nosetests tests/test_lms.py:test_unigram

- **Deliverable 3.2** (in the notebook): Sample a sentence from the uniform LM. 

- **Deliverable 3.3** (in the notebook): Calculate the perplexity of the dev data `corpus_dev.csv` according to the uniform and unigram LM. Which perplexity should be lower?
- **Test**: `nosetest tests/test_lms.py:test_ppl`

In [None]:
## calculate perplexity of both LMs (uniform and unigram) on the dev data


In [None]:
! nosetests tests/test_lms.py:test_ppl

## <font color='blue'>Task 4</font>: Create a bigram n-gram language model


The language models so far ignored that certain words have a higher probability following certain other words (the conditioning history so far was empty). 


- **Deliverable 4.1**  Implement a bigram LM. Before doing so, manually estimate the probability of all words that can follow `is` according to the corpus in `data/corpus.csv`. Then, create the `bigramLM` you just implemented to check your calculations. Finally, generate (sample) sentences from the bigram LM. Did the LM improve? How can you tell that this LM is better than the one before?
- **Test**: `nosetest tests/test_lms.py:test_bigram`

In [None]:
!cat data/corpus.csv

In [None]:
# TODO


In [None]:
reload(lm);
! nosetests tests/test_lms.py:test_bigram

In [None]:
' '.join(sample(bigramLM, ["<START>"],sentence_len))

## <font color='blue'>Task 5</font>: Out-of-Vocabulary Words - Theoretical question
The problem in the above example is that the baseline model assigns zero probability to words that are not in the vocabulary. Test sets will usually contain such words, and this leads to the above result of infinite perplexity. For example, the following three words do not appear in the training set vocabulary `vocab` and hence receive 0 probability.

In [None]:
vocab = preproc.create_vocab(x_train)
unigramLM = lm.UnigramLM(vocab, x_train)
print(unigramLM.vocab)

print(unigramLM.probability("the"))
print(unigramLM.probability("blue"))

- **Deliverable 5.1**  What are possible ways to deal with the problem of OOVs?


##  <font color='blue'>Task 6</font>: Language Model - Pen & Paper

- **Deliverable 6.1** Solve Exercise 3.4 from J&M, copied below for convenience.

```
We are given the following corpus, modified from the one in the chapter:

<s> I am Sam </s>
<s> Sam I am </s>
<s> I am Sam </s>
<s> I do not like green eggs and Sam </s>

Using a bigram language model with add-one smoothing, what is
P(Sam | am)? Include <s> and </s> in your counts just like any other token.
```
