<div class="alert alert-danger">
**Due date:** 2017-02-03
</div>

# Lab 2: Language Modelling

**Students:** Johan Lindström (johli160), Jonathan Sjölund (jonsj507)

## Introduction

In this lab you will experiment with $n$-gram models. You will test various parameters that influence these models&rsquo; quality and train to estimate models with additive smoothing.

The following lines of code import the Python modules needed for this lab:

In [1]:
import nlp2
import ngrams

The data for this lab consists of Arthur Conan Doyle&rsquo;s novels about Sherlock Holmes: *The Adventures of Sherlock Holmes*, *The Memoirs of Sherlock Holmes*, *The Return of Sherlock Holmes*, *His Last Bow* and *The Case-Book of Sherlock Holmes*. The next piece of code loads the first three of these as training data:

In [2]:
training_data = nlp2.read_data("/home/TDDE09/labs/nlp2/data/advs.txt",
                               "/home/TDDE09/labs/nlp2/data/mems.txt",
                               "/home/TDDE09/labs/nlp2/data/retn.txt")

The data is represented as a list of sentences, where one sentence is represented as a list of tokens (strings). The next line prints the 101th sentence:

In [3]:
print(training_data[100])

["'", 'Let', 'us', 'glance', 'at', 'our', 'Continental', 'Gazetteer', '.']


## Relation between a model’s quality and its order

In the first part of this lab you will examine the relation between an $n$-gram model’s quality and its **order**, i.e. the value of&nbsp;$n$. You will do both a qualitative and quantitative evaluation with the help of the entropy measure.

### Qualitative evaluation

The following line trains a bigram-model of the class `ngrams.Model` on the training data.

In [4]:
model = nlp2.train(ngrams.Model, 10, training_data)

With this model you are able to generate random sentences. Every time you run the following code cell a new sentence is generated.

In [5]:
print(" ".join(model.generate()))

" " It is a privilege to be associated with you in the handling of a case , " said the inspector , warmly .


Look at the sentences. Do they sound natural?

<div class="panel panel-primary">
<div class="panel-heading">Problem 1</div>
<div class="panel-body">
Train a unigram-, bigram-, trigram-, and quadrigram-model, and generate random sentences with each. How does the quality of the sentences change with the model’s order? Explain your observations using your understanding from how an $n$-gram model works. Use some generated sentences in order to illustrate your discussion. How would the sentences look like for higher values of $n$, such as $n=10$?
</div>
</div>

In [6]:
# TODO: Insert your code here
model1 = nlp2.train(ngrams.Model, 1, training_data)
model2 = nlp2.train(ngrams.Model, 2, training_data)
model3 = nlp2.train(ngrams.Model, 3, training_data)
model4 = nlp2.train(ngrams.Model, 4, training_data)
model10 = nlp2.train(ngrams.Model, 10, training_data)

It increases much from unigram to quadrigram, and from quadrigram to 10-gram the change is not as big but it is clear that 10-gram is better. Here are some sentences for unigram, bigram, trigram... etc!

Unigram: problem them invisible better She all did ; must between a , gave as kind-hearted to have

Bigram: He had a day , ' " Pray tell us as usual size , and to the thought , strong , that I am a very old man , Watson , as silent course the form of the professional commission of such a boot-lace .

Trigram: " " You'll never persuade me to hear the tones that he had described .

Quadrigram: You made some small job in my lady's room -- you and your confederate Cusack -- and you managed that he should retain the young Swiss messenger with him as guide and companion while I returned to Baker Street , and remember the advice which I have for this fellow .

10-gram: The tip had been cut off , not bitten off , but the cut was not a clean one , so I deduced a blunt pen-knife .

If we look at the unigram sentence it is completely incoherent, for the bigram we can see at least a sentence that can be read but it is still incoherent. The trigram sentence is understandable in some sense, it is drastical change from unigram and clearly better than bigram. The quadrigram sentence is longer so we can see that the coherency is still a bit off, a couple of words are weird. The difference between quadrigram and trigram is not as big as the changes between unigram and bigram as well as the difference between bigram and trigram. For the 10-gram we can clearly see that it is the best sentence becuase we look at the 10 previous words which some sentences are, thus producing the best result. The reason for these changeses is becuase the number of options for a given n-gram decreases when n increases, thus the sentece created will likely be a very coherent one.

### Quantitative evaluation

In order to do a quantitative evaluation of a model we can compute its **entropy** on held-out data. We will use the first part of the novel *The Adventures of Sherlock Holmes* for this. It is loaded by the following command:

In [7]:
test_data = nlp2.read_data("/home/TDDE09/labs/nlp2/data/test.txt")

The next piece of code trains a bigram-model and computes its entropy on the test data:

In [8]:
model = nlp2.train(ngrams.Model, 2, training_data)
nlp2.evaluate(model, test_data)

3.426862596420277

<div class="panel panel-primary">
<div class="panel-heading">Problem 2</div>
<div class="panel-body">
Compute the entropy for the four models you created for the previous problem. How does the model’s entropy change with the model’s order? Explain using your knowledge of the entropy measure.
</div>
</div>

In [9]:
# TODO: Insert your code here
ent1=nlp2.evaluate(model1, test_data)
ent2=nlp2.evaluate(model2, test_data)
ent3=nlp2.evaluate(model3, test_data)
ent4=nlp2.evaluate(model4, test_data)
ent10=nlp2.evaluate(model10, test_data)
print(ent1)
print(ent2)
print(ent3)
print(ent4)
print(ent10)

7.337551182974018
3.426862596420277
1.4289533769461726
0.5436027106964166
0.3321634449144811


Entropy is a measure of efficiency or disorder (depending on context) for a system. In this case we can see it more as a measure of how "suprised" we are to see the word. This means that more entropy means that we are more "suprised" over the result which often means that the word predicted is probably not the right one. If the entropy is low we are more sure that the word predicted is the right one which translates perfectly for when someone sends messages on their phone without looking at the keyboard and produces the information desired in the message. As we can see, the higher model we have the lower the entropy gets since we look at more previous words the next suggested word will suprises us less because it will be coherent for the sentence we are writing. From unigram to bigram the entropy decreases alot, same for bigram to trigram and after that the decreases is not as big which we can see from quadrigram to n-gram. This means that for higher n-grams the "supprise" change wont be so different which means that at some point it is not worth increasing the n-gram since it will not impact the result that much.

## Relation between a model’s quality and the estimation method

In the second part of this lab you will implement and evaluate various estimation methods. In order to do that you will need to know how the lab system is built up.

### The content of a model

When you call the `train()` function (like you did above), the system creates an $n$-gram model of the given class (so far: `ngrams.Model`) and with the given order (the value of $n$) and trains the model on the given data set. For the second part of this lab you will use your own model class. We start with defining the class in such a way that it simply calls the corresponding methods of the superclass:

In [10]:
class Model(ngrams.Model):
    
    def order(self):
        """Return the order of this model (an integer)."""
        return super().order()
    
    def vocabulary(self):
        """Return this model's vocabulary (a set)."""
        return super().vocabulary()
    
    def freq(self, ctxt, word):
        """Return the number of occurrences of `word` (a string) after `ctxt` (a tuple of strings)."""
        return super().freq(ctxt, word)
    
    def total(self, ctxt):
        """Return the total number of ngrams that start with `ctxt` (a tuple of strings)."""
        return super().total(ctxt)
    
    def prob(self, ctxt, word):
        """Return the probability for `word` (a string) given `ctxt` (a tuple of strings)."""
        return super().prob(ctxt, word)

The next piece of code trains a bigram-model of the class `Model` and prints the model’s order (an integer) and the size of its vocabulary (a set of strings, represented by Python’s `set` type).

In [11]:
model = nlp2.train(Model, 2, training_data)
print("order of the model:", model.order())
print("number of words in the model's vocabulary:", len(model.vocabulary()))

order of the model: 2
number of words in the model's vocabulary: 15339


#### Look up an n-gram’s absolute frequency

A trained model consists primarily of a table with absolute frequencies for all $n$-grams that appear in the text it was trained on. In order to look up an $n$-gram’s absolute frequency one can use the method `freq()`. An $n$-gram is divided into two parts: an $(n-1)$-gram called **context** (`ctxt`) and a final unigram (`word`). In Python the context is represented as a tuple of strings and the unigram as a normal string.

If you want to train a trigram model and then know the absolute frequency for the trigram *Mr. Sherlock Holmes* you can write:

In [12]:
model = nlp2.train(Model, 3, training_data)
model.freq(("Mr.", "Sherlock"), "Holmes")

50

For training a bigram model and looking up the absolute frequency for the bigram *Baker Street* you can write the following. Note that the context of a bigram model is a 1-tuple of strings, which has a special notation in Python.

In [13]:
model = nlp2.train(Model, 2, training_data)
model.freq(("Baker",), "Street")

67

#### Look up the absolute frequency of an n-gram with a given context

The method `total()` returns the absolute frequency of $n$-grams with the given context. Here is an example for a trigram model:

In [14]:
model = nlp2.train(Model, 3, training_data)
model.total(("Mr.", "Sherlock"))

50

<div class="panel panel-primary">
<div class="panel-heading">Problem 3</div>
<div class="panel-body">
Train a bigram model and use it to calculate the following values, using the methods shown above.
</div>
</div>

In [15]:
model = nlp2.train(Model, 2, training_data)

**3.1.** the absolute frequency for the bigram *Sherlock Holmes*

In [16]:
# TODO: Insert your code here
model.freq(("Sherlock",),"Holmes")

195

**3.2.** the absolute frequency of bigrams with the context *Sherlock*

In [17]:
# TODO: Insert your code here
model.total(("Sherlock",))

210

**3.3.** the absolute frequency for the unigram *Sherlock*

In [18]:
# TODO: Insert your code here
model.total(("Sherlock",))

210

**3.4.** the number of words in the vocabulary

In [19]:
# TODO: Insert your code here
len(model.vocabulary())

15339

**3.5.** a list with all the words following the context *Sherlock*

In [20]:
# TODO: Insert your code here
temp = model.vocabulary()
temp = list(temp)
listOfWords = []
templist = [""]
for word in temp:
    if(model.freq(("Sherlock",),word) > 0):
        templist[0] = word
        listOfWords = listOfWords+templist
print(listOfWords)


["Holmes's", 'looked', 'everywhere', 'Holmes', '?', ',', '!', 'has', '.']


(For the last exercise you will need to write a bit more than a simple function call.)

### Estimate probabilities with the Maximum Likelihood method

The method `prob()` returns the estimated conditional probability $P(w|c)$ for a word $w$ given a context $c$. The following code snippet trains a trigram model and estimates the pobability for *Holmes* given the context *Mr. Sherlock*:

In [21]:
model = nlp2.train(Model, 3, training_data)
model.prob(("Mr.", "Sherlock"), "Holmes")

1.0

(What does the returned value imply?)

<div class="panel panel-primary">
<div class="panel-heading">Problem 4</div>
<div class="panel-body">
Do your own implementation of the method `prob()`. The method should estimate probabilities using the Maximum Likelihood method. Test your implementation by redoing Exercise&nbsp;2 with the new class; you should get the same result as before. Use the code you wrote in Exercise&nbsp;3 in order to solve the exercise.
</div>
</div>

In [22]:
class Model(ngrams.Model):
    
    def prob(self, ctxt, word):
        """Return the probability for `word` (a string) given `ctxt` (a tuple of strings)."""
        # TODO: Replace the next line with your own code 
        freqOfngram = self.freq(ctxt,word)
        freqOfTuplegram = self.total(ctxt)
            
        return freqOfngram/freqOfTuplegram

# TODO: Insert your testing code here
model1 = nlp2.train(Model, 1, training_data)
model2 = nlp2.train(Model, 2, training_data)
model3 = nlp2.train(Model, 3, training_data)
model4 = nlp2.train(Model, 4, training_data)

ent1=nlp2.evaluate(model1, test_data)
ent2=nlp2.evaluate(model2, test_data)
ent3=nlp2.evaluate(model3, test_data)
ent4=nlp2.evaluate(model4, test_data)

print(ent1)
print(ent2)
print(ent3)
print(ent4)


7.337551182974018
3.426862596420277
1.4289533769461726
0.5436027106964166


In order to solve this exercise you will need to turn the formula for Maximum Likelihood estimation into code. We illustrate the formula for a bigram model. If we write $f(w_1w_2)$ for the number of occurrences of the bigram  $w_1w_2$ and $f(w_1)$ for the number of occurrences of the unigram $w_1$, then the probability for observing $w_2$ given $w_1$ is
$$
P(w_2|w_1) = \frac{f(w_1w_2)}{f(w_1)}
$$

### Problems with Maximum Likelihood estimation

The file `yoda.txt` contains the same text as `test.txt`, but in the jumbled [Yoda-language]( http://itre.cis.upenn.edu/~myl/languagelog/archives/002173.html).

In [23]:
yoda_data = nlp2.read_data("/home/TDDE09/labs/nlp2/data/yoda.txt")

<div class="panel panel-primary">
<div class="panel-heading">Problem 5</div>
<div class="panel-body">
Redo the evaluation of the four previous models with `yoda.txt` as test data. Something unexpected happens for models with $n>1$. Why? Explain the problem with your knowledge of Maximum Likelihood estimation.
</div>
</div>

In [24]:
# TODO: Insert your code here
ent1=nlp2.evaluate(model1, yoda_data)
#ent2=nlp2.evaluate(model2, yoda_data)
#ent3=nlp2.evaluate(model3, yoda_data)
#ent4=nlp2.evaluate(model4, yoda_data)

print(ent1)
#print(ent2)
#print(ent3)
#print(ent4)


7.2441064060866385


For n = 1 the model will be a unigram model, which means that as long as the word exist we will always get a probability which can be a problem for higher n-gram models. For higher model it seems that atleast one word combination will have the probability of 0 % which is a problem in the context of how the evaluation for maximum likelihood estimation is done. For unigram it works and we get an entropy, almost similar to the one we got previously, which means that in order to avoid underflow the evalutation will take the log of each probability which is the problem here. If just one probability is zero then it will ruin the whole evaluation which is the case here for n > 1.

### Estimate probabilities with additive smoothing

For the next problem you are going to do Maximum Likelihood estimation, but with additive smoothing.

<div class="panel panel-primary">
<div class="panel-heading">Problem 6</div>
<div class="panel-body">
<p>
Write a new implementation of the method `prob()`, such that it estimates probabilities with additive smoothing.</p>
<p>
Evaluate the system with new new class using the entropy measure from Problem&nbsp;2. Choose the following values for the smoothing constant $k$:&nbsp;0,00,&nbsp;1,00, 0,10, 0,01. For $k=0$ you should get the same result as in Problem&nbsp;5.
</p>
<p>
How and why does the smoothing constant influence the model’s entropy? Connect to the distribution of the probability mass between observed and imaginary occurrences.
</p>
</div>
</div>

In [25]:
class Model(ngrams.Model):
    
    def prob(self, ctxt, word):
        """Return the probability for `word` (a string) given `ctxt` (a tuple of strings)."""
        # TODO: Replace the next line with your own code
        freqOfngram = self.freq(ctxt,word)
        freqOfTuplegram = self.total(ctxt)
        k = 0.01
            
        return (freqOfngram+k)/(freqOfTuplegram+k*len(self.vocabulary()))

# TODO: Insert your testing code here
model1 = nlp2.train(Model, 1, training_data)
model2 = nlp2.train(Model, 2, training_data)
model3 = nlp2.train(Model, 3, training_data)
model4 = nlp2.train(Model, 4, training_data)

ent1=nlp2.evaluate(model1, test_data)
ent2=nlp2.evaluate(model2, test_data)
ent3=nlp2.evaluate(model3, test_data)
ent4=nlp2.evaluate(model4, test_data)

print(ent1)
print(ent2)
print(ent3)
print(ent4)

7.3366010522328216
4.813092274317304
4.767917212034177
4.854491062575155


Fill the table below with your entropy measures.

<table>
<tr><td></td><td>k = 0,00</td><td>k = 1,00</td><td>k = 0,10</td><td>k = 0,01</td></tr>
<tr><td>n = 1</td><td>7.337551182974018/td><td>7.273171406469247</td><td>7.328460956945743</td><td>7.3366010522328216</td></tr>
<tr><td>n = 2</td><td>3.426862596420277</td><td>7.383667454674711</td><td>5.983399629244003</td><td>4.813092274317304</td></tr>
<tr><td>n = 3</td><td>1.4289533769461726</td><td>8.457347218116801</td><td>6.733205552260075</td><td>4.767917212034177</td></tr>
<tr><td>n = 4</td><td>0.5436027106964166</td><td>8.657735266369459</td><td>6.955629963762169</td><td>4.854491062575155</td></tr>
</table>

The smoothing constant influence the model's entropy in the way that the probability for a specific word decreases, thus making it less likely to be choosen. The reason for this is because the smoothing only changes the number of words, by increasing it, the probability itself does not increase in any way thus causing this phenomenon. When we add smoothing we take away a certain percentage of the probability and redistribute it equally to all other words. This effect shows clearly in the table when k = 1 the entropy increases for each n-gram higher than unigram and when k decreases the entropy decreases.

### An unseen test set

Your last exercise is to redo the evaluation on a previously unseen test set, texts from the collection *His Last Bow*.

In [26]:
unseen_data = nlp2.read_data("/home/TDDE09/labs/nlp2/data/lstb.txt")

<div class="panel panel-primary">
<div class="panel-heading">Problem 7</div>
<div class="panel-body">
Redo the evaluation from Problem 6 with the new test data. Explain what happens given the differences between `test.txt` and `lstb.txt`.
</div>
</div>

In [27]:
# TODO: Insert your code here
class Model(ngrams.Model):
    
    def prob(self, ctxt, word):
        """Return the probability for `word` (a string) given `ctxt` (a tuple of strings)."""
        # TODO: Replace the next line with your own code
        temp = self.vocabulary()
        k = 1.0
        wordExist = True
        for i in ctxt:
            if(i not in temp):
                wordExist = False
        
        if(word in temp and wordExist == True):
            freqOfngram = self.freq(ctxt,word)
            freqOfTuplegram = self.total(ctxt)  
            return (freqOfngram+k)/(freqOfTuplegram+k*len(self.vocabulary()))
        else:
            return k/(k*len(self.vocabulary()))

# TODO: Insert your testing code here
model1 = nlp2.train(Model, 1, training_data)
model2 = nlp2.train(Model, 2, training_data)
model3 = nlp2.train(Model, 3, training_data)
model4 = nlp2.train(Model, 4, training_data)

ent1=nlp2.evaluate(model1, unseen_data )
ent2=nlp2.evaluate(model2, unseen_data )
ent3=nlp2.evaluate(model3, unseen_data )
ent4=nlp2.evaluate(model4, unseen_data )

print(ent1)
print(ent2)
print(ent3)
print(ent4)

6.160241085953113
6.81990718604588
8.421607621810862
8.860184828789535


The difference between test and lstb is that lstb contains words that did not appear in the training data. The entropy for lstb is lower for model 1,2,3 when k = 1. The entropy for lstb is lower for model 1,2 when k = 0.1. The entropy for lstb is lower for the model 1 when k = 0.01. Here we see that when we can get a better evaluating model for unseen data if we use smoothing. It seems that the evaluating model have more trouble with ngrams of higher order if the test data contains unknown words. If we increase the smoothing factor k more we alleviates this problem.