# Language Model Exercises
In these exercises you will extend and develop language models. We will use the code from the notes, but within a python package [`lm`](http://localhost:8888/edit/statnlpbook/lm.py).

## <font color='green'>Setup 1</font>: Load Libraries

In [2]:
import sys, os
_snlp_book_dir = ".."
sys.path.append(_snlp_book_dir) 
from statnlpbook.lm import *
from statnlpbook.ohhla import *

## <font color='green'>Setup 2</font>: Load Data

In [3]:
docs = load_all_songs("../data/ohhla/train/www.ohhla.com/anonymous/j_live/")
trainDocs, testDocs = docs[:len(docs)//2], docs[len(docs)//2:] 
train = words(trainDocs)
test = words(testDocs)

## <font color='blue'>Task 1</font>: Optimal Pseudo Count 

Find an optimal pseudo count `alpha` number for [laplace smoothing](https://github.com/uclmr/stat-nlp-book/blob/python/statnlpbook/lm.py#L180) for the given data. 

In [12]:
oov_train = inject_OOVs(train)
oov_vocab = set(oov_train)
oov_test = replace_OOVs(oov_vocab, test)
bigram = NGramLM(oov_train,2)
laplace_bigram = LaplaceLM(bigram, alpha=1.0) 
perplexity(laplace_bigram, oov_test)

125.89823282979303

## <font color='blue'>Task 2</font>: Sanity Check LM
Implement a method that tests whether a language model provides a valid probability distribution.

In [10]:
def sanity_check(lm, *history):
    """Returns True if lm defines a valid probability distribution for all words in the vocabulary."""  
    pass  # todo

unigram = NGramLM(oov_train,1)
stupid = StupidBackoff(bigram, unigram, 0.1)
print(sum([stupid.probability(word, 'the') for word in stupid.vocab]))
print("Is normalized:", sanity_check(stupid, 'the'))

1.0647115579930901
Is normalized: None


## <font color='blue'>Task 3</font>: Normalisation of Stupid LM
Develop and implement a version of the [stupid language model](https://github.com/uclmr/stat-nlp-book/blob/python/statnlpbook/lm.py#L205) that provides probabilities summing up to 1.

In [11]:
class StupidBackoffNormalized(LanguageModel):
    def __init__(self, main, backoff, alpha):
        super().__init__(main.vocab, main.order)
        self.main = main
        self.backoff = backoff
        self.alpha = alpha

    def probability(self, word, *history):
        return 0.0 # todo
        
less_stupid = StupidBackoffNormalized(bigram, unigram, 0.1)
print(sum([less_stupid.probability(word, 'the') for word in less_stupid.vocab]))
print("Is normalized:", sanity_check(less_stupid, 'the'))

0.0
Is normalized: None


## <font color='blue'>Task 4</font>: Subtract Count LM
Develop and implement a language model that subtracts a count $d\in[0,1]$ from each non-zero count in the training set.


In [7]:
class SubtractCount(CountLM):        
    def __init__(self, base_lm, d):
        super().__init__(base_lm.vocab, base_lm.order)
        self.base_lm = base_lm
        self.d = d

    def counts(self, word_and_history):
        pass

    def norm(self, history):
        pass