## **>>> You were told not to change the setup and/or assessment cells, yet you did so. The magnitude of change doesn't matter. Next time you'll lose points for that. <<<**

# Assignment 1

In this assignment you will build a language model for the [OHHLA corpus](http://ohhla.com/) we are using in the book. You will train the model on the available training set, and can tune it on the development set. After submission we will run your notebook on a different test set. Your mark will depend on 

* whether your language model is **properly normalized**,
* its **perplexity** on the unseen test set,
* your **description** of your approach. 

To develop your model you have access to:

* The training and development data in `data/ohhla`.
* The code of the lecture, stored in a python module [here](/edit/statnlpbook/lm.py).
* Libraries on the [docker image](https://github.com/uclmr/stat-nlp-book/blob/python/Dockerfile) which contains everything in [this image](https://github.com/jupyter/docker-stacks/tree/master/scipy-notebook), including scikit-learn and tensorflow. 

As we have to run the notebooks of all students, and because writing efficient code is important, **your notebook should run in 5 minutes at most**, on your machine. Further comments:

* We have tested a possible solution on the Azure VMs and it ran in seconds, so it is possible to train a reasonable LM on the data in reasonable time. 

* Try to run your parameter optimisation offline, such that in your answer notebook the best parameters are already set and don't need to be searched.

## Setup Instructions
It is important that this file is placed in the **correct directory**. It will not run otherwise. The correct directory is

    DIRECTORY_OF_YOUR_BOOK/assignments/2017/assignment1/problem/
    
where `DIRECTORY_OF_YOUR_BOOK` is a placeholder for the directory you downloaded the book to. After you placed it there, **rename the file** to your UCL ID (of the form `ucxxxxx`). 

## General Instructions
This notebook will be used by you to provide your solution, and by us to both assess your solution and enter your marks. It contains three types of sections:

1. **Setup** Sections: these sections set up code and resources for assessment. **Do not edit these**. 
2. **Assessment** Sections: these sections are used for both evaluating the output of your code, and for markers to enter their marks. **Do not edit these**. 
3. **Task** Sections: these sections require your solutions. They may contain stub code, and you are expected to edit this code. For free text answers simply edit the markdown field.  

Note that you are free to **create additional notebook cells** within a task section. 

Please **do not share** this assignment publicly, by uploading it online, emailing it to friends etc. 


## Submission Instructions

To submit your solution:

* Make sure that your solution is fully contained in this notebook. 
* **Rename this notebook to your UCL ID** (of the form "ucxxxxx"), if you have not already done so.
* Download the notebook in Jupyter via *File -> Download as -> Notebook (.ipynb)*.
* Upload the notebook to the Moodle submission site.


## <font color='green'>Setup 1</font>: Load Libraries
This cell loads libraries important for evaluation and assessment of your model. **Do not change it.**

In [1]:
#! SETUP 1
import sys, os
_snlp_book_dir = "../../../../"
sys.path.append(_snlp_book_dir) 
import statnlpbook.lm as lm
import statnlpbook.ohhla as ohhla
import math
import numpy as np
import collections

## <font color='green'>Setup 2</font>: Load Training Data

This cell loads the training data. We use this data for assessment to define the reference vocabulary: the union of the words of the training and set set. You can use the dataset to train your model, but you are also free to load the data in a different way, or focus on subsets etc. However, when you do this, still **do not edit this setup section**. Instead refer to the variables in your own code, and slice and dice them as you see fit.   

In [2]:
#! SETUP 2
_snlp_train_dir = _snlp_book_dir + "/data/ohhla/train"
_snlp_dev_dir = _snlp_book_dir + "/data/ohhla/dev"
_snlp_train_song_words = ohhla.words(ohhla.load_all_songs(_snlp_train_dir))
_snlp_dev_song_words = ohhla.words(ohhla.load_all_songs(_snlp_dev_dir))
assert(len(_snlp_train_song_words)==1041496)

Could not load ../../../..//data/ohhla/train/www.ohhla.com/anonymous/nas/distant/tribal.nas.txt.html


Due to file encoding issues this code produces one error `Could not load ...`. **Ignore this error**.

## <font color='blue'>Task 1</font>: Develop and Train the Model

This is the core part of the assignment. You are to code up, train and tune a language model. Your language model needs to be subclass of the `lm.LanguageModel` class. You can use some of the existing language models developed in the lecture, or develop your own extensions. 

Concretely, you need to return a better language model in the `create_lm` function. This function receives a target vocabulary `vocab`, and it needs to return a language model defined over this vocabulary. 

The target vocab will be the union of the training and test set (hidden to you at development time). This vocab will contain words not in the training set. One way to address this issue is to use the `lm.OOVAwareLM` class discussed in the lecture notes.

## Below is the code for N-gram

In [3]:
import numpy as np
import collections

class NGramLM(lm.CountLM):
    def __init__(self, train, order):
        """
        Create an NGram language model.
        Args:
            train: list of training tokens.
            order: order of the LM.
        """
        super().__init__(set(train), order)
        self._counts = collections.defaultdict(float)
        self._norm = collections.defaultdict(float)
        self._norm_counts = collections.defaultdict(float)
        seen = set()
        for i in range(self.order, len(train)):
            history = tuple(train[i - self.order + 1: i])
            word = train[i]
            self._counts[(word,) + history] += 1.0
            self._norm[history] += 1.0
#             print (" ".join((word,) + history))
            if "".join((word,) + history) not in seen:
                self._norm_counts[history] += 1.0
                seen.add("".join((word,) + history))

    def counts(self, word_and_history):
        return self._counts[word_and_history]

    def norm(self, history):
        return self._norm[history]
    
class KneserNey(lm.CountLM):
    def __init__(self, train, order):
        super().__init__(set(train), order)
        self._counts = collections.defaultdict(float)
        self._norm = collections.defaultdict(float)
        self._word_counts = collections.defaultdict(float)
        self.total_bigrams = 0
        bigram_seen = set()
        word_seen = set()
        for i in range(self.order, len(train)):
            history = tuple(train[i - self.order: i])
            word = train[i]
            self._counts[(word,) + history] += 1.0
            self._norm[history] += 1.0
#             print (" ".join((word,) + history))
            if "".join((word,) + history) not in word_seen:
                self._word_counts[word] += 1.0
                word_seen.add("".join((word,) + history))
            if "".join((word,) + history) not in bigram_seen:
                self.total_bigrams += 1
                bigram_seen.add("".join((word,) + history))
                
    def probability(self, word, *history):
        # word = word.lower()
        if word not in self.vocab:
            return 0.0
        return self._word_counts[word] / self.total_bigrams

    def counts(self, word_and_history):
        return self._counts[word_and_history]

    def norm(self, history):
        return self._norm[history]

In [4]:
class AbsoluteDiscounting(lm.LanguageModel):
    def __init__(self, main, backoff, d,missing_words):
        super().__init__(main.vocab, main.order)
        self.main = main
        self.backoff = backoff
        self.d = d
        self.missing_words = missing_words
        
    def probability(self, word, *history):
        is_oov = False
        if word in self.main.vocab:
            word = word
            is_oov = False
        elif word in self.missing_words:
            word = lm.OOV
            is_oov = True

        sub_history = tuple(history[-(self.order - 1):]) if self.order > 1 else ()        
        if self.main.norm(sub_history) > 0 and self.main._norm_counts[sub_history] > 0:
            a = max( self.main.counts((word,) + sub_history) - self.d ,0 )/self.main.norm(sub_history) 
            normalizer = ( self.d / self.main.norm(sub_history)  ) * self.main._norm_counts[sub_history]            
            b = normalizer * self.backoff.probability(word, *history) 
            if is_oov:
                return (a + b)/len(self.missing_words)
            return a + b
        else:
            if is_oov:
                return self.backoff.probability(word, *history)/len(self.missing_words)
            return self.backoff.probability(word, *history) 
            

In [5]:
## You should improve this cell
def create_lm(vocab):
    """
    Return an instance of `lm.LanguageModel` defined over the given vocabulary.
    Args:
        vocab: the vocabulary the LM should be defined over. It is the union of the training and test words.
    Returns:
        a language model, instance of `lm.LanguageModel`.
    """
    train_corpus = lm.inject_OOVs(_snlp_train_song_words)
    _snlp_train_vocab = set(train_corpus)
    missing_words = vocab - _snlp_train_vocab 
    
    lm1 = NGramLM(train_corpus,2)
    lm2 = KneserNey(train_corpus,1)
    lm3 = NGramLM(_snlp_train_song_words,3)
    lm4 = NGramLM(_snlp_train_song_words,4)
    lm5 = NGramLM(_snlp_train_song_words,5)
    lm6 = NGramLM(_snlp_train_song_words,6)
    lm7 = NGramLM(_snlp_train_song_words,7)
    lm8 = NGramLM(_snlp_train_song_words,8)
    lm9 = NGramLM(_snlp_train_song_words,9)

    full_lm = AbsoluteDiscounting(lm1,lm2,0.9,missing_words)
    full_lm = AbsoluteDiscounting(lm3,full_lm,0.95,missing_words)
    full_lm = AbsoluteDiscounting(lm4,full_lm,0.95,missing_words)
    full_lm = AbsoluteDiscounting(lm5,full_lm,0.75,missing_words)
    full_lm = AbsoluteDiscounting(lm6,full_lm,0.6,missing_words)
    full_lm = AbsoluteDiscounting(lm7,full_lm,0.5,missing_words)
    full_lm = AbsoluteDiscounting(lm8,full_lm,0.45,missing_words)
    full_lm = AbsoluteDiscounting(lm9,full_lm,0.55,missing_words)

    return full_lm


## Below is the code for LSTM.
Just a try, not using for create_lm() function

In [6]:
import tensorflow as tf
def word2id(_snlp_train_vocab):
    return dict(zip(_snlp_train_vocab,range(len(_snlp_train_vocab)))),dict(zip(range(len(_snlp_train_vocab)),_snlp_train_vocab))

class LSTMModel():
    def __init__(self,train_set,dev_set,order,vocab,batch_size,is_preprocess,embedding_size=50,lstm_hidden_size=50,hidden_size=50,learning_rate=1e-4):
        self.train_vocab,count = lm.inject_OOVs(train_set)
        
        self.train_set = self.train_vocab
        self.dev_set = dev_set
        self.order = order
        self.vocab = set(self.train_set)
        
        self._snlp_train_vocab = set(self.train_set)
        self.vocab_size = len(self._snlp_train_vocab)
        self.missing_words = vocab - self._snlp_train_vocab
        print (self.vocab_size)
        self.batch_size = batch_size
        
        self.embedding_size = embedding_size
        self.lstm_hidden_size = lstm_hidden_size
        self.learning_rate = learning_rate
        
        self.w2id,self.id2w = word2id(self._snlp_train_vocab)
        self.num_batchs = int(len(train_set)/self.batch_size)

        if is_preprocess:
            train_set = [self.w2id[word] for word in self.train_set]
            def get_batch():
                x_batch = np.zeros([self.batch_size,self.order-1],dtype=np.int32)
                y_batch = np.zeros([self.batch_size],dtype=np.int32)
                for i in range(self.batch_size):
                    index = np.random.randint(self.order+1,len(train_set))
                    y_batch[i] = train_set[index]
                    x_batch[i,:] = train_set[index-self.order+1:index]
                return x_batch,y_batch

            self.batchs = []
            for i in range(int(len(train_set)/self.batch_size)):
                x_batch,y_batch = get_batch()
                self.batchs.append([x_batch,y_batch])
            
            file = open("batchs.pkl","wb")
            pickle.dump(self.batchs,file, protocol=pickle.HIGHEST_PROTOCOL) 
            file.close()
            print ("batchs loaded: ",len(train_set),"batches")
        else:
            file = open("batchs.pkl","rb")
            self.batchs = pickle.load(file) 
            file.close()
            print ("batchs loaded")
        
        
        self.graph = tf.Graph()
        with self.graph.as_default():
            with tf.device('/cpu:0'):
                self.x = tf.placeholder(dtype=tf.int32)
                self.y = tf.placeholder(dtype=tf.int32)
                self.keep_prob = tf.placeholder(dtype=tf.float32)
                
            with tf.device('/cpu:0'):
                self.embedding = tf.get_variable(
                  "embedding", [self.vocab_size, embedding_size])

                inputs = tf.nn.embedding_lookup(self.embedding, self.x)
                self.inputs = inputs
                
            with tf.device('/cpu:0'):
                limV = np.sqrt(6. / (embedding_size + lstm_hidden_size * 2))
                limG = limV * 4
                
                def randMatrix(rng, shape, lim):
                    return np.asarray(
                        rng.uniform(
                            low=-lim,
                            high=lim,
                            size=shape
                        ),
                        dtype=np.float32
                    )
                rng = np.random
                # Parameters:
                # Input gate: input, previous output, and bias.
                ix_value = randMatrix(rng, (embedding_size, lstm_hidden_size), limG)
                ix = tf.Variable(ix_value)
                im_value = randMatrix(rng, (lstm_hidden_size, lstm_hidden_size), limG)
                im = tf.Variable(im_value)
                ib = tf.Variable(tf.zeros([1, lstm_hidden_size]))
                # Forget gate: input, previous output, and bias.
                fx_value = randMatrix(rng, (embedding_size, lstm_hidden_size), limG)
                fx = tf.Variable(fx_value)
                fm_value = randMatrix(rng, (lstm_hidden_size, lstm_hidden_size), limG)
                fm = tf.Variable(fm_value)
                fb = tf.Variable(tf.zeros([1, lstm_hidden_size]))
                # Memory cell: input, state and bias.    
                cx_value = randMatrix(rng, (embedding_size, lstm_hidden_size), limV)
                cx = tf.Variable(cx_value)
                cm_value = randMatrix(rng, (lstm_hidden_size, lstm_hidden_size), limV)
                cm = tf.Variable(cm_value)
                cb = tf.Variable(tf.zeros([1, lstm_hidden_size]))
                
                # Output gate: input, previous output, and bias.
                ox_value = randMatrix(rng, (embedding_size, lstm_hidden_size), limG)
                ox = tf.Variable(ox_value)
                om_value = randMatrix(rng, (lstm_hidden_size, lstm_hidden_size), limG)
                om = tf.Variable(om_value)
                ob = tf.Variable(tf.zeros([1, lstm_hidden_size]))

                
                W_atten_values = np.asarray(
                    rng.uniform(
                        low=-np.sqrt(6. / (lstm_hidden_size + 100)),
                        high=np.sqrt(6. / (lstm_hidden_size + 100)),
                        size=(lstm_hidden_size, 100)
                    ),
                    dtype=np.float32
                )
                w_atten = tf.Variable(W_atten_values)
                b_atten = tf.Variable(tf.zeros([100]))
                
                v_values = np.asarray(
                    rng.normal(scale=0.1, size=(100,)),
                    dtype=np.float32
                )
                v = tf.Variable(v_values)
            
                # Classifier weights and biases.
                
                w_1 = tf.Variable(tf.truncated_normal([lstm_hidden_size, hidden_size], -0.1, 0.1))
                b_1 = tf.Variable(tf.zeros([hidden_size]))
                
                w = tf.Variable(tf.truncated_normal([hidden_size, self.vocab_size], -0.1, 0.1))
                b = tf.Variable(tf.zeros([self.vocab_size]))

                ## LSTM layer
                def lstm_cell(i, prev, state):
                    input_gate = tf.sigmoid(tf.matmul(i, ix) + tf.matmul(prev, im) + ib)
                    forget_gate = tf.sigmoid(tf.matmul(i, fx) + tf.matmul(prev, fm) + fb)
                    output_gate = tf.sigmoid(tf.matmul(i, ox) + tf.matmul(prev, om) + ob)
                    update = tf.tanh(tf.matmul(i, cx) + tf.matmul(prev , cm) + cb)

                    CC = forget_gate * state + input_gate * update
                    h = output_gate * tf.tanh(CC)
                    return h, CC

                output, state = tf.zeros_like(tf.matmul(inputs[:,0,:], ix)), tf.zeros_like(tf.matmul(inputs[:,0,:], ix))
                outputs = []
                for i in range(self.order-1):
                    output, state = lstm_cell(inputs[:,i,:], output, state)
                    output_3 = tf.expand_dims(output,1)
                    outputs.append(output_3)
                outputs = tf.concat(outputs,axis=1)
                
                ## attention layer
                outputs = tf.transpose(outputs, perm=[1, 0, 2]) 
                atten = tf.nn.tanh(tf.matmul(outputs,w_atten) + b_atten)
                w_atten = tf.expand_dims(w_atten,0)
                w_atten = tf.tile(w_atten,[self.order-1,1,1])
                atten = tf.nn.tanh(outputs @ w_atten + b_atten)
                atten = tf.reduce_sum(atten * v, axis=2)
                atten = tf.nn.softmax(tf.transpose(atten, perm=[1, 0]))
                atten = tf.expand_dims(tf.transpose(atten, perm=[1, 0]),2)
                lstm_output = tf.reduce_sum(atten * outputs, axis = 0)
                
                self.lstm_output = lstm_output
                
                ## fully connected layer
                hidden = tf.nn.relu(tf.matmul(lstm_output,w_1) + b_1)
                hidden = tf.nn.dropout(hidden,keep_prob = self.keep_prob)
                
                logits = tf.matmul(hidden,w) + b 
                self.predicted = tf.nn.softmax(logits=logits)
            with tf.device('/cpu:0'):
                y_ = tf.one_hot(self.y,depth=self.vocab_size,dtype=tf.int64)
            with tf.device('/cpu:0'):
                ## cost function
                self.cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits , labels=y_))
                self.train_op = tf.train.AdamOptimizer(learning_rate=self.learning_rate).minimize(self.cost)
        
    def trianing(self,num_iteration):
        
        val_set = [self.w2id[word] if word in self.vocab else self.w2id[lm.OOV] for word in self.dev_set]
        
        self.session = tf.InteractiveSession(graph=self.graph)
        tf.initialize_all_variables().run()
        
        saver = tf.train.Saver()
        
        min_cost = 10000000
        cost_list = []
        
        for i in range(num_iteration):
            for _ in range(500):
                n = np.random.randint(self.num_batchs)
                x_batch,y_batch = self.batchs[n]
                feed_dict = {self.x:x_batch,self.y:y_batch,self.keep_prob : 0.5}
                _, cost = self.session.run([self.train_op, self.cost], feed_dict=feed_dict)
                cost_list.append(cost)
        
        
            def get_batch(data_set):
                x_batch = np.zeros([self.batch_size,self.order-1],dtype=np.int32)
                y_batch = np.zeros([self.batch_size],dtype=np.int32)
                for i in range(self.batch_size):
                    index = np.random.randint(self.order+1,len(data_set))
                    y_batch[i] = data_set[index]
                    x_batch[i,:] = data_set[index-self.order+1:index]
                return x_batch,y_batch
            val_cost_list = []
            for _ in range(50):
                val_x_batch,val_y_batch = get_batch(val_set)
                feed_dict = {self.x:val_x_batch,self.y:val_y_batch,self.keep_prob : 1}
                cost = self.session.run(self.cost, feed_dict=feed_dict)
                val_cost_list.append(cost)
            print (str(i)+"th","validation set cost: ",np.mean(val_cost_list))
        
            val_cost = np.mean(val_cost_list)
            if val_cost <= min_cost:
                save_path = saver.save(self.session, "./model")
                print ("model saved")
                min_cost = val_cost
        return cost_list
    
    def loadmodel(self):
        saver = tf.train.Saver()
        self.test_sess = tf.InteractiveSession(graph=self.graph)
        saver.restore(self.test_sess, "./model")
    
    def probability(self,word,*history):
        if self.test_sess == None:
            print ("Need to load model first")
        is_missing = False
        if word not in self.vocab:
            word = lm.OOV
            is_missing = True
            
        history = list(history)
        history = [w if w in self.vocab else lm.OOV for w in history ]
            
        x_batch = np.zeros([1,self.order-1])
        x_batch[0,:] = [self.w2id[w] for w in history]
        
        feed_dict = {self.x:x_batch,self.keep_prob : 1}
        predicted = self.test_sess.run(self.predicted,feed_dict=feed_dict)
        ans = predicted[0,self.w2id[word]]
        if is_missing:
            ans = ans/len(self.missing_words)
        return ans
        

In [7]:
# model initialization
train_set = _snlp_train_song_words
dev_set = _snlp_dev_song_words
order = 31
vocab = set(_snlp_train_song_words) | set(_snlp_dev_song_words)
batch_size = 20
lstm_lm = LSTMModel(train_set,
                    dev_set,
                    order,
                    vocab,
                    batch_size,
                    True,
                    embedding_size=200,
                    lstm_hidden_size=650,
                    hidden_size=650,
                    learning_rate=1e-3)
                    
# model training
cost_list = lstm_lm.trianing(100)
plt.plot(cost_list)
plt.show()

# model testing
lstm_lm.loadmodel()
lm.perplexity(lstm_lm, _snlp_dev_song_words)      
    

ValueError: too many values to unpack (expected 2)

## <font color='green'>Setup 3</font>: Specify Test Data
This cell defines the directory to load the test songs from. Currently, this points to the dev set but when we evaluate your notebook we will point this directory elsewhere and use a **hidden test set**.  

In [6]:
#! SETUP 3
_snlp_test_dir = _snlp_book_dir + "/data/ohhla/test"

## <font color='green'>Setup 4</font>: Load Test Data and Prepare Language Model
In this section we load the test data, prepare the reference vocabulary and then create your language model based on this vocabulary.

In [7]:
#! SETUP 4
_snlp_test_song_words = ohhla.words(ohhla.load_all_songs(_snlp_test_dir))
_snlp_test_vocab = set(_snlp_test_song_words)
_snlp_dev_vocab = set(_snlp_dev_song_words)
_snlp_train_vocab = set(_snlp_train_song_words)
_snlp_vocab = _snlp_test_vocab | _snlp_train_vocab | _snlp_dev_vocab
_snlp_lm = create_lm(_snlp_vocab)

FileNotFoundError: [Errno 2] No such file or directory: '../../../..//data/ohhla/test'

## <font color='red'>Assessment 1</font>: Test Normalization (20 pts)
Here we test whether the conditional distributions of your language model are properly normalized. If probabilities sum up to $1$ you get full points, you get half of the points if probabilities sum up to be smaller than 1, and 0 points otherwise. Due to floating point issues we will test with respect to a tolerance $\epsilon$ (`_eps`).

Points:
* 10 pts: $\leq 1 + \epsilon$
* 20 pts: $\approx 1$

In [10]:
#! ASSESSMENT 1
_snlp_test_token_indices = [100, 1000, 10000]
_eps = 0.000001
approx_1 = []
leq_1 = []
for i in _snlp_test_token_indices:
    result = sum([_snlp_lm.probability(word, *_snlp_test_song_words[i-_snlp_lm.order+1:i]) for word in _snlp_vocab])
    approx_1.append(abs(result - 1.0) < _eps)
    leq_1.append(result - _eps <= 1.0)
    
    print("Sum: {sum}, ~1: {approx_1}, <=1: {leq_1}".format(sum=result, 
                                                            approx_1=abs(result - 1.0) < _eps, 
                                                            leq_1=result - _eps <= 1.0))
(sum(approx_1) == 3, sum(leq_1) == 3)

Sum: 1.0000000000000626, ~1: True, <=1: True
Sum: 1.0000000000004305, ~1: True, <=1: True
Sum: 0.999999999999878, ~1: True, <=1: True


(True, True)

The above solution is marked with **
<!-- ASSESSMENT 2: START_POINTS -->
20
<!-- ASSESSMENT 2: END_POINTS --> 
points **.

### <font color='red'>Assessment 2</font>: Apply to Test Data (50 pts)

We assess how well your LM performs on some unseen test set. Perplexities are mapped to points as follows.

* 0-10 pts: uniform perplexity > perplexity > 550, linear
* 10-30 pts: 550 > perplexity > 140, linear
* 30-50 pts: 140 > perplexity > 105, linear

The **linear** mapping maps any perplexity value between the lower and upper bound linearly to a score. For example, if uniform perplexity is $U$ and your model's perplexity is $P\leq550$, then your score is $10\frac{P-U}{550-U}$. 

In [11]:
lm.perplexity(_snlp_lm, _snlp_test_song_words)

inf

The above solution is marked with **
<!-- ASSESSMENT 3: START_POINTS -->
0
<!-- ASSESSMENT 3: END_POINTS --> points**. 

## <font color='blue'>Task 2</font>: Describe your Approach

< Enter a 500 words max description of your model and the way you trained and tuned it here >

## <font color='red'>Assessment 3</font>: Assess Description (30 pts) 

We will mark the description along the following dimensions: 

* Clarity (10pts: very clear, 0pts: we can't figure out what you did)
* Creativity (10pts: we could not have come up with this, 0pts: Use the unigram model from the lecture notes)
* Substance (10pts: implemented complex state-of-the-art LM, 0pts: Use the unigram model from the lecture notes)

# <font color='o'>Introduction</font>:

I have developed the language models in two ways:
 * N-gram models with absolute discounting and Kneser Ney smoothing.
 * Neural network (LSTM) model built with tensorflow. 
 
The perplexity of first method is about 122 on validation set, the perplexity of the second method is about 160 on validation set.
 
## <font color='red'>1. N-gram model</font>:

#### Absolute discounting and Kneser Ney
The idea of this model is to combine different order N-gram models to obtain better performance than single N-gram. Theoretically, higher order N-gram model leads to more precise probability, but the problem is that along with the order higher of N-gram model, the required data grows exponentially, the small dataset will lead to sparse history record. The way I used to balance the problem is absolute discounting, which combines different order N-gram models using a constant discount $d$ and a normolizer $\lambda$, the constant discount $d$ is a hyperparameter and the normolizer $\lambda$ is calculated during producing the probability. AbsoluteDiscounting() and NGramLM() are the implementation of absolute discounting, the reason of rewrite NGramLM() is count first seen history which is used to calculate $\lambda$ in absolute discounting. To improve the performance of the unigram in absolute discounting, I use Kneser Ney smoothing to calculate the unigram model. The implementation of Kneser Ney is in KneserNey() function which will count the total number of unique bigram pairs and count the number of each word behind any word once. 

#### Tune process
The way to find the how many different order models need to be used and the suitable constant discount $d$ is greedy search. The process of greedy search is first build a unigram and bigram, try every $d$ in a candidates list and calculate corresponding perplexity, find the best $d$ which leads the smallest perplexity. After that, keep $d_{1,2}$ do not change and add trigram into language model and do the same thing to find the best $d_{2,3}$. Iterate these steps until add a new order model leads to bad performance. Then tune each $d_{i,j}$ where $i != j$ and $i=j-1$ while $d_{n,m}$ where $n,m != i,j$ not change. The final language model contains 9 order models from unigram to 9-gram.

#### Out of Vocab problem
There is possible to receive a word which not in the training set. In order to not return 0, I inject the training set and when facing a OOV word during calculate the probability, replace the word with lm.OOV and divide the probability with the length of missing words in order to norm to 1.


## <font color='red'>2. Neural network model</font>:
The basic architecture of the neural network language model is a single long short term memory(LSTM) layer with learnable word embedding as input definded in class LSTMModel(). In the forward process, each word of training vocabulary has been embedded into a fixed size vector, initialzed with uniform(6/(f_in+f_out)). The output of the LSTM layer send to attention layer to combine these hidden state. Then using two fully connected layer with dropout and softmax with cross entropy as cost function.

The above solution is marked with **
<!-- ASSESSMENT 1: START_POINTS -->
20
<!-- ASSESSMENT 1: END_POINTS --> points**. 
    
Notes:
1-9 gram (2-gram uses KN) + absoluyr discounting; greedy grid search for hyperparameter selection; also tried: lstm.
Clarity/Creativity/Substance:10/4/6