<!---
Latex Macros
-->
$$
\newcommand{\bar}{\,|\,}
\newcommand{\Xs}{\mathcal{X}}
\newcommand{\Ys}{\mathcal{Y}}
\newcommand{\y}{\mathbf{y}}
\newcommand{\weights}{\mathbf{w}}
\newcommand{\balpha}{\boldsymbol{\alpha}}
\newcommand{\bbeta}{\boldsymbol{\beta}}
\newcommand{\aligns}{\mathbf{a}}
\newcommand{\align}{a}
\newcommand{\source}{\mathbf{s}}
\newcommand{\target}{\mathbf{t}}
\newcommand{\ssource}{s}
\newcommand{\starget}{t}
\newcommand{\repr}{\mathbf{f}}
\newcommand{\repry}{\mathbf{g}}
\newcommand{\x}{\mathbf{x}}
\newcommand{\prob}{p}
\newcommand{\vocab}{V}
\newcommand{\params}{\boldsymbol{\theta}}
\newcommand{\param}{\theta}
\DeclareMathOperator{\perplexity}{PP}
\DeclareMathOperator{\argmax}{argmax}
\DeclareMathOperator{\argmin}{argmin}
\newcommand{\train}{\mathcal{D}}
\newcommand{\counts}[2]{\#_{#1}(#2) }
\newcommand{\length}[1]{\text{length}(#1) }
\newcommand{\indi}{\mathbb{I}}
$$

# Assignment 3

## Introduction

In the last assignment, you will apply deep learning methods to solve a particular story understanding problem. Automatic understanding of stories is an important task in natural language understanding [[1]](http://anthology.aclweb.org/D/D13/D13-1020.pdf). Specifically, you will develop a model that given a sequence of sentences learns to sort these sentence in order to yield a coherent story [[2]](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/06/short-commonsense-stories.pdf). This sounds (and to an extent is) trivial for humans, however it is quite a difficult task for machines as it involves commonsense knowledge and temporal understanding.

## Goal

You are given a dataset of 45502 instances, each consisting of 5 sentences. Your system needs to ouput a sequence of numbers which represent the predicted order of these sentences. For example, given a story:

    He went to the store.
    He found a lamp he liked.
    He bought the lamp.
    Jan decided to get a new lamp.
    Jan's lamp broke.

your system needs to provide an answer in the following form:

    2	3	4	1	0

where the numbers correspond to the zero-based index of each sentence in the correctly ordered story. So "`2`" for "`He went to the store.`" means that this sentence should come 3rd in the correctly ordered target story. In this particular example, this order of indices corresponds to the following target story:

    Jan's lamp broke.
    Jan decided to get a new lamp.
    He went to the store.
    He found a lamp he liked.
    He bought the lamp.

## Resources

To develop your model(s), we provide a training and a development datasets. The test dataset will be held out, and we will use it to evaluate your models. The test set is coming from the same task distribution, and you don't need to expect drastic changes in it.

You will use [TensorFlow](https://www.tensorflow.org/) to build a deep learning model for the task. We provide a very crude system which solves the task with a low accuracy, and a set of additional functions you will have to use to save and load the model you create so that we can run it.

As we have to run the notebooks of each submission, and as deep learning models take long time to train, your notebook **NEEDS** to conform to the following requirements:
* You **NEED** to run your parameter optimisation offline, and provide your final model saved by using the provided function
* The maximum size of a zip file you can upload to moodle is 160MB. We will **NOT** allow submissions larger than that.
* We do not have time to train your models from scratch! You **NEED** to provide the full code you used for the training of your model, but by all means you **CANNOT** call the training method in the notebook you will send to us.
* We will run these notebooks automatically. If your notebook runs the training procedure, in addition to loading the model, and we need to edit your code to stop the training, you will be penalised with **-20 points**.
* If you do not provide a pretrained model, and rely on training your model on our machines, you will get **0 points**.
* Your submissions will be tested on the stat-nlp-book Docker image to ensure that it does not have any dependencies outside of those that we provide. If your submission fails to adhere to this requirement, you will get **0 points**.

Running time and memory issues:
* We have tested a possible solution on a mid-2014 MacBook Pro, and a few epochs of the model run in less than 3min. Thus it is possible to train a model on the data in reasonable time. However, be aware that you will need to run these models many times over, for a larger number of epochs (more elaborate models, trained on much larger datasets can train for weeks! However, this shouldn't be the case here.). If you find training times too long for your development cycle you can reduce the training set size. Once you have found a good solution you can increase the size again. Caveat: model parameters tuned on a smaller dataset may not be optimal for a larger training set.
* In addition to this, as your submission is capped by size, feel free to experiment with different model sizes, numeric values of different precisions, filtering the vocabulary size, downscaling some vectors, etc.

## Hints

A non-exhaustive list of things you might want to give a try:
- better tokenization
- experiment with pre-trained word representations such as [word2vec](https://code.google.com/archive/p/word2vec/), or [GloVe](http://nlp.stanford.edu/projects/glove/). Be aware that these representations might take a lot of parameters in your model. Be sure you use only the words you expect in the training/dev set and account for OOV words. When saving the model parameters, pre-rained word embeddings can simply be used in the word embedding matrix of your model. As said, make sure that this word embedding matrix does not contain all of word2vec or GloVe. Your submission is limited, and we will not allow uploading nor using the whole representations set (up to 3GB!)
- reduced sizes of word representations
- bucketing and batching (our implementation is deliberately not a good one!)
  - make sure to draw random batches from the data! (we do not provide this in our code!)
- better models:
  - stacked RNNs (see tf.contrib.rnn.MultiRNNCell)
  - bi-directional RNNs
  - attention
  - word-by-word attention
  - conditional encoding
  - get model inspirations from papers on [nlp.stanford.edu/projects/snli/](nlp.stanford.edu/projects/snli/)
  - sequence-to-sequence encoder-decode architecture for producing the right ordering
- better training procedure:
  - different training algorithms
  - dropout on the input and output embeddings (see tf.nn.dropout)
  - L2 regularization (see tf.nn.l2_loss)
  - gradient clipping (see tf.clip_by_value or tf.clip_by_norm)
- model selection:
  - early stopping
- hyper-parameter optimization (e.g. random search or grid search (expensive!))
    - initial learning rate
    - dropout probability
    - input and output size
    - L2 regularization
    - gradient clipping value
    - batch size
    - ...
- post-processing
  - for incorporating consistency constraints

## Setup Instructions
It is important that this file is placed in the **correct directory**. It will not run otherwise. The correct directory is

    DIRECTORY_OF_YOUR_BOOK/assignments/2017/assignment3/problem/group_X/
    
where `DIRECTORY_OF_YOUR_BOOK` is a placeholder for the directory you downloaded the book to, and in `X` in `group_X` contains the number of your group.

After you placed it there, **rename the notebook file** to `group_X.ipynb`.

The notebook is pre-set to save models in

    DIRECTORY_OF_YOUR_BOOK/assignments/2017/assignment3/problem/group_X/model/

Be sure not to tinker with that directory - we expect your submission to contain a `model` subdirectory with a single saved model! 
The saving procedure might overwrite the latest save, or not. Make sure you understand what it does, and upload only a single model! (for more details check tf.train.Saver)

## General Instructions
This notebook will be used by you to provide your solution, and by us to both assess your solution and enter your marks. It contains three types of sections:

1. **Setup** Sections: these sections set up code and resources for assessment. **Do not edit, move nor copy these cells**.
2. **Assessment** Sections: these sections are used for both evaluating the output of your code, and for markers to enter their marks. **Do not edit, move, nor copy these cells**.
3. **Task** Sections: these sections require your solutions. They may contain stub code, and you are expected to edit this code. For free text answers simply edit the markdown field.  

**If you edit, move or copy any of the setup, assessments and mark cells, you will be penalised with -20 points**.

Note that you are free to **create additional notebook cells** within a task section. 

Please **do not share** this assignment nor the dataset publicly, by uploading it online, emailing it to friends etc.

## Submission Instructions

To submit your solution:

* Make sure that your solution is fully contained in this notebook. Make sure you do not use any additional files other than your saved model.
* Make sure that your solution runs linearly from start to end (no execution hops). We will run your notebook in that order.
* **Before you submit, make sure your submission is tested on the stat-nlp-book Docker setup to ensure that it does not have any dependencies outside of those that we provide. If your submission fails to adhere to this requirement, you will get 0 points**.
* **If running your notebook produces a trivially fixable error that we spot, we will correct it and penalise you with -20 points. Otherwise you will get 0 points for that solution.**
* **Rename this notebook to your `group_X`** (where `X` is the number of your group), and adhere to the directory structure requirements, if you have not already done so. ** Failure to do so will result in -1 point.**
* Download the notebook in Jupyter via *File -> Download as -> Notebook (.ipynb)*.
* Your submission should be a zip file containing the `group_X` directory, containing `group_X.ipynb` notebook, and the `model` directory with the saved model
* Upload that file to the Moodle submission site.

## <font color='green'>Setup 1</font>: Load Libraries
This cell loads libraries important for evaluation and assessment of your model. **Do not change, move or copy it.**

In [1]:
%%capture
%load_ext autoreload
%autoreload 2
%matplotlib inline
#! SETUP 1 - DO NOT CHANGE, MOVE NOR COPY
import sys, os
_snlp_book_dir = "../../../../../"
sys.path.append(_snlp_book_dir)
# docker image contains tensorflow 0.10.0rc0. We will support execution of only that version!
import statnlpbook.nn as nn

import tensorflow as tf
import numpy as np

## <font color='green'>Setup 2</font>: Load Training Data

This cell loads the training data. **Do not edit the next cell, nor copy/duplicate it**. Instead refer to the variables in your own code, and slice and dice them as you see fit (but do not change their values). 
For example, no one stops you from introducing, in the corresponding task section, `my_train` and `my_dev` variables that split the data into different folds.   

In [2]:
#! SETUP 2 - DO NOT CHANGE, MOVE NOR COPY
data_path = _snlp_book_dir + "data/nn/"
data_train = nn.load_corpus(data_path + "train.tsv")
data_dev = nn.load_corpus(data_path + "dev.tsv")
assert(len(data_train) == 45502)

In [3]:
from IPython.display import clear_output, Image, display, HTML

def strip_consts(graph_def, max_const_size=32):
    """Strip large constant values from graph_def."""
    strip_def = tf.GraphDef()
    for n0 in graph_def.node:
        n = strip_def.node.add() 
        n.MergeFrom(n0)
        if n.op == 'Const':
            tensor = n.attr['value'].tensor
            size = len(tensor.tensor_content)
            if size > max_const_size:
                tensor.tensor_content = "<stripped %d bytes>"%size
    return strip_def

def show_graph(graph_def, max_const_size=32):
    """Visualize TensorFlow graph."""
    if hasattr(graph_def, 'as_graph_def'):
        graph_def = graph_def.as_graph_def()
    strip_def = strip_consts(graph_def, max_const_size=max_const_size)
    code = """
        <script>
          function load() {{
            document.getElementById("{id}").pbtxt = {data};
          }}
        </script>
        <link rel="import" href="https://tensorboard.appspot.com/tf-graph-basic.build.html" onload=load()>
        <div style="height:600px">
          <tf-graph-basic id="{id}"></tf-graph-basic>
        </div>
    """.format(data=repr(str(strip_def)), id='graph'+str(np.random.rand()))

    iframe = """
        <iframe seamless style="width:1200px;height:620px;border:0" srcdoc="{}"></iframe>
    """.format(code.replace('"', '&quot;'))
    display(HTML(iframe))

### Data Structures

Notice that the data is loaded from tab-separated files. The files are easy to read, and we provide the loading functions that load it into a simple data structure. Feel free to check details of the loading.

The data structure at hand is an array of dictionaries, each containing a `story` and the `order` entry. `story` is a list of strings, and `order` is a list of integer indices:

In [4]:
data_train[0]

{'order': [3, 2, 1, 0, 4],
 'story': ['His parents understood and decided to make a change.',
  'The doctors told his parents it was unhealthy.',
  'Dan was overweight as well.',
  "Dan's parents were overweight.",
  'They got themselves and Dan on a diet.']}

## <font color='blue'>Task 1</font>: Model implementation

Your primary task in this assignment is to implement a model that produces the right order of the sentences in the dataset.

### Preprocessing pipeline

First, we construct a preprocessing pipeline, in our case `pipeline` function which takes care of:
- out-of-vocabulary words
- building a vocabulary (on the train set), and applying the same unaltered vocabulary on other sets (dev and test)
- making sure that the length of input is the same for the train and dev/test sets (for fixed-sized models)

You are free (and encouraged!) to do your own input processing function. Should you experiment with recurrent neural networks, you will find that you will need to do so.

In [10]:
# TODO create tokenizer so it handles the following:
# - (n't 's 'm 're 've 'll 'd) word endings to be separated (shouldn't -> (should, n't))
# - punctuation at the end of sentences (This is a sentence. -> (..., a, sentence, .))
#     - try to make sure if possible to filter out only sentence ending punctuation (U.S. etc. Mr. St.) should be kept
#     - also words like e.g. a.m. p.m.
# - separate numbers from others ($5 -> $, 5)
#
# OR find a library that is included in the docker image (nltk and spacy aren't...) that does that for us
#
# decide what to do with words not in GloVe (random embedding?)

# tokenisation
punctuation = ".,:;?!"
endings_2 = ["'s", "'m", "'d"]
endings_3 = ["n't", "'re", "'ve", "'ll"]
valid_words = ['e.g.', 'a.m.', 'p.m.', 'U.S.', 'etc.', 'i.e.', 'Mr.', 'Mrs.', 'Ms.', 'St.']
currency = '$€£'

def tokenize_word(word):    
    if len(word) == 1:
        return [word]
    
    if word.isalpha():
        return [word]

    if word in valid_words:
        return [word]
    
    if word[0] in currency:
        tokens = tokenize_word(word[1:])
        sign = word[0]
        tokens.insert(0, sign)
        return tokens
    
    if word[-1] in punctuation:
        tokens = tokenize_word(word[:-1])
        tokens.append(word[-1])
        return tokens
    
    if len(word) > 2 and word[-2:] in endings_2:
        tokens = tokenize_word(word[:-2])
        tokens.append(word[-2:])
        return tokens
    
    if len(word) > 3 and word[-3:] in endings_3:
        tokens = tokenize_word(word[:-3])
        tokens.append(word[-3:])
        return tokens
    
    return [word]
        
def tokenize_sent(sent):
    sent = sent.split(' ')
    ret = [tokens for word in sent if len(word) > 0 for tokens in tokenize_word(word) ]
    return ret

In [11]:
# preprocessing pipeline, used to load the data intro a structure required by the model
def pipeline(data, vocab=None, max_sent_len_=None):
    is_ext_vocab = True
    if vocab is None:
        is_ext_vocab = False
        vocab = {'<PAD>': 0, '<OOV>': 1}

    max_sent_len = -1
    data_sentences = []
    data_orders = []
    for instance in data:
        sents = []
        for sentence in instance['story']:
            sent = []
            tokenized = tokenize_sent(sentence)
            for token in tokenized:
                if not is_ext_vocab and token not in vocab:
                    vocab[token] = len(vocab)
                if token not in vocab:
                    token_id = vocab['<OOV>']
                else:
                    token_id = vocab[token]
                sent.append(token_id)
            if len(sent) > max_sent_len:
                max_sent_len = len(sent)
            sents.append(sent)
        data_sentences.append(sents)
        data_orders.append(instance['order'])

    if max_sent_len_ is not None:
        max_sent_len = max_sent_len_
    out_sentences = np.full([len(data_sentences), 5, max_sent_len], vocab['<PAD>'], dtype=np.int32)

    for i, elem in enumerate(data_sentences):
        for j, sent in enumerate(elem):
            out_sentences[i, j, 0:len(sent)] = sent

    out_orders = np.array(data_orders, dtype=np.int32)

    return out_sentences, out_orders, vocab

In [6]:
# convert train set to integer IDs
train_stories, train_orders, vocab = nn.pipeline(data_train)

In [13]:
vocab

{'<PAD>': 0,
 '<OOV>': 1,
 'His': 2,
 'parents': 3,
 'understood': 4,
 'and': 5,
 'decided': 6,
 'to': 7,
 'make': 8,
 'a': 9,
 'change': 10,
 '.': 11,
 'The': 12,
 'doctors': 13,
 'told': 14,
 'his': 15,
 'it': 16,
 'was': 17,
 'unhealthy': 18,
 'Dan': 19,
 'overweight': 20,
 'as': 21,
 'well': 22,
 "'s": 23,
 'were': 24,
 'They': 25,
 'got': 26,
 'themselves': 27,
 'on': 28,
 'diet': 29,
 'She': 30,
 'did': 31,
 "n't": 32,
 'have': 33,
 'bike': 34,
 'of': 35,
 'her': 36,
 'own': 37,
 'Carrie': 38,
 'had': 39,
 'just': 40,
 'learned': 41,
 'how': 42,
 'ride': 43,
 'nervous': 44,
 'hill': 45,
 'crashed': 46,
 'into': 47,
 'wall': 48,
 'frame': 49,
 'bent': 50,
 'deep': 51,
 'gash': 52,
 'leg': 53,
 'would': 54,
 'sneak': 55,
 'rides': 56,
 'sister': 57,
 'Morgan': 58,
 'propose': 59,
 'boyfriend': 60,
 'Her': 61,
 'upset': 62,
 'he': 63,
 'first': 64,
 'After': 65,
 'walking': 66,
 'for': 67,
 'over': 68,
 'mile': 69,
 ',': 70,
 'something': 71,
 'happened': 72,
 'go': 73,
 'long': 74,

You need to make sure that the `pipeline` function returns the necessary data for your computational graph feed - the required inputs in this case, as we will call this function to process your dev and test data. If you do not make sure that the same pipeline applied to the train set is applied to other datasets, your model may not work with that data!

In [7]:
# get the length of the longest sentence
max_sent_len = train_stories.shape[2]

# convert dev set to integer IDs, based on the train vocabulary and max_sent_len
dev_stories, dev_orders, _ = nn.pipeline(data_dev, vocab=vocab, max_sent_len_=max_sent_len)

You can take a look at the result of the `pipeline` with the `show_data_instance` function to make sure that your data loaded correctly:

In [8]:
nn.show_data_instance(dev_stories, dev_orders, vocab, 155)

Input:
 Story:
  The manager decided to offer John the job.
  During the interview he was very <OOV> and <OOV>
  He went to the interview very prepared and nicely dressed.
  John was excited to have a job interview.
  The manager of the company was really impressed by John's comments.
 Order:
  [4 2 1 0 3]

Desired story:
  John was excited to have a job interview.
  He went to the interview very prepared and nicely dressed.
  During the interview he was very <OOV> and <OOV>
  The manager of the company was really impressed by John's comments.
  The manager decided to offer John the job.


In [16]:
import pickle
import csv
import sys
import numpy as np
from collections import defaultdict
import time
csv.field_size_limit(sys.maxsize)

131072

In [17]:
PAD_TOKEN = '<PAD>'
OOV_TOKEN = '<OOV>'

In [18]:

def readAndDumpCSV(txtfile,output_name,save=True):
    mydict = {}
    with open(txtfile,'r') as f:
        for line in f:
            mydict[line.split(' ')[0]] = np.array([float(n) for n in line.split(' ')[1:]])
        if(save):
            pickle_out = open("{}.pickle".format(output_name),"wb")
            pickle.dump(mydict, pickle_out)
            pickle_out.close()
        return mydict
            


In [19]:

glove_50_D = readAndDumpCSV("./glove/glove.6B.50d.txt","glove50D")

In [20]:

def findMeanEmbedding(embeddingDict):
    vectorList = list(embeddingDict.values())
    return np.mean(vectorList,axis=0), np.std(vectorList,axis=0)

In [21]:
mean_glove_50_D, std_glove_50_D= findMeanEmbedding(glove_50_D)
mean_glove_50_D_with_noise = mean_glove_50_D + np.random.rand(50)*std_glove_50_D

In [22]:

####      QUESTION    ######
#By taking embeddings from glove that are not in the give vocab, aren't we effectively 
# increasing the length our vocab?????? Because a word not in the vocab that is in the test set will have an 
# embedding ...

# This issue BECOMES APPARENT  in the line where the bug appears


def createWordEmbeddings(pre_learned_embeddings, total_size):
    '''
    Using a pre-trained word embeddings dictionary, create our (reduced in size) dictionary of embeddings.
    Make sure that all the words in our vocab are embedded, and also use the embeddings of the most popular words,
    as long as the total_size of the dictionary is not exceeded.
    pre_learned_embeddings : Dictionnary of word embeddings, can come from glove or word2vec
    total_size : Length of our output embeddings dictionary
    Returns : A dictionary of word embeddings
    '''
    
    if(total_size >=len(pre_learned_embeddings)):
        #Undefined behaviour in the above case
        raise ValueError("Total size is too big")
    
    #Get the dimention of the embeddings
    dim = len(pre_learned_embeddings["the"])
    
    #Compute the mean vector and std vector for the glove embeddings
    mean_glove, std_glove = findMeanEmbedding(pre_learned_embeddings)

    #Initialise the embeddings dict
    embeddings = dict()
    
    #Iterate over all the words in our vocabulary
    for word, word_index in vocab.items():
        #Stop if we reach our total desired size : 
        #Careful : Will this create bugs with non-embedded words? -> Maybe its better not to allow total_size<vocab_size
        if(len(embeddings) >= total_size):
            print("Warning: Total size reached before full vocab was embedded")
            break
            
            
        if(word == PAD_TOKEN):
            #Set the <PAD> token to 0
            embeddings[word_index] = np.zeros(dim)
        elif(word == OOV_TOKEN ):
            #Initialize the <OOV> token to 1. Update later (c.f. below)
            embeddings['OOV'] = np.ones(300)
        elif(word in pre_learned_embeddings):
            #If the word is in the glove dictionnary, use this embedding
            embeddings[word_index] = pre_learned_embeddings[word]
        else:
            #If not, set its embedding to a random vector with
            #mean the average glove embedding and std the std of the glove embeddings
            #TODO : think if there is a better way to assign vectors in our vocab that are not in Glove
            mean_with_white_noise = mean_glove_50_D + np.random.rand(dim)*std_glove_50_D
            embeddings[word_index] = mean_with_white_noise[word]
            
            
    #Make some more embeddings than the words that are in our vocab:
    #Iterate over the glove Word embeddings
    for word, pre_learned_embedding in pre_learned_embeddings.items():
        #If we exceed our total desired length, stop
        if(len(embeddings) >= total_size):
            break
        
        #Add embeddings that are not already there
        if(word not in embeddings):
            embeddings[word] = pre_learned_embedding ## BUG HERE ::: word here does not have a word_index coming from 
                                                    ## vocab because it is not in the vocab!!! 
   

    #Update the OOV embedding : The idea is to set it the average value of the unused Glove embeddings
    #To do so set the OOV value to the mean of all glove embeddings - mean of our embeddings 
    #ALERT : This mean is corrupted by the noise we are adding in the else clause above and by the embeddings
    # for the <PAD> and <OOV> tokens -> Not sure if we should care or not
    mean_embeddings , std_embeddings = findMeanEmbedding(embeddings)
    OOV_value = mean_glove - mean_embeddings
    embeddings['OOV'] = OOV_value
    
    return embeddings



In [None]:
embeddings = createWordEmbeddings(glove_50_D, 100000)

In [23]:
l = list(embeddings.keys())

not_in_glove = set(range(vocab_len)) - set(l)

NameError: name 'embeddings' is not defined

In [None]:
# print out words not in GloVe (they need to be preprocessed better)
words_in_vocab = list(vocab.keys())
for i in not_in_glove:
    print(words_in_vocab[i])

In [None]:
#with open('glove.840B.300d/glove.840B.300d.txt', 'r') as f:
#    for i in range(10000):
#        word = f.readline().split(" ")[0]
#        if not word.isalpha() and not all(char.isdigit() for char in word) and any(c.isalpha() for c in word):
#            print(word)

In [None]:
embeddings

In [None]:
import pickle
with open('word_embeddings.pkl', 'wb') as f:
        pickle.dump(embeddings, f, pickle.HIGHEST_PROTOCOL)

### Model

The model we provide is a rudimentary, non-optimised model that essentially represents every word in a sentence with a fixed vector, sums these vectors up (per sentence) and puts a softmax at the end which aims to guess the order of sentences independently.

First we define the model parameters:

In [9]:
 #Imports
from tensorflow.contrib import rnn 

In [10]:

### MODEL PARAMETERS ###
target_size = 5
vocab_size = len(vocab)
input_size = 10
# n = len(train_stories)
output_size = 5

n_hidden_1 = 128
n_hidden_2 = 128
num_hidden_lstm = 16
sentence_embedding_dim = 32 

and then we define the model

In [11]:
def leaky_relu(x):
    return tf.maximum(x, 0.1*x) 

In [12]:
def RNN(x, weights, biases, fwd_cell, sent_lens):

    # Get lstm cell output - dynamic_rnn allows for different sequence lengths, sets output to 0 after and just maintains state
    outputs, final_states = tf.nn.dynamic_rnn(fwd_cell, x, dtype=tf.float32, sequence_length=sent_lens)
    
    final_output = final_states[0] # final_states returns a tuple of (final_output, final_state)

    return tf.matmul(final_output, weights) + biases   


In [13]:
def BiRNN(x, weights, biases, fwd_cell, bwd_cell, sent_lens):
    
    outputs, final_states = tf.nn.bidirectional_dynamic_rnn(fwd_cell, bwd_cell, x, dtype=tf.float32, sequence_length=sent_lens)
    
    final_output = final_states[0] # final_states returns a tuple of (final_output, final_state)
    
    output_concat = tf.concat([final_output[0], final_output[1]], 1) # concatenate the forward pass and backwards pass
    
    return tf.matmul(output_concat, weights) + biases

In [14]:
### MODEL ###
tf.reset_default_graph() 


## PLACEHOLDERS
story = tf.placeholder(tf.int64, [None, None, None], "story")        # [batch_size x 5 x max_length]
order = tf.placeholder(tf.int64, [None, None], "order")              # [batch_size x 5]
dropout_prob = tf.placeholder(tf.float32) # to hold dropout probability (need placeholder as no dropout at prediction time)

batch_size = tf.shape(story)[0]
max_length = tf.shape(story)[2]

sentences = [tf.reshape(x, [batch_size, -1]) for x in tf.split(axis=1, num_or_size_splits=5, value=story)]  # 5 times [batch_size x max_length]

# We need the lengths of each sentence as an input to the dynamic_rnn - find the number of non zero elements
# in each sentence
sentence_lengths = [tf.count_nonzero(sentence, 1) for sentence in sentences] 

# Word embeddings
initializer = tf.glorot_uniform_initializer()
embeddings = tf.get_variable("W", [vocab_size, input_size], initializer=initializer)

sentences_embedded = [tf.nn.embedding_lookup(embeddings, sentence)   # [batch_size x max_seq_length x input_size]
                      for sentence in sentences]

# Define lstm cells
fwd_cell = rnn.BasicLSTMCell(num_hidden_lstm, forget_bias=1.0) #forward lstm cell
bwd_cell = rnn.BasicLSTMCell(num_hidden_lstm, forget_bias=1.0) # backwards lstm cell for bidirectional
fwd_cell = rnn.DropoutWrapper(fwd_cell, output_keep_prob=dropout_prob)
bwd_cell = rnn.DropoutWrapper(bwd_cell, output_keep_prob=dropout_prob) 

# Weights and biases to be applied to the final output of each sentence embedding
weights = {
    'lstm' : tf.Variable(tf.random_normal([num_hidden_lstm, sentence_embedding_dim])),
    'bi' : tf.Variable(tf.random_normal([2*num_hidden_lstm, sentence_embedding_dim])) # 2 times num_hidden_lstm because we concat the outputs of the 2 directions
}
biases = {
    'lstm' : tf.Variable(tf.random_normal([sentence_embedding_dim])),
    'bi' : tf.Variable(tf.random_normal([sentence_embedding_dim]))
}

# Get a list of sentence embeddings: 5 x 25 x sentence_embedding_dim
# Forward LSTM
#sentence_codes = [RNN(sentences_embedded[i], weights['lstm'], biases['lstm'], fwd_cell, sentence_lengths[i]) for i in range(0,5)]
# Bidirectional LSTM
bi = [BiRNN(sentences_embedded[i], weights['bi'], biases['bi'], fwd_cell, bwd_cell, sentence_lengths[i]) for i in range(0,5)]

h_temp = tf.concat(axis=1, values=bi)    # [batch_size x 5*sentence_embedding_dim]
h = tf.reshape(h_temp, [batch_size, 5*sentence_embedding_dim])

# Currently ignoring fully connected layers
#dense1 = tf.contrib.layers.fully_connected(inputs=h, num_outputs=n_hidden_1, activation_fn=lambda x:leaky_relu(x))
#dropout1 = tf.layers.dropout(inputs=dense1, rate=1.0)
#dense2 = tf.contrib.layers.fully_connected(inputs=dropout1, num_outputs=n_hidden_2, activation_fn=lambda x:leaky_relu(x))
#dropout2 = tf.layers.dropout(inputs=dense2, rate=1.0)

logits_flat = tf.contrib.layers.fully_connected(inputs=h, num_outputs=5*target_size, activation_fn=None)  # [batch_size x 5*sentence_embedding_dim]
logits = tf.reshape(logits_flat, [-1, 5, target_size])        # [batch_size x 5 x target_size]

# loss 
temp = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=order)
loss = tf.reduce_sum(temp)

# prediction function
unpacked_logits = [tensor for tensor in tf.unstack(logits, axis=1)]
softmaxes = [tf.nn.softmax(tensor) for tensor in unpacked_logits]
softmaxed_logits = tf.stack(softmaxes, axis=1)

predict = tf.arg_max(softmaxed_logits, 2)

We built our model, together with the loss and the prediction function, all we are left with now is to build an optimiser on the loss:

In [15]:
opt_op = tf.train.AdamOptimizer(0.1).minimize(loss) 

In [16]:
show_graph(tf.get_default_graph().as_graph_def())    

### Model training 

We defined the preprocessing pipeline, set the model up, so we can finally train the model

In [None]:
# Defined again so easy to change size in cell below
train_stories, train_orders, vocab = nn.pipeline(data_train)

In [None]:
BATCH_SIZE = 256
#train_stories = train_stories[:1000]
#train_orders = train_orders[:1000]

with tf.Session() as sess:
    sess.run(tf.initialize_all_variables())
    n = train_stories.shape[0]
    
    
    for epoch in range(50):
        print('----- Epoch', epoch, '-----')
        total_loss = 0
        perm = np.random.permutation(n)
        for i in range(n // BATCH_SIZE):
            indices = perm[i * BATCH_SIZE: (i + 1) * BATCH_SIZE]
            inst_story = train_stories[indices]
            inst_order = train_orders[indices]
            feed_dict = {story: inst_story, order: inst_order, dropout_prob: 0.5}
            _, current_loss= sess.run([opt_op, loss], feed_dict=feed_dict)
            if i % 10 == 0:
                print("Current Epoch %: " + str(round(i / (n // BATCH_SIZE)*100)))
            total_loss += current_loss

        print(' Train loss:', total_loss / n)

        train_feed_dict = {story: train_stories, order: train_orders, dropout_prob: 1.0}
        train_predicted = sess.run(predict, feed_dict=train_feed_dict)
        train_accuracy = nn.calculate_accuracy(train_orders, train_predicted)
        print(' Train accuracy:', train_accuracy)
        
        dev_feed_dict = {story: dev_stories, order: dev_orders, dropout_prob: 1.0}
        dev_predicted = sess.run(predict, feed_dict=dev_feed_dict)
        dev_accuracy = nn.calculate_accuracy(dev_orders, dev_predicted)
        print(' Dev accuracy:', dev_accuracy)

        
    
    nn.save_model(sess)

----- Epoch 0 -----
Current Epoch %: 0
Current Epoch %: 6
Current Epoch %: 11
Current Epoch %: 17
Current Epoch %: 23
Current Epoch %: 28
Current Epoch %: 34
Current Epoch %: 40


## <font color='red'>Assessment 1</font>: Assess Accuracy (40 pts) 

We assess how well your model performs on an unseen test set. We will look at the accuracy of the predicted sentence order, on sentence level, and will score them as followis:

* 0 - 10 pts: 45% <= accuracy < 50%, linear
* 10 - 20 pts: 50% <= accuracy < 55, linear
* 20 - 40 pts: 55 <= accuracy < 60, linear
* extra 0-10 pts: 60 <= accuracy < 70, linear

The **linear** mapping maps any accuracy value between the lower and upper bound linearly to a score. For example, if your model's accuracy score is $acc=54.5\%$, then your score is $10 + 10\frac{acc-50}{55-50}$.

Change the following lines so that they construct the test set in the same way you constructed the dev set in the code above. We will insert the test set instead of the dev set here. **`test_feed_dict` variable must stay named the same**.

In [None]:
# LOAD THE DATA
data_test = nn.load_corpus(data_path + "dev.tsv")
# make sure you process this with the same pipeline as you processed your dev set
test_stories, test_orders, _ = nn.pipeline(data_test, vocab=vocab, max_sent_len_=max_sent_len)

# THIS VARIABLE MUST BE NAMED `test_feed_dict`
test_feed_dict = {story: test_stories, order: test_orders}

The following code loads your model, computes accuracy, and exports the result. **DO NOT** change this code.

In [None]:
#! ASSESSMENT 1 - DO NOT CHANGE, MOVE NOR COPY
with tf.Session() as sess:
    # LOAD THE MODEL
    saver = tf.train.Saver()
    saver.restore(sess, './model/model.checkpoint')
    
    # RUN TEST SET EVALUATION
    dev_predicted = sess.run(predict, feed_dict=test_feed_dict)
    dev_accuracy = nn.calculate_accuracy(dev_orders, dev_predicted)

dev_accuracy

## <font color='orange'>Mark</font>:  Your solution to Task 1 is marked with ** __ points**. 
---

## <font color='blue'>Task 2</font>: Describe your Approach

Enter a 1000 words max description of your approach **in this cell**.
Make sure to provide:
- an **error analysis** of the types of errors your system makes
- compare your system with the model we provide, focus on differences and draw useful comparations between them

Should you need to include figures in your report, make sure they are Python-generated (matplotlib, seaborn, bokeh are all included in the stat-nlp-book Docker image). For that, feel free to create new cells after this cell (before Assessment 2 cell). Link online images at your risk.

...WRITE YOUR DESCRIPTION HERE...

## <font color='red'>Assessment 2</font>: Assess Description (60 pts) 

We will mark the description along the following dimensions: 

* Clarity (10pts: very clear, 0pts: we can't figure out what you did, or you did nothing)
* Creativity (25pts: we could not have come up with this, 0pts: Use only the provided model)
* Substance (25pts: implemented complex state-of-the-art classifier, compared it to a simpler model, 0pts: Only use what is already there)

## <font color='orange'>Mark</font>:  Your solution to Task 2 is marked with ** __ points**.
---

## <font color='orange'>Final mark</font>: Your solution to Assignment 3 is marked with ** __points**. 