<!---
Latex Macros
-->
$$
\newcommand{\bar}{\,|\,}
\newcommand{\Xs}{\mathcal{X}}
\newcommand{\Ys}{\mathcal{Y}}
\newcommand{\y}{\mathbf{y}}
\newcommand{\weights}{\mathbf{w}}
\newcommand{\balpha}{\boldsymbol{\alpha}}
\newcommand{\bbeta}{\boldsymbol{\beta}}
\newcommand{\aligns}{\mathbf{a}}
\newcommand{\align}{a}
\newcommand{\source}{\mathbf{s}}
\newcommand{\target}{\mathbf{t}}
\newcommand{\ssource}{s}
\newcommand{\starget}{t}
\newcommand{\repr}{\mathbf{f}}
\newcommand{\repry}{\mathbf{g}}
\newcommand{\x}{\mathbf{x}}
\newcommand{\prob}{p}
\newcommand{\vocab}{V}
\newcommand{\params}{\boldsymbol{\theta}}
\newcommand{\param}{\theta}
\DeclareMathOperator{\perplexity}{PP}
\DeclareMathOperator{\argmax}{argmax}
\DeclareMathOperator{\argmin}{argmin}
\newcommand{\train}{\mathcal{D}}
\newcommand{\counts}[2]{\#_{#1}(#2) }
\newcommand{\length}[1]{\text{length}(#1) }
\newcommand{\indi}{\mathbb{I}}
$$

# Assignment 3

## Introduction

In the last assignment, you will apply deep learning methods to solve a particular story understanding problem. Automatic understanding of stories is an important task in natural language understanding [[1]](http://anthology.aclweb.org/D/D13/D13-1020.pdf). Specifically, you will develop a model that given a sequence of sentences learns to sort these sentence in order to yield a coherent story [[2]](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/06/short-commonsense-stories.pdf). This sounds (and to an extent is) trivial for humans, however it is quite a difficult task for machines as it involves commonsense knowledge and temporal understanding.

## Goal

You are given a dataset of 45502 instances, each consisting of 5 sentences. Your system needs to ouput a sequence of numbers which represent the predicted order of these sentences. For example, given a story:

    He went to the store.
    He found a lamp he liked.
    He bought the lamp.
    Jan decided to get a new lamp.
    Jan's lamp broke.

your system needs to provide an answer in the following form:

    2	3	4	1	0

where the numbers correspond to the zero-based index of each sentence in the correctly ordered story. So "`2`" for "`He went to the store.`" means that this sentence should come 3rd in the correctly ordered target story. In this particular example, this order of indices corresponds to the following target story:

    Jan's lamp broke.
    Jan decided to get a new lamp.
    He went to the store.
    He found a lamp he liked.
    He bought the lamp.

## Resources

To develop your model(s), we provide a training and a development datasets. The test dataset will be held out, and we will use it to evaluate your models. The test set is coming from the same task distribution, and you don't need to expect drastic changes in it.

You will use [TensorFlow](https://www.tensorflow.org/) to build a deep learning model for the task. We provide a very crude system which solves the task with a low accuracy, and a set of additional functions you will have to use to save and load the model you create so that we can run it.

As we have to run the notebooks of each submission, and as deep learning models take long time to train, your notebook **NEEDS** to conform to the following requirements:
* You **NEED** to run your parameter optimisation offline, and provide your final model saved by using the provided function
* The maximum size of a zip file you can upload to moodle is 160MB. We will **NOT** allow submissions larger than that.
* We do not have time to train your models from scratch! You **NEED** to provide the full code you used for the training of your model, but by all means you **CANNOT** call the training method in the notebook you will send to us.
* We will run these notebooks automatically. If your notebook runs the training procedure, in addition to loading the model, and we need to edit your code to stop the training, you will be penalised with **-20 points**.
* If you do not provide a pretrained model, and rely on training your model on our machines, you will get **0 points**.
* Your submissions will be tested on the stat-nlp-book Docker image to ensure that it does not have any dependencies outside of those that we provide. If your submission fails to adhere to this requirement, you will get **0 points**.

Running time and memory issues:
* We have tested a possible solution on a mid-2014 MacBook Pro, and a few epochs of the model run in less than 3min. Thus it is possible to train a model on the data in reasonable time. However, be aware that you will need to run these models many times over, for a larger number of epochs (more elaborate models, trained on much larger datasets can train for weeks! However, this shouldn't be the case here.). If you find training times too long for your development cycle you can reduce the training set size. Once you have found a good solution you can increase the size again. Caveat: model parameters tuned on a smaller dataset may not be optimal for a larger training set.
* In addition to this, as your submission is capped by size, feel free to experiment with different model sizes, numeric values of different precisions, filtering the vocabulary size, downscaling some vectors, etc.

## Hints

A non-exhaustive list of things you might want to give a try:
- better tokenization
- experiment with pre-trained word representations such as [word2vec](https://code.google.com/archive/p/word2vec/), or [GloVe](http://nlp.stanford.edu/projects/glove/). Be aware that these representations might take a lot of parameters in your model. Be sure you use only the words you expect in the training/dev set and account for OOV words. When saving the model parameters, pre-rained word embeddings can simply be used in the word embedding matrix of your model. As said, make sure that this word embedding matrix does not contain all of word2vec or GloVe. Your submission is limited, and we will not allow uploading nor using the whole representations set (up to 3GB!)
- reduced sizes of word representations
- bucketing and batching (our implementation is deliberately not a good one!)
  - make sure to draw random batches from the data! (we do not provide this in our code!)
- better models:
  - stacked RNNs (see tf.contrib.rnn.MultiRNNCell)
  - bi-directional RNNs
  - attention
  - word-by-word attention
  - conditional encoding
  - get model inspirations from papers on [nlp.stanford.edu/projects/snli/](nlp.stanford.edu/projects/snli/)
  - sequence-to-sequence encoder-decode architecture for producing the right ordering
- better training procedure:
  - different training algorithms
  - dropout on the input and output embeddings (see tf.nn.dropout)
  - L2 regularization (see tf.nn.l2_loss)
  - gradient clipping (see tf.clip_by_value or tf.clip_by_norm)
- model selection:
  - early stopping
- hyper-parameter optimization (e.g. random search or grid search (expensive!))
    - initial learning rate
    - dropout probability
    - input and output size
    - L2 regularization
    - gradient clipping value
    - batch size
    - ...
- post-processing
  - for incorporating consistency constraints

## Setup Instructions
It is important that this file is placed in the **correct directory**. It will not run otherwise. The correct directory is

    DIRECTORY_OF_YOUR_BOOK/assignments/2017/assignment3/problem/group_X/
    
where `DIRECTORY_OF_YOUR_BOOK` is a placeholder for the directory you downloaded the book to, and in `X` in `group_X` contains the number of your group.

After you placed it there, **rename the notebook file** to `group_X.ipynb`.

The notebook is pre-set to save models in

    DIRECTORY_OF_YOUR_BOOK/assignments/2017/assignment3/problem/group_X/model/

Be sure not to tinker with that directory - we expect your submission to contain a `model` subdirectory with a single saved model! 
The saving procedure might overwrite the latest save, or not. Make sure you understand what it does, and upload only a single model! (for more details check tf.train.Saver)

## General Instructions
This notebook will be used by you to provide your solution, and by us to both assess your solution and enter your marks. It contains three types of sections:

1. **Setup** Sections: these sections set up code and resources for assessment. **Do not edit, move nor copy these cells**.
2. **Assessment** Sections: these sections are used for both evaluating the output of your code, and for markers to enter their marks. **Do not edit, move, nor copy these cells**.
3. **Task** Sections: these sections require your solutions. They may contain stub code, and you are expected to edit this code. For free text answers simply edit the markdown field.  

**If you edit, move or copy any of the setup, assessments and mark cells, you will be penalised with -20 points**.

Note that you are free to **create additional notebook cells** within a task section. 

Please **do not share** this assignment nor the dataset publicly, by uploading it online, emailing it to friends etc.

## Submission Instructions

To submit your solution:

* Make sure that your solution is fully contained in this notebook. Make sure you do not use any additional files other than your saved model.
* Make sure that your solution runs linearly from start to end (no execution hops). We will run your notebook in that order.
* **Before you submit, make sure your submission is tested on the stat-nlp-book Docker setup to ensure that it does not have any dependencies outside of those that we provide. If your submission fails to adhere to this requirement, you will get 0 points**.
* **If running your notebook produces a trivially fixable error that we spot, we will correct it and penalise you with -20 points. Otherwise you will get 0 points for that solution.**
* **Rename this notebook to your `group_X`** (where `X` is the number of your group), and adhere to the directory structure requirements, if you have not already done so. ** Failure to do so will result in -1 point.**
* Download the notebook in Jupyter via *File -> Download as -> Notebook (.ipynb)*.
* Your submission should be a zip file containing the `group_X` directory, containing `group_X.ipynb` notebook, and the `model` directory with the saved model
* Upload that file to the Moodle submission site.

## <font color='green'>Setup 1</font>: Load Libraries
This cell loads libraries important for evaluation and assessment of your model. **Do not change, move or copy it.**

In [1]:
%%capture
%load_ext autoreload
%autoreload 2
%matplotlib inline
#! SETUP 1 - DO NOT CHANGE, MOVE NOR COPY
import sys, os
_snlp_book_dir = "../../../../../"
sys.path.append(_snlp_book_dir)
# docker image contains tensorflow 0.10.0rc0. We will support execution of only that version!
import statnlpbook.nn as nn

import tensorflow as tf
import numpy as np
import tensorflow.contrib as keras


## <font color='green'>Setup 2</font>: Load Training Data

This cell loads the training data. **Do not edit the next cell, nor copy/duplicate it**. Instead refer to the variables in your own code, and slice and dice them as you see fit (but do not change their values). 
For example, no one stops you from introducing, in the corresponding task section, `my_train` and `my_dev` variables that split the data into different folds.   

In [2]:
#! SETUP 2 - DO NOT CHANGE, MOVE NOR COPY
data_path = _snlp_book_dir + "data/nn/"
data_train = nn.load_corpus(data_path + "train.tsv")
data_dev = nn.load_corpus(data_path + "dev.tsv")
assert(len(data_train) == 45502)

### Data Structures

Notice that the data is loaded from tab-separated files. The files are easy to read, and we provide the loading functions that load it into a simple data structure. Feel free to check details of the loading.

The data structure at hand is an array of dictionaries, each containing a `story` and the `order` entry. `story` is a list of strings, and `order` is a list of integer indices:

In [3]:
data_train[1]

{'order': [1, 0, 3, 4, 2],
 'story': ["She didn't have a bike of her own.",
  'Carrie had just learned how to ride a bike.',
  'She got nervous on a hill and crashed into a wall.',
  'The bike frame bent and Carrie got a deep gash on her leg.',
  "Carrie would sneak rides on her sister's bike."]}

## <font color='blue'>Task 1</font>: Model implementation

Your primary task in this assignment is to implement a model that produces the right order of the sentences in the dataset.

### Preprocessing pipeline

First, we construct a preprocessing pipeline, in our case `pipeline` function which takes care of:
- out-of-vocabulary words
- building a vocabulary (on the train set), and applying the same unaltered vocabulary on other sets (dev and test)
- making sure that the length of input is the same for the train and dev/test sets (for fixed-sized models)

You are free (and encouraged!) to do your own input processing function. Should you experiment with recurrent neural networks, you will find that you will need to do so.

In [4]:
def tokenize(input):
    return input.split(' ')

In [5]:
import re
def tokenize(input):
    token = re.compile("[\w]+(?=n't)|n't|'s|\'m|\'ll|[\w]+|[.?!;,:]")
    tokens = token.findall(input)
    return tokens

In [6]:
def newpipeline(data, vocab=None, max_sent_len_=None):
    is_ext_vocab = True
    if vocab is None:
        is_ext_vocab = False
        vocab = {'<PAD>': 0, '<OOV>': 1}

    max_sent_len = -1
    data_sentences = []
    data_orders = []
    data_length = []
    for instance in data:
        sents = []
        story_length = []
        for sentence in instance['story']:
            sent = []
            tokenized = tokenize(sentence)
            temp_length = np.shape(tokenized)[0]
            for token in tokenized:
                if not is_ext_vocab and token not in vocab:
                    vocab[token] = len(vocab)
                if token not in vocab:
                    token_id = vocab['<OOV>']
                else:
                    token_id = vocab[token]
                sent.append(token_id)
            if len(sent) > max_sent_len:
                max_sent_len = len(sent)
            sents.append(sent)
            story_length.append(temp_length)
        data_sentences.append(sents)
        data_orders.append(instance['order'])
        data_length.append(story_length)

    if max_sent_len_ is not None:
        max_sent_len = max_sent_len_
    out_sentences = np.full([len(data_sentences), 5, max_sent_len], vocab['<PAD>'], dtype=np.int32)

    for i, elem in enumerate(data_sentences):
        for j, sent in enumerate(elem):
            out_sentences[i, j, 0:len(sent)] = sent

    out_orders = np.array(data_orders, dtype=np.int32)
    out_length = np.array(data_length, dtype = np.int32)
    return out_sentences, out_orders, vocab, out_length

In [7]:
# convert train set to integer IDs
train_stories, train_orders, vocab,train_length = newpipeline(data_train[0:10000])

In [8]:
train_length.shape

(10000, 5)

In [9]:
def loadWord2Vec(filename):
    vocab = {}
    embd = []
    file = open(filename,'r')
    for line in file.readlines():
        row = line.strip().split(' ')
        for i,token in enumerate(row):
            if i == 0:
                vocab[row[i]] = len(vocab)
            else:
                embd.append(float(row[i]))
    print('Successfully load.')
    file.close()    
    return vocab,embd

In [10]:
import re
def tokenize(input):
    token = re.compile("[\w]+(?=n't)|n't|'s|\'m|\'ll|[\w]+|[.?!;,:]")
    tokens = token.findall(input)
    return tokens

In [11]:
def pipeline2(data,vocab=None,embd=None,max_sent_len_=None):
    is_ext_vocab = True
    if vocab is None and embd is None: 
        is_ext_vocab = False
        #load word2vect
        filename = 'word2vec.txt'
        vocab, vec = loadWord2Vec(filename)
        embd = np.array(vec,dtype=np.float32)
        embd = np.reshape(embd,(len(vocab),50))
    
    max_sent_len = -1
    data_sentences = []
    data_orders = []
    full_length = []
    for instance in data:
        sents = []
        story_length = []
        for sentence in instance['story']:
            sent = []
            tokenized = tokenize(sentence)
            temp_length = np.shape(tokenized)[0]
            for token in tokenized:
                token = token.lower()
                if not is_ext_vocab: #trainning set
                    token_id = vocab[token]
                elif token not in vocab:
                    token_id = vocab['<OOV>']
                else:
                    token_id = vocab[token]                 
                sent.append(token_id)
            if len(sent) > max_sent_len:
                max_sent_len = len(sent)
            sents.append(sent)
            story_length.append(temp_length)
        data_sentences.append(sents)
        data_orders.append(instance['order'])
        full_length.append(story_length)

    if max_sent_len_ is not None:
        max_sent_len = max_sent_len_
    out_sentences = np.full([len(data_sentences), 5, max_sent_len], vocab['<PAD>'], dtype=np.int32)

    for i, elem in enumerate(data_sentences):
        for j, sent in enumerate(elem):
            out_sentences[i, j, 0:len(sent)] = sent

    out_orders = np.array(data_orders, dtype=np.int32)
    #full_length = np.reshape(full_length,[-1])
    full_length = np.asmatrix(full_length)
    return out_sentences, out_orders, vocab, embd, full_length

In [12]:
train_stories, train_orders, vocab, embd, train_length= pipeline2(data_train)

Successfully load.


In [16]:
train_length.shape

(45502, 5)

You need to make sure that the `pipeline` function returns the necessary data for your computational graph feed - the required inputs in this case, as we will call this function to process your dev and test data. If you do not make sure that the same pipeline applied to the train set is applied to other datasets, your model may not work with that data!

In [12]:
# get the length of the longest sentence = 24
max_sent_len = train_stories.shape[2]

# convert dev set to integer IDs, based on the train vocabulary and max_sent_len
dev_stories, dev_orders, _ ,dev_length= newpipeline(data_dev, vocab=vocab, max_sent_len_=max_sent_len)
max_sent_len

20

In [39]:
#Word Embedding
max_sent_len = train_stories.shape[2]
dev_stories, dev_orders, _, _ ,dev_length= pipeline2(data_dev, vocab=vocab, max_sent_len_=max_sent_len)

You can take a look at the result of the `pipeline` with the `show_data_instance` function to make sure that your data loaded correctly:

In [40]:
nn.show_data_instance(dev_stories, dev_orders, vocab, 155)

Input:
 Story:
  the manager decided to offer john the job .
  during the interview he was very talkative and <OOV> .
  he went to the interview very prepared and nicely dressed .
  john was excited to have a job interview .
  the manager of the company was really impressed by john 's comments .
 Order:
  [4 2 1 0 3]

Desired story:
  john was excited to have a job interview .
  he went to the interview very prepared and nicely dressed .
  during the interview he was very talkative and <OOV> .
  the manager of the company was really impressed by john 's comments .
  the manager decided to offer john the job .


### Model

The model we provide is a rudimentary, non-optimised model that essentially represents every word in a sentence with a fixed vector, sums these vectors up (per sentence) and puts a softmax at the end which aims to guess the order of sentences independently.

First we define the model parameters:

In [41]:
#training parameter
num_hidden = 8
target_size = 5
vocab_size = len(vocab)
input_size = 50
timesteps = 5
num_layers = 2

and then we define the model

In [42]:
tf.reset_default_graph()
#creat tensor holder
story = tf.placeholder(tf.int32, [None,None,None],"story")
order = tf.placeholder(tf.int32, [None,None],"order")
length = tf.placeholder(tf.int64, [None, None], "sentence_length") #[batch_size x 5]

weights = {
    'out': tf.Variable(tf.random_normal([num_hidden, target_size]),dtype=tf.float32)
}
biases = {
    'out': tf.Variable(tf.random_normal([target_size]),dtype=tf.float32)
}

batch_size = tf.shape(story)[0]
sentences = [tf.reshape(x, [batch_size, -1]) for x in tf.split(axis=1, num_or_size_splits=5, value=story)]
#initializer = tf.random_uniform_initializer(-0.1, 0.1)
#embeddings = tf.get_variable("W", [vocab_size, input_size], initializer=initializer)

embeddings = tf.constant(embd)
sentences_embedded = [tf.nn.embedding_lookup(embeddings, sentence)    # 5 x batch_size x max_length x input_size
                    for sentence in sentences]
# hs = [tf.reduce_sum(sentence, 1) for sentence in sentences_embedded]
hs = []
#cell = tf.contrib.rnn.BasicLSTMCell(num_hidden, forget_bias=1.0,state_is_tuple=True)
hidden_units = [20,5]
rnn_layers = [tf.contrib.rnn.LSTMCell(size, activation= tf.nn.tanh) for size in hidden_units]
cell = tf.contrib.rnn.MultiRNNCell(rnn_layers, state_is_tuple=True)
cell = tf.contrib.rnn.DropoutWrapper(
      cell, output_keep_prob=0.4)


i = 0
for batch in sentences_embedded:
    output,final_state = tf.nn.dynamic_rnn(cell,batch,sequence_length = length[:,i],dtype=tf.float32)
    #output, _ = tf.nn.bidirectional_dynamic_rnn(cell_fw=cell,cell_bw=cell,inputs = batch, 
    #                                                      sequence_length= length[:,1], dtype = tf.float32)
    
    hs.append(final_state[-1].h)
    #hs.append(tf.matmul(final_state.h, weights['out']) + biases['out'])
    i+=1
# the size of hs [5,batch_size,5]

h = tf.concat(axis=1, values=hs)    # [batch_size x 5*input_size]
h = tf.reshape(h, [batch_size, 5*hidden_units[-1]])
logits_flat = tf.contrib.layers.linear(h, 5 * target_size)    # [batch_size x 5*target_size]
logits = tf.reshape(logits_flat, [-1, 5, target_size])        # [batch_size x 5 x target_size]
#logits = tf.transpose(hs,perm=[1,0,2])
# # # loss 
loss = tf.reduce_sum(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=order))

#beta = 0.1
#regularizers = tf.nn.l2_loss(weights['out'])

#loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(
#    logits=logits, labels=order))

#loss = loss + beta * regularizers


# prediction function
unpacked_logits = [tensor for tensor in tf.unstack(logits, axis=1)]
softmaxes = [tf.nn.softmax(tensor) for tensor in unpacked_logits]
softmaxed_logits = tf.stack(softmaxes, axis=1)
predict = tf.arg_max(softmaxed_logits, 2)
#opt_op = tf.train.AdamOptimizer(0.1).minimize(loss)

In [43]:
max_gradient_norm = 1

# Calculate and clip gradients
params = tf.trainable_variables()
gradients = tf.gradients(loss, params)
clipped_gradients, _ = tf.clip_by_global_norm(
gradients, max_gradient_norm)

# Optimization
#optimizer = tf.train.AdamOptimizer(0.001)
optimizer = tf.train.RMSPropOptimizer(0.001)
update_step = optimizer.apply_gradients(
    zip(clipped_gradients, params))
opt_op = optimizer.minimize(loss)

In [51]:
1000//25

40

In [44]:
tf.set_random_seed(1234)

We built our model, together with the loss and the prediction function, all we are left with now is to build an optimiser on the loss:

### Model training 

We defined the preprocessing pipeline, set the model up, so we can finally train the model

In [50]:
np.random.choice(range(len(train_stories)),20,replace = False)

array([28803,   100, 40534, 34152,  9407, 40087, 17993, 30877, 43566,
       14355,  7876, 10433, 42544, 21522,  8430, 43914, 36772, 34517,
       23116, 24648])

In [55]:
BATCH_SIZE = 512
with tf.Session() as sess:
    sess.run(tf.initialize_all_variables())
    n = train_stories.shape[0]

    for epoch in range(1000):
        #BATCH_SIZE = 
        print('----- Epoch', epoch, '-----')
        total_loss = 0
        idx = np.random.permutation(train_stories.shape[0])
        for i in range(n // BATCH_SIZE):
            inst_story = train_stories[idx[i * BATCH_SIZE: (i + 1) * BATCH_SIZE]]
            inst_order = train_orders[idx[i * BATCH_SIZE: (i + 1) * BATCH_SIZE]]
            inst_length = train_length[idx[i * BATCH_SIZE: (i + 1) * BATCH_SIZE]]
            feed_dict = {story: inst_story, order: inst_order, length: inst_length}
            _, current_loss = sess.run([opt_op, loss], feed_dict=feed_dict)
            total_loss += current_loss

        print(' Train loss:', total_loss / n)

        train_feed_dict = {story: train_stories, order: train_orders, length: train_length}
        train_predicted = sess.run(predict, feed_dict=train_feed_dict)
        train_accuracy = nn.calculate_accuracy(train_orders, train_predicted)
        print(' Train accuracy:', train_accuracy)
        
        dev_feed_dict = {story: dev_stories, order: dev_orders,length:dev_length}
        dev_predicted = sess.run(predict, feed_dict=dev_feed_dict)
        dev_accuracy = nn.calculate_accuracy(dev_orders, dev_predicted)
        print(' Dev accuracy:', dev_accuracy)

        
    
    nn.save_model(sess)

----- Epoch 0 -----
 Train loss: 7.823172919
 Train accuracy: 0.318073930816
 Dev accuracy: 0.308818813469
----- Epoch 1 -----
 Train loss: 7.21117543812
 Train accuracy: 0.349615401521
 Dev accuracy: 0.349011223944
----- Epoch 2 -----
 Train loss: 6.84138643094
 Train accuracy: 0.363346666081
 Dev accuracy: 0.366862640299
----- Epoch 3 -----
 Train loss: 6.66900403245
 Train accuracy: 0.375390092743
 Dev accuracy: 0.379155531801
----- Epoch 4 -----
 Train loss: 6.55370595832
 Train accuracy: 0.387481868929
 Dev accuracy: 0.392089791555
----- Epoch 5 -----
 Train loss: 6.45837628231
 Train accuracy: 0.40535361083
 Dev accuracy: 0.39914484233
----- Epoch 6 -----
 Train loss: 6.36129633982
 Train accuracy: 0.420021097974
 Dev accuracy: 0.400962052378
----- Epoch 7 -----
 Train loss: 6.25840915346
 Train accuracy: 0.43580501956
 Dev accuracy: 0.40833778728
----- Epoch 8 -----
 Train loss: 6.15665607118
 Train accuracy: 0.447386928047
 Dev accuracy: 0.413682522715
----- Epoch 9 -----
 Trai

 Train loss: 5.0077764117
 Train accuracy: 0.565350973584
 Dev accuracy: 0.519294494923
----- Epoch 76 -----
 Train loss: 5.00618272102
 Train accuracy: 0.566709155642
 Dev accuracy: 0.521111704971
----- Epoch 77 -----
 Train loss: 4.99963373541
 Train accuracy: 0.565830073403
 Dev accuracy: 0.519294494923
----- Epoch 78 -----
 Train loss: 4.99844282686
 Train accuracy: 0.567030020658
 Dev accuracy: 0.522394441475
----- Epoch 79 -----
 Train loss: 4.99432918449
 Train accuracy: 0.564915827876
 Dev accuracy: 0.524853019776
----- Epoch 80 -----
 Train loss: 4.99155470458
 Train accuracy: 0.567069579359
 Dev accuracy: 0.521432389097
----- Epoch 81 -----
 Train loss: 4.98599351681
 Train accuracy: 0.567122324293
 Dev accuracy: 0.523570283271
----- Epoch 82 -----
 Train loss: 4.98328715183
 Train accuracy: 0.563966419059
 Dev accuracy: 0.52613575628
----- Epoch 83 -----
 Train loss: 4.98097280043
 Train accuracy: 0.568704672322
 Dev accuracy: 0.523035809727
----- Epoch 84 -----
 Train loss:

 Train accuracy: 0.579534965496
 Dev accuracy: 0.53607696419
----- Epoch 150 -----
 Train loss: 4.84385310866
 Train accuracy: 0.57844050811
 Dev accuracy: 0.525494388028
----- Epoch 151 -----
 Train loss: 4.84355573699
 Train accuracy: 0.579983297437
 Dev accuracy: 0.533939070016
----- Epoch 152 -----
 Train loss: 4.84027903628
 Train accuracy: 0.580449211024
 Dev accuracy: 0.533297701764
----- Epoch 153 -----
 Train loss: 4.84156242982
 Train accuracy: 0.579886598391
 Dev accuracy: 0.533297701764
----- Epoch 154 -----
 Train loss: 4.83751756233
 Train accuracy: 0.580097578128
 Dev accuracy: 0.531373597007
----- Epoch 155 -----
 Train loss: 4.8354936182
 Train accuracy: 0.578313041185
 Dev accuracy: 0.527097808658
----- Epoch 156 -----
 Train loss: 4.83584088358
 Train accuracy: 0.577820755132
 Dev accuracy: 0.529021913415
----- Epoch 157 -----
 Train loss: 4.83611578457
 Train accuracy: 0.580959078722
 Dev accuracy: 0.531694281133
----- Epoch 158 -----
 Train loss: 4.83294743998
 Tra

 Train accuracy: 0.58453254802
 Dev accuracy: 0.535970069482
----- Epoch 224 -----
 Train loss: 4.76933433107
 Train accuracy: 0.58608412817
 Dev accuracy: 0.53693212186
----- Epoch 225 -----
 Train loss: 4.76874722819
 Train accuracy: 0.58737637906
 Dev accuracy: 0.532549438803
----- Epoch 226 -----
 Train loss: 4.76534749138
 Train accuracy: 0.582523845106
 Dev accuracy: 0.527097808658
----- Epoch 227 -----
 Train loss: 4.76528142607
 Train accuracy: 0.584928135027
 Dev accuracy: 0.530625334046
----- Epoch 228 -----
 Train loss: 4.76602867817
 Train accuracy: 0.585174278054
 Dev accuracy: 0.536504543025
----- Epoch 229 -----
 Train loss: 4.76479020453
 Train accuracy: 0.586734649026
 Dev accuracy: 0.535328701229
----- Epoch 230 -----
 Train loss: 4.76455978347
 Train accuracy: 0.585323722034
 Dev accuracy: 0.529556386959
----- Epoch 231 -----
 Train loss: 4.76601875201
 Train accuracy: 0.585503933893
 Dev accuracy: 0.531587386424
----- Epoch 232 -----
 Train loss: 4.76157103263
 Trai

 Train loss: 4.722867307
 Train accuracy: 0.59007076612
 Dev accuracy: 0.535435595938
----- Epoch 299 -----
 Train loss: 4.72276446125
 Train accuracy: 0.58988615885
 Dev accuracy: 0.534152859433
----- Epoch 300 -----
 Train loss: 4.71992571884
 Train accuracy: 0.588273042943
 Dev accuracy: 0.533297701764
----- Epoch 301 -----
 Train loss: 4.71846083216
 Train accuracy: 0.590703705332
 Dev accuracy: 0.537359700695
----- Epoch 302 -----
 Train loss: 4.72268219752
 Train accuracy: 0.590760845677
 Dev accuracy: 0.541207910208
----- Epoch 303 -----
 Train loss: 4.71785648459
 Train accuracy: 0.590817986023
 Dev accuracy: 0.537894174238
----- Epoch 304 -----
 Train loss: 4.71828489768
 Train accuracy: 0.589525735133
 Dev accuracy: 0.534152859433
----- Epoch 305 -----
 Train loss: 4.71819379698
 Train accuracy: 0.590677332864
 Dev accuracy: 0.540887226082
----- Epoch 306 -----
 Train loss: 4.71732475358
 Train accuracy: 0.591037756582
 Dev accuracy: 0.537359700695
----- Epoch 307 -----
 Trai

 Train accuracy: 0.594184870995
 Dev accuracy: 0.540780331374
----- Epoch 373 -----
 Train loss: 4.68520021592
 Train accuracy: 0.594444200255
 Dev accuracy: 0.5411010155
----- Epoch 374 -----
 Train loss: 4.68527842333
 Train accuracy: 0.593147553954
 Dev accuracy: 0.539497594869
----- Epoch 375 -----
 Train loss: 4.68224437701
 Train accuracy: 0.593846424333
 Dev accuracy: 0.539176910743
----- Epoch 376 -----
 Train loss: 4.68433610877
 Train accuracy: 0.594149707705
 Dev accuracy: 0.540780331374
----- Epoch 377 -----
 Train loss: 4.68236035212
 Train accuracy: 0.59269043119
 Dev accuracy: 0.5411010155
----- Epoch 378 -----
 Train loss: 4.68295724703
 Train accuracy: 0.594039822425
 Dev accuracy: 0.539818278995
----- Epoch 379 -----
 Train loss: 4.68268309721
 Train accuracy: 0.594545294712
 Dev accuracy: 0.541849278461
----- Epoch 380 -----
 Train loss: 4.68360876265
 Train accuracy: 0.594000263725
 Dev accuracy: 0.538214858365
----- Epoch 381 -----
 Train loss: 4.68196621916
 Train

 Train loss: 4.65486553132
 Train accuracy: 0.596479275636
 Dev accuracy: 0.542490646713
----- Epoch 448 -----
 Train loss: 4.65743252228
 Train accuracy: 0.596184783086
 Dev accuracy: 0.539070016034
----- Epoch 449 -----
 Train loss: 4.6556990498
 Train accuracy: 0.596035339106
 Dev accuracy: 0.544628540887
----- Epoch 450 -----
 Train loss: 4.65389073944
 Train accuracy: 0.597129796492
 Dev accuracy: 0.544628540887
----- Epoch 451 -----
 Train loss: 4.65552507908
 Train accuracy: 0.596655092084
 Dev accuracy: 0.543987172635
----- Epoch 452 -----
 Train loss: 4.65633582641
 Train accuracy: 0.595965012527
 Dev accuracy: 0.540994120791
----- Epoch 453 -----
 Train loss: 4.65359426917
 Train accuracy: 0.596888048877
 Dev accuracy: 0.543987172635
----- Epoch 454 -----
 Train loss: 4.6540485171
 Train accuracy: 0.596430926113
 Dev accuracy: 0.541956173169
----- Epoch 455 -----
 Train loss: 4.65392670969
 Train accuracy: 0.596747395719
 Dev accuracy: 0.542918225548
----- Epoch 456 -----
 Tr

 Train loss: 4.63083141005
 Train accuracy: 0.597296822118
 Dev accuracy: 0.543238909674
----- Epoch 522 -----
 Train loss: 4.63426237403
 Train accuracy: 0.596532020571
 Dev accuracy: 0.548049171566
----- Epoch 523 -----
 Train loss: 4.6308860254
 Train accuracy: 0.597635268779
 Dev accuracy: 0.541528594335
----- Epoch 524 -----
 Train loss: 4.63420135765
 Train accuracy: 0.599142894818
 Dev accuracy: 0.544842330305
----- Epoch 525 -----
 Train loss: 4.63065922605
 Train accuracy: 0.596413344468
 Dev accuracy: 0.552324959914
----- Epoch 526 -----
 Train loss: 4.63375848456
 Train accuracy: 0.597257263417
 Dev accuracy: 0.545056119722
----- Epoch 527 -----
 Train loss: 4.62942095093
 Train accuracy: 0.598509955606
 Dev accuracy: 0.549866381614
----- Epoch 528 -----
 Train loss: 4.62987315998
 Train accuracy: 0.596905630522
 Dev accuracy: 0.546231961518
----- Epoch 529 -----
 Train loss: 4.62968328594
 Train accuracy: 0.597512197266
 Dev accuracy: 0.541956173169
----- Epoch 530 -----
 T

 Train accuracy: 0.596712232429
 Dev accuracy: 0.548049171566
----- Epoch 596 -----
 Train loss: 4.61534353594
 Train accuracy: 0.600817546481
 Dev accuracy: 0.54441475147
----- Epoch 597 -----
 Train loss: 4.6127857146
 Train accuracy: 0.600435145708
 Dev accuracy: 0.54687332977
----- Epoch 598 -----
 Train loss: 4.61315392151
 Train accuracy: 0.600545030988
 Dev accuracy: 0.547407803314
----- Epoch 599 -----
 Train loss: 4.61109855169
 Train accuracy: 0.600386796185
 Dev accuracy: 0.546125066809
----- Epoch 600 -----
 Train loss: 4.61484542511
 Train accuracy: 0.60047909982
 Dev accuracy: 0.545376803848
----- Epoch 601 -----
 Train loss: 4.61140953025
 Train accuracy: 0.599810997319
 Dev accuracy: 0.546231961518
----- Epoch 602 -----
 Train loss: 4.61396572047
 Train accuracy: 0.599687925805
 Dev accuracy: 0.547300908605
----- Epoch 603 -----
 Train loss: 4.61236734572
 Train accuracy: 0.600215375148
 Dev accuracy: 0.548049171566
----- Epoch 604 -----
 Train loss: 4.61254730433
 Trai

 Train loss: 4.59489455606
 Train accuracy: 0.601261483012
 Dev accuracy: 0.542597541422
----- Epoch 671 -----
 Train loss: 4.59504233781
 Train accuracy: 0.600861500593
 Dev accuracy: 0.543666488509
----- Epoch 672 -----
 Train loss: 4.59816665854
 Train accuracy: 0.601243901367
 Dev accuracy: 0.546766435061
----- Epoch 673 -----
 Train loss: 4.5978330537
 Train accuracy: 0.601437299459
 Dev accuracy: 0.543132014965
----- Epoch 674 -----
 Train loss: 4.5971187189
 Train accuracy: 0.601125225265
 Dev accuracy: 0.545483698557
----- Epoch 675 -----
 Train loss: 4.59699470091
 Train accuracy: 0.601129620676
 Dev accuracy: 0.543773383218
----- Epoch 676 -----
 Train loss: 4.59896900887
 Train accuracy: 0.601868049756
 Dev accuracy: 0.546445750935
----- Epoch 677 -----
 Train loss: 4.59816995832
 Train accuracy: 0.601595534262
 Dev accuracy: 0.544628540887
----- Epoch 678 -----
 Train loss: 4.59903754231
 Train accuracy: 0.600659311679
 Dev accuracy: 0.537466595404
----- Epoch 679 -----
 Tr

 Train accuracy: 0.602426266977
 Dev accuracy: 0.542811330839
----- Epoch 745 -----
 Train loss: 4.58365505719
 Train accuracy: 0.601828491055
 Dev accuracy: 0.547087119188
----- Epoch 746 -----
 Train loss: 4.58581913724
 Train accuracy: 0.602430662388
 Dev accuracy: 0.546445750935
----- Epoch 747 -----
 Train loss: 4.58313854808
 Train accuracy: 0.602698782471
 Dev accuracy: 0.543666488509
----- Epoch 748 -----
 Train loss: 4.58554364605
 Train accuracy: 0.602268032174
 Dev accuracy: 0.542169962587
----- Epoch 749 -----
 Train loss: 4.58428165009
 Train accuracy: 0.603278976748
 Dev accuracy: 0.545376803848
----- Epoch 750 -----
 Train loss: 4.58387341126
 Train accuracy: 0.603098764889
 Dev accuracy: 0.541849278461
----- Epoch 751 -----
 Train loss: 4.58588295441
 Train accuracy: 0.602892180563
 Dev accuracy: 0.54270443613
----- Epoch 752 -----
 Train loss: 4.58518825194
 Train accuracy: 0.603239418048
 Dev accuracy: 0.544307856761
----- Epoch 753 -----
 Train loss: 4.58079939738
 T

 Train loss: 4.57200520011
 Train accuracy: 0.600791174014
 Dev accuracy: 0.547087119188
----- Epoch 820 -----
 Train loss: 4.57198829344
 Train accuracy: 0.604509691882
 Dev accuracy: 0.541849278461
----- Epoch 821 -----
 Train loss: 4.57276564595
 Train accuracy: 0.604544855171
 Dev accuracy: 0.5411010155
----- Epoch 822 -----
 Train loss: 4.57137436846
 Train accuracy: 0.602316381698
 Dev accuracy: 0.544735435596
----- Epoch 823 -----
 Train loss: 4.5706085464
 Train accuracy: 0.601472462749
 Dev accuracy: 0.54687332977
----- Epoch 824 -----
 Train loss: 4.57256518043
 Train accuracy: 0.602285613819
 Dev accuracy: 0.53693212186
----- Epoch 825 -----
 Train loss: 4.56988608824
 Train accuracy: 0.603208650169
 Dev accuracy: 0.543773383218
----- Epoch 826 -----
 Train loss: 4.57389599418
 Train accuracy: 0.604034987473
 Dev accuracy: 0.539711384286
----- Epoch 827 -----
 Train loss: 4.57234565668
 Train accuracy: 0.604017405828
 Dev accuracy: 0.545697487974
----- Epoch 828 -----
 Train

 Train accuracy: 0.603920706782
 Dev accuracy: 0.542383752004
----- Epoch 894 -----
 Train loss: 4.56211651754
 Train accuracy: 0.605340424597
 Dev accuracy: 0.543238909674
----- Epoch 895 -----
 Train loss: 4.56236377015
 Train accuracy: 0.604747044086
 Dev accuracy: 0.541849278461
----- Epoch 896 -----
 Train loss: 4.56168416618
 Train accuracy: 0.604623972573
 Dev accuracy: 0.547835382149
----- Epoch 897 -----
 Train loss: 4.5622389849
 Train accuracy: 0.605863478528
 Dev accuracy: 0.545163014431
----- Epoch 898 -----
 Train loss: 4.56042675236
 Train accuracy: 0.603854775614
 Dev accuracy: 0.542276857296
----- Epoch 899 -----
 Train loss: 4.56147105957
 Train accuracy: 0.603744890334
 Dev accuracy: 0.543880277926
----- Epoch 900 -----
 Train loss: 4.56260117709
 Train accuracy: 0.605476682344
 Dev accuracy: 0.541207910208
----- Epoch 901 -----
 Train loss: 4.56166950229
 Train accuracy: 0.605032745813
 Dev accuracy: 0.542383752004
----- Epoch 902 -----
 Train loss: 4.56271966325
 T

 Train accuracy: 0.605788756538
 Dev accuracy: 0.538535542491
----- Epoch 968 -----
 Train loss: 4.55423684695
 Train accuracy: 0.606329392115
 Dev accuracy: 0.542490646713
----- Epoch 969 -----
 Train loss: 4.55589164522
 Train accuracy: 0.603208650169
 Dev accuracy: 0.540887226082
----- Epoch 970 -----
 Train loss: 4.55403015746
 Train accuracy: 0.605599753857
 Dev accuracy: 0.544200962052
----- Epoch 971 -----
 Train loss: 4.55137446418
 Train accuracy: 0.60631181047
 Dev accuracy: 0.544094067344
----- Epoch 972 -----
 Train loss: 4.5542072187
 Train accuracy: 0.606012922509
 Dev accuracy: 0.542597541422
----- Epoch 973 -----
 Train loss: 4.55330595015
 Train accuracy: 0.605661289614
 Dev accuracy: 0.542276857296
----- Epoch 974 -----
 Train loss: 4.55351194213
 Train accuracy: 0.6030636016
 Dev accuracy: 0.538642437199
----- Epoch 975 -----
 Train loss: 4.55378534615
 Train accuracy: 0.604865720188
 Dev accuracy: 0.541742383752
----- Epoch 976 -----
 Train loss: 4.55282782578
 Trai

## <font color='red'>Assessment 1</font>: Assess Accuracy (40 pts) 

We assess how well your model performs on an unseen test set. We will look at the accuracy of the predicted sentence order, on sentence level, and will score them as followis:

* 0 - 10 pts: 45% <= accuracy < 50%, linear
* 10 - 20 pts: 50% <= accuracy < 55, linear
* 20 - 40 pts: 55 <= accuracy < 60, linear
* extra 0-10 pts: 60 <= accuracy < 70, linear

The **linear** mapping maps any accuracy value between the lower and upper bound linearly to a score. For example, if your model's accuracy score is $acc=54.5\%$, then your score is $10 + 10\frac{acc-50}{55-50}$.

Change the following lines so that they construct the test set in the same way you constructed the dev set in the code above. We will insert the test set instead of the dev set here. **`test_feed_dict` variable must stay named the same**.

In [28]:
# LOAD THE DATA
data_test = nn.load_corpus(data_path + "dev.tsv")
# make sure you process this with the same pipeline as you processed your dev set
test_stories, test_orders, _ = nn.pipeline(data_test, vocab=vocab, max_sent_len_=max_sent_len)

# THIS VARIABLE MUST BE NAMED `test_feed_dict`
test_feed_dict = {story: test_stories, order: test_orders}

The following code loads your model, computes accuracy, and exports the result. **DO NOT** change this code.

In [29]:
#! ASSESSMENT 1 - DO NOT CHANGE, MOVE NOR COPY
with tf.Session() as sess:
    # LOAD THE MODEL
    saver = tf.train.Saver()
    saver.restore(sess, './model/model.checkpoint')
    
    # RUN TEST SET EVALUATION
    dev_predicted = sess.run(predict, feed_dict=test_feed_dict)
    dev_accuracy = nn.calculate_accuracy(dev_orders, dev_predicted)

dev_accuracy

0.38995189738107966

## <font color='orange'>Mark</font>:  Your solution to Task 1 is marked with ** __ points**. 
---

## <font color='blue'>Task 2</font>: Describe your Approach

Enter a 1000 words max description of your approach **in this cell**.
Make sure to provide:
- an **error analysis** of the types of errors your system makes
- compare your system with the model we provide, focus on differences and draw useful comparations between them

Should you need to include figures in your report, make sure they are Python-generated (matplotlib, seaborn, bokeh are all included in the stat-nlp-book Docker image). For that, feel free to create new cells after this cell (before Assessment 2 cell). Link online images at your risk.

...WRITE YOUR DESCRIPTION HERE...

## <font color='red'>Assessment 2</font>: Assess Description (60 pts) 

We will mark the description along the following dimensions: 

* Clarity (10pts: very clear, 0pts: we can't figure out what you did, or you did nothing)
* Creativity (25pts: we could not have come up with this, 0pts: Use only the provided model)
* Substance (25pts: implemented complex state-of-the-art classifier, compared it to a simpler model, 0pts: Only use what is already there)

## <font color='orange'>Mark</font>:  Your solution to Task 2 is marked with ** __ points**.
---

## <font color='orange'>Final mark</font>: Your solution to Assignment 3 is marked with ** __points**. 