<!---
Latex Macros
-->
$$
\newcommand{\bar}{\,|\,}
\newcommand{\Xs}{\mathcal{X}}
\newcommand{\Ys}{\mathcal{Y}}
\newcommand{\y}{\mathbf{y}}
\newcommand{\weights}{\mathbf{w}}
\newcommand{\balpha}{\boldsymbol{\alpha}}
\newcommand{\bbeta}{\boldsymbol{\beta}}
\newcommand{\aligns}{\mathbf{a}}
\newcommand{\align}{a}
\newcommand{\source}{\mathbf{s}}
\newcommand{\target}{\mathbf{t}}
\newcommand{\ssource}{s}
\newcommand{\starget}{t}
\newcommand{\repr}{\mathbf{f}}
\newcommand{\repry}{\mathbf{g}}
\newcommand{\x}{\mathbf{x}}
\newcommand{\prob}{p}
\newcommand{\vocab}{V}
\newcommand{\params}{\boldsymbol{\theta}}
\newcommand{\param}{\theta}
\DeclareMathOperator{\perplexity}{PP}
\DeclareMathOperator{\argmax}{argmax}
\DeclareMathOperator{\argmin}{argmin}
\newcommand{\train}{\mathcal{D}}
\newcommand{\counts}[2]{\#_{#1}(#2) }
\newcommand{\length}[1]{\text{length}(#1) }
\newcommand{\indi}{\mathbb{I}}
$$

# Assignment 3

## Introduction

In the last assignment, you will apply deep learning methods to solve a particular story understanding problem. Automatic understanding of stories is an important task in natural language understanding [[1]](http://anthology.aclweb.org/D/D13/D13-1020.pdf). Specifically, you will develop a model that given a sequence of sentences learns to sort these sentence in order to yield a coherent story [[2]](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/06/short-commonsense-stories.pdf). This sounds (and to an extent is) trivial for humans, however it is a quite difficult task for machines as it involves commonsense knowledge and temporal understanding.

## Goal

You are given a dataset of 45502 instances, each consisting of 5 sentences. Your system needs to ouput a sequence of numbers which represent the predicted order of these sentences. For example, given a story:

    He went to the store.
    He found a lamp he liked.
    He bought the lamp.
    Jan decided to get a new lamp.
    Jan's lamp broke.

your system needs to provide an answer in the following form:

    2	3	4	1	0

where the numbers correspond to the zero-based index of each sentence in the correctly ordered story. So "`2`" for "`He went to the store.`" means that this sentence should come 3rd in the correctly ordered target story. In This particular example, this order of indices corresponds to the following target story:

    Jan's lamp broke.
    Jan decided to get a new lamp.
    He went to the store.
    He found a lamp he liked.
    He bought the lamp.

## Resources

To develop your model(s), we provide a training and a development datasets. The test dataset will be held out, and we will use it to evaluate your models. The test set is coming from the same task distribution, and you don't need to expect drastic changes in it.

You will use [TensorFlow](https://www.tensorflow.org/) to build a deep learning model for the task. We provide a very crude system which solves the task with a low accuracy, and a set of additional functions you will have to use to save and load the model you create so that we can run it.

As we have to run the notebooks of each submission, and as deep learning models take long time to train, your notebook **NEEDS** to conform to the following requirements:
* You **NEED** to run your parameter optimisation offline, and provide your final model saved by using the provided function
* The maximum size of a zip file you can upload to moodle is 160MB. We will **NOT** allow submissions larger than that.
* We do not have time to train your models from scratch! You **NEED** to provide the full code you used for the training of your model, but by all means you **CANNOT** call the training method in the notebook you will send to us.
* We will run these notebooks automatically. If your notebook runs the training procedure, in addition to loading the model, and we need to edit your code to stop the training, you will be penalised with **-20 points**.
* If you do not provide a pretrained model, and rely on training your model on our machines, you will get **0 points**.
* It needs to be tested on the stat-nlp-book Docker setup to ensure that it does not have any dependencies outside of those that we provide. If your submission fails to adhere to this requirement, you will get **0 points**.

Running time and memory issues:
* We have tested a possible solution on a mid-2014 MacBook Pro, and a few epochs of the model run in less than 3min. Thus it is possible to train a model on the data in reasonable time. However, be aware that you will need to run these models many times over, for a larger number of epochs (more elaborate models, trained on much larger datasets can train for weeks! However, this shouldn't be the case here.). If you find training times too long for your development cycle you can reduce the training set size. Once you have found a good solution you can increase the size again. Caveat: model parameters tuned on a smaller dataset may not be optimal for a larger training set.
* In addition to this, as your submission is capped by size, feel free to experiment with different model sizes, numeric values of different precisions, filtering the vocabulary size, downscaling some vectors, etc.

## Hints

A non-exhaustive list of things you might want to give a try:
- better tokenization
- experiment with pre-trained word representations such as [word2vec](https://code.google.com/archive/p/word2vec/), or [GloVe](http://nlp.stanford.edu/projects/glove/). Be aware that these representations might take a lot of parameters in your model. Be sure you use only the words you expect in the training/dev set and account for OOV words. When saving the model parameters, pre-rained word embeddings can simply be used in the word embedding matrix of your model. As said, make sure that this word embedding matrix does not contain all of word2vec or GloVe. Your submission is limited, and we will not allow uploading nor using the whole representations set (up to 3GB!)
- reduced sizes of word representations
- bucketing and batching (our implementation is deliberately not a good one!)
  - make sure to draw random batches from the data! (we do not provide this in our code!)
- better models:
  - stacked RNNs (see tf.nn.rnn_cell.MultiRNNCel
  - bi-directional RNNs
  - attention
  - word-by-word attention
  - conditional encoding
  - get model inspirations from papers on nlp.stanford.edu/projects/snli/
  - sequence-to-sequence encoder-decode architecture for producing the right ordering
- better training procedure:
  - different training algorithms
  - dropout on the input and output embeddings (see tf.nn.dropout)
  - L2 regularization (see tf.nn.l2_loss)
  - gradient clipping (see tf.clip_by_value or tf.clip_by_norm)
- model selection:
  - early stopping
- hyper-parameter optimization (e.g. random search or grid search (expensive!))
    - initial learning rate
    - dropout probability
    - input and output size
    - L2 regularization
    - gradient clipping value
    - batch size
    - ...
- post-processing
  - for incorporating consistency constraints

## Setup Instructions
It is important that this file is placed in the **correct directory**. It will not run otherwise. The correct directory is

    DIRECTORY_OF_YOUR_BOOK/assignments/2016/assignment3/problem/group_X/
    
where `DIRECTORY_OF_YOUR_BOOK` is a placeholder for the directory you downloaded the book to, and in `X` in `group_X` contains the number of your group.

After you placed it there, **rename the notebook file** to `group_X`.

The notebook is pre-set to save models in

    DIRECTORY_OF_YOUR_BOOK/assignments/2016/assignment3/problem/group_X/model/

Be sure not to tinker with that - we expect your submission to contain a `model` subdirectory with a single saved model! 
The saving procedure might overwrite the latest save, or not. Make sure you understand what it does, and upload only a single model! (for more details check tf.train.Saver)

## General Instructions
This notebook will be used by you to provide your solution, and by us to both assess your solution and enter your marks. It contains three types of sections:

1. **Setup** Sections: these sections set up code and resources for assessment. **Do not edit, move nor copy these cells**.
2. **Assessment** Sections: these sections are used for both evaluating the output of your code, and for markers to enter their marks. **Do not edit, move, nor copy these cells**.
3. **Task** Sections: these sections require your solutions. They may contain stub code, and you are expected to edit this code. For free text answers simply edit the markdown field.  

**If you edit, move or copy any of the setup, assessments and mark cells, you will be penalised with -20 points**.

Note that you are free to **create additional notebook cells** within a task section. 

Please **do not share** this assignment nor the dataset publicly, by uploading it online, emailing it to friends etc.

## Submission Instructions

To submit your solution:

* Make sure that your solution is fully contained in this notebook. Make sure you do not use any additional files other than your saved model.
* Make sure that your solution runs linearly from start to end (no execution hops). We will run your notebook in that order.
* **Before you submit, make sure your submission is tested on the stat-nlp-book Docker setup to ensure that it does not have any dependencies outside of those that we provide. If your submission fails to adhere to this requirement, you will get 0 points**.
* **If running your notebook produces a trivially fixable error that we spot, we will correct it and penalise you with -20 points. Otherwise you will get 0 points for that solution.**
* **Rename this notebook to your `group_X`** (where `X` is the number of your group), and adhere to the directory structure requirements, if you have not already done so. ** Failure to do so will result in -1 point.**
* Download the notebook in Jupyter via *File -> Download as -> Notebook (.ipynb)*.
* Your submission should be a zip file containing the `group_X` directory, containing `group_X.ipynb` notebook, and the `model` directory with _____
* Upload that file to the Moodle submission site.

## <font color='green'>Setup 1</font>: Load Libraries
This cell loads libraries important for evaluation and assessment of your model. **Do not change, move or copy it.**

In [10]:
%%capture
%load_ext autoreload
%autoreload 2
#%matplotlib inline
#! SETUP 1 - DO NOT CHANGE, MOVE NOR COPY
import sys, os
_snlp_book_dir = "../../../../../"
sys.path.append(_snlp_book_dir)
# docker image contains tensorflow 0.10.0rc0. We will support execution of only that version!
import statnlpbook.nn as nn
import random
import tensorflow as tf
import numpy as np

## <font color='green'>Setup 2</font>: Load Training Data

This cell loads the training data. **Do not edit the next cell, nor copy/duplicate it**. Instead refer to the variables in your own code, and slice and dice them as you see fit (but do not change their values). 
For example, no one stops you from introducing, in the corresponding task section, `my_train` and `my_dev` variables that split the data into different folds.   

In [11]:
#! SETUP 2 - DO NOT CHANGE, MOVE NOR COPY
data_path = _snlp_book_dir + "data/nn/"
data_train = nn.load_corpus(data_path + "train.tsv")
data_dev = nn.load_corpus(data_path + "dev.tsv")
assert(len(data_train) == 45502)

### Data Structures

Notice that the data is loaded from tab-separated files. The files are easy to read, and we provide the loading functions that load it into a simple data structure. Feel free to check details of the loading.

The data structure at hand is an array of dictionaries, each containing a `story` and the `order` entry. `story` is a list of strings, and `order` is a list of integer indices:

In [12]:
data_train[0]

{'order': [3, 2, 1, 0, 4],
 'story': ['His parents understood and decided to make a change.',
  'The doctors told his parents it was unhealthy.',
  'Dan was overweight as well.',
  "Dan's parents were overweight.",
  'They got themselves and Dan on a diet.']}

## <font color='blue'>Task 1</font>: Model implementation



###  Modified preprocessing pipeline function

In [13]:
import numpy as np
import re
import collections

def tokenize(input):
    return re.split("-| ",input.replace("'s"," 's").replace('.',' .').replace(',',' ,').replace('?',' ?').replace('!',' !'))

def pipeline(data, vocab=None, max_sent_len_=None):
    is_ext_vocab = True
    if vocab is None:
        is_ext_vocab = False
        vocab = {'<PAD>': 0, '<OOV>': 1}

    max_sent_len = -1
    data_sentences = []
    data_sentences_correctseq = []
    data_crqseq_lengths = []
    data_orders = []
    data_lengths = []
    for instance in data:
        correctseq=collections.defaultdict(list)
        sents = []
        lengths = []
        for index,sentence in enumerate(instance['story']):
            sent = []
            length = []
            tokenized = tokenize(sentence)
            for token in tokenized:
                token = token.lower()
                if not is_ext_vocab and token not in vocab:
          
                    vocab[token] = len(vocab)
                if token not in vocab:
                    token_id = vocab['<OOV>']
                else:
                    token_id = vocab[token]
                sent.append(token_id)
            if len(sent) > max_sent_len:
                max_sent_len = len(sent)
            sents.append(sent)
            correctseq[instance['order'][index]]=sent          
            lengths.append(len(sent))
        #newlist = sorted(list_to_be_sorted, key=lambda k: k['name']) 
        data_lengths.append(lengths)
        data_sentences.append(sents)
        data_sentences_correctseq.append([value for key,value in correctseq.items()])
        data_crqseq_lengths.append([len(value) for key,value in correctseq.items()])
        data_orders.append(instance['order'])

    if max_sent_len_ is not None:
        max_sent_len = max_sent_len_
    out_sentences = np.full([len(data_sentences), 5, max_sent_len], vocab['<PAD>'], dtype=np.int32)
    out_sentences_correctseq = np.full([len(data_sentences), 5, max_sent_len], vocab['<PAD>'], dtype=np.int32)
    
    for i, elem in enumerate(data_sentences):
        for j, sent in enumerate(elem):
            out_sentences[i, j, 0:len(sent)] = sent
   
    for i, elem in enumerate(data_sentences_correctseq):
        for j, sent in enumerate(elem):
            out_sentences_correctseq[i, j, 0:len(sent)] = sent
    
    sentence_place_order=[]
    for lst in data_orders:
        temp_lst=[]
        for i in range(5):
            temp_lst.append(lst.index(i))
        sentence_place_order.append(temp_lst)
        
    out_orders = np.array(data_orders, dtype=np.int32)
    out_lengths = np.array(data_lengths, dtype=np.int32)
    out_sentence_place_order=np.array(sentence_place_order, dtype=np.int32)
    out_crqseq_lengths=np.array(data_crqseq_lengths, dtype=np.int32)
    return out_sentences,out_sentences_correctseq,out_lengths,out_crqseq_lengths, out_orders,out_sentence_place_order, vocab



In [14]:
# convert train set to integer IDs
train_stories,train_stories_seq,train_lengths,train_crtseq_lengths, train_orders,train_positions, vocab = \
pipeline(data_train)

In [15]:
# get the length of the longest sentence
max_sent_len = train_stories.shape[2]

# convert dev set to integer IDs, based on the train vocabulary and max_sent_len
dev_stories, dev_stories_seq, dev_lengths, dev_crtseq_lengths, dev_orders,dev_positions, _ = \
pipeline(data_dev, vocab=vocab, max_sent_len_=max_sent_len)

## Model

### Set up the model parameters

In [16]:
target_size = 5
vocab_size = len(vocab)
input_size = 50*6
n, sentence_set_length,max_length = train_stories.shape
projection_size = 5 
output_size = 5
attention_size = 300
encoder_hidden_size = 500
decoder_hidden_size = 500

dropout_rate=0.5

### Creating the GloVe word representation

### Load the Glove 

In [17]:
#W = np.load('Glove_840B_300.npy')[:vocab_size]

In [18]:
def calculate_accuracy(orders_gold, orders_predicted,i):
    shape = np.shape(orders_predicted)[1]
    num =orders_predicted == orders_gold
    num_correct = np.sum(np.split(num,shape,axis=1)[i])
    num_total =  orders_gold.shape[0]
    return num_correct / num_total

### Define the model

In [20]:
story = tf.placeholder(tf.int64, [None, None, None], "story")        # [batch_size x 5 x max_length]
length = tf.placeholder(tf.int64, [None, None], "length")             # [batch_size x 5] 
order = tf.placeholder(tf.int64, [None, None], "order")              # [batch_size x 5]
  
batch_size = tf.shape(story)[0]

sentences = [tf.reshape(x, [batch_size, -1]) for x in tf.split(1, 5, story)]  # 5 times [batch_size x max_length]
reshape_length = [tf.reshape(x,[-1]) for x in tf.split(1,5, length)]   # 5 times [batch_size]

## https://www.tensorflow.org/api_docs/python/array_ops/shapes_and_shaping#reshape
# Word embeddings
initializer = tf.random_uniform_initializer(-0.1, 0.1)
embeddings = tf.get_variable("W", [vocab_size, input_size], initializer=initializer,trainable= False)
#embeddings = embeddings.assign(W) 

sentences_embedded = [tf.nn.embedding_lookup(embeddings, sentence)  # [batch_size x max_seq_length x input_size]  
                      for sentence in sentences]              # 5 times[batch_size x max_seq_length x input_size]

sentences_embedded_1 = sentences_embedded[0] #[batch_size x max_seq_length x input_size]
sentences_embedded_2 = sentences_embedded[1]
sentences_embedded_3 = sentences_embedded[2]
sentences_embedded_4 = sentences_embedded[3]
sentences_embedded_5 = sentences_embedded[4]

sequence_length_1 = reshape_length[0]
sequence_length_2 = reshape_length[1]
sequence_length_3 = reshape_length[2]
sequence_length_4 = reshape_length[3]
sequence_length_5 = reshape_length[4]

lstm_cell = tf.nn.rnn_cell.LSTMCell(encoder_hidden_size, state_is_tuple= True)
with tf.variable_scope("sentence_encoder") as varscope:        
    #stacked_cell = tf.nn.rnn_cell.MultiRNNCell([lstm_cell] * 2, state_is_tuple=False)
    _, sentences_5_final_state = tf.nn.dynamic_rnn(lstm_cell, sentences_embedded_5,\
                                                   sequence_length=sequence_length_5, dtype=tf.float32)        
    sentences_5 = sentences_5_final_state.h  
    
    varscope.reuse_variables()  
    _, sentences_4_final_state = tf.nn.dynamic_rnn(lstm_cell, sentences_embedded_4,  \
                                                   sequence_length=sequence_length_4, dtype=tf.float32)        
    sentences_4 = sentences_4_final_state.h

    varscope.reuse_variables() 
    _, sentences_3_final_state = tf.nn.dynamic_rnn(lstm_cell, sentences_embedded_3, \
                                                   sequence_length=sequence_length_3, dtype=tf.float32)        
    sentences_3 = sentences_3_final_state.h

    varscope.reuse_variables()
    _, sentences_2_final_state = tf.nn.dynamic_rnn(lstm_cell, sentences_embedded_2,  \
                                                   sequence_length=sequence_length_2, dtype=tf.float32)        
    sentences_2 = sentences_2_final_state.h

    varscope.reuse_variables() 
    _, sentences_1_final_state = tf.nn.dynamic_rnn(lstm_cell, sentences_embedded_1, \
                                                   sequence_length=sequence_length_1, dtype=tf.float32)        
    sentences_1 = sentences_1_final_state.h
    
#---------------------------------------------------------------------------------------------------------------------------------------#        
#---------------------------------------------------------------------------------------------------------------------------------------# 

story_crtseq = tf.placeholder(tf.int64, [None, None, None], "story_crtseq")        # [batch_size x 5 x max_length]
crt_length = tf.placeholder(tf.int64, [None, None], "crt_length")             # [batch_size x 5] 

sentences_crtseq = [tf.reshape(x, [batch_size, -1]) for x in tf.split(1, 5, story)] 
sentences_crtseq_embedded = [tf.nn.embedding_lookup(embeddings, sentence)  # [batch_size x max_seq_length x input_size]  
                      for sentence in sentences_crtseq] 
    
reshape_crt_length = [tf.reshape(x,[-1]) for x in tf.split(1,5, crt_length)]   # 5 times [batch_size]

sentences_crtseq_embedded_1 = sentences_crtseq_embedded[0] 
sentences_crtseq_embedded_2 = sentences_crtseq_embedded[1]
sentences_crtseq_embedded_3 = sentences_crtseq_embedded[2]
sentences_crtseq_embedded_4 = sentences_crtseq_embedded[3]
sentences_crtseq_embedded_5 = sentences_crtseq_embedded[4]        

crt_sequence_length_1 = reshape_crt_length[0]
crt_sequence_length_2 = reshape_crt_length[1]
crt_sequence_length_3 = reshape_crt_length[2]
crt_sequence_length_4 = reshape_crt_length[3]
crt_sequence_length_5 = reshape_crt_length[4] 

with tf.variable_scope("sentence_encoder") as varscope:  
    varscope.reuse_variables() 
    _, sentences_crtseq_5_final_state = tf.nn.dynamic_rnn(lstm_cell, sentences_crtseq_embedded_5, \
                                                   sequence_length=crt_sequence_length_5, dtype=tf.float32)        
    sentences_crtseq_5 = sentences_crtseq_5_final_state.h
    
    varscope.reuse_variables()  
    _, sentences_crtseq_4_final_state = tf.nn.dynamic_rnn(lstm_cell, sentences_crtseq_embedded_4,  \
                                                   sequence_length=crt_sequence_length_4, dtype=tf.float32)        
    sentences_crtseq_4 = sentences_crtseq_4_final_state.h

    varscope.reuse_variables() 
    _, sentences_crtseq_3_final_state = tf.nn.dynamic_rnn(lstm_cell, sentences_crtseq_embedded_3, \
                                                   sequence_length=crt_sequence_length_3, dtype=tf.float32)        
    sentences_crtseq_3 = sentences_crtseq_3_final_state.h

    varscope.reuse_variables()
    _, sentences_crtseq_2_final_state = tf.nn.dynamic_rnn(lstm_cell, sentences_crtseq_embedded_2,  \
                                                   sequence_length=crt_sequence_length_2, dtype=tf.float32)        
    sentences_crtseq_2 = sentences_crtseq_2_final_state.h

    varscope.reuse_variables() 
    _, sentences_crtseq_1_final_state = tf.nn.dynamic_rnn(lstm_cell, sentences_crtseq_embedded_1,\
                                                   sequence_length=crt_sequence_length_1, dtype=tf.float32)        
    sentences_crtseq_1 = sentences_crtseq_1_final_state.h

#---------------------------------------------------------------------------------------------------------------------------------------#        
#---------------------------------------------------------------------------------------------------------------------------------------# 
sentence_pack = tf.pack([sentences_1,sentences_2,sentences_3,sentences_4,sentences_5],axis=1)    
ones = tf.pack([tf.ones([batch_size,encoder_hidden_size], dtype=tf.float32)], axis=1) 
zeros = tf.pack([tf.zeros([batch_size,encoder_hidden_size], dtype=tf.float32)], axis=1) 

seq_decoder_1 = tf.pack([sentences_crtseq_1], axis=1) 
seq_decoder_2 = tf.pack([sentences_crtseq_2], axis=1) 
seq_decoder_3 = tf.pack([sentences_crtseq_3], axis=1) 
seq_decoder_4 = tf.pack([sentences_crtseq_4], axis=1) 
seq_decoder_5 = tf.pack([sentences_crtseq_5], axis=1) 

def score_functions(s,h, weight, bias):
    # h :[batch_size, hidden_size(500)]
    inside_part =tf.add(h*weight,bias) 
    return tf.reduce_sum(inside_part * s,1)

score_weight = tf.Variable(tf.random_normal([decoder_hidden_size]),name="score_weight")

def score_sumup(s,a):
    a_reshape = tf.pack([a],axis=2)
    dot_product = s*a_reshape
    return tf.reduce_sum(dot_product,1)    

score_weight = tf.Variable(tf.random_normal([decoder_hidden_size]),name="score_weight")
score_bias   = tf.Variable(tf.random_normal([1]),name="score_bias")


with tf.variable_scope("encoder") as varscope:  
    _, encoder_hidden_state_initial = tf.nn.dynamic_rnn(lstm_cell,zeros, dtype=tf.float32)  
    encoder_hidden_state_0_h = encoder_hidden_state_initial.h
    
en_0_1 = score_functions(sentences_1,encoder_hidden_state_0_h,score_weight, score_bias)
en_0_2 = score_functions(sentences_2,encoder_hidden_state_0_h,score_weight, score_bias)
en_0_3 = score_functions(sentences_3,encoder_hidden_state_0_h,score_weight, score_bias)
en_0_4 = score_functions(sentences_4,encoder_hidden_state_0_h,score_weight, score_bias)
en_0_5 = score_functions(sentences_5,encoder_hidden_state_0_h,score_weight, score_bias)

en_0 = tf.nn.softmax(tf.concat(1,tf.pack([en_0_1,en_0_2,en_0_3,en_0_4,en_0_5], axis=1)))
s = tf.pack([score_sumup(sentence_pack,en_0)],1)


for i in range(5):
    with tf.variable_scope("encoder") as varscope:  
        varscope.reuse_variables()  
        _, encoder_hidden_state = tf.nn.dynamic_rnn(lstm_cell,s,initial_state = encoder_hidden_state_initial, dtype=tf.float32)  
        encoder_hidden_state_h_hat = encoder_hidden_state.h
        encoder_hidden_state_c = encoder_hidden_state.c
    
    en_1_1 = score_functions(sentences_1,encoder_hidden_state_h_hat,score_weight, score_bias)
    en_1_2 = score_functions(sentences_2,encoder_hidden_state_h_hat,score_weight, score_bias)
    en_1_3 = score_functions(sentences_3,encoder_hidden_state_h_hat,score_weight, score_bias)
    en_1_4 = score_functions(sentences_4,encoder_hidden_state_h_hat,score_weight, score_bias)
    en_1_5 = score_functions(sentences_5,encoder_hidden_state_h_hat,score_weight, score_bias)

    en_1 = tf.nn.softmax(tf.concat(1,tf.pack([en_1_1,en_1_2,en_1_3,en_1_4,en_1_5], axis=1)))
    s = tf.pack([score_sumup(sentence_pack,en_1)],1)
    
    encoder_hidden_state_initial = encoder_hidden_state
    
#---------------------------------------------------------------------------------------------------------------------------------------#        
#---------------------------------------------------------------------------------------------------------------------------------------#     
initial_decoder_parameters = tf.Variable(tf.random_normal([encoder_hidden_size]),name="initial_decoder_parameters")
initial_decoder = ones * initial_decoder_parameters

lstm_decoder_cell = tf.nn.rnn_cell.LSTMCell(decoder_hidden_size, state_is_tuple= True)
lstm_decoder_cell = tf.nn.rnn_cell.DropoutWrapper(lstm_decoder_cell, output_keep_prob=dropout_rate)
with tf.variable_scope("decoder") as varscope:         
    _, decoder_final_state_5= tf.nn.dynamic_rnn(lstm_decoder_cell, zeros,\
                                                 initial_state =  encoder_hidden_state, dtype=tf.float32)
    decoder_hidden_5 = decoder_final_state_5.h
    
    varscope.reuse_variables()  
    _, decoder_final_state_4 = tf.nn.dynamic_rnn(lstm_decoder_cell, seq_decoder_1,\
                                                 initial_state = decoder_final_state_5,dtype=tf.float32)
    decoder_hidden_4 = decoder_final_state_4.h
    
    varscope.reuse_variables() 
    _, decoder_final_state_3= tf.nn.dynamic_rnn(lstm_decoder_cell, seq_decoder_2,\
                                                initial_state = decoder_final_state_4,dtype=tf.float32)
    decoder_hidden_3 = decoder_final_state_3.h
    
    varscope.reuse_variables() 
    _, decoder_final_state_2= tf.nn.dynamic_rnn(lstm_decoder_cell, seq_decoder_3,\
                                                initial_state = decoder_final_state_3,dtype=tf.float32)
    decoder_hidden_2 = decoder_final_state_2.h
    
    varscope.reuse_variables() 
    _, decoder_final_state_1= tf.nn.dynamic_rnn(lstm_decoder_cell, seq_decoder_4,\
                                                initial_state = decoder_final_state_2,dtype=tf.float32)
    decoder_hidden_1 = decoder_final_state_1.h
    

decoder_hidden_5=tf.nn.dropout(decoder_hidden_5,dropout_rate)
decoder_hidden_4=tf.nn.dropout(decoder_hidden_4,dropout_rate)
decoder_hidden_3=tf.nn.dropout(decoder_hidden_3,dropout_rate)
decoder_hidden_2=tf.nn.dropout(decoder_hidden_2,dropout_rate)
decoder_hidden_1=tf.nn.dropout(decoder_hidden_1,dropout_rate)

e_5_1 = score_functions(sentences_1,decoder_hidden_5, score_weight, score_bias)
e_5_2 = score_functions(sentences_2,decoder_hidden_5, score_weight, score_bias)
e_5_3 = score_functions(sentences_3,decoder_hidden_5, score_weight, score_bias)
e_5_4 = score_functions(sentences_4,decoder_hidden_5, score_weight, score_bias)
e_5_5 = score_functions(sentences_5,decoder_hidden_5, score_weight, score_bias)   
a_5 = tf.concat(1,tf.pack([e_5_1,e_5_2,e_5_3,e_5_4,e_5_5], axis=1))
a_5 = tf.nn.dropout(a_5,0.75)

e_4_1 = score_functions(sentences_1,decoder_hidden_4, score_weight, score_bias)
e_4_2 = score_functions(sentences_2,decoder_hidden_4, score_weight, score_bias)
e_4_3 = score_functions(sentences_3,decoder_hidden_4, score_weight, score_bias)
e_4_4 = score_functions(sentences_4,decoder_hidden_4, score_weight, score_bias)
e_4_5 = score_functions(sentences_5,decoder_hidden_4, score_weight, score_bias)
a_4 = tf.concat(1,tf.pack([e_4_1,e_4_2,e_4_3,e_4_4,e_4_5], axis=1))
a_4 = tf.nn.dropout(a_4,0.75)

e_3_1 = score_functions(sentences_1,decoder_hidden_3, score_weight, score_bias)
e_3_2 = score_functions(sentences_2,decoder_hidden_3, score_weight, score_bias)
e_3_3 = score_functions(sentences_3,decoder_hidden_3, score_weight, score_bias)
e_3_4 = score_functions(sentences_4,decoder_hidden_3, score_weight, score_bias)
e_3_5 = score_functions(sentences_5,decoder_hidden_3, score_weight, score_bias)
a_3 = tf.concat(1,tf.pack([e_3_1,e_3_2,e_3_3,e_3_4,e_3_5], axis=1))
a_3 = tf.nn.dropout(a_3,0.75)

e_2_1 = score_functions(sentences_1,decoder_hidden_2, score_weight, score_bias)
e_2_2 = score_functions(sentences_2,decoder_hidden_2, score_weight, score_bias)
e_2_3 = score_functions(sentences_3,decoder_hidden_2, score_weight, score_bias)
e_2_4 = score_functions(sentences_4,decoder_hidden_2, score_weight, score_bias)
e_2_5 = score_functions(sentences_5,decoder_hidden_2, score_weight, score_bias)
a_2 = tf.concat(1,tf.pack([e_2_1,e_2_2,e_2_3,e_2_4,e_2_5], axis=1))
a_2 = tf.nn.dropout(a_2,0.75)

e_1_1 = score_functions(sentences_1,decoder_hidden_1, score_weight, score_bias)
e_1_2 = score_functions(sentences_2,decoder_hidden_1, score_weight, score_bias)
e_1_3 = score_functions(sentences_3,decoder_hidden_1, score_weight, score_bias)
e_1_4 = score_functions(sentences_4,decoder_hidden_1, score_weight, score_bias)
e_1_5 = score_functions(sentences_5,decoder_hidden_1, score_weight, score_bias)
a_1 = tf.concat(1,tf.pack([e_1_1,e_1_2,e_1_3,e_1_4,e_1_5], axis=1))
a_1 = tf.nn.dropout(a_1,0.75)

#---------------------------------------------------------------------------------------------------------------------------------------#        
#---------------------------------------------------------------------------------------------------------------------------------------#     

position = tf.placeholder(tf.int64, [None, None], "position")             # [batch_size x 5]    

reshape_position = [tf.reshape(x,[-1]) for x in tf.split(1,5, position)] 
position_1 = reshape_position[0]
position_2 = reshape_position[1]
position_3 = reshape_position[2]
position_4 = reshape_position[3]
position_5 = reshape_position[4]

loss_1 = tf.reduce_sum(tf.nn.sparse_softmax_cross_entropy_with_logits(a_5, position_1))
loss_2 = tf.reduce_sum(tf.nn.sparse_softmax_cross_entropy_with_logits(a_4, position_2))
loss_3 = tf.reduce_sum(tf.nn.sparse_softmax_cross_entropy_with_logits(a_3, position_3))
loss_4 = tf.reduce_sum(tf.nn.sparse_softmax_cross_entropy_with_logits(a_2, position_4))
loss_5 = tf.reduce_sum(tf.nn.sparse_softmax_cross_entropy_with_logits(a_1, position_5))

loss_2 =  0.8*loss_1+ 0.8* loss_2 + 1.5*loss_3 + 1.5*loss_4+ 1.2*loss_5 
opt_op_2 = tf.train.AdamOptimizer(1e-3).minimize(loss_2)

#---------------------------------------------------------------------------------------------------------------------------------------#        
#---------------------------------------------------------------------------------------------------------------------------------------#     
sentence_encoder = tf.pack([sentences_1,sentences_2,sentences_3,sentences_4,sentences_5], axis=1)  #[batch_size x 5 x encoder_hidden_size]

def setence_select(logits, indices):
    batch_size = tf.shape(logits)[0]
    rows_per_batch = tf.shape(logits)[1]
    indices_per_batch = tf.shape(indices)[1]

    # Offset to add to each row in indices. We use `tf.expand_dims()` to make 
    # this broadcast appropriately.
    offset = tf.expand_dims(tf.range(0, batch_size) * rows_per_batch, 1)
    
    # Convert indices and logits into appropriate form for `tf.gather()`. 
    flattened_indices = tf.reshape(indices + offset, [-1])
    flattened_logits = tf.reshape(logits, tf.concat(0, [[-1], tf.shape(logits)[2:]]))
    
    selected_rows = tf.gather(flattened_logits, flattened_indices)

    return tf.reshape(selected_rows,tf.concat(0, [tf.pack([batch_size, indices_per_batch]), tf.shape(logits)[2:]]))

#---------------------------------------------------------------------------------------------------------------------------------------#        
#---------------------------------------------------------------------------------------------------------------------------------------#     
        
with tf.variable_scope("decoder") as varscope:    
    varscope.reuse_variables() 
    _, decoder_predict_5= tf.nn.dynamic_rnn(lstm_decoder_cell, zeros,\
                                            initial_state =  encoder_hidden_state, dtype=tf.float32)
    predict_hidden_state_5 = decoder_predict_5.h

p_5_1 = score_functions(sentences_1, predict_hidden_state_5, score_weight, score_bias)
p_5_2 = score_functions(sentences_2, predict_hidden_state_5, score_weight, score_bias)
p_5_3 = score_functions(sentences_3, predict_hidden_state_5, score_weight, score_bias)
p_5_4 = score_functions(sentences_4, predict_hidden_state_5, score_weight, score_bias)
p_5_5 = score_functions(sentences_5, predict_hidden_state_5, score_weight, score_bias)

p_5 = tf.pack([tf.argmax(tf.nn.softmax(tf.concat(1,tf.pack([p_5_1,p_5_2,p_5_3,p_5_4,p_5_5], axis=1))),1)], axis=1)
p_5 = tf.cast(p_5, tf.int32)   

predicted_sentence_5 = setence_select(sentence_encoder, p_5)
predicted_sentence_5.set_shape([None, None, encoder_hidden_size])


with tf.variable_scope("decoder") as varscope:    
    varscope.reuse_variables() 
    _, decoder_predict_4= tf.nn.dynamic_rnn(lstm_decoder_cell, predicted_sentence_5, \
                                            initial_state = decoder_predict_5, dtype=tf.float32)
    predict_hidden_state_4 = decoder_predict_4.h

p_4_1 = score_functions(sentences_1, predict_hidden_state_4, score_weight, score_bias)
p_4_2 = score_functions(sentences_2, predict_hidden_state_4, score_weight, score_bias)
p_4_3 = score_functions(sentences_3, predict_hidden_state_4, score_weight, score_bias)
p_4_4 = score_functions(sentences_4, predict_hidden_state_4, score_weight, score_bias)
p_4_5 = score_functions(sentences_5, predict_hidden_state_4, score_weight, score_bias)

p_4 = tf.pack([tf.argmax(tf.nn.softmax(tf.concat(1,tf.pack([p_4_1,p_4_2,p_4_3,p_4_4,p_4_5], axis=1))),1)], axis=1)  
p_4 = tf.cast(p_4, tf.int32)

predicted_sentence_4 = setence_select(sentence_encoder, p_4)
predicted_sentence_4.set_shape([None, None, encoder_hidden_size])

with tf.variable_scope("decoder") as varscope:    
    varscope.reuse_variables() 
    _, decoder_predict_3= tf.nn.dynamic_rnn(lstm_decoder_cell, predicted_sentence_4, \
                                            initial_state = decoder_predict_4 ,dtype=tf.float32)
    predict_hidden_state_3 = decoder_predict_3.h

p_3_1 = score_functions(sentences_1,predict_hidden_state_3, score_weight, score_bias)
p_3_2 = score_functions(sentences_2,predict_hidden_state_3, score_weight, score_bias)
p_3_3 = score_functions(sentences_3,predict_hidden_state_3, score_weight, score_bias)
p_3_4 = score_functions(sentences_4,predict_hidden_state_3, score_weight, score_bias)
p_3_5 = score_functions(sentences_5,predict_hidden_state_3, score_weight, score_bias)

p_3 = tf.pack([tf.argmax(tf.nn.softmax(tf.concat(1,tf.pack([p_3_1,p_3_2,p_3_3,p_3_4,p_3_5], axis=1))),1)], axis=1) 
p_3 = tf.cast(p_3, tf.int32)

predicted_sentence_3 = setence_select(sentence_encoder, p_3)
predicted_sentence_3.set_shape([None, None, encoder_hidden_size])


with tf.variable_scope("decoder") as varscope:    
    varscope.reuse_variables() 
    _, decoder_predict_2= tf.nn.dynamic_rnn(lstm_decoder_cell, predicted_sentence_3, \
                                            initial_state = decoder_predict_3, dtype=tf.float32)
    predict_hidden_state_2 = decoder_predict_2.h

p_2_1 = score_functions(sentences_1,predict_hidden_state_2, score_weight, score_bias)
p_2_2 = score_functions(sentences_2,predict_hidden_state_2, score_weight, score_bias)
p_2_3 = score_functions(sentences_3,predict_hidden_state_2, score_weight, score_bias)
p_2_4 = score_functions(sentences_4,predict_hidden_state_2, score_weight, score_bias)
p_2_5 = score_functions(sentences_5,predict_hidden_state_2, score_weight, score_bias)

p_2 = tf.pack([tf.argmax(tf.nn.softmax(tf.concat(1,tf.pack([p_2_1,p_2_2,p_2_3,p_2_4,p_2_5], axis=1))),1)], axis=1)
p_2 = tf.cast(p_2, tf.int32)

predicted_sentence_2 = setence_select(sentence_encoder, p_2)
predicted_sentence_2.set_shape([None, None, encoder_hidden_size])

with tf.variable_scope("decoder") as varscope:    
    varscope.reuse_variables() 
    _, decoder_predict_1= tf.nn.dynamic_rnn(lstm_decoder_cell, predicted_sentence_2, \
                                            initial_state = decoder_predict_2, dtype=tf.float32)
    predict_hidden_state_1 = decoder_predict_1.h

p_1_1 = score_functions(sentences_1,predict_hidden_state_1, score_weight, score_bias)
p_1_2 = score_functions(sentences_2,predict_hidden_state_1, score_weight, score_bias)
p_1_3 = score_functions(sentences_3,predict_hidden_state_1, score_weight, score_bias)
p_1_4 = score_functions(sentences_4,predict_hidden_state_1, score_weight, score_bias)
p_1_5 = score_functions(sentences_5,predict_hidden_state_1, score_weight, score_bias)

p_1 = tf.pack([tf.argmax(tf.nn.softmax(tf.concat(1,tf.pack([p_1_1,p_1_2,p_1_3,p_1_4,p_1_5], axis=1))),1)], axis=1)
p_1 = tf.cast(p_1, tf.int32)

predict = tf.concat(1,[p_5, p_4, p_3, p_2, p_1])   

   

TypeError: Input 'split_dim' of 'Split' Op has type int64 that does not match expected type of int32.

## <font color='red'>Assessment 1</font>: Assess Accuracy (50 pts) 

We assess how well your model performs on an unseen test set. We will look at the accuracy of the predicted sentence order, on sentence level, and will score them as followis:

* 0 - 20 pts: 45% <= accuracy < 50%, linear
* 20 - 40 pts: 50% <= accuracy < 55
* 40 - 70 pts 55 <= accuracy < Best Result, linear

The **linear** mapping maps any accuracy value between the lower and upper bound linearly to a score. For example, if your model's accuracy score is $acc=54.5\%$, then your score is $20 + 20\frac{acc-50}{55-50}$.

The *Best-Result* accuracy is the maximum of the best accuracy the course organiser achieved, and the submitted accuracies scores.  

Change the following lines so that they construct the test set in the same way you constructed the dev set in the code above. We will insert the test set instead of the dev set here. test_feed_dict variable must stay named the same.

In [None]:
# LOAD THE DATA
data_test = nn.load_corpus(data_path + "dev.tsv")
# make sure you process this with the same pipeline as you processed your dev set
test_stories, test_stories_seq, test_lengths, test_crtseq_lengths,_,test_orders, _ = \
pipeline(data_test, vocab=vocab, max_sent_len_=max_sent_len)

# THIS VARIABLE MUST BE NAMED `test_feed_dict`
test_feed_dict = {story: test_stories, length: test_lengths}

The following code loads your model, computes accuracy, and exports the result. **DO NOT** change this code.

In [None]:
#! ASSESSMENT 1 - DO NOT CHANGE, MOVE NOR COPY
with tf.Session() as sess:
    # LOAD THE MODEL
    saver = tf.train.Saver()   
    saver.restore(sess, './model/model.checkpoint')
    
    # RUN TEST SET EVALUATION
    test_predicted = sess.run(predict, feed_dict=test_feed_dict)
    test_accuracy = nn.calculate_accuracy(test_orders, test_predicted)

test_accuracy

## <font color='orange'>Mark</font>:  Your solution to Task 1 is marked with ** __ points**. 
---

## <font color='blue'>Task 2</font>: Describe your Approach

Enter a 750 words max description of your approach **in this cell**.
Make sure to provide:
- an **error analysis** of the types of errors your system makes
- compare your system with the model we provide, focus on differences and draw useful comparations between them

Should you need to include figures in your report, make sure they are Python-generated. For that, feel free to create new cells after this cell (before Assessment 2 cell). Link online images at your risk.


##  Model Description:
The basic idea of our model is to use pre-trained word embedding to capture semantics and input into a RNN model for sequence modelling. The approach is inspired by the method introduced in [Sentence Ordering using Recurrent Neural Networks ( Logeswaran et al, 2017)](https://arxiv.org/pdf/1611.02654v1.pdf). The model is comprised of a sentence encoder RNN, an encoder RNN and a decoder RNN:

**Sentence Encoder**: Pipeline function is modified so that it can produce the correct order and length of sentences for the sequence to sequence model. After processing the pipeline, we construct a RNN as sentence encoder which takes the words of a sentence $“s”$ sequentially as input and computes the sentence representations.

**Encoder**: We apply a RNN as encoder which attends to the word embeddings and computes an attention readout at each step, then appending it to the current hidden state. 

The structure is defined as equation $(1)-(4)$. Initially the LSTM takes a zero vector as input, after updating the regular LSTM hidden state $({h}_{enc}^t,{c}_{enc}^t)$, we compute an attention readout vector $s_{att}^t$ by composing the hidden state with sentence embedding through a scoring function $f$ (Equation $(5)$) and taking the $softmax$ to produce attention probabilities (Equations $(2) - (4)$). The attention readout vector $s_{att}^t$ is then used as LSTM input for the next time step. The process is repeated for certain time.


$$(1)\  {h}_{enc}^t,e{c}_{enc}^t = LSTM (h_{enc}^{t-1},c_{enc}^{t-1},s_{att}^{t-1}) $$

$$(2)\  e_{enc}^{t,i}=f(s_i,{h}_{enc}^t);i\in \{1,...,n\}$$

$$(3)\  a_{enc}^{t} = Softmax(e_{enc}^t)$$

$$(4)\  s_{att}^t = \sum_{i=1}^n a_{enc}^{t,i} s_i $$

$$(5)\  f(s,h) = s^T (Wh)$$

**Decoder**: We construct a new RNN as decoder that produces the target sequence conditioned on the representation produced by the encoder. The attention weights are used by decoder for prediction. 

The structure of decoder is shown as equation $(6)-(8)$. The LSTM takes the embedding of the previous sentence as input. The attention probability $a_{dec}^{t,i}$ is computed by the same method as the encoder. The initial state of the decoder LSTM is initialized with the final hidden state of the encoder. During training time the correct order of sentences is used as input while during prediction we use the previously predicted sentences. $x^o$ is set as a zero vector.

$$(6)\  h_{dec}^t,c_{dec}^t  =LSTM (h_{dec}^{t-1},c_{dec}^{t-1},x^{t-1}) $$

$$(7)\  e_{dec}^{t,i} = f(s_i,h_{dec}^t); i \in \{1,...,n\}$$

$$(8)\  a_{dec}^t = Softmax(e_{dec}^t) $$
 
 
$$
\  
$$
The **Model overview** is as followed ([flickr link](https://www.flickr.com/photos/147273529@N04/31935764933/in/dateposted-public/)):

<a data-flickr-embed="true"  href="https://www.flickr.com/photos/147273529@N04/31935764933/in/dateposted-public/" title="Seq2Seq"><img src="https://c1.staticflickr.com/1/630/31935764933_346c8793e5_k.jpg" width="2048" height="1152" alt="Seq2Seq"></a><script async src="//embedr.flickr.com/assets/client-code.js" charset="utf-8"></script>

    
## Training and Prediction:
**Model Training:**  Pre-trained 300 dimension Glove is used as word embedding, all LSTM cells in encoder and decoder have a hidden layer size of 500. The number of learning iteration in encoder cell is set as 5. We minimise the cross entropy loss by using the Adam optimiser with a learning rate of 1e-4 and batch size is set to be 25. Regularisation is implemented by dropouts the sentence representations and the decoder cells with the rate of 0.5. Early stopping is also applied for regularisation.

**Model Predictions:** During test, we found out that if we pass the sentences reversely (ending sentence first and starting sentence last) into decoder during training, it performed better in predicting sentences in the last two orders, while if passing the sentences forwardly then it performed better in predicting sentences in the first two orders. Therefore we concatenated one forward and one backward models, which do not share any parameters, as our final model, so that the first three sentences are predicted by the forward model and the last two sentences are predicted by the backward model, which successfully improved the accuracy by 2%.

## Comparison:
$\bullet$ Our model used a completely different predicting structure compares to the provided model, instead of directly predicting which order belongs to each sentence, our model predicts which sentence belongs to each order; in another word, it points orders to sentences instead of pointing sentences to orders. 

$\bullet$ Also instead of a direct discriminate approach in the provided model, our model predicts the order in a more interpretable way: the sentence encoders first create the sequence representations, then the encoders read through whole sentences, and finally the decoders predict the sentences by order one by one. 

$\bullet$ The generality of our model is stronger than the origin model due to the benefit from the sequence to sequence model; it can be extended easily to predict variable lengths set of sentences. 

$\bullet$ Finally, the order of sentences are predicted independently in the provided model, which failed to extract the inter-correlation within the sentences. In our model, the orders are predicted in condition of the previously predicted orders. 

## Error Analysis:
| Type | 1st sentence  | 2nd sentence | 3rd sentence  |4th sentence  |5th sentence  |
|------|---|--|--|--|
| **Dev Accuracy** | 0.890967397114 |0.595403527525 |0.432923570283  |0.390700160342  |0.539283805452  |
The model performs quite well on predicting the first sentences, relatively fine on predicting the second and the last sentences, but it is not good at predicting the third and the forth sentences. Also, 65.3% misclassified 3rd sentences actually belong to the 4th sentences, while 68.5% misclassified 4th sentences are the 3rd sentences. This is consistent with what we human predict the sentence orders, where it is easy to find the first and the last sentences without any prior knowledge, but predicting the third and forth sentence is not an easy task. 


## Reference:
Logeswaran, L., Lee, H. and Radev, D. (2017). Sentence Ordering using Recurrent Neural Networks

Ilya Sutskever, Oriol Vinyals, Quoc V. Le (2014). Sequence to Sequence Learning with Neural Networks

Tim Rocktäschel, Edward Grefenstette, Karl Moritz Hermann, Tomáš Kočiský, Phil Blunsom(2015). Reasoning about Entailment with Neural Attention

Yang Liu, Chengjie Sun, Lei Lin, Xiaolong Wang (2016). Learning Natural Language Inference using Bidirectional LSTM model and Inner-Attention


$$
\
\
$$

## <font color='red'>Assessment 2</font>: Assess Description (30 pts) 

We will mark the description along the following dimensions: 

* Clarity (10pts: very clear, 0pts: we can't figure out what you did, or you did nothing)
* Creativity (10pts: we could not have come up with this, 0pts: Use only the provided model)
* Substance (10pts: implemented complex state-of-the-art classifier, compared it to a simpler model, 0pts: Only use what is already there)

## <font color='orange'>Mark</font>:  Your solution to Task 2 is marked with ** __ points**.
---

## <font color='orange'>Final mark</font>: Your solution to Assignment 3 is marked with ** __points**. 