[View in Colaboratory](https://colab.research.google.com/github/cyyeh/2018AI_summer_school/blob/master/MeMN2N_Lab.ipynb)

# End-To-End Memory Networks
## Tutorial in TensorFlow
---
This tutorial will introduce the implementation of  <a href="https://arxiv.org/pdf/1503.08895.pdf">End-To-End Memory Networks</a>, and is based on the github repository - <a href="https://github.com/domluna/memn2n">memn2n</a>. 

We will cover synthetic question and answering experiments on bAbI dataset. 

### bAbI Dataset

<a href="https://arxiv.org/pdf/1502.05698.pdf">bAbI dataset</a> is a set of proxy tasks that evaluate reading comprehension via question answering. Their tasks measure understanding in several ways: whether a system is able to answer questions via chaining facts, simple induction, deduction and many
more.

The following is an example in one of 20 bAbI task:

> 1 Mary moved to the bathroom.  
2 John went to the hallway.  
3 Where is Mary? 	bathroom	1  
4 Daniel went back to the hallway.  
5 Sandra moved to the garden.  
6 Where is Daniel? 	hallway	4  
7 John moved to the office.  
8 Sandra journeyed to the bathroom.  
9 Where is Daniel? 	hallway	4  


### Introduction to End-To-End Memory Networks

![MeMN2N Picture](https://camo.githubusercontent.com/ba1c7dbbccc5dd51d4a76cc6ef849bca65a9bf4d/687474703a2f2f692e696d6775722e636f6d2f6e7638394a4c632e706e67 "MeMN2N")
Illustration of End-To-End Memory Networks Architecture.

#### Model Inputs and Outputs
---
The model takes a discrete set of inputs $x_1, ..., x_n$ that are to be stored in the memory, a query $q$, and outputs an answer $a$. Each of the $x_i$, $q$, and $a$ contains symbols coming from a dictionary with $V$ words. The model writes all $x$ to the memory up to a fixed buffer size, and then finds a continuous representation for the $x$ and $q$. The continuous representation is then processed via multiple hops to output $a$. 

Note that $x_i$, $q$, and $a$ can be seen as vectors of dimension $V$.
#### Input memory representation
---
1. Given $\{x_i\}$ and *embedding matrix $A$* of size $d\times V$,  the **memory vectors $m_i$** is derived from $m_i = A x_i$ of dimension $d$.

2. Given $q$ and *embedding matrix $B$* of size $d\times V$, the **inital internal state $u$** is obtained from $u = B q$ of dimension $d$.

3. The **match probability vector $p_i$** that measure the match confidence between $u$ and $m_i$ is aquired from $p_i = Softmax(u^T m_i)$, where $Softmax(z_i) = e^{z_i} / \sum_j e^{z_j}$.

#### Output memory representation
---
1. Given $\{x_i\}$ and *embedding matrix $C$* of size $d\times V$,  the **output vectors $c_i$** is derived from $c_i = C x_i$ of dimension $d$.

2. The **response vector $o$** is then a sum over the transformed inputs $c_i$, weighted by the probability vector from the input: $o = \sum_i p_i c_i$.

3. The **final prediction $\hat{a} = Softmax(W(o+u))$**.

#### Multiple Layers
---
The End-To-End Memory Networks can be extended to handle $K$ hop operations.  The memory layers are stacked $K$ times in the following way:

* $u^{k+1} = u^k + o^k$
* Each layer has its own embedding matrices $A^k$, $C^k$, used to embed the inputs $\{x_i\}$
* $\hat{a}= Softmax(W u^{K+1}) = Softmax(W(o^K + u^K))$

Note that superscipt of $k$ means from layer k, and $K$ means from last layer. In the original paper, there are two types of weight tying within model:

1. **Adjacent**: the output embedding for one layer is the input embedding for the one above, i.e. $A^{k+1} = C^k$. They also constrain (a) the answer prediction matrix to be the same as the final output embedding, i.e. $W^T = C^K$, and (b) the question embedding to match the input
embedding of the first layer, i.e. $B = A^1$.
2. **Layer-wise (RNN-like)**: the input and output embeddings are the same across different
layers, i.e. $A^1 = A^2 = ... = A^K$ and $C^1 = C^2 = ... = C^K$. They have found it useful to
add a linear mapping $H$ to the update of $u$ between hops; that is, $u^{k+1} = H u^k + o^k$. This
mapping is learnt along with the rest of the parameters and used throughout our experiments for layer-wise weight tying.

#### Lab Experiment Setting
---
During training, all three embedding matrices $A$, $B$ and $C$, and W are jointly learned by minimizing a standard cross-entropy loss between $\hat{a}$ and the true label $a$.

In the following lab experiment, we will use the setting below:

1. All experiments use a $K = 3$ hops model with the **adjacent weight sharing scheme**. For all tasks that output lists (i.e. the answers are multiple words), we take each possible combination of possible outputs and record them as a separate answer vocabulary word.
2. **Position Encoding (PE)**:  
$m_i = \sum_j l_j \cdot A x_{ij}$  
$c_i = \sum_j l_j \cdot Cx_{ij}$  
$u =\sum_j l_j \cdot Bq_j$  
where $\cdot$ is an element-wise multiplication. $l_j$ is a column vector of $j$th word with the structure $l_{kj} = 1 + 4 * (j-(d+1)/2)(k-(J+1)/2)/Jd$ (assuming 1-based indexing), with $J$ being the number of words in the sentence, and $d$ is the dimension of the embedding.
3. **Temporal Encoding**: To enable our model to address notion of temporal context, we modify the memory vector so that $m_i = \sum_j A x_{ij} + T_A(i)$, where $T_A(i)$ is the ith row of a special matrix $T_A$ that encodes temporal information. The output embedding is augmented in the same way with a matrix $T_C$ (e.g. $c_i = \sum_j C x_{ij} + T_C(i)$). Both $T_A$ and $T_C$ are learned during training. They are also subject to the same sharing constraints as $A$ and $C$. Note that sentences are indexed in reverse order, reflecting their relative distance from the question so that $x_1$ is the last sentence of the story.

### Download bAbI Dataset

Run the code cell below to download and extract bAbI dataset.

In [0]:
!nvidia-smi
!wget http://www.thespermwhale.com/jaseweston/babi/tasks_1-20_v1-2.tar.gz
!tar xzf ./tasks_1-20_v1-2.tar.gz

### Library and Parameter Setting
The library is imported and all hyper-parameters is defined here. In the following lab lecture, we will choose one of 20 bAbI tasks to demostrate the  network implementation and training process.

In [0]:
import tensorflow as tf
import numpy as np

from sklearn import model_selection, metrics
from itertools import chain
from six.moves import range, reduce
import re
import os

class Parameter(object):
    learning_rate = 0.01     # Learning rate for SGD.
    anneal_rate = 25         # Number of epochs between halving the learnign rate.
    anneal_stop_epoch = 100  # Epoch number to end annealed lr schedule.
    max_grad_norm = 40.0     # Clip gradients to this norm.
    evaluation_interval = 10 # Evaluate and print results every x epochs"
    batch_size = 32          # Batch size for training.
    hops = 3                 # Number of hops in the Memory Network.
    epochs = 100             # Number of epochs to train for.
    embedding_size = 20      # Embedding size for embedding matrices.
    memory_size = 50         # Maximum size of memory.
    task_id = 1              # bAbI task id, 1 <= id <= 20
    random_state = None      # Random state.
    name = 'MemN2N'          # Model name.
    data_dir = './tasks_1-20_v1-2/en/' # Directory containing bAbI tasks

param = Parameter()

print("Started Task:", param.task_id)

### Load Dataset
1. ```load_task``` loads train and test data for  ```task_id``` th of bAbI tasks.
2. ```parse_stories``` parse the content like:  
`1 Mary moved to the bathroom.`  
`2 John went to the hallway.`  
`3 Where is Mary? 	bathroom	1`  
`4 Daniel went back to the hallway.`  
`5 Sandra moved to the garden.`  
`6 Where is Daniel? 	hallway	4`  
`7 John moved to the office.`  
`8 Sandra journeyed to the bathroom.`  
`9 Where is Daniel? 	hallway	4`  
One complete story corresponding queries and answers will cover line 1 to the previous line of next line 1. Each query has its answer and the supporting fact id in the current story. The output of ```parse_stories``` is ```[(substory, q, a),...]```.

In [0]:
def load_task(data_dir, task_id, only_supporting=False):
    '''Load the nth task. There are 20 tasks in total.
    Returns a tuple containing the training and testing data for the task.
    '''
    assert task_id > 0 and task_id < 21

    files = [os.path.join(data_dir, f) for f in os.listdir(data_dir)]
    s = 'qa{}_'.format(task_id)
    train_file = [f for f in files if s in f and 'train' in f][0]
    test_file = [f for f in files if s in f and 'test' in f][0]
    train_data = get_stories(train_file, only_supporting)
    test_data = get_stories(test_file, only_supporting)
    return train_data, test_data

def get_stories(f, only_supporting=False):
    '''Given a file name, read the file, retrieve the stories, and then convert the sentences into a single story.
    If max_length is supplied, any stories longer than max_length tokens will be discarded.
    '''
    with open(f) as f:
        return parse_stories(f.readlines(), only_supporting=only_supporting)

def parse_stories(lines, only_supporting=False):
    '''Parse stories provided in the bAbI tasks format
    If only_supporting is true, only the sentences that support the answer are kept.
    '''
    data = []
    story = []
    for line in lines:
        line = str.lower(line)
        nid, line = line.split(' ', 1)
        nid = int(nid)
        if nid == 1:
            story = []
        if '\t' in line: # question
            q, a, supporting = line.split('\t')
            q = tokenize(q)
            #a = tokenize(a)
            # answer is one vocab word even if it's actually multiple words
            a = [a]
            substory = None

            # remove question marks
            if q[-1] == "?":
                q = q[:-1]

            if only_supporting:
                # Only select the related substory
                supporting = map(int, supporting.split())
                substory = [story[i - 1] for i in supporting]
            else:
                # Provide all the substories
                substory = [x for x in story if x]

            data.append((substory, q, a))
            story.append('')
        else: # regular sentence
            # remove periods
            sent = tokenize(line)
            if sent[-1] == ".":
                sent = sent[:-1]
            story.append(sent)
    return data
    
def tokenize(sent):
    '''Return the tokens of a sentence including punctuation.
    >>> tokenize('Bob dropped the apple. Where is the apple?')
    ['Bob', 'dropped', 'the', 'apple', '.', 'Where', 'is', 'the', 'apple', '?']
    '''
    return [x.strip() for x in re.split('\W+', sent) if x.strip()]

# task data
train, test = load_task(param.data_dir, param.task_id)

### Preprocessing
1. Create vocabulary.
2. Perform statistical analysis on data.


In [0]:
data = train + test

vocab = sorted(reduce(lambda x, y: x | y, (set(list(chain.from_iterable(s)) + q + a) for s, q, a in data)))
word_idx = dict((c, i + 1) for i, c in enumerate(vocab))

max_story_size = max(map(len, (s for s, _, _ in data)))
mean_story_size = int(np.mean([ len(s) for s, _, _ in data ]))
sentence_size = max(map(len, chain.from_iterable(s for s, _, _ in data)))
query_size = max(map(len, (q for _, q, _ in data)))
memory_size = min(param.memory_size, max_story_size)

Note that in this implementation, we append **temporal encoding matrix** to **embedding matrix**. Since that, we add `time' words to the vocabulary.

In [0]:
print("Vocabulary size before adding time words", len(word_idx))
# Add time words/indexes
for i in range(memory_size):
    word_idx['time{}'.format(i+1)] = 'time{}'.format(i+1)

vocab_size = len(word_idx) + 1 # +1 for nil word
sentence_size = max(query_size, sentence_size) # for the position encoding
sentence_size += 1  # +1 for time words

print("Vocabulary size after adding time words", len(word_idx))
print("Longest sentence length", sentence_size)
print("Longest story length", max_story_size)
print("Average story length", mean_story_size)

### Encode Data
```vectorize_data``` encodes stories, queries and answers.   
For stories, we encode and pad lists of words into indices arrays and add `time' word index to the last of each sentence. Then, pad the arrays with zero to memory size.

For queries, we also encode and pad lists of words into indices arrays. As for answers, we encode them into one hot vectors.

Stories' shape = ```(num_queries, memory_size, sentence_size)```  
Queries' shape = ```(num_queries, sentence_size)```  
Answers' shape = ```(num_queries, vocab_size)```

On line 23-24, the index order of sentence is reversed which corresponds to **temporal encoding** setting.

In [0]:
def vectorize_data(data, word_idx, sentence_size, memory_size):
    """
    Vectorize stories and queries.
    If a sentence length < sentence_size, the sentence will be padded with 0's.
    If a story length < memory_size, the story will be padded with empty memories.
    Empty memories are 1-D arrays of length sentence_size filled with 0's.
    The answer array is returned as a one-hot encoding.
    """
    S = []
    Q = []
    A = []
    for story, query, answer in data:
        ss = []
        for i, sentence in enumerate(story, 1):
            ls = max(0, sentence_size - len(sentence))
            ss.append([word_idx[w] for w in sentence] + [0] * ls)

        # take only the most recent sentences that fit in memory
        ss = ss[::-1][:memory_size][::-1]

        # Make the last word of each sentence the time 'word' which 
        # corresponds to vector of lookup table
        for i in range(len(ss)):
            ss[i][-1] = len(word_idx) - memory_size - i + len(ss)

        # pad to memory_size
        lm = max(0, memory_size - len(ss))
        for _ in range(lm):
            ss.append([0] * sentence_size)

        lq = max(0, sentence_size - len(query))
        q = [word_idx[w] for w in query] + [0] * lq

        y = np.zeros(len(word_idx) + 1) # 0 is reserved for nil word
        for a in answer:
            y[word_idx[a]] = 1

        S.append(ss)
        Q.append(q)
        A.append(y)
    return np.array(S), np.array(Q), np.array(A)


# train/validation/test sets
S, Q, A = vectorize_data(train, word_idx, sentence_size, memory_size)

### Data Split
We split data into train, validation and test set. Note that train and validation set is split from data in the train file. Train set contains 90% of data and validation set contains 10 % of data.

In [0]:
trainS, valS, trainQ, valQ, trainA, valA = model_selection.train_test_split(S, Q, A, test_size=.1, random_state=param.random_state)
testS, testQ, testA = vectorize_data(test, word_idx, sentence_size, memory_size)

print(testS[0])

print("Training set shape", trainS.shape)

# params
n_train = trainS.shape[0]
n_test = testS.shape[0]
n_val = valS.shape[0]

print("Training Size", n_train)
print("Validation Size", n_val)
print("Testing Size", n_test)

train_labels = np.argmax(trainA, axis=1)
test_labels = np.argmax(testA, axis=1)
val_labels = np.argmax(valA, axis=1)

### Create Model
---
#### Build Inputs
The model inputs include stories, queries, answers, and learning rate. From **Encode Data** section, we know that:  
Stories' shape = ```(num_queries, memory_size, sentence_size)```  
Queries' shape = ```(num_queries, sentence_size)```  
Answers' shape = ```(num_queries, vocab_size)```

In this section, we prepare our inputs by applying **tf.placeholder**. Be aware that input data contrains only vocabulary index. The data type should be integer.

In [0]:
tf.reset_default_graph()
tf.set_random_seed(param.random_state)

_stories = 'Your code here'
_queries = 'Your code here'
_answers = 'Your code here'
_lr = tf.placeholder(tf.float32, [], name="learning_rate")

#### Build Variables
The model has weight variables, and we will define them here.  Recall that the input and output memory representation have embedding matrix $A$, $B$, $C$, and $W$. Since we are using **adjacent weight sharing scheme**,it should follow $A^{k+1} = C^k$, $W^T = C^K$, and $B = A^1$.

In [0]:
#################################################################################
# TODO:                                                                         #
# Create weight variables for model. Please follow the description above.       #
#################################################################################
initializer=tf.random_normal_initializer(stddev=0.1)
with tf.variable_scope(param.name):
    nil_word_slot = tf.zeros([1, param.embedding_size])
    # hint: create temporary tensor. nil word should be at index 0.
    #       you can use tf.concat to concat random initialized weight matrx with nil word vector.
    A = 'Your code here'
    _C = 'Your code here'
    
    # hint: use A to create variable
    A_1 = 'Your code here'

    C = []

    for hopn in range(param.hops):
        with tf.variable_scope('hop_{}'.format(hopn)):
            # hint: use _C to create variable
            C.append('Your code here')

_nil_vars = set([A_1.name] + [x.name for x in C])

#### Create Pipeline
In this section, we will build up our model. Note that we have to consider **adjacent weight sharing scheme**.

1. Use  **Position Encoding (PE)** to encode sentences.   
>**Position Encoding (PE)**:  
$m_i = \sum_j l_j \cdot A x_{ij}$  
$c_i = \sum_j l_j \cdot Cx_{ij}$  
$u =\sum_j l_j \cdot Bq_j$  
where $\cdot$ is an element-wise multiplication. $l_j$ is a column vector of $j$th word with the structure  
$l_{kj} = 1 + 4 * (j-(d+1)/2)(k-(J+1)/2)/Jd$ (assuming 1-based indexing),  
with $J$ being the number of words in the sentence, and $d$ is the dimension of the embedding.

2. Obtain **inital internal state $u = \sum_j l_j \cdot Bq_j$**
3. Obtain **memory vector $m_i = \sum_j l_j \cdot A x_{ij}$**.
4. Obtain **output vector $c_i = \sum_j l_j \cdot Cx_{ij}$**.
5. Obtain **reponse vector $o = \sum_i p_i c_i$**.
6. Repeat 2-4 for $K$ times.
7. Obtain **final prediction $\hat{a} = Softmax(W(o+u))$**.


![MeMN2N Picture](https://camo.githubusercontent.com/ba1c7dbbccc5dd51d4a76cc6ef849bca65a9bf4d/687474703a2f2f692e696d6775722e636f6d2f6e7638394a4c632e706e67 "MeMN2N")

In [0]:
def position_encoding(sentence_size, embedding_size):
    """
    Position Encoding described in section 4.1 [1]
    """
    #################################################################################
    # TODO:                                                                         #
    # Implementation of position encoding. Return the position encoding numpy array #
    # with size (sentence_size, embedding_size).                                    #
    #################################################################################
    encoding = np.ones('Your code here', dtype=np.float32)
    ls = sentence_size+1
    le = embedding_size+1
    for i in range(1, le):
        for j in range(1, ls):
            encoding[i-1, j-1] = 'Your code here'
    encoding = 1 + 4 * 'Your code here' / 'Your code here' / 'Your code here'
    # Make position encoding of time words identity to avoid modifying them 
    encoding[:, -1] = 1.0
    return np.transpose(encoding)

_PE = tf.constant(position_encoding(sentence_size, param.embedding_size), name="encoding")

with tf.variable_scope(param.name):
    # Use A_1 for thee question embedding as per Adjacent Weight Sharing
    # Use tf.nn.embedding_lookup to retrieve embedding vector
    q_emb = 'Your code here'
    # Remenber to multiply the embedding by position encodding and 
    # sum the embedding vectors over sentence dimension
    u_0 = 'Your code here'
    u = [u_0]

    for hopn in range(param.hops):
        if hopn == 0:
            m_emb_A = 'Your code here'
            m_A = 'Your code here'

        else:
            with tf.variable_scope('hop_{}'.format(hopn - 1)):
                m_emb_A = 'Your code here'
                m_A = 'Your code here'

        # hack to get around no reduce_dot
        u_temp = tf.transpose(tf.expand_dims(u[-1], -1), [0, 2, 1])
        dotted = tf.reduce_sum(m_A * u_temp, 2)

        # Calculate probabilities
        probs = 'Your code here'

        probs_temp = tf.transpose(tf.expand_dims(probs, -1), [0, 2, 1])
        with tf.variable_scope('hop_{}'.format(hopn)):
            m_emb_C = 'Your code here'
        m_C = 'Your code here'

        c_temp = tf.transpose(m_C, [0, 2, 1])
        o_k = 'Your code here'

        u_k = u[-1] + o_k

        u.append(u_k)

    # Use last C for output (transposed)
    with tf.variable_scope('hop_{}'.format(param.hops)):
        logits = 'Your code here'

### Prepare Training


In [0]:
def zero_nil_slot(t, name=None):
    """
    Overwrites the nil_slot (first row) of the input Tensor with zeros.
    The nil_slot is a dummy slot and should not be trained and influence
    the training algorithm.
    """
    with tf.name_scope("zero_nil_slot", name, [t]) as name:
        t = tf.convert_to_tensor(t, name="t")
        s = tf.shape(t)[1]
        z = tf.zeros(tf.stack([1, s]))
        return tf.concat(axis=0, values=[z, tf.slice(t, [1, 0], [-1, -1])], name=name)

def add_gradient_noise(t, stddev=1e-3, name=None):
    """
    Adds gradient noise as described in http://arxiv.org/abs/1511.06807 [2].
    The input Tensor `t` should be a gradient.
    The output will be `t` + gaussian noise.
    0.001 was said to be a good fixed value for memory networks [2].
    """
    with tf.name_scope("add_gradient_noise", name, [t, stddev]) as name:
        t = tf.convert_to_tensor(t, name="t")
        gn = tf.random_normal(tf.shape(t), stddev=stddev)
        return tf.add(t, gn, name=name)

# optimizer
opt = tf.train.GradientDescentOptimizer(learning_rate=_lr)
# cross entropy
cross_entropy = tf.nn.softmax_cross_entropy_with_logits_v2(logits=logits, labels=tf.cast(_answers, tf.float32), name="cross_entropy")
# loss
loss = tf.reduce_sum(cross_entropy, name="cross_entropy_sum")
# gradient pipeline
grads_and_vars = opt.compute_gradients(loss)
grads_and_vars = [(tf.clip_by_norm(g, param.max_grad_norm) if g is not None else g, v) for g,v in grads_and_vars]
grads_and_vars = [(add_gradient_noise(g) if g is not None else g, v) for g,v in grads_and_vars]
nil_grads_and_vars = []
for g, v in grads_and_vars:
    if v.name in _nil_vars:
        nil_grads_and_vars.append((zero_nil_slot(g), v))
    else:
        nil_grads_and_vars.append((g, v))
train_op = opt.apply_gradients(nil_grads_and_vars, name="train_op")

predict_op = tf.argmax(logits, 1, name="predict_op")
predict_proba_op = tf.nn.softmax(logits, name="predict_proba_op")
predict_log_proba_op = tf.log(predict_proba_op, name="predict_log_proba_op")

batch_size = param.batch_size

batches = zip(range(0, n_train-batch_size, batch_size), range(batch_size, n_train, batch_size))
batches = [(start, end) for start, end in batches]

### Start Training and Evaluation

In [0]:
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
#     model = MemN2N(batch_size, vocab_size, sentence_size, memory_size, param.embedding_size, session=sess,
#                    hops=param.hops, max_grad_norm=param.max_grad_norm)
    for t in range(1, param.epochs+1):
        # Stepped learning rate
        if t - 1 <= param.anneal_stop_epoch:
            anneal = 2.0 ** ((t - 1) // param.anneal_rate)
        else:
            anneal = 2.0 ** (param.anneal_stop_epoch // param.anneal_rate)
        lr = param.learning_rate / anneal

        np.random.shuffle(batches)
        total_cost = 0.0
        for start, end in batches:
            s = trainS[start:end]
            q = trainQ[start:end]
            a = trainA[start:end]
            cost_t, _ = sess.run([loss, train_op], 
                                 feed_dict={_stories: s, _queries: q, _answers: a, _lr: lr})
#             cost_t = model.batch_fit(s, q, a, lr)
            total_cost += cost_t

        if t % param.evaluation_interval == 0:
            train_preds = []
            for start in range(0, n_train, batch_size):
                end = start + batch_size
                s = trainS[start:end]
                q = trainQ[start:end]
#                 pred = model.predict(s, q)
                pred = sess.run(predict_op, feed_dict={_stories: s, _queries: q})
                train_preds += list(pred)
            
            val_preds = sess.run(predict_op, feed_dict={_stories: valS, _queries: valQ})
#             val_preds = model.predict(valS, valQ)
            train_acc = metrics.accuracy_score(np.array(train_preds), train_labels)
            val_acc = metrics.accuracy_score(val_preds, val_labels)

            print('-----------------------')
            print('Epoch', t)
            print('Total Cost:', total_cost)
            print('Training Accuracy:', train_acc)
            print('Validation Accuracy:', val_acc)
            print('-----------------------')
    test_preds = sess.run(predict_op, feed_dict={_stories: testS, _queries: testQ})
#     test_preds = model.predict(testS, testQ)
    test_acc = metrics.accuracy_score(test_preds, test_labels)
    print("Testing Accuracy:", test_acc)

### Homework

1. In the original paper, there is a training scheme called linear start (LS) training:  
The softmax in each memory layer is removed, making the model entirely linear except for the final softmax for answer prediction. When the validation loss stopped decreasing, the softmax layers were re-inserted and training recommenced.  
Pleas implement this training scheme.   
(hint: put tf.nn.softmax and tf.identity to tf.cond with a boolean scalar. Once the validation loss stopped decreasing, negate the boolean scalar.)
2. Implement Layer-wise (RNN-like) weight tying scheme. Compare the performace between adjacent share weight scheme and Layer-wise (RNN-like) weight tying scheme.
3. Use bag-of-word instead of positioin encoding for senetence representations. Compare the performace between bag-of-word and positioin encoding.