# Introduction

<center><h3>**Welcome to the Summarization Notebook.**</h3></center>

In this assignment, you are going to train a neural network to summarize news articles.
Your neural network is going to learn from example, as we provide you with (article, summary) pairs.
We provide you with a **toy dataset** made of only articles about police related news.
Usual datasets can be 20x larger in size, but we have reduced it for computational purposes.

You will do this using a Transformer network, from the __[Attention is all you need](http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf)__ paper.
In this assignment you will:
- Learn to process text into sub-word tokens, to avoid fixed vocabulary sizes, and UNK tokens.
- Implement the key conceptual blocks of a Transformer.
- Use a Transformer to read a news article, and produce a summary.
- Perform operations on learned word-vectors to examine what the model has learned.

    
** Before you start **

You should read the Attention is all you need paper.
We are providing you with skeleton code for the Transformer, but there will have to implement 5 conceptual blocks of the transformer yourself:
-  AttentionQKV: the Query, Key, Value attention mechanism at the center of the Transformer
- MultiHeadAttention: the multiple heads that enable each input to attend at many places at once.
- PositionEmbedding: the sinusoid-based position embedding of the Transformer.
- Encoder & Decoder: The encoder (that reads inputs, such as news articles), the decoder (that produces the output summary, one token at a time)
- Full Transformer: piecing it all together.

You should get the dataset from Google Drive, as instructed in the README of the project.

All dataset files should be placed in the `dataset/` folder of this assignment.

If you are using Google Colab, follow the instructions to mount your Google Drive onto the remote machine.

# Library imports

In [1]:
from transformer import Transformer
import sentencepiece as spm
import tensorflow as tf
import numpy as np
import json
import capita
%load_ext autoreload
%autoreload 2

root_folder = ""

In [2]:
# Load the word piece model that will be used to tokenize the texts into
# word pieces with a vocabulary size of 10000

sp = spm.SentencePieceProcessor()
sp.Load(root_folder+"dataset/wp_vocab10000.model")

vocab = [line.split('\t')[0] for line in open(root_folder+"dataset/wp_vocab10000.vocab", "r")]
pad_index = vocab.index('#')

def pad_sequence(numerized, pad_index, to_length):
    pad = numerized[:to_length]
    padded = pad + [pad_index] * (to_length - len(pad))
    mask = [w != pad_index for w in padded]
    return padded, mask

# Building blocks of a Transformer


**TODO**:

Implement the 5 blocks of the Transformer. In order to finish this section, you should get very small error <1e-7 on each of the 5 checks in this section.


The Transformer is split into 3 files: transformer_attention.py, transformer_layers.py and transformer.py

Each section below gives you directions and a way to verify your code works properly.

You do not need to modify the rest of the code provided, but should read it to understand overall architecture.

Our Transformer is built as a Keras model, a standard that is good for you to get accustomed to.



## (1) Implementing the Query-Key-Value Attention (AttentionQKV)

This part is located in AttentionQKV in transformer_attention.py. You must implement the call function of the class.
You will need to implement the mathematical procedure of AttentionQKV that is described in the [Attention is all you need paper](https://arxiv.org/pdf/1706.03762.pdf).

In [3]:
from transformer_attention import AttentionQKV

batch_size = 2;
n_queries = 3;
n_keyval = 5;
depth_k = 2;
depth_v = 2

with open(root_folder+"transformer_checks/attention_qkv_io.json", "r") as f:
    io = json.load(f)
    queries = np.array(io['queries'])
    keys = np.array(io['keys'])
    values = np.array(io['values'])
    expected_output  = np.array(io['output'])
    expected_weights = np.array(io['weights'])

attn_qkv = AttentionQKV()
queries_h = tf.placeholder(tf.float32,shape=(None, n_queries, depth_k), name="queries")
keys_h = tf.placeholder(tf.float32,shape=(None, n_keyval, depth_k), name="keys")
values_h = tf.placeholder(tf.float32,shape=(None, n_keyval, depth_v), name="values")

attn_output, attn_weights = attn_qkv(queries_h, keys_h, values_h)

with tf.Session() as sess:
    output, weights = sess.run([attn_output, attn_weights], feed_dict={queries_h: queries, keys_h: keys, values_h: values})

print("Total error on the output:",np.sum(np.abs(expected_output-output)), "(should be 0.0 or close to 0.0)")
print("Total error on the weights:",np.sum(np.abs(expected_weights-weights)), "(should be 0.0 or close to 0.0)")

Total error on the output: 1.9371509552001953e-07 (should be 0.0 or close to 0.0)
Total error on the weights: 1.2945383787155151e-07 (should be 0.0 or close to 0.0)


## (2) Implementing Multi-head attention

This part is located in the class MultiHeadProjection in transformer_attention.py.
You must implement the call, \_split_heads, and \_combine_heads functions.

**Procedure**

The objective is to leverage the AttentionQKV class you already wrote.

Your input are the queries, keys, values as 3-d tensors (batch_size, sequence_length, feature_size).

Split them into 4-d tensors (batch_size, n_heads, sequence_length, new_feature_size). Where:
$$feature\_size = n\_heads * new_feature\_size.$$

You can then feed the split qkv to your implemented AttentionQKV, which will treat each head as an independent attention function.

Then the output must be combined back into a 3-d tensor.
You can test the validity of your implementation in the cell below.

In [4]:
from transformer_attention import MultiHeadProjection
tf.reset_default_graph()

batch_size = 2;
n_queries = 3;
n_heads = 4
n_keyval = 5;
depth_k = 8;
depth_v = 8;

with open(root_folder+"transformer_checks/multihead_io.json", "r") as f:
    io = json.load(f)
    queries = np.array(io['queries'])
    keys = np.array(io['keys'])
    values = np.array(io['values'])
    expected_output  = np.array(io['output'])


attn_qkv = MultiHeadProjection(n_heads)
queries_h = tf.placeholder(tf.float32,shape=(None, n_queries, depth_k), name="queries")
keys_h = tf.placeholder(tf.float32,shape=(None, n_keyval, depth_k), name="keys")
values_h = tf.placeholder(tf.float32,shape=(None, n_keyval, depth_v), name="values")

multihead_output = attn_qkv((queries_h, keys_h, values_h))

with tf.Session() as sess:
    output = sess.run(multihead_output, feed_dict={queries_h: queries, keys_h: keys, values_h: values})


print("Total error on the output:",np.sum(np.abs(expected_output-output)), "(should be 0.0 or close to 0.0)")

Total error on the output: 8.437782526016235e-07 (should be 0.0 or close to 0.0)


## (3) Position Embedding 

You must implement the PositionEmbedding class in transformer.py.


The cell below helps you verify the validity of your implementation


In [5]:
from transformer import PositionEmbedding

batch_size = 2;
sequence_length = 3;
dim = 4;

with open(root_folder+"transformer_checks/position_embedding_io.json", "r") as f:
    io = json.load(f)
    inputs = np.array(io['inputs'])
    expected_output  = np.array(io['output'])

inputs_h = tf.placeholder(tf.float32,shape=(None, sequence_length, dim), name="inputs")
pos_emb = PositionEmbedding()
output_t = pos_emb(inputs_h)

with tf.Session() as sess:
    output = sess.run(output_t, feed_dict={inputs_h: inputs})

print("Total error on the output:",np.sum(np.abs(expected_output-output)), "(should be 0.0 or close to 0.0)")

Total error on the output: 1.1920928955078125e-07 (should be 0.0 or close to 0.0)


## (4) Transformer Encoder / Transformer Decoder

You now have all the blocks needed to implement the Transformer.
For this part, you have to fill in 2 classes in the transformer.py file: TransformerEncoderBlock, TransformerDecoderBlock.

The code below will verify the accuracy of each block

In [6]:
from transformer import TransformerEncoderBlock

batch_size = 2
sequence_length = 5
hidden_size = 6
filter_size = 12
n_heads = 2

with open(root_folder+"transformer_checks/transformer_encoder_block_io.json", "r") as f:
    io = json.load(f)
    inputs = np.array(io['inputs'])
    expected_output = np.array(io['output'])
#print(inputs.shape)
tf.reset_default_graph()
inputs_h = tf.placeholder(tf.float32,shape=(None, sequence_length, hidden_size), name="inputs")
enc_block = TransformerEncoderBlock(n_heads=n_heads, filter_size=filter_size, hidden_size=hidden_size)
output_t = enc_block(inputs_h)
saver = tf.train.Saver()

with tf.Session() as sess:
    saver.restore(sess, "transformer_checks/transformer_encoder_block")
    output = sess.run(output_t, feed_dict={inputs_h: inputs})
    
    #print(output)
print("Total error on the output:",np.sum(np.abs(expected_output-output)), "(should be 0.0 or close to 0.0)")

INFO:tensorflow:Restoring parameters from transformer_checks/transformer_encoder_block
Total error on the output: 5.58607280254364e-06 (should be 0.0 or close to 0.0)


In [7]:
from transformer import TransformerDecoderBlock

batch_size = 2
encoder_length = 5
decoder_length = 3
hidden_size = 6
filter_size = 12
n_heads = 2

with open("transformer_checks/transformer_decoder_block_io.json", "r") as f:
    io = json.load(f)
    decoder_inputs = np.array(io['decoder_inputs'])
    encoder_output = np.array(io['encoder_output'])
    expected_output = np.array(io['expected_output'])

tf.reset_default_graph()
decoder_inputs_h = tf.placeholder(tf.float32,shape=(None, decoder_length, hidden_size), name="dec_inputs")
encoder_output_h = tf.placeholder(tf.float32,shape=(None, encoder_length, hidden_size), name="enc_out")

dec_block = TransformerDecoderBlock(n_heads=n_heads, filter_size=filter_size, hidden_size=hidden_size)
output_t = dec_block(decoder_inputs_h, encoder_output_h)
saver = tf.train.Saver()

with tf.Session() as sess:
    saver.restore(sess, "transformer_checks/transformer_decoder_block")
    output = sess.run(output_t, feed_dict={decoder_inputs_h: decoder_inputs, encoder_output_h: encoder_output})
    
print("Total error on the output:",np.sum(np.abs(expected_output-output)), "(should be 0.0 or close to 0.0)")

INFO:tensorflow:Restoring parameters from transformer_checks/transformer_decoder_block
Total error on the output: 5.379319190979004e-06 (should be 0.0 or close to 0.0)


## (5) Transformer

This is the final high-level function that pieces it all together.

You have to implement the call function of the Transformer class in the `transformer.py` file.

The block below verifies your implementation is correct.

In [8]:
from transformer import Transformer

batch_size = 2
vocab_size = 11
n_layers = 3
n_heads = 4
d_model = 8
d_filter = 16
input_length = 5
output_length = 3

with open(root_folder+"transformer_checks/transformer_io.json", "r") as f:
    io = json.load(f)
    enc_input = np.array(io['enc_input'])
    dec_input = np.array(io['dec_input'])
    enc_mask = np.array(io['enc_mask'])
    dec_mask = np.array(io['dec_mask'])
    expected_output = np.array(io['output'])
    
tf.reset_default_graph()
enc_input_h = tf.placeholder(tf.int32,shape=(None, input_length), name="enc_inp")
dec_input_h = tf.placeholder(tf.int32,shape=(None, output_length), name="dec_inp")
enc_mask_h = tf.placeholder(tf.bool,shape=(None,input_length),name="encoder_mask")
dec_mask_h = tf.placeholder(tf.bool, shape=(None,output_length),name="decoder_mask")

transfo = Transformer(vocab_size=vocab_size, n_layers=n_layers, n_heads=n_heads, d_model=d_model, d_filter=d_filter)
output_t = transfo(enc_input_h, target_sequence=dec_input_h, encoder_mask=enc_mask_h, decoder_mask=dec_mask_h)

saver = tf.train.Saver()

with tf.Session() as sess:
    saver.restore(sess, "transformer_checks/transformer")
    output = sess.run(output_t, feed_dict={enc_input_h: enc_input, dec_input_h: dec_input, enc_mask_h: enc_mask, dec_mask_h: dec_mask})

print("Total error on the output:",np.sum(np.abs(expected_output-output)), "(should be 0.0 or close to 0.0)")

INFO:tensorflow:Restoring parameters from transformer_checks/transformer
Total error on the output: 4.5239925384521484e-05 (should be 0.0 or close to 0.0)


# Creating a Transformer

Now that all the blocks of the Transformer are implemented, we can create a full model with placeholders and a loss.

We've helped you with the placeholders, and the loss, as it is similar to the one in the previous assignment.

In [9]:
# We are giving you the trainer, as it is similar to the one
# you created in the Language Modeling assignment.

class TransformerTrainer():

    def __init__(self, vocab_size, d_model, input_length, output_length, n_layers, d_filter, learning_rate=1e-3):

        self.source_sequence = tf.placeholder(tf.int32,shape=(None,input_length), name="source_sequence")
        self.target_sequence = tf.placeholder(tf.int32, shape=(None,output_length),name="target_sequence")
        self.encoder_mask = tf.placeholder(tf.bool,shape=(None,input_length),name="encoder_mask")
        self.decoder_mask = tf.placeholder(tf.bool, shape=(None,output_length),name="decoder_mask")

        self.model = Transformer(vocab_size=vocab_size, d_model=d_model, n_layers=n_layers, d_filter=d_filter)

        self.decoded_logits = self.model(self.source_sequence, self.target_sequence, encoder_mask=self.encoder_mask, decoder_mask=self.decoder_mask)
        self.global_step = tf.train.get_or_create_global_step()
        
        # Summarization loss
        self.loss = tf.losses.sparse_softmax_cross_entropy(self.target_sequence, self.decoded_logits, tf.cast(self.decoder_mask, tf.float32))
        self.optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
        self.train_op = self.optimizer.minimize(self.loss, global_step=self.global_step)
        self.saver = tf.train.Saver()

We now instantiate the Transformer with our sets of hyperparameters specific to the task of summarization.
In summarization, we are going to go from documents with up to 400 words, to documents with up to 100 words.
The vocabulary size is set for you, and is of 10,000 words (we are using WordPieces, [here is a paper about subword encoding](http://aclweb.org/anthology/P18-1007), if you are interested).

In [10]:
# Dataset related parameters
vocab_size = len(vocab)
ilength = 400 # Length of the article
olength  = 100 # Length of the summaries

# Model related parameters, feel free to modify these.
n_layers = 6
d_model  = 128
d_filter = 416
tf.reset_default_graph()
model = TransformerTrainer(vocab_size, d_model, ilength, olength, n_layers, d_filter)

# Training the model

Your objective is to train the Language on the dataset you are provided to reach a **validation loss <= 4.50**

Careful: we will be testing this loss on an unreleased test set, so make sure to evaluate properly on a validation set and not overfit.

You must save the model you want us to test under: models/final_transformer_summarization (the .index, .meta and .data files)

**Advice**:
- It should be possible to attain validation loss <= 4.50 with the model dimensions we've specified (n_layers=6, d_model=104, d_filter=416), but you can tune these hyperparameters. Increasing d_model will yield better model, at the cost of longer training time.
- You should try tuning the learning rate, as well as what optimizer you use.
- You might need to train for a few (up to 2 hours) to obtain our expected loss. Remember to tune your hyperparameters first, once you find ones that work well, let it train for longer.

**Dataset**: as in the previous notebook, make sure the dataset files are in the `dataset` folder. These can be found on the Google Drive.


In [11]:
with open(root_folder+"dataset/summarization_dataset_preprocessed.json", "r") as f:

    dataset = json.load(f)

# We load the dataset, and split it into 2 sub-datasets based on if they are training or validation.
# Feel free to split this dataset another way, but remember, a validation set is important, to have an idea of 
# the amount of overfitting that has occurred!

d_train = [d for d in dataset if d['cut'] == 'training']
d_valid = [d for d in dataset if d['cut'] == 'evaluation']

len(d_train), len(d_valid)

(61055, 1558)

In [30]:
# An example (article, summary) pair in the training data:

print(d_train[145]['story'])
print("=======================\n=======================")
print(d_train[145]['summary'])

Tbilisi, Georgia (CNN)Police have shot and killed a white tiger that killed a man Wednesday in Tbilisi, Georgia, a Ministry of Internal Affairs representative said, after severe flooding allowed hundreds of wild animals to escape the city zoo. 
The tiger attack happened at a warehouse in the city center. The animal had been unaccounted for since the weekend floods destroyed the zoo premises.
The man killed, who was 43, worked in a company based in the warehouse, the Ministry of Internal Affairs said. Doctors said he was attacked in the throat and died before reaching the hospital. 
Experts are still searching the warehouse, the ministry said, adding that earlier reports that the tiger had injured a second man were unfounded. 
The zoo administration said Wednesday that another tiger was still missing. It was unable to confirm if the creature was dead or had escaped alive.
Georgian Prime Minister Irakli Garibashvili apologized to the public, saying he had been misinformed by the zoo's ma

Similarly to the previous assignment, we create a function to get a random batch to train on, given a dataset.

In [12]:
def build_batch(dataset, batch_size):
    indices = list(np.random.randint(0, len(dataset), size=batch_size))
    
    batch = [dataset[i] for i in indices]
    batch_input = np.array([a['input'] for a in batch])
    batch_input_mask = np.array([a['input_mask'] for a in batch])
    batch_output = np.array([a['output'] for a in batch])
    batch_output_mask = np.array([a['output_mask'] for a in batch])
    
    return batch_input, batch_input_mask, batch_output, batch_output_mask

In [None]:
# Skeleton code, as in the previous notebook.
# Write code training code and save your best performing model on the
# validation set. We will be testing the loss on a held-out test dataset.

folder_path="models/final_transformer_summarization"
no_epochs=5
batch_size = 32
no_batches=len(d_train)//batch_size
val_size=len(d_valid)
val_loss=10

#Get Validation Data
val_input, val_input_mask, val_output, val_output_mask = build_batch(d_valid, val_size)
feed_val = {model.source_sequence: val_input, model.target_sequence: val_output,model.encoder_mask: val_input_mask, model.decoder_mask: val_output_mask}

with tf.Session() as sess:
    # This is how you randomly initialize the Transformer weights.
    sess.run(tf.global_variables_initializer())

    for e in range(no_epochs):
        print("-----Epoch: {}".format(e))
        if val_loss<4:
            break
        
        for b in range(no_batches):
    
            # Create a random mini-batch from the training dataset
            batch_input, batch_input_mask, batch_output, batch_output_mask = build_batch(d_train, batch_size)
            # Build the feed-dict connecting placeholders and mini-batch
            feed = {model.source_sequence: batch_input, model.target_sequence: batch_output,model.encoder_mask: batch_input_mask, model.decoder_mask: batch_output_mask}

            # Obtain the loss. Be careful when you use the train_op and not, as previously.
            train_loss, _, step = sess.run([model.loss, model.train_op, model.global_step], feed_dict=feed)
            
            if b%1==0:
                
                val_loss= sess.run(model.loss, feed_dict=feed_val)
                print("Iteration:  {} ----  Train Loss: {:.3f} ----  Validation Loss: {}".format(b,train_loss,val_loss))
                # This is how you save model weights into a file
                model.saver.save(sess, root_folder+folder_path)
                


-----Epoch: 0
Iteration:  0 ----  Train Loss: 10.007 ----  Validation Loss: 8.67977523803711
Iteration:  1 ----  Train Loss: 8.713 ----  Validation Loss: 8.52271556854248
Iteration:  2 ----  Train Loss: 8.512 ----  Validation Loss: 8.760713577270508
Iteration:  3 ----  Train Loss: 8.760 ----  Validation Loss: 9.195798873901367
Iteration:  4 ----  Train Loss: 9.106 ----  Validation Loss: 8.891547203063965
Iteration:  5 ----  Train Loss: 8.772 ----  Validation Loss: 8.423049926757812
Iteration:  6 ----  Train Loss: 8.494 ----  Validation Loss: 8.112584114074707
Iteration:  7 ----  Train Loss: 8.143 ----  Validation Loss: 7.998614311218262
Iteration:  8 ----  Train Loss: 8.083 ----  Validation Loss: 8.05123519897461
Iteration:  9 ----  Train Loss: 8.183 ----  Validation Loss: 7.986842632293701
Iteration:  10 ----  Train Loss: 7.901 ----  Validation Loss: 7.889002799987793
Iteration:  11 ----  Train Loss: 7.857 ----  Validation Loss: 7.921374797821045
Iteration:  12 ----  Train Loss: 7.957

Iteration:  103 ----  Train Loss: 6.982 ----  Validation Loss: 6.884770393371582
Iteration:  104 ----  Train Loss: 6.916 ----  Validation Loss: 6.792153835296631
Iteration:  105 ----  Train Loss: 6.798 ----  Validation Loss: 6.845345973968506
Iteration:  106 ----  Train Loss: 6.832 ----  Validation Loss: 6.755489349365234
Iteration:  107 ----  Train Loss: 6.666 ----  Validation Loss: 6.79281759262085
Iteration:  108 ----  Train Loss: 6.911 ----  Validation Loss: 6.751568794250488
Iteration:  109 ----  Train Loss: 6.674 ----  Validation Loss: 6.780922889709473
Iteration:  110 ----  Train Loss: 6.842 ----  Validation Loss: 6.751992702484131
Iteration:  111 ----  Train Loss: 6.702 ----  Validation Loss: 6.753074645996094
Iteration:  112 ----  Train Loss: 6.788 ----  Validation Loss: 6.733000755310059
Iteration:  113 ----  Train Loss: 6.716 ----  Validation Loss: 6.676428318023682
Iteration:  114 ----  Train Loss: 6.811 ----  Validation Loss: 6.662871360778809
Iteration:  115 ----  Train L

Iteration:  205 ----  Train Loss: 6.200 ----  Validation Loss: 6.19789457321167
Iteration:  206 ----  Train Loss: 6.112 ----  Validation Loss: 6.245452404022217
Iteration:  207 ----  Train Loss: 6.165 ----  Validation Loss: 6.214327335357666
Iteration:  208 ----  Train Loss: 6.278 ----  Validation Loss: 6.180044174194336
Iteration:  209 ----  Train Loss: 6.374 ----  Validation Loss: 6.227603912353516
Iteration:  210 ----  Train Loss: 6.171 ----  Validation Loss: 6.185586929321289
Iteration:  211 ----  Train Loss: 6.314 ----  Validation Loss: 6.1911845207214355
Iteration:  212 ----  Train Loss: 6.306 ----  Validation Loss: 6.157428741455078
Iteration:  213 ----  Train Loss: 6.252 ----  Validation Loss: 6.181612968444824
Iteration:  214 ----  Train Loss: 6.088 ----  Validation Loss: 6.191200256347656
Iteration:  215 ----  Train Loss: 6.183 ----  Validation Loss: 6.165416717529297
Iteration:  216 ----  Train Loss: 6.093 ----  Validation Loss: 6.161003112792969
Iteration:  217 ----  Train 

Iteration:  307 ----  Train Loss: 5.931 ----  Validation Loss: 5.947957515716553
Iteration:  308 ----  Train Loss: 5.691 ----  Validation Loss: 5.932313442230225
Iteration:  309 ----  Train Loss: 5.960 ----  Validation Loss: 5.938601016998291
Iteration:  310 ----  Train Loss: 5.939 ----  Validation Loss: 5.948243618011475
Iteration:  311 ----  Train Loss: 6.026 ----  Validation Loss: 5.920848846435547
Iteration:  312 ----  Train Loss: 5.662 ----  Validation Loss: 5.923020362854004
Iteration:  313 ----  Train Loss: 5.759 ----  Validation Loss: 5.940638065338135
Iteration:  314 ----  Train Loss: 5.985 ----  Validation Loss: 5.957841396331787
Iteration:  315 ----  Train Loss: 6.059 ----  Validation Loss: 5.933299541473389
Iteration:  316 ----  Train Loss: 5.915 ----  Validation Loss: 5.933446407318115
Iteration:  317 ----  Train Loss: 5.768 ----  Validation Loss: 5.933737277984619
Iteration:  318 ----  Train Loss: 5.801 ----  Validation Loss: 5.955179214477539
Iteration:  319 ----  Train 

Iteration:  409 ----  Train Loss: 5.921 ----  Validation Loss: 5.741721153259277
Iteration:  410 ----  Train Loss: 5.791 ----  Validation Loss: 5.750595569610596
Iteration:  411 ----  Train Loss: 5.488 ----  Validation Loss: 5.7445855140686035
Iteration:  412 ----  Train Loss: 5.759 ----  Validation Loss: 5.751798629760742
Iteration:  413 ----  Train Loss: 5.901 ----  Validation Loss: 5.752812385559082
Iteration:  414 ----  Train Loss: 5.698 ----  Validation Loss: 5.727957248687744
Iteration:  415 ----  Train Loss: 5.720 ----  Validation Loss: 5.731263160705566
Iteration:  416 ----  Train Loss: 5.879 ----  Validation Loss: 5.751122951507568
Iteration:  417 ----  Train Loss: 5.801 ----  Validation Loss: 5.71882438659668
Iteration:  418 ----  Train Loss: 5.622 ----  Validation Loss: 5.7288031578063965
Iteration:  419 ----  Train Loss: 5.850 ----  Validation Loss: 5.745349407196045
Iteration:  420 ----  Train Loss: 5.695 ----  Validation Loss: 5.727166652679443
Iteration:  421 ----  Train

Iteration:  511 ----  Train Loss: 5.601 ----  Validation Loss: 5.596654415130615
Iteration:  512 ----  Train Loss: 5.599 ----  Validation Loss: 5.611676216125488
Iteration:  513 ----  Train Loss: 5.906 ----  Validation Loss: 5.620067596435547
Iteration:  514 ----  Train Loss: 5.774 ----  Validation Loss: 5.599830627441406
Iteration:  515 ----  Train Loss: 5.663 ----  Validation Loss: 5.634026050567627
Iteration:  516 ----  Train Loss: 5.598 ----  Validation Loss: 5.606135845184326
Iteration:  517 ----  Train Loss: 5.479 ----  Validation Loss: 5.589108467102051
Iteration:  518 ----  Train Loss: 5.877 ----  Validation Loss: 5.608665943145752
Iteration:  519 ----  Train Loss: 5.725 ----  Validation Loss: 5.591723442077637
Iteration:  520 ----  Train Loss: 5.638 ----  Validation Loss: 5.594696521759033
Iteration:  521 ----  Train Loss: 5.694 ----  Validation Loss: 5.602048397064209
Iteration:  522 ----  Train Loss: 5.521 ----  Validation Loss: 5.585960865020752
Iteration:  523 ----  Train 

Iteration:  613 ----  Train Loss: 5.543 ----  Validation Loss: 5.530158519744873
Iteration:  614 ----  Train Loss: 5.526 ----  Validation Loss: 5.519693851470947
Iteration:  615 ----  Train Loss: 5.623 ----  Validation Loss: 5.5241804122924805
Iteration:  616 ----  Train Loss: 5.515 ----  Validation Loss: 5.529833793640137
Iteration:  617 ----  Train Loss: 5.547 ----  Validation Loss: 5.5327277183532715
Iteration:  618 ----  Train Loss: 5.549 ----  Validation Loss: 5.517020225524902
Iteration:  619 ----  Train Loss: 5.373 ----  Validation Loss: 5.503746509552002
Iteration:  620 ----  Train Loss: 5.622 ----  Validation Loss: 5.494482517242432
Iteration:  621 ----  Train Loss: 5.586 ----  Validation Loss: 5.505293846130371
Iteration:  622 ----  Train Loss: 5.564 ----  Validation Loss: 5.5291595458984375
Iteration:  623 ----  Train Loss: 5.335 ----  Validation Loss: 5.50836181640625
Iteration:  624 ----  Train Loss: 5.481 ----  Validation Loss: 5.512810230255127
Iteration:  625 ----  Trai

Iteration:  715 ----  Train Loss: 5.519 ----  Validation Loss: 5.45617151260376
Iteration:  716 ----  Train Loss: 5.586 ----  Validation Loss: 5.461826324462891
Iteration:  717 ----  Train Loss: 5.532 ----  Validation Loss: 5.44578742980957
Iteration:  718 ----  Train Loss: 5.437 ----  Validation Loss: 5.446972846984863
Iteration:  719 ----  Train Loss: 5.423 ----  Validation Loss: 5.457370758056641
Iteration:  720 ----  Train Loss: 5.505 ----  Validation Loss: 5.447963714599609
Iteration:  721 ----  Train Loss: 5.472 ----  Validation Loss: 5.456634998321533
Iteration:  722 ----  Train Loss: 5.495 ----  Validation Loss: 5.459018707275391
Iteration:  723 ----  Train Loss: 5.192 ----  Validation Loss: 5.455605983734131
Iteration:  724 ----  Train Loss: 5.480 ----  Validation Loss: 5.447632789611816
Iteration:  725 ----  Train Loss: 5.418 ----  Validation Loss: 5.472736835479736
Iteration:  726 ----  Train Loss: 5.675 ----  Validation Loss: 5.468730926513672
Iteration:  727 ----  Train Lo

Iteration:  817 ----  Train Loss: 5.367 ----  Validation Loss: 5.415581703186035
Iteration:  818 ----  Train Loss: 5.261 ----  Validation Loss: 5.41074275970459
Iteration:  819 ----  Train Loss: 5.330 ----  Validation Loss: 5.395697116851807
Iteration:  820 ----  Train Loss: 5.158 ----  Validation Loss: 5.393232822418213
Iteration:  821 ----  Train Loss: 5.503 ----  Validation Loss: 5.4011640548706055
Iteration:  822 ----  Train Loss: 5.450 ----  Validation Loss: 5.377888202667236
Iteration:  823 ----  Train Loss: 5.470 ----  Validation Loss: 5.375551223754883
Iteration:  824 ----  Train Loss: 5.397 ----  Validation Loss: 5.3856096267700195
Iteration:  825 ----  Train Loss: 5.396 ----  Validation Loss: 5.394066333770752
Iteration:  826 ----  Train Loss: 5.198 ----  Validation Loss: 5.377017021179199
Iteration:  827 ----  Train Loss: 5.472 ----  Validation Loss: 5.375180244445801
Iteration:  828 ----  Train Loss: 5.238 ----  Validation Loss: 5.376309394836426
Iteration:  829 ----  Train

Iteration:  919 ----  Train Loss: 5.501 ----  Validation Loss: 5.337430477142334
Iteration:  920 ----  Train Loss: 5.299 ----  Validation Loss: 5.340147495269775
Iteration:  921 ----  Train Loss: 5.348 ----  Validation Loss: 5.341284275054932
Iteration:  922 ----  Train Loss: 5.437 ----  Validation Loss: 5.352200508117676
Iteration:  923 ----  Train Loss: 5.341 ----  Validation Loss: 5.339975833892822
Iteration:  924 ----  Train Loss: 5.245 ----  Validation Loss: 5.325348377227783
Iteration:  925 ----  Train Loss: 5.379 ----  Validation Loss: 5.3216729164123535
Iteration:  926 ----  Train Loss: 5.406 ----  Validation Loss: 5.327038288116455
Iteration:  927 ----  Train Loss: 5.490 ----  Validation Loss: 5.343814373016357
Iteration:  928 ----  Train Loss: 5.237 ----  Validation Loss: 5.320337772369385
Iteration:  929 ----  Train Loss: 5.268 ----  Validation Loss: 5.321713447570801
Iteration:  930 ----  Train Loss: 5.255 ----  Validation Loss: 5.312236785888672
Iteration:  931 ----  Train

Iteration:  1020 ----  Train Loss: 5.226 ----  Validation Loss: 5.2962493896484375
Iteration:  1021 ----  Train Loss: 4.945 ----  Validation Loss: 5.294392108917236
Iteration:  1022 ----  Train Loss: 5.345 ----  Validation Loss: 5.297206401824951
Iteration:  1023 ----  Train Loss: 5.173 ----  Validation Loss: 5.287859916687012
Iteration:  1024 ----  Train Loss: 5.233 ----  Validation Loss: 5.278548240661621
Iteration:  1025 ----  Train Loss: 5.444 ----  Validation Loss: 5.278409004211426
Iteration:  1026 ----  Train Loss: 5.298 ----  Validation Loss: 5.282951354980469
Iteration:  1027 ----  Train Loss: 5.273 ----  Validation Loss: 5.288089752197266
Iteration:  1028 ----  Train Loss: 5.391 ----  Validation Loss: 5.291444778442383
Iteration:  1029 ----  Train Loss: 5.208 ----  Validation Loss: 5.289310455322266
Iteration:  1030 ----  Train Loss: 5.317 ----  Validation Loss: 5.2767486572265625
Iteration:  1031 ----  Train Loss: 5.267 ----  Validation Loss: 5.2798171043396
Iteration:  1032

Iteration:  1120 ----  Train Loss: 5.334 ----  Validation Loss: 5.269781589508057
Iteration:  1121 ----  Train Loss: 5.321 ----  Validation Loss: 5.261507987976074
Iteration:  1122 ----  Train Loss: 5.058 ----  Validation Loss: 5.257890701293945
Iteration:  1123 ----  Train Loss: 5.387 ----  Validation Loss: 5.245558738708496
Iteration:  1124 ----  Train Loss: 5.261 ----  Validation Loss: 5.253586769104004
Iteration:  1125 ----  Train Loss: 5.334 ----  Validation Loss: 5.259716510772705
Iteration:  1126 ----  Train Loss: 5.536 ----  Validation Loss: 5.262256145477295
Iteration:  1127 ----  Train Loss: 5.248 ----  Validation Loss: 5.250502586364746
Iteration:  1128 ----  Train Loss: 5.142 ----  Validation Loss: 5.245802402496338
Iteration:  1129 ----  Train Loss: 5.410 ----  Validation Loss: 5.2513427734375
Iteration:  1130 ----  Train Loss: 5.296 ----  Validation Loss: 5.256730079650879
Iteration:  1131 ----  Train Loss: 5.304 ----  Validation Loss: 5.249411106109619
Iteration:  1132 -

Iteration:  1220 ----  Train Loss: 4.881 ----  Validation Loss: 5.225401401519775
Iteration:  1221 ----  Train Loss: 5.282 ----  Validation Loss: 5.222836017608643
Iteration:  1222 ----  Train Loss: 4.932 ----  Validation Loss: 5.211705684661865
Iteration:  1223 ----  Train Loss: 5.314 ----  Validation Loss: 5.2142653465271
Iteration:  1224 ----  Train Loss: 5.258 ----  Validation Loss: 5.2068891525268555
Iteration:  1225 ----  Train Loss: 5.342 ----  Validation Loss: 5.210072040557861
Iteration:  1226 ----  Train Loss: 5.209 ----  Validation Loss: 5.222681999206543
Iteration:  1227 ----  Train Loss: 5.208 ----  Validation Loss: 5.209819793701172
Iteration:  1228 ----  Train Loss: 5.085 ----  Validation Loss: 5.196742534637451
Iteration:  1229 ----  Train Loss: 5.570 ----  Validation Loss: 5.203137397766113
Iteration:  1230 ----  Train Loss: 5.413 ----  Validation Loss: 5.223742485046387
Iteration:  1231 ----  Train Loss: 5.242 ----  Validation Loss: 5.212557315826416
Iteration:  1232 

Iteration:  1320 ----  Train Loss: 5.188 ----  Validation Loss: 5.188730716705322
Iteration:  1321 ----  Train Loss: 5.016 ----  Validation Loss: 5.174473762512207
Iteration:  1322 ----  Train Loss: 5.121 ----  Validation Loss: 5.185000896453857
Iteration:  1323 ----  Train Loss: 5.228 ----  Validation Loss: 5.191696643829346
Iteration:  1324 ----  Train Loss: 5.523 ----  Validation Loss: 5.19225549697876
Iteration:  1325 ----  Train Loss: 5.041 ----  Validation Loss: 5.185523509979248
Iteration:  1326 ----  Train Loss: 5.296 ----  Validation Loss: 5.176225662231445
Iteration:  1327 ----  Train Loss: 5.094 ----  Validation Loss: 5.1805949211120605
Iteration:  1328 ----  Train Loss: 4.957 ----  Validation Loss: 5.189404010772705
Iteration:  1329 ----  Train Loss: 5.228 ----  Validation Loss: 5.181819915771484
Iteration:  1330 ----  Train Loss: 5.228 ----  Validation Loss: 5.179991245269775
Iteration:  1331 ----  Train Loss: 4.945 ----  Validation Loss: 5.176795959472656
Iteration:  1332

Iteration:  1420 ----  Train Loss: 5.150 ----  Validation Loss: 5.1340532302856445
Iteration:  1421 ----  Train Loss: 4.792 ----  Validation Loss: 5.150972843170166
Iteration:  1422 ----  Train Loss: 5.052 ----  Validation Loss: 5.157068729400635
Iteration:  1423 ----  Train Loss: 5.118 ----  Validation Loss: 5.146186828613281
Iteration:  1424 ----  Train Loss: 5.143 ----  Validation Loss: 5.132877349853516
Iteration:  1425 ----  Train Loss: 5.234 ----  Validation Loss: 5.130870819091797
Iteration:  1426 ----  Train Loss: 5.120 ----  Validation Loss: 5.138670444488525
Iteration:  1427 ----  Train Loss: 5.029 ----  Validation Loss: 5.145425796508789
Iteration:  1428 ----  Train Loss: 4.802 ----  Validation Loss: 5.145691394805908
Iteration:  1429 ----  Train Loss: 5.174 ----  Validation Loss: 5.136600494384766
Iteration:  1430 ----  Train Loss: 4.975 ----  Validation Loss: 5.159383773803711
Iteration:  1431 ----  Train Loss: 5.203 ----  Validation Loss: 5.16884708404541
Iteration:  1432

Iteration:  1520 ----  Train Loss: 5.003 ----  Validation Loss: 5.103593349456787
Iteration:  1521 ----  Train Loss: 4.957 ----  Validation Loss: 5.109870910644531
Iteration:  1522 ----  Train Loss: 5.221 ----  Validation Loss: 5.141987323760986
Iteration:  1523 ----  Train Loss: 5.105 ----  Validation Loss: 5.129457950592041
Iteration:  1524 ----  Train Loss: 5.043 ----  Validation Loss: 5.104854106903076
Iteration:  1525 ----  Train Loss: 4.979 ----  Validation Loss: 5.110279083251953
Iteration:  1526 ----  Train Loss: 5.200 ----  Validation Loss: 5.120273590087891
Iteration:  1527 ----  Train Loss: 5.019 ----  Validation Loss: 5.115592956542969
Iteration:  1528 ----  Train Loss: 5.084 ----  Validation Loss: 5.128420829772949
Iteration:  1529 ----  Train Loss: 5.027 ----  Validation Loss: 5.139223098754883
Iteration:  1530 ----  Train Loss: 5.302 ----  Validation Loss: 5.1173787117004395
Iteration:  1531 ----  Train Loss: 5.133 ----  Validation Loss: 5.113808631896973
Iteration:  153

Iteration:  1620 ----  Train Loss: 4.830 ----  Validation Loss: 5.087459564208984
Iteration:  1621 ----  Train Loss: 5.089 ----  Validation Loss: 5.088533878326416
Iteration:  1622 ----  Train Loss: 5.172 ----  Validation Loss: 5.100046157836914
Iteration:  1623 ----  Train Loss: 5.117 ----  Validation Loss: 5.098257541656494
Iteration:  1624 ----  Train Loss: 5.080 ----  Validation Loss: 5.086877822875977
Iteration:  1625 ----  Train Loss: 4.963 ----  Validation Loss: 5.079644680023193
Iteration:  1626 ----  Train Loss: 4.942 ----  Validation Loss: 5.074357509613037
Iteration:  1627 ----  Train Loss: 5.019 ----  Validation Loss: 5.080835819244385
Iteration:  1628 ----  Train Loss: 4.984 ----  Validation Loss: 5.084448337554932
Iteration:  1629 ----  Train Loss: 5.039 ----  Validation Loss: 5.076990604400635
Iteration:  1630 ----  Train Loss: 4.762 ----  Validation Loss: 5.08550500869751
Iteration:  1631 ----  Train Loss: 5.054 ----  Validation Loss: 5.090381145477295
Iteration:  1632 

Iteration:  1720 ----  Train Loss: 4.980 ----  Validation Loss: 5.051407814025879
Iteration:  1721 ----  Train Loss: 4.962 ----  Validation Loss: 5.0629963874816895
Iteration:  1722 ----  Train Loss: 5.006 ----  Validation Loss: 5.0762505531311035
Iteration:  1723 ----  Train Loss: 4.863 ----  Validation Loss: 5.063940525054932
Iteration:  1724 ----  Train Loss: 4.868 ----  Validation Loss: 5.054327964782715
Iteration:  1725 ----  Train Loss: 5.145 ----  Validation Loss: 5.055893898010254
Iteration:  1726 ----  Train Loss: 5.269 ----  Validation Loss: 5.0555009841918945
Iteration:  1727 ----  Train Loss: 5.021 ----  Validation Loss: 5.061692237854004
Iteration:  1728 ----  Train Loss: 5.159 ----  Validation Loss: 5.060827255249023
Iteration:  1729 ----  Train Loss: 5.085 ----  Validation Loss: 5.0803961753845215
Iteration:  1730 ----  Train Loss: 4.914 ----  Validation Loss: 5.081442356109619
Iteration:  1731 ----  Train Loss: 5.159 ----  Validation Loss: 5.050912380218506
Iteration:  

Iteration:  1820 ----  Train Loss: 4.941 ----  Validation Loss: 5.033489227294922
Iteration:  1821 ----  Train Loss: 4.741 ----  Validation Loss: 5.03232479095459
Iteration:  1822 ----  Train Loss: 4.826 ----  Validation Loss: 5.032073020935059
Iteration:  1823 ----  Train Loss: 5.088 ----  Validation Loss: 5.03150749206543
Iteration:  1824 ----  Train Loss: 4.879 ----  Validation Loss: 5.0285491943359375
Iteration:  1825 ----  Train Loss: 4.910 ----  Validation Loss: 5.033068656921387
Iteration:  1826 ----  Train Loss: 4.800 ----  Validation Loss: 5.03103494644165
Iteration:  1827 ----  Train Loss: 5.215 ----  Validation Loss: 5.019521236419678
Iteration:  1828 ----  Train Loss: 4.844 ----  Validation Loss: 5.023561954498291
Iteration:  1829 ----  Train Loss: 5.002 ----  Validation Loss: 5.025839805603027
Iteration:  1830 ----  Train Loss: 5.157 ----  Validation Loss: 5.0215373039245605
Iteration:  1831 ----  Train Loss: 5.028 ----  Validation Loss: 5.023189067840576
Iteration:  1832 

Iteration:  14 ----  Train Loss: 4.892 ----  Validation Loss: 5.00086784362793
Iteration:  15 ----  Train Loss: 5.001 ----  Validation Loss: 5.011538982391357
Iteration:  16 ----  Train Loss: 5.012 ----  Validation Loss: 5.020787715911865
Iteration:  17 ----  Train Loss: 4.910 ----  Validation Loss: 5.014736175537109
Iteration:  18 ----  Train Loss: 4.894 ----  Validation Loss: 5.002664566040039
Iteration:  19 ----  Train Loss: 5.196 ----  Validation Loss: 5.003419399261475
Iteration:  20 ----  Train Loss: 4.912 ----  Validation Loss: 5.006008148193359
Iteration:  21 ----  Train Loss: 4.698 ----  Validation Loss: 5.007732391357422
Iteration:  22 ----  Train Loss: 5.037 ----  Validation Loss: 5.007424831390381
Iteration:  23 ----  Train Loss: 4.909 ----  Validation Loss: 5.012799263000488
Iteration:  24 ----  Train Loss: 4.549 ----  Validation Loss: 5.038649559020996
Iteration:  25 ----  Train Loss: 4.807 ----  Validation Loss: 5.059661388397217
Iteration:  26 ----  Train Loss: 5.025 --

Iteration:  117 ----  Train Loss: 4.877 ----  Validation Loss: 4.972674369812012
Iteration:  118 ----  Train Loss: 4.752 ----  Validation Loss: 4.970985412597656
Iteration:  119 ----  Train Loss: 4.759 ----  Validation Loss: 4.974315643310547
Iteration:  120 ----  Train Loss: 4.928 ----  Validation Loss: 4.98099422454834
Iteration:  121 ----  Train Loss: 5.055 ----  Validation Loss: 4.971135139465332
Iteration:  122 ----  Train Loss: 4.922 ----  Validation Loss: 4.969357490539551
Iteration:  123 ----  Train Loss: 4.948 ----  Validation Loss: 4.985050678253174
Iteration:  124 ----  Train Loss: 5.033 ----  Validation Loss: 5.00342321395874
Iteration:  125 ----  Train Loss: 4.960 ----  Validation Loss: 4.979394912719727
Iteration:  126 ----  Train Loss: 4.991 ----  Validation Loss: 4.980649948120117
Iteration:  127 ----  Train Loss: 4.759 ----  Validation Loss: 5.01153039932251
Iteration:  128 ----  Train Loss: 4.850 ----  Validation Loss: 5.01528263092041
Iteration:  129 ----  Train Loss

Iteration:  219 ----  Train Loss: 4.571 ----  Validation Loss: 4.948429584503174
Iteration:  220 ----  Train Loss: 4.876 ----  Validation Loss: 4.964869499206543
Iteration:  221 ----  Train Loss: 4.882 ----  Validation Loss: 4.967685699462891
Iteration:  222 ----  Train Loss: 4.874 ----  Validation Loss: 4.960124969482422
Iteration:  223 ----  Train Loss: 4.888 ----  Validation Loss: 4.944619655609131
Iteration:  224 ----  Train Loss: 4.850 ----  Validation Loss: 4.945191860198975
Iteration:  225 ----  Train Loss: 4.961 ----  Validation Loss: 4.948084354400635
Iteration:  226 ----  Train Loss: 4.895 ----  Validation Loss: 4.946781158447266
Iteration:  227 ----  Train Loss: 4.836 ----  Validation Loss: 4.946169376373291
Iteration:  228 ----  Train Loss: 4.614 ----  Validation Loss: 4.950279712677002
Iteration:  229 ----  Train Loss: 4.953 ----  Validation Loss: 4.946408271789551
Iteration:  230 ----  Train Loss: 4.836 ----  Validation Loss: 4.951852798461914
Iteration:  231 ----  Train 

Iteration:  321 ----  Train Loss: 4.988 ----  Validation Loss: 4.932236194610596
Iteration:  322 ----  Train Loss: 4.951 ----  Validation Loss: 4.924662113189697
Iteration:  323 ----  Train Loss: 4.779 ----  Validation Loss: 4.933304786682129
Iteration:  324 ----  Train Loss: 5.116 ----  Validation Loss: 4.94537878036499
Iteration:  325 ----  Train Loss: 4.762 ----  Validation Loss: 4.937424182891846
Iteration:  326 ----  Train Loss: 4.694 ----  Validation Loss: 4.943544387817383
Iteration:  327 ----  Train Loss: 4.835 ----  Validation Loss: 4.9546613693237305
Iteration:  328 ----  Train Loss: 5.039 ----  Validation Loss: 4.947123050689697
Iteration:  329 ----  Train Loss: 4.925 ----  Validation Loss: 4.943378925323486
Iteration:  330 ----  Train Loss: 4.785 ----  Validation Loss: 4.93367862701416
Iteration:  331 ----  Train Loss: 5.124 ----  Validation Loss: 4.9312920570373535
Iteration:  332 ----  Train Loss: 4.965 ----  Validation Loss: 4.935765743255615
Iteration:  333 ----  Train 

Iteration:  423 ----  Train Loss: 4.946 ----  Validation Loss: 4.919123649597168
Iteration:  424 ----  Train Loss: 5.001 ----  Validation Loss: 4.923211097717285
Iteration:  425 ----  Train Loss: 5.006 ----  Validation Loss: 4.925907135009766
Iteration:  426 ----  Train Loss: 4.949 ----  Validation Loss: 4.929439544677734
Iteration:  427 ----  Train Loss: 4.824 ----  Validation Loss: 4.924445629119873
Iteration:  428 ----  Train Loss: 4.872 ----  Validation Loss: 4.91565465927124
Iteration:  429 ----  Train Loss: 4.724 ----  Validation Loss: 4.913360118865967
Iteration:  430 ----  Train Loss: 4.938 ----  Validation Loss: 4.9123663902282715
Iteration:  431 ----  Train Loss: 4.798 ----  Validation Loss: 4.918203830718994
Iteration:  432 ----  Train Loss: 4.977 ----  Validation Loss: 4.926858901977539
Iteration:  433 ----  Train Loss: 5.076 ----  Validation Loss: 4.927017688751221
Iteration:  434 ----  Train Loss: 4.952 ----  Validation Loss: 4.920456886291504
Iteration:  435 ----  Train 

Iteration:  525 ----  Train Loss: 4.461 ----  Validation Loss: 4.9114089012146
Iteration:  526 ----  Train Loss: 4.722 ----  Validation Loss: 4.904717445373535
Iteration:  527 ----  Train Loss: 4.763 ----  Validation Loss: 4.90263557434082
Iteration:  528 ----  Train Loss: 4.860 ----  Validation Loss: 4.905561923980713
Iteration:  529 ----  Train Loss: 4.893 ----  Validation Loss: 4.909581184387207
Iteration:  530 ----  Train Loss: 4.991 ----  Validation Loss: 4.904873371124268
Iteration:  531 ----  Train Loss: 4.415 ----  Validation Loss: 4.890449523925781
Iteration:  532 ----  Train Loss: 4.708 ----  Validation Loss: 4.89394474029541
Iteration:  533 ----  Train Loss: 4.828 ----  Validation Loss: 4.900832653045654
Iteration:  534 ----  Train Loss: 4.598 ----  Validation Loss: 4.904253005981445
Iteration:  535 ----  Train Loss: 4.845 ----  Validation Loss: 4.904942035675049
Iteration:  536 ----  Train Loss: 4.666 ----  Validation Loss: 4.8995680809021
Iteration:  537 ----  Train Loss: 

Iteration:  627 ----  Train Loss: 4.986 ----  Validation Loss: 4.88356351852417
Iteration:  628 ----  Train Loss: 4.754 ----  Validation Loss: 4.889678001403809
Iteration:  629 ----  Train Loss: 4.580 ----  Validation Loss: 4.887837886810303
Iteration:  630 ----  Train Loss: 4.724 ----  Validation Loss: 4.893331050872803
Iteration:  631 ----  Train Loss: 4.909 ----  Validation Loss: 4.901705265045166
Iteration:  632 ----  Train Loss: 4.794 ----  Validation Loss: 4.908322811126709
Iteration:  633 ----  Train Loss: 4.773 ----  Validation Loss: 4.898626804351807
Iteration:  634 ----  Train Loss: 4.874 ----  Validation Loss: 4.887174129486084
Iteration:  635 ----  Train Loss: 4.966 ----  Validation Loss: 4.88170051574707
Iteration:  636 ----  Train Loss: 4.771 ----  Validation Loss: 4.880207061767578
Iteration:  637 ----  Train Loss: 4.620 ----  Validation Loss: 4.8789520263671875
Iteration:  638 ----  Train Loss: 4.719 ----  Validation Loss: 4.878148555755615
Iteration:  639 ----  Train L

Iteration:  729 ----  Train Loss: 4.652 ----  Validation Loss: 4.866940021514893
Iteration:  730 ----  Train Loss: 4.531 ----  Validation Loss: 4.866424560546875
Iteration:  731 ----  Train Loss: 4.934 ----  Validation Loss: 4.862237930297852
Iteration:  732 ----  Train Loss: 4.710 ----  Validation Loss: 4.859340190887451
Iteration:  733 ----  Train Loss: 4.814 ----  Validation Loss: 4.861793518066406
Iteration:  734 ----  Train Loss: 4.738 ----  Validation Loss: 4.873211860656738
Iteration:  735 ----  Train Loss: 4.806 ----  Validation Loss: 4.874600410461426
Iteration:  736 ----  Train Loss: 4.775 ----  Validation Loss: 4.8683905601501465
Iteration:  737 ----  Train Loss: 4.722 ----  Validation Loss: 4.879542350769043
Iteration:  738 ----  Train Loss: 4.856 ----  Validation Loss: 4.873431205749512
Iteration:  739 ----  Train Loss: 4.542 ----  Validation Loss: 4.859036445617676
Iteration:  740 ----  Train Loss: 4.868 ----  Validation Loss: 4.874239444732666
Iteration:  741 ----  Train

Iteration:  831 ----  Train Loss: 4.742 ----  Validation Loss: 4.849947929382324
Iteration:  832 ----  Train Loss: 4.816 ----  Validation Loss: 4.860151767730713
Iteration:  833 ----  Train Loss: 4.872 ----  Validation Loss: 4.859526634216309
Iteration:  834 ----  Train Loss: 4.695 ----  Validation Loss: 4.851210117340088
Iteration:  835 ----  Train Loss: 4.740 ----  Validation Loss: 4.853640079498291
Iteration:  836 ----  Train Loss: 5.017 ----  Validation Loss: 4.8536200523376465
Iteration:  837 ----  Train Loss: 4.921 ----  Validation Loss: 4.8474345207214355
Iteration:  838 ----  Train Loss: 4.924 ----  Validation Loss: 4.860219955444336
Iteration:  839 ----  Train Loss: 4.590 ----  Validation Loss: 4.863420486450195
Iteration:  840 ----  Train Loss: 4.806 ----  Validation Loss: 4.8530449867248535
Iteration:  841 ----  Train Loss: 4.903 ----  Validation Loss: 4.8479533195495605
Iteration:  842 ----  Train Loss: 4.622 ----  Validation Loss: 4.8405632972717285
Iteration:  843 ----  T

Iteration:  933 ----  Train Loss: 4.661 ----  Validation Loss: 4.841170310974121
Iteration:  934 ----  Train Loss: 4.746 ----  Validation Loss: 4.8368988037109375
Iteration:  935 ----  Train Loss: 5.015 ----  Validation Loss: 4.843055725097656
Iteration:  936 ----  Train Loss: 4.679 ----  Validation Loss: 4.85270881652832
Iteration:  937 ----  Train Loss: 4.769 ----  Validation Loss: 4.844640254974365
Iteration:  938 ----  Train Loss: 4.683 ----  Validation Loss: 4.840487480163574
Iteration:  939 ----  Train Loss: 4.794 ----  Validation Loss: 4.852359294891357
Iteration:  940 ----  Train Loss: 4.719 ----  Validation Loss: 4.865027904510498
Iteration:  941 ----  Train Loss: 4.804 ----  Validation Loss: 4.8553786277771
Iteration:  942 ----  Train Loss: 4.845 ----  Validation Loss: 4.850130558013916
Iteration:  943 ----  Train Loss: 4.789 ----  Validation Loss: 4.840876579284668
Iteration:  944 ----  Train Loss: 4.836 ----  Validation Loss: 4.831457138061523
Iteration:  945 ----  Train Lo

Iteration:  1034 ----  Train Loss: 4.627 ----  Validation Loss: 4.839349746704102
Iteration:  1035 ----  Train Loss: 4.664 ----  Validation Loss: 4.831002235412598
Iteration:  1036 ----  Train Loss: 4.701 ----  Validation Loss: 4.8185505867004395
Iteration:  1037 ----  Train Loss: 4.644 ----  Validation Loss: 4.821834564208984
Iteration:  1038 ----  Train Loss: 4.781 ----  Validation Loss: 4.824356555938721
Iteration:  1039 ----  Train Loss: 4.778 ----  Validation Loss: 4.822392463684082
Iteration:  1040 ----  Train Loss: 4.717 ----  Validation Loss: 4.823319435119629
Iteration:  1041 ----  Train Loss: 4.828 ----  Validation Loss: 4.8322625160217285
Iteration:  1042 ----  Train Loss: 4.658 ----  Validation Loss: 4.843292713165283
Iteration:  1043 ----  Train Loss: 4.759 ----  Validation Loss: 4.8276777267456055
Iteration:  1044 ----  Train Loss: 4.560 ----  Validation Loss: 4.817562580108643
Iteration:  1045 ----  Train Loss: 4.532 ----  Validation Loss: 4.819546222686768
Iteration:  1

# Using the Summarization model

Now that you have trained a Transformer to perform Summarization, we will use the model on news articles from the wild.

The three subsections below explore what the model has learned.

In [13]:
# Put the file path to your best performing model in the string below.

model_file = root_folder+"models/final_transformer_summarization"
#model_file = root_folder+"models/transformer_summarizer"

## The validation loss

Measure the validation loss of your model. This part could be used, as in our previous notebook, in deciding what is a likely, vs. unlikely summary for an article.

We will use the code here with the unreleased test-set to evaluate your model.

In [14]:
with tf.Session() as sess:
    model.saver.restore(sess, model_file)

    e_input, e_input_mask, e_output, e_output_mask = build_batch(d_valid, 200)
    feed = {model.source_sequence: e_input, model.target_sequence: e_output,
                                      model.encoder_mask: e_input_mask, model.decoder_mask: e_output_mask}
    valid_loss = sess.run(model.loss, feed_dict=feed)
    print("Validation loss:", valid_loss)

INFO:tensorflow:Restoring parameters from models/final_transformer_summarization
Validation loss: 7.0038767


## Generating an article's summary

This model we have built is meant to be used to generate summaries for new articles we do not have summaries for.
We got a [news article](https://www.chicagotribune.com/news/local/breaking/ct-met-officer-shot-20190309-story.html) from the Chicago Tribune about a police shooting, and want to use our model to produce a summary.

As you will see, our model is still limited in its ability, and will most likely not produce a perfect summary, however, with more data and training, this model would be able to produce good summaries.
The article you produce should look like broken English sentences, but should roughly correspond to the article.

In [None]:
article_text = "A 34-year-old Chicago police officer has been shot in the shoulder during the execution of a search warrant in the Humboldt Park neighborhood, police say. The alleged shooter, a 19-year-old woman, was in custody. The shooting happened about 7:20 p.m. in the 2700 block of West Potomac Avenue, police said. The officer, part of the Grand Central District tactical unit, was taken to Stroger Hospital. While officers were serving a \"typical\" search warrant for \"narcotics and illegal weapons\" and were attempting to reach a rear door, \"a shot was fired,\" striking the tactical officer in the shoulder, said Chicago police Superintendent Eddie Johnson during a news briefing outside the hospital. He said the officer, who has about four or five years on the job, was \"stable\" but in critical condition. \"His family is here,\" Johnson said. \"He’s talking a lot and just wants the ordeal to be over.\" He said this incident serves as just another reminder of how dangerous a police officer’s job is. At the scene of the shooting, crime tape closed Potomac from Washtenaw Avenue to California Avenue and encompassed the alley west of the brick apartment building, south of Potomac. Dozens of officers stood in the alley, while even more walked up and down the street. Neighbors gathered at the edge of the yellow tape on the sidewalk along California and watched them work. Standing next to a man, a woman talked to police in the crime scene, across the street. \"We're not under arrest? We can go?\" the woman checked with officers. They told her she could go, and she and the man walked underneath the yellow tape and out of the crime scene."
int_output="<s>"
input_length = 400
output_length = 100

# Process the capitalization with the preprocess_capitalization of the capita package.
article_text = capita.preprocess_capitalization(article_text)
int_output="the police"

# Numerize the tokens of the processed text using the loaded sentencepiece model.
numerized = sp.EncodeAsIds(article_text)
#numerized_out=sp.EncodeAsIds(int_output)
# Pad the sequence and keep the mask of the input
padded, mask = pad_sequence(numerized, pad_index, 400)
#padded_out, mask_out = pad_sequence(numerized_out, pad_index, 100)


# Making the news article into a batch of size one, to be fed to the neural network.
encoder_input = np.array([padded])
encoder_mask = np.array([mask])
#decoder_input = np.array([padded_out])
#decoder_mask = np.array([mask_out])

with tf.Session() as sess:
    model.saver.restore(sess, model_file)

    #decoded_so_far = [0]
    decoded_so_far=sp.EncodeAsIds(int_output)
    
    for j in range(output_length):
        padded_decoder_input, decoder_mask = pad_sequence(decoded_so_far, pad_index, output_length)
        padded_decoder_input = [padded_decoder_input]
        decoder_mask = [decoder_mask]
        print("========================")
        print(padded_decoder_input)
        # Use the model to find the distrbution over the vocabulary for the next word
        feed = {model.source_sequence: encoder_input, model.target_sequence: padded_decoder_input,
                                      model.encoder_mask: encoder_mask, model.decoder_mask: decoder_mask}
        
        logits = sess.run(model.decoded_logits, feed_dict=feed)
        
        chosen_words = np.argmax(logits,axis=2) # Take the argmax, getting the most likely next word
        print(logits.shape)
        decoded_so_far.append(int(chosen_words[0, j])) # We add it to the summary so far


print("The final summary:")
print("".join([vocab[i] for i in decoded_so_far]).replace("▁", " "))

## Word vectors

The model we train learns word representations for each word in our vocabulary. A word represention is a vector of **dim** size.

It is common in NLP to inspect the word vectors, as some properties of language often appear in the embedding structure.


We are going to load the word embeddings learned by our model, and inspect it.
Because our network was not trained for long, we are going for the simplest patterns, but if we let the network train longer, it learns more complex, semantic patterns.

In [None]:
# We help you load the matrix, as it is hidden within the Transformer structure.

with tf.Session() as sess:
    model.saver.restore(sess, model_file)
    E = sess.run(model.model.encoder.embedding_layer.embedding.embeddings)

print("The embedding matrix has shape:", E.shape)
print("The vocabulary has length:", len(vocab))

Pronouns serve very similar purposes, therefore we should expect the representation of "he" and "she" to be similar, and have cosine similarity.

- **TODO**:  Find the cosine similarity between the vectors that represent words "she" and "he".
- **TODO**:  Find the cosine similarity between the vectors that represent words "more" and "less".

We can contrast that with the cosine similarity to a random, non-related word, like "ball", or "gorilla".
- **TODO**: Compute the cosine similarity between "she" and "ball".
- **TODO**: Compute the cosine similarity between "more" and "protest".



In [None]:
def cosine_sim(v1, v2):
    # DONE: Implement the cosine similarity of 2 vectors. Careful: the words might not have unit norm.
    v1=np.array(v1)
    v2=np.array(v2)
    return np.dot(v1,v2.T)/(np.linalg.norm(v1)*np.linalg.norm(v2))

for w1, w2 in [("she", "he"), ("more", "less"), ("she", "ball"), ("more", "gorilla")]:
    w1_index = vocab.index('▁'+w1) # The index of the first  word in our vocabulary
    w2_index = vocab.index('▁'+w2) # The index of the second word in our vocabulary
    w1_vec = E[w1_index] # Get the embedding vector of the first  word
    w2_vec = E[w2_index] # Get the embedding vector of the second word
    
    print(w1," vs. ", w2, "similarity:",cosine_sim(w1_vec, w2_vec))
    

These effects are unfortunately small, as we have only trained the network on a few hours on a few thousand articles.
However, the same model trained for longer on more data exhibits many interesting semantic and syntactic patterns, such as:

- Words vectors with high cosine similarity usually represent words that have semantic similarity (such as duck and pigeon)
- Analogies can occur, a famous case is that of: woman - man + king ≈ queen. Or france - paris + rome ≈ italy.

- Looking at top-k similar words can help find synonyms.

To read examples of more complex patterns that appear in word embedding spaces, read [this blog](https://explosion.ai/blog/sense2vec-with-spacy). To play with a live demo and try similarities on rich word embeddings, [go here.](https://explosion.ai/demos/sense2vec)