# Introduction

In this assignment, you are going to train a neural network to summarize news articles.
Your neural network is going to learn from example, as we provide you with (article, summary) pairs.
We provide you with a **toy dataset** made of only articles about police related news.
Usual datasets can be 20x larger in size, but we have reduced it for computational purposes.

You will do this using a Transformer network, from the [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf) paper.
In this assignment you will:
- Learn to process text into sub-word tokens, to avoid fixed vocabulary sizes, and UNK tokens.
- Implement the key conceptual blocks of a Transformer.
- Use a Transformer to read a news article, and produce a summary.
- Perform operations on learned word-vectors to examine what the model has learned.

**Before you start**

You should read the Attention Is All You Need paper.  
We are providing you with skeleton code for the Transformer, but you will implement 5 conceptual blocks of the transformer yourself:
- AttentionQKV: the Query, Key, Value attention mechanism at the center of the Transformer
- MultiHeadAttention: the multiple heads that enable each input to attend at many places at once.
- PositionEmbedding: the sinusoid-based position embedding of the Transformer.
- Encoder & Decoder: The encoder (that reads inputs, such as news articles), the decoder (that produces the output summary, one token at a time)
- Full Transformer: piecing it all together.

All dataset files should be placed in the `dataset/` folder of this assignment.

If you are using Google Colab, follow the instructions to mount your Google Drive onto the remote machine.

# Library imports

In [3]:
!pip install segtok
!pip install sentencepiece

Collecting segtok
  Downloading segtok-1.5.11-py3-none-any.whl (24 kB)
Installing collected packages: segtok
Successfully installed segtok-1.5.11
Collecting sentencepiece
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m13.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sentencepiece
Successfully installed sentencepiece-0.1.99


Run the first of the following two cells if you are running the homework locally, and run the second cell if you are running the homework in Colab

In [1]:
import sys

if 'google.colab' in sys.modules:
    from google.colab import drive
    drive.mount('/content/drive')
    root_folder = "/content/drive/My Drive/cs182_hw3_anton/"
    public_folder = "/content/drive/My Drive/cs182_hw3_public/"
    dataset_folder = "/content/drive/My Drive/cs182_hw3_public/dataset/"
    print('Running on Colab')
else:
    DRIVE=False
    root_folder = ""
    public_folder = ""
    dataset_folder = "dataset/"
    print('Running on localhost')

Mounted at /content/drive
Running on Colab


In [4]:
import os
import sys
sys.path.append(root_folder)
from transformer import Transformer
import sentencepiece as spm
import torch
from torch import nn
from torch.nn import functional as F
from torch import optim
import numpy as np
import json
import capita
import os
from transformer_utils import set_device
import gc
from utils import validate_to_array, model_out_to_list

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)
set_device(device)

list_to_device = lambda torch_obj: [tensor.to(device) for tensor in torch_obj]

# This cell autoreloads the notebook when you change you python file code.
# If you think the notebook did not reload, rerun this cell.
%load_ext autoreload
%autoreload 2

cpu


In [5]:
# Load the word piece model that will be used to tokenize the texts into
# word pieces with a vocabulary size of 10000
sp = spm.SentencePieceProcessor()
sp.Load(dataset_folder+"wp_vocab10000.model")

vocab = [line.split('\t')[0] for line in open(dataset_folder+"wp_vocab10000.vocab", "r")]
pad_index = vocab.index('#')

def pad_sequence(numerized, pad_index, to_length):
    pad = numerized[:to_length]
    padded = pad + [pad_index] * (to_length - len(pad))
    mask = [w != pad_index for w in padded]
    return padded, mask

# Building blocks of a Transformer


**TODO**:

Implement the 5 blocks of the Transformer. In order to finish this section, you should get very small error <1e-7 on each of the 5 checks in this section.


The Transformer is split into 3 files: transformer_attention.py, transformer_utils.py and transformer.py

Each section below gives you directions and a way to verify your code works properly.

You do not need to modify the rest of the code provided, but should read it to understand overall architecture.

Our Transformer is built as a Pytorch model, a standard that is good for you to get accustomed to.



## (1) Implementing the Query-Key-Value Attention (AttentionQKV)

Finish implementing the AttentionQKV class in `transformer_attention.py`. Specifically, you must implement the `forward` function of the class using the mathematical procedure described in the [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf).

In [21]:
from transformer_attention import AttentionQKV

batch_size = 2;
n_queries = 3;
n_keyval = 5;
depth_k = 2;
depth_v = 2

with open(public_folder+"transformer_checks/attention_qkv_io.json", "r") as f:
    io = json.load(f)
    queries = torch.tensor(io['queries'])
    keys = torch.tensor(io['keys'])
    values = torch.tensor(io['values'])
    expected_output  = torch.tensor(io['output'])
    expected_weights = torch.tensor(io['weights'])

attn_qkv = AttentionQKV()
output, weights = attn_qkv(queries, keys, values)

print("Total error on the output:",torch.sum(torch.abs(expected_output-output)).item(), "(should be 0.0 or close to 0.0)")
print("Total error on the weights:",torch.sum(torch.abs(expected_weights-weights)).item(), "(should be 0.0 or close to 0.0)")

Total error on the output: 3.2782554626464844e-07 (should be 0.0 or close to 0.0)
Total error on the weights: 1.816079020500183e-07 (should be 0.0 or close to 0.0)


## (2) Implementing Multi-head attention

Finish implementing the MultiHeadProjection class in `transformer_attention.py`. You must implement the `forward` call, `_split_heads`, and `_combine_heads` functions. Leverage the AttentionQKV class you already wrote.

Your input are the queries, keys, values as 3-d tensors (batch_size, sequence_length, feature_size). Split them into 4-d tensors (batch_size, n_heads, sequence_length, new_feature_size) where $feature\_size = n\_heads * new\_feature\_size$. You can then feed the split qkv to your implemented AttentionQKV, which will treat each head as an independent attention function. Then the output must be combined back into a 3-d tensor.  

You can test the validity of your implementation in the cell below.

In [23]:
from transformer_attention import MultiHeadProjection

batch_size = 2;
n_queries = 3;
n_heads = 4
n_keyval = 5;
depth_k = 8;
depth_v = 8;

with open(public_folder+"transformer_checks/multihead_io.json", "r") as f:
    io = json.load(f)
    queries = torch.tensor(io['queries'])
    keys = torch.tensor(io['keys'])
    values = torch.tensor(io['values'])
    expected_output  = torch.tensor(io['output'])

mhp = MultiHeadProjection(n_heads, (depth_k,depth_v))
multihead_output = mhp((queries, keys, values))

print("Total error on the output:",torch.sum(torch.abs(expected_output-multihead_output)).item(), "(should be 0.0 or close to 0.0)")

Total error on the output: 1.407228410243988e-06 (should be 0.0 or close to 0.0)


## (3) Position Embedding

Implement the FeedForward and PositionEmbedding classes in `transformer.py`.

You can test the validity of your implementation in the cell below.

In [24]:
# wrapping my head around how this code works...

d_model = 512

seq_pos = torch.arange(1, 10 + 1, dtype=torch.float32)
seq_pos_expanded = seq_pos[None,:,None]
index = seq_pos_expanded.repeat(*[1,1,d_model//2])
assert index.shape == (1, 10, d_model//2)

power = torch.arange(0, d_model, step=2, dtype=torch.float32)[:] / d_model
divisor = 10000 ** power
assert divisor.shape == (d_model // 2,)

assert (index / divisor).shape == index.shape
assert (index / divisor)[0,1,0] == 2*(index * divisor)[0,0,0]

In [26]:
from transformer import PositionEmbedding

batch_size = 2;
sequence_length = 3;
dim = 4;

with open(public_folder+"transformer_checks/position_embedding_io.json", "r") as f:
    io = json.load(f)
    inputs = torch.tensor(io['inputs'])
    expected_output  = torch.tensor(io['output'])

pos_emb = PositionEmbedding(dim)
(inputs,expected_output,pos_emb) = list_to_device((inputs,expected_output,pos_emb))
output_t = pos_emb(inputs)

print("Total error on the output:",torch.sum(torch.abs(expected_output-output_t)).item(), "(should be 0.0 or close to 0.0)")

Total error on the output: 2.980232238769531e-07 (should be 0.0 or close to 0.0)


## (4) Transformer Encoder / Transformer Decoder

You now have all the blocks needed to implement the Transformer.
For this part, you have to fill in 2 classes in the transformer.py file: TransformerEncoderBlock, TransformerDecoderBlock.

The code below will verify the accuracy of each block

In [27]:
from transformer import TransformerEncoderBlock

batch_size = 2
sequence_length = 5
hidden_size = 6
filter_size = 12
n_heads = 2

with open(public_folder+"transformer_checks/transformer_encoder_block_io_new.json", "r") as f:
    io = json.load(f)
    inputs = torch.tensor(io['inputs'])
    expected_output = torch.tensor(io['output'])

enc_block = TransformerEncoderBlock(input_size=6,
                                    n_heads=n_heads,
                                    filter_size=filter_size,
                                    hidden_size=hidden_size)
enc_block.load_state_dict(torch.load(public_folder+"transformer_checks/transformer_encoder_block"))

inputs, expected_output, enc_block = list_to_device((inputs,expected_output,enc_block))
output_t = enc_block(inputs)

print("Total error on the output:",torch.sum(torch.abs(expected_output-output_t)).item(), "(should be 0.0 or close to 0.0)")

Total error on the output: 4.999339580535889e-06 (should be 0.0 or close to 0.0)


In [29]:
from transformer import TransformerDecoderBlock
batch_size = 2
encoder_length = 5
decoder_length = 3
hidden_size = 6
filter_size = 12
n_heads = 2

with open(public_folder+"transformer_checks/transformer_decoder_block_io_new.json", "r") as f:
    io = json.load(f)
    decoder_inputs = torch.tensor(io['decoder_inputs'])
    encoder_output = torch.tensor(io['encoder_output'])
    expected_output = torch.tensor(io['output'])

dec_block = TransformerDecoderBlock(input_size=6,
                                    n_heads=n_heads,
                                    filter_size=filter_size,
                                    hidden_size=hidden_size)

# Added by Anton for compatibility with poor quality variable names
old_state_dict = torch.load(public_folder+"transformer_checks/transformer_decoder_block")
update = {'cross_norm_source.weight': 'cross_norm_encoder.weight',
          'cross_norm_source.bias':   'cross_norm_encoder.bias',
          'cross_norm_target.weight': 'cross_norm_decoder.weight',
          'cross_norm_target.bias':   'cross_norm_decoder.bias'}
state_dict = {update[k] if k in update.keys() else k: v for k, v in old_state_dict.items()}

dec_block.load_state_dict(state_dict)

decoder_inputs, encoder_output, expected_output, dec_block = list_to_device((decoder_inputs,encoder_output,expected_output,dec_block))
output_t = dec_block(decoder_inputs, encoder_output)

print("Total error on the output:",torch.sum(torch.abs(expected_output-output_t)).item(), "(should be 0.0 or close to 0.0)")

Total error on the output: 3.2186508178710938e-06 (should be 0.0 or close to 0.0)


## (5) Transformer

This is the final high-level function that pieces it all together.

You have to implement the call function of the Transformer class in the `transformer.py` file.

The block below verifies your implementation is correct.

In [30]:
from transformer import Transformer

batch_size = 2
vocab_size = 11
n_layers = 3
n_heads = 4
d_model = 8
d_filter = 16
input_length = 5
output_length = 3

with open(public_folder+"transformer_checks/transformer_io_new.json", "r") as f:
    io = json.load(f)
    enc_input = torch.tensor(io['enc_input'])
    dec_input = torch.tensor(io['dec_input'])
    enc_mask = torch.tensor(io['enc_mask'])
    dec_mask = torch.tensor(io['dec_mask'])
    expected_output = torch.tensor(io['output'])

transformer = Transformer(vocab_size=vocab_size, n_layers=n_layers, n_heads=n_heads, d_model=d_model, d_filter=d_filter)
old_state_dict = torch.load(public_folder+"transformer_checks/transformer")
state_dict = {k.replace('cross_norm_source', 'cross_norm_encoder').replace('cross_norm_target', 'cross_norm_decoder'): v
              for k, v in old_state_dict.items()}
transformer.load_state_dict(state_dict)

(enc_input,dec_input,enc_mask,dec_mask,expected_output,transformer) \
    = list_to_device((enc_input,dec_input,enc_mask,dec_mask,expected_output,transformer))
output_t = transformer(enc_input, target_sequence=dec_input, encoder_mask=enc_mask, decoder_mask=dec_mask)

print("Total error on the output:",torch.sum(torch.abs(expected_output-output_t)).item(), "(should be 0.0 or close to 0.0)")

Total error on the output: 5.602836608886719e-05 (should be 0.0 or close to 0.0)


# Training the model

Your objective is to train the Language on the dataset you are provided to reach a **validation loss <= 6.50**

Careful: we will be testing this loss on an unreleased test set, so make sure to evaluate properly on a validation set and not overfit.

You must save the model you want us to test under: models/final_transformer_summarization (the .index, .meta and .data files)

**Advice**:
- It should be possible to attain validation loss <= 6.50 with the model dimensions we've specified (n_layers=6, d_model=104, d_filter=416), but you can tune these hyperparameters. Increasing d_model will yield better model, at the cost of longer training time.
- You should try tuning the learning rate, as well as what optimizer you use.
- You might need to train for a few (up to 2 hours) to obtain our expected loss. Remember to tune your hyperparameters first, once you find ones that work well, let it train for longer.

**Dataset**: as in the previous notebook, make sure the dataset files are in the `dataset` folder. These can be found on the Google Drive.


In [14]:
with open(dataset_folder+"summarization_dataset_preprocessed.json", "r") as f:
    dataset = json.load(f)

# We load the dataset, and split it into 2 sub-datasets based on if they are training or validation.
# Feel free to split this dataset another way, but remember, a validation set is important, to have an idea of
# the amount of overfitting that has occurred!

d_train = [d for d in dataset if d['cut'] == 'training']
d_valid = [d for d in dataset if d['cut'] == 'evaluation']

len(d_train), len(d_valid)

(61055, 1558)

In [32]:
# An example (article, summary) pair in the training data:

print(d_train[145]['story'])
print("=======================\n=======================")
print(d_train[145]['summary'])

Tbilisi, Georgia (CNN)Police have shot and killed a white tiger that killed a man Wednesday in Tbilisi, Georgia, a Ministry of Internal Affairs representative said, after severe flooding allowed hundreds of wild animals to escape the city zoo. 
The tiger attack happened at a warehouse in the city center. The animal had been unaccounted for since the weekend floods destroyed the zoo premises.
The man killed, who was 43, worked in a company based in the warehouse, the Ministry of Internal Affairs said. Doctors said he was attacked in the throat and died before reaching the hospital. 
Experts are still searching the warehouse, the ministry said, adding that earlier reports that the tiger had injured a second man were unfounded. 
The zoo administration said Wednesday that another tiger was still missing. It was unable to confirm if the creature was dead or had escaped alive.
Georgian Prime Minister Irakli Garibashvili apologized to the public, saying he had been misinformed by the zoo's ma

Similarly to the previous assignment, we create a function to get a random batch to train on, given a dataset.

In [12]:
def build_batch(dataset, batch_size):
    """Create a random minibatch from the dataset"""

    indices = list(np.random.randint(0, len(dataset), size=batch_size))

    batch = [dataset[i] for i in indices]
    batch_input = np.array([a['input'] for a in batch])
    batch_input_mask = np.array([a['input_mask'] for a in batch])
    batch_output = np.array([a['output'] for a in batch])
    batch_output_mask = np.array([a['output_mask'] for a in batch])

    return batch_input, batch_input_mask, batch_output, batch_output_mask


def get_minibatch(dataset, batch_size):
    """Return minibatch in dictionary format, loaded to device"""

    batch = build_batch(dataset, batch_size)
    # Build the dict connecting placeholders and mini-batch
    batch_input, batch_input_mask, batch_output, batch_output_mask = [torch.tensor(tensor) for tensor in batch]
    batch_input, batch_input_mask, batch_output, batch_output_mask \
                = list_to_device([batch_input, batch_input_mask, batch_output, batch_output_mask])
    batch = {'source_sequence': batch_input, 'target_sequence': batch_output,
            'encoder_mask': batch_input_mask, 'decoder_mask': batch_output_mask}
    return batch

We now instantiate the Transformer with our sets of hyperparameters specific to the task of summarization.
In summarization, we are going to go from documents with up to 400 words, to documents with up to 100 words.
The vocabulary size is set for you, and is of 10,000 words (we are using WordPieces, [here is a paper about subword encoding](http://aclweb.org/anthology/P18-1007), if you are interested).

In [19]:
# Use this trainer to train a Transformer model

class TransformerTrainer(nn.Module):
    def __init__(self, model, optimizer):
        super().__init__()

        self.model = model
        self.optimizer = optimizer

        # Summarization loss
        criterion = nn.CrossEntropyLoss(reduce='none')
        self.loss_fn = lambda pred, target, mask: (criterion(pred.permute(0,2,1), target)*mask).sum()/mask.sum()


    def forward(self, batch, optimize=True):
        pred_logits = self.model(**batch)
        target, mask = batch['target_sequence'], batch['decoder_mask']
        loss = self.loss_fn(pred_logits, target, mask)
        accuracy = (torch.eq(pred_logits.argmax(dim=2, keepdim=False), target).float()*mask).sum()/mask.sum()

        if optimize:
            self.optimizer.zero_grad()
            loss.backward()
            self.optimizer.step()

        return loss, accuracy

In [36]:
# Dataset related parameters
vocab_size = len(vocab)
ilength = 400 # Length of the article
olength  = 100 # Length of the summaries

# Model related parameters
n_layers = 6
d_model  = 128
d_filter = 4*d_model  # feedforward projection space size
batch_size = 32
dropout = 0.1
model = Transformer(vocab_size=vocab_size, d_model=d_model, n_layers=n_layers, d_filter=d_filter, dropout=dropout)

# trainer parameters
lr = 1e-3
lr_decay = 0.9
epochs = 10
optimizer = optim.Adam(model.parameters(), lr=lr, weight_decay=lr_decay)
trainer = TransformerTrainer(model, optimizer)

model_id = 'test1'
os.makedirs(root_folder+'models/part2/', exist_ok=True)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)
set_device(device)

cuda




In [45]:
# Run a quick random parameter search to see what combination gives best initial training results.
# Then run for real in the next cell.

from tqdm import tqdm

gc.collect()
trainer.model.to(device)
trainer.model.train()

losses, accuracies = [], []
val_losses, val_accuracies = [], []

epochs = 1

max_count = 20
for count in range(max_count):
    lr = 10**np.random.uniform(-4, -1)
    dropout = np.random.uniform(0, 0.5)
    print(f"==================")
    print(f"Experiment {count+1} of {max_count}")
    print(f"lr={lr:e}, dropout={dropout:.4f}")

    model = Transformer(vocab_size=vocab_size, d_model=d_model, n_layers=n_layers, d_filter=d_filter, dropout=dropout)
    model.to(device)
    optimizer = optim.Adam(model.parameters(), lr=lr, weight_decay=lr_decay)
    trainer = TransformerTrainer(model, optimizer)

    for epoch in range(epochs):

        # for each epoch, get a random permutation of minibatches
        # t = tqdm(range(0,(len(d_train)//batch_size)+1))
        t = tqdm(range(0,101))

        for i in t:
            # Create a random mini-batch from the training dataset
            batch = get_minibatch(d_train, batch_size)

            # Obtain the training loss
            train_loss, accuracy = trainer(batch)
            losses.append(train_loss.item())
            accuracies.append(accuracy.item())

            if i % 10 == 0:
                t.set_description(f"Epoch {epoch}/{epochs}. Iteration: {i} Loss: {np.mean(losses[-10:])} Accuracy: {np.mean(accuracies[-10:])}")

            epoch_end = (i == len(t)-1)
            if epoch_end:
                for g in optimizer.param_groups:
                    g['lr'] *= lr_decay
                pass

            # at first iteration and after each epoch, output accuracies and losses on validation set.
            first_it = (epoch == 0 and i == 0)

            if first_it or epoch_end:
                # create validation minibatch (full size of val set)
                val_batch = get_minibatch(d_valid, batch_size)
                with torch.no_grad():
                    val_loss, val_acc = trainer(val_batch, optimize=False)
                    val_losses.append(val_loss)
                    val_accuracies.append(val_acc)

                    print(f'(Epoch {epoch+1}/{epochs}) val loss: {val_loss}; val acc: {val_acc}')


Experiment 1 of 20
lr=5.592691e-04, dropout=0.3569


Epoch 0/1. Iteration: 0 Loss: 86.89588928222656 Accuracy: 0.0009737098589539528:   2%|▏         | 2/101 [00:00<00:19,  5.02it/s]

(Epoch 1/1) val loss: 105.54676818847656; val acc: 0.0


Epoch 0/1. Iteration: 100 Loss: 36.056027221679685 Accuracy: 0.09842775315046311: 100%|██████████| 101/101 [00:14<00:00,  6.87it/s]


(Epoch 1/1) val loss: 49.17786407470703; val acc: 0.10789679735898972
Experiment 2 of 20
lr=1.739770e-03, dropout=0.3692


Epoch 0/1. Iteration: 0 Loss: 39.2494571685791 Accuracy: 0.089481782913208:   2%|▏         | 2/101 [00:00<00:16,  5.94it/s]

(Epoch 1/1) val loss: 84.2785873413086; val acc: 0.0


Epoch 0/1. Iteration: 100 Loss: 33.64618854522705 Accuracy: 0.11994030736386777: 100%|██████████| 101/101 [00:14<00:00,  6.86it/s]


(Epoch 1/1) val loss: 39.16978454589844; val acc: 0.13327205181121826
Experiment 3 of 20
lr=6.454894e-02, dropout=0.3167


Epoch 0/1. Iteration: 0 Loss: 37.711001777648924 Accuracy: 0.10545228980481625:   2%|▏         | 2/101 [00:00<00:17,  5.55it/s]

(Epoch 1/1) val loss: 798.796875; val acc: 0.0


Epoch 0/1. Iteration: 100 Loss: 8.686459922790528 Accuracy: 0.0: 100%|██████████| 101/101 [00:14<00:00,  6.91it/s]


(Epoch 1/1) val loss: 9.0984525680542; val acc: 0.0
Experiment 4 of 20
lr=5.127627e-04, dropout=0.4793


Epoch 0/1. Iteration: 0 Loss: 14.481757164001465 Accuracy: 0.0:   2%|▏         | 2/101 [00:00<00:17,  5.68it/s]

(Epoch 1/1) val loss: 107.97550201416016; val acc: 0.0


Epoch 0/1. Iteration: 100 Loss: 41.833504104614256 Accuracy: 0.0889276035130024: 100%|██████████| 101/101 [00:14<00:00,  6.84it/s]


(Epoch 1/1) val loss: 59.48481369018555; val acc: 0.07919366657733917
Experiment 5 of 20
lr=2.142013e-04, dropout=0.4361


Epoch 0/1. Iteration: 0 Loss: 45.015542602539064 Accuracy: 0.07953533828258515:   2%|▏         | 2/101 [00:00<00:16,  6.05it/s]

(Epoch 1/1) val loss: 114.12635803222656; val acc: 0.0


Epoch 0/1. Iteration: 100 Loss: 42.688822174072264 Accuracy: 0.08371297493577004: 100%|██████████| 101/101 [00:14<00:00,  6.93it/s]


(Epoch 1/1) val loss: 50.89136505126953; val acc: 0.10192837566137314
Experiment 6 of 20
lr=4.438441e-04, dropout=0.3805


Epoch 0/1. Iteration: 0 Loss: 43.546159744262695 Accuracy: 0.07546012178063392:   2%|▏         | 2/101 [00:00<00:17,  5.53it/s]

(Epoch 1/1) val loss: 111.14080047607422; val acc: 0.0014054813655093312


Epoch 0/1. Iteration: 100 Loss: 34.088938522338864 Accuracy: 0.0861500084400177: 100%|██████████| 101/101 [00:14<00:00,  6.95it/s]


(Epoch 1/1) val loss: 50.48798370361328; val acc: 0.11486486345529556
Experiment 7 of 20
lr=3.533240e-02, dropout=0.1823


Epoch 0/1. Iteration: 0 Loss: 38.63574752807617 Accuracy: 0.0787299171090126:   2%|▏         | 2/101 [00:00<00:17,  5.52it/s]

(Epoch 1/1) val loss: 251.09066772460938; val acc: 0.0


Epoch 0/1. Iteration: 100 Loss: 7.287444829940796 Accuracy: 0.0: 100%|██████████| 101/101 [00:14<00:00,  6.93it/s]


(Epoch 1/1) val loss: 8.34176254272461; val acc: 0.0
Experiment 8 of 20
lr=3.541083e-04, dropout=0.4036


Epoch 0/1. Iteration: 0 Loss: 15.105563163757324 Accuracy: 0.0:   2%|▏         | 2/101 [00:00<00:17,  5.55it/s]

(Epoch 1/1) val loss: 101.83775329589844; val acc: 0.0


Epoch 0/1. Iteration: 100 Loss: 36.81281719207764 Accuracy: 0.08202475160360337: 100%|██████████| 101/101 [00:14<00:00,  6.89it/s]


(Epoch 1/1) val loss: 48.356014251708984; val acc: 0.0839826837182045
Experiment 9 of 20
lr=3.641326e-04, dropout=0.3814


Epoch 0/1. Iteration: 0 Loss: 40.60747890472412 Accuracy: 0.07394704222679138:   2%|▏         | 2/101 [00:00<00:18,  5.38it/s]

(Epoch 1/1) val loss: 88.93000793457031; val acc: 0.0


Epoch 0/1. Iteration: 100 Loss: 37.93932571411133 Accuracy: 0.0912585698068142: 100%|██████████| 101/101 [00:14<00:00,  6.85it/s]


(Epoch 1/1) val loss: 54.93415069580078; val acc: 0.10078627616167068
Experiment 10 of 20
lr=2.419360e-03, dropout=0.0313


Epoch 0/1. Iteration: 0 Loss: 40.984622955322266 Accuracy: 0.0823413722217083:   2%|▏         | 2/101 [00:00<00:18,  5.47it/s]

(Epoch 1/1) val loss: 81.68582916259766; val acc: 0.0


Epoch 0/1. Iteration: 100 Loss: 30.572784996032716 Accuracy: 0.10260354951024056: 100%|██████████| 101/101 [00:14<00:00,  6.89it/s]


(Epoch 1/1) val loss: 38.22150802612305; val acc: 0.11024844646453857
Experiment 11 of 20
lr=2.415331e-03, dropout=0.1556


Epoch 0/1. Iteration: 0 Loss: 35.36472911834717 Accuracy: 0.09086560606956481:   2%|▏         | 2/101 [00:00<00:16,  6.00it/s]

(Epoch 1/1) val loss: 72.52505493164062; val acc: 0.08579088747501373


Epoch 0/1. Iteration: 100 Loss: 28.877438354492188 Accuracy: 0.11458127573132515: 100%|██████████| 101/101 [00:14<00:00,  6.97it/s]


(Epoch 1/1) val loss: 37.176937103271484; val acc: 0.13116057217121124
Experiment 12 of 20
lr=1.105175e-04, dropout=0.0089


Epoch 0/1. Iteration: 0 Loss: 32.804124450683595 Accuracy: 0.10170169398188592:   1%|          | 1/101 [00:00<00:20,  4.91it/s]

(Epoch 1/1) val loss: 110.89585876464844; val acc: 0.0021834061481058598


Epoch 0/1. Iteration: 100 Loss: 34.47009105682373 Accuracy: 0.09987314566969871: 100%|██████████| 101/101 [00:15<00:00,  6.66it/s]


(Epoch 1/1) val loss: 44.30207443237305; val acc: 0.12128514051437378
Experiment 13 of 20
lr=6.307942e-02, dropout=0.2538


Epoch 0/1. Iteration: 0 Loss: 37.51595211029053 Accuracy: 0.0894819475710392:   2%|▏         | 2/101 [00:00<00:16,  5.96it/s]

(Epoch 1/1) val loss: 128.0049285888672; val acc: 0.0


Epoch 0/1. Iteration: 100 Loss: 6.8480613231658936 Accuracy: 0.0: 100%|██████████| 101/101 [00:14<00:00,  6.97it/s]


(Epoch 1/1) val loss: 7.759701728820801; val acc: 0.0
Experiment 14 of 20
lr=1.724665e-03, dropout=0.4788


Epoch 0/1. Iteration: 0 Loss: 15.598576259613036 Accuracy: 0.0:   2%|▏         | 2/101 [00:00<00:16,  6.01it/s]

(Epoch 1/1) val loss: 132.7911376953125; val acc: 0.0


Epoch 0/1. Iteration: 100 Loss: 38.44895439147949 Accuracy: 0.1129253163933754: 100%|██████████| 101/101 [00:14<00:00,  6.94it/s]


(Epoch 1/1) val loss: 50.02727127075195; val acc: 0.10215053707361221
Experiment 15 of 20
lr=4.403611e-04, dropout=0.4331


Epoch 0/1. Iteration: 0 Loss: 42.421321868896484 Accuracy: 0.10214895457029342:   2%|▏         | 2/101 [00:00<00:16,  6.06it/s]

(Epoch 1/1) val loss: 98.53439331054688; val acc: 0.002481389557942748


Epoch 0/1. Iteration: 100 Loss: 37.89188365936279 Accuracy: 0.08862417712807655: 100%|██████████| 101/101 [00:14<00:00,  6.94it/s]


(Epoch 1/1) val loss: 52.81816482543945; val acc: 0.07711038738489151
Experiment 16 of 20
lr=9.092520e-04, dropout=0.3273


Epoch 0/1. Iteration: 0 Loss: 39.69145832061768 Accuracy: 0.08142648041248321:   2%|▏         | 2/101 [00:00<00:16,  5.98it/s]

(Epoch 1/1) val loss: 92.34660339355469; val acc: 0.0032025619875639677


Epoch 0/1. Iteration: 100 Loss: 20.723396492004394 Accuracy: 0.10310936532914639: 100%|██████████| 101/101 [00:14<00:00,  6.82it/s]


(Epoch 1/1) val loss: 26.574037551879883; val acc: 0.12855006754398346
Experiment 17 of 20
lr=7.317861e-02, dropout=0.2738


Epoch 0/1. Iteration: 0 Loss: 25.150929260253907 Accuracy: 0.09215698428452015:   2%|▏         | 2/101 [00:00<00:17,  5.50it/s]

(Epoch 1/1) val loss: 610.203857421875; val acc: 0.0


Epoch 0/1. Iteration: 100 Loss: 8.212886762619018 Accuracy: 0.12136920839548111: 100%|██████████| 101/101 [00:14<00:00,  6.85it/s]


(Epoch 1/1) val loss: 8.603231430053711; val acc: 0.12997746467590332
Experiment 18 of 20
lr=6.720953e-03, dropout=0.4224


Epoch 0/1. Iteration: 0 Loss: 13.247795724868775 Accuracy: 0.10858570337295533:   2%|▏         | 2/101 [00:00<00:16,  5.92it/s]

(Epoch 1/1) val loss: 119.03179931640625; val acc: 0.0036231884732842445


Epoch 0/1. Iteration: 100 Loss: 22.4865083694458 Accuracy: 0.05849347971379757: 100%|██████████| 101/101 [00:14<00:00,  6.91it/s]


(Epoch 1/1) val loss: 30.58523941040039; val acc: 0.034866467118263245
Experiment 19 of 20
lr=2.746106e-02, dropout=0.2538


Epoch 0/1. Iteration: 0 Loss: 27.26737251281738 Accuracy: 0.05110481567680836:   2%|▏         | 2/101 [00:00<00:16,  6.03it/s]

(Epoch 1/1) val loss: 99.73124694824219; val acc: 0.02994483895599842


Epoch 0/1. Iteration: 100 Loss: 6.147905731201172 Accuracy: 0.0: 100%|██████████| 101/101 [00:14<00:00,  6.93it/s]


(Epoch 1/1) val loss: 7.604753494262695; val acc: 0.0
Experiment 20 of 20
lr=6.405336e-04, dropout=0.3865


Epoch 0/1. Iteration: 0 Loss: 13.785102367401123 Accuracy: 0.0:   2%|▏         | 2/101 [00:00<00:16,  6.03it/s]

(Epoch 1/1) val loss: 91.82673645019531; val acc: 0.0


Epoch 0/1. Iteration: 100 Loss: 35.06514263153076 Accuracy: 0.09724149964749813: 100%|██████████| 101/101 [00:14<00:00,  6.91it/s]

(Epoch 1/1) val loss: 40.146663665771484; val acc: 0.09090909361839294





In [67]:
# full train cycle with best parameters.
d_model = 256
d_filter = 4*d_model

lr = 5e-4
dropout = 0.25
epochs = 2
batch_size = 64

gc.collect()

losses, accuracies = [], []
val_losses, val_accuracies = [], []

model = Transformer(vocab_size=vocab_size, d_model=d_model, n_layers=n_layers, d_filter=d_filter, dropout=dropout)
model.to(device)
optimizer = optim.Adam(model.parameters(), lr=lr)
scheduler = optim.lr_scheduler.ExponentialLR(optimizer, gamma=lr_decay)
trainer = TransformerTrainer(model, optimizer)
trainer.model.train()

for epoch in range(epochs):

    # for each epoch, get a random permutation of minibatches
    t = tqdm(range(0,(len(d_train)//batch_size)+1))

    for i in t:
        # Create a random mini-batch from the training dataset
        batch = get_minibatch(d_train, batch_size)

        # Obtain the training loss
        train_loss, accuracy = trainer(batch)
        losses.append(train_loss.item())
        accuracies.append(accuracy.item())

        if i % 10 == 0:
            t.set_description(f"Epoch {epoch+1}/{epochs}. Iteration: {i} Loss: {np.mean(losses[-10:]):.4f} Accuracy: {np.mean(accuracies[-10:]):.4f}")

        # at first iteration and after each epoch, output accuracies and losses on validation set.
        first_it = (epoch == 0 and i == 0)
        epoch_end = (i == len(t)-1)

        if first_it or epoch_end:
            # create validation minibatch
            val_batch = get_minibatch(d_valid, 8*batch_size)
            with torch.no_grad():
                trainer.model.eval()
                val_loss, val_acc = trainer(val_batch, optimize=False)
                trainer.model.train()
                val_losses.append(val_loss)
                val_accuracies.append(val_acc)

                print(f'Epoch {epoch+1}/{epochs}. val loss: {val_loss:.4f}; val acc: {val_acc:.4f}')
    scheduler.step()

Epoch 1/2. Iteration: 0 Loss: 142.9693 Accuracy: 0.0000:   0%|          | 1/954 [00:02<46:36,  2.93s/it]

Epoch 1/2. val loss: 159.0312; val acc: 0.0002


Epoch 1/2. Iteration: 950 Loss: 4.2664 Accuracy: 0.2244: 100%|██████████| 954/954 [11:22<00:00,  1.40it/s]


Epoch 1/2. val loss: 5.0128; val acc: 0.2640


Epoch 2/2. Iteration: 950 Loss: 3.9047 Accuracy: 0.2600: 100%|██████████| 954/954 [11:19<00:00,  1.40it/s]

Epoch 2/2. val loss: 4.6912; val acc: 0.2872





In [68]:
model_dict = dict(
                kwargs = dict(
                    vocab_size=vocab_size,
                    d_model=d_model,
                    n_layers=n_layers,
                    d_filter=d_filter,
                    dropout=dropout
                ),
                model_state_dict = trainer.model.state_dict(),
                notes = ""
            )
torch.save(model_dict, root_folder+f'models/part2/model_{model_id}.pt')

# Using the Summarization model

On Google Colab, switch the runtime from GPU to CPU and rerun all the above cells that load Python modules, load the dataset, etc.

Now that you have trained a Transformer to perform Summarization, we will use the model on news articles from the wild.

The three subsections below explore what the model has learned.

## The validation loss

Measure the validation loss of your model. This part could be used, as in our previous notebook, in deciding what is a likely, vs. unlikely summary for an article.

We will use the code here with the unreleased test-set to evaluate your model.

In [20]:
model_id = "test1"
save_dict = torch.load(root_folder+'models/part2/'+f"model_{model_id}.pt", map_location='cpu')
model = Transformer(**save_dict['kwargs'])
model.load_state_dict(save_dict['model_state_dict'])
set_device('cpu')
model.eval()
trainer = TransformerTrainer(model=model, optimizer=None)



In [21]:
from tqdm import tqdm

gc.collect()
losses = []

for i in tqdm(range(100)):
    batch = build_batch(d_valid, 1)
    # Build the feed-dict connecting placeholders and mini-batch
    batch_input, batch_input_mask, batch_output, batch_output_mask = [torch.tensor(tensor) for tensor in batch]
    batch = {'source_sequence': batch_input, 'target_sequence': batch_output,
            'encoder_mask': batch_input_mask, 'decoder_mask': batch_output_mask}
    valid_loss, accuracy = trainer(batch, optimize=False)
    losses.append(float(valid_loss.cpu().item()))
print("Validation loss:", np.mean(losses))

100%|██████████| 100/100 [00:28<00:00,  3.49it/s]

Validation loss: 4.433573721051216





## Generating an article's summary

This model we have built is meant to be used to generate summaries for new articles we do not have summaries for.
We got a [news article](https://www.chicagotribune.com/news/local/breaking/ct-met-officer-shot-20190309-story.html) from the Chicago Tribune about a police shooting, and want to use our model to produce a summary.

As you will see, our model is still limited in its ability, and will most likely not produce an interpretible summary, however, with more data and training, this model would be able to produce good summaries.

In [22]:
article_text = "A 34-year-old Chicago police officer has been shot in the shoulder during the execution of a search warrant in the Humboldt Park neighborhood, police say. The alleged shooter, a 19-year-old woman, was in custody. The shooting happened about 7:20 p.m. in the 2700 block of West Potomac Avenue, police said. The officer, part of the Grand Central District tactical unit, was taken to Stroger Hospital. While officers were serving a \"typical\" search warrant for \"narcotics and illegal weapons\" and were attempting to reach a rear door, \"a shot was fired,\" striking the tactical officer in the shoulder, said Chicago police Superintendent Eddie Johnson during a news briefing outside the hospital. He said the officer, who has about four or five years on the job, was \"stable\" but in critical condition. \"His family is here,\" Johnson said. \"He’s talking a lot and just wants the ordeal to be over.\" He said this incident serves as just another reminder of how dangerous a police officer’s job is. At the scene of the shooting, crime tape closed Potomac from Washtenaw Avenue to California Avenue and encompassed the alley west of the brick apartment building, south of Potomac. Dozens of officers stood in the alley, while even more walked up and down the street. Neighbors gathered at the edge of the yellow tape on the sidewalk along California and watched them work. Standing next to a man, a woman talked to police in the crime scene, across the street. \"We're not under arrest? We can go?\" the woman checked with officers. They told her she could go, and she and the man walked underneath the yellow tape and out of the crime scene."
input_length = 400
output_length = 100

# Process the capitalization with the preprocess_capitalization of the capita package.
article_text = capita.preprocess_capitalization(article_text)

# Numerize the tokens of the processed text using the loaded sentencepiece model.
numerized = sp.EncodeAsIds(article_text)

# Pad the sequence and keep the mask of the input
padded, mask = pad_sequence(numerized, pad_index, input_length)

# Making the news article into a batch of size one, to be fed to the neural network.
encoder_input = np.array([padded])
encoder_mask = np.array([mask])

decoded_so_far = [0]

for j in range(output_length):
    padded_decoder_input, decoder_mask = pad_sequence(decoded_so_far, pad_index, output_length)
    padded_decoder_input = [padded_decoder_input]
    decoder_mask = [decoder_mask]
    print("========================")
    print(padded_decoder_input)
    # Use the model to find the distrbution over the vocabulary for the next word
    batch = (encoder_input, encoder_mask, padded_decoder_input, decoder_mask)
    batch_input, batch_input_mask, batch_output, batch_output_mask = [torch.tensor(tensor) for tensor in batch]
    batch = {'source_sequence': batch_input,      'target_sequence': batch_output,
             'encoder_mask':    batch_input_mask, 'decoder_mask':    batch_output_mask}
    logits = trainer.model(**batch).cpu().detach().numpy()

    chosen_words = np.argmax(logits, axis=2) # Take the argmax, getting the most likely next word
    decoded_so_far.append(int(chosen_words[0, j])) # We add it to the summary so far


print("The final summary:")
print("".join([vocab[i] for i in decoded_so_far]).replace("▁", " "))

[[0, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998]]
[[0, 3, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 9998, 99

## Word vectors

The model we train learns word representations for each word in our vocabulary. A word represention is a vector of **dim** size.

It is common in NLP to inspect the word vectors, as some properties of language often appear in the embedding structure.


We are going to load the word embeddings learned by our model, and inspect it.
Because our network was not trained for long, we are going for the simplest patterns, but if we let the network train longer, it learns more complex, semantic patterns.

In [23]:
# We help you load the matrix, as it is hidden within the Transformer structure.
E = trainer.model.encoder.embedding_layer.embedding.weight.cpu().detach().numpy()

print("The embedding matrix has shape:", E.shape)
print("The vocabulary has length:", len(vocab))

The embedding matrix has shape: (10000, 256)
The vocabulary has length: 10000


Pronouns serve very similar purposes, therefore we should expect the representation of "he" and "she" to be similar, and have cosine similarity.

- **TODO**:  Find the cosine similarity between the vectors that represent words "she" and "he".
- **TODO**:  Find the cosine similarity between the vectors that represent words "more" and "less".

We can contrast that with the cosine similarity to a random, non-related word, like "ball", or "gorilla".
- **TODO**: Compute the cosine similarity between "she" and "ball".
- **TODO**: Compute the cosine similarity between "more" and "protest".



In [47]:
def cosine_sim(v1, v2):
    # return F.cosine_similarity(torch.Tensor([v1]), torch.Tensor([v2]))
    return np.dot(v1, v2) / (np.sqrt(np.sum(v1**2)) * np.sqrt(np.sum(v2**2)))

for w1, w2 in [("gun", "knife"), ("she", "he"), ("more", "less"), ("she", "ball"), ("more", "gorilla")]:
    w1_index = vocab.index('▁'+w1) # The index of the first  word in our vocabulary
    w2_index = vocab.index('▁'+w2) # The index of the second word in our vocabulary
    w1_vec = E[w1_index] # Get the embedding vector of the first  word
    w2_vec = E[w2_index] # Get the embedding vector of the second word

    print(w1," vs. ", w2, "similarity:",cosine_sim(w1_vec, w2_vec))

gun  vs.  knife similarity: -0.0046270206
she  vs.  he similarity: 0.021105306
more  vs.  less similarity: -0.104122266
she  vs.  ball similarity: -0.11319418
more  vs.  gorilla similarity: -0.04415771


These effects are unfortunately small, as we have only trained the network on a few hours on a few thousand articles.
However, the same model trained for longer on more data exhibits many interesting semantic and syntactic patterns, such as:

- Words vectors with high cosine similarity usually represent words that have semantic similarity (such as duck and pigeon)
- Analogies can occur, a famous case is that of: woman - man + king ≈ queen. Or france - paris + rome ≈ italy.

- Looking at top-k similar words can help find synonyms.

To read examples of more complex patterns that appear in word embedding spaces, read [this blog](https://explosion.ai/blog/sense2vec-with-spacy). To play with a live demo and try similarities on rich word embeddings, [go here.](https://explosion.ai/demos/sense2vec)