# Text Summarisation with Transformer from Scratch

## 1.0. Introduction

In today's digital age, news flows in an endless stream from various sources. We have great amount of news articles everyday. But, there are a small amount of useful information in the articles and it is hard to extract useful information manually. As a result, there are lots of news articles but, it is hard to read all of articles and find informative news manually. One of solutions on this problem is to summarize texts in the article.

<p align='center'>
    <img src="https://blog.fpt-software.com/hs-fs/hubfs/image-8.png?width=376&name=image-8.png" alt="Text Summarisation Visual" />
</p>

### 1.1. Problem Statement
Text summarisation automatically gives the reader a summary containing important sentences and relevant information about an article. This is highly useful because it shortens the time needed to capture the meaning and main events of an article. Broadly, there are 2 ways of performing text summarisation - abstractive and extractive. 

**Abstractive.** Abstractive methods analyse input texts and generate new texts that capture the essence of the original text. If trained correctly, they convey the same meaning as the original text, yet are more concise.

**Extractive.** Extractive methods, on the other, take out the important texts from the original text and joins them to form a summary. Hence, they do not generate any new texts.

In this assignment, we'll use the abstractive method to solve the following problem - **given a news article, can we return a succinct summary of the article?**

### 1.2. Abstractive Text Summarisation
Abstractive text summarisation can be achieved with transformer models. In the notebook titled 'text-summarisation-abs-pretrained-training-N.ipynb', we had used a pretrained T5 transformer model to abstract news summaries from news articles. Here, I will be building a transformer model from scratch and I will attempt to see how this transformer model can generate news summaries. 

The architecture of this model is going to reference the transformer architecture described in "Attention is all you need" [[PDF](https://arxiv.org/pdf/1706.03762.pdf)]. This transformer will have an encoder and decoder.

Similar to the 'text-summarisation-abs-pretrained-training-N.ipynb', **we'll be using XSum.** 

**Why do we use the XSum dataset?** XSum stands for 'Extreme Summarisation' and it is a dataset for evaluating single-document summarisation systems. Each article summary follows the question of 'What is the article about?'. It comprises of 226,711 news articles accompanied with one-sentence summary, and they are collected from BBC (from 2010 to 2017) which cover a wide variety of genres such as general news, politics, sports, weather, business, technology, science, health, family, education, entertainment and arts. With a wide span of genre, it is the ideal dataset to use for our pre-trained models fine tuning exercise.

### 1.3. Environment
AWS EC2 Instance - Deep Learning AMI GPU TensorFlow 2.7.3 (Ubuntu 20.04). Instance type: c5.2xlarge

(Learn how to set up your deep learning workstation with AWS [here](https://medium.com/@bobbycxy/detailed-guide-to-connect-ec2-with-vscode-2c084c265e36?source=your_stories_page))


***

## 2.0. Data Preprocessing
### 2.1. Import Libraries

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from collections import defaultdict
import string
import tensorflow as tf
import re
import os 
import time
from tensorflow import keras
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import ModelCheckpoint
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split

### 2.2. Prepare Key Parameters

In [2]:
ENCODER_LEN = 500
DECODER_LEN = 100
BATCH_SIZE = 64
BUFFER_SIZE = BATCH_SIZE*8

### 2.3. Import Data

As mentioned in Section 1.1., we will use the XSum dataset. This dataset is available with huggingface's datasets.

In [3]:
from datasets import load_dataset
raw_datasets = load_dataset("xsum", split="train")

  from .autonotebook import tqdm as notebook_tqdm


In [4]:
raw_datasets = raw_datasets.train_test_split(
    train_size=0.1, test_size=0.1
)

In [5]:
train_df = pd.DataFrame(raw_datasets['train'])
train_df['train test'] = 'train'
# test_df = pd.DataFrame(raw_datasets['test'])
# test_df['train test'] = 'test'

# df = pd.concat([train_df, test_df], axis = 0).reset_index(drop = True)
df = pd.concat([train_df], axis = 0).reset_index(drop = True)

df



Unnamed: 0,document,summary,id,train test
0,13 September 2013 Last updated at 08:33 BST\nO...,The glass ceiling is still a problem for women...,24075638,train
1,The character is seen dying as a result of an ...,Marvel has killed off The Hulk's human alter e...,36781990,train
2,"The vaccine, which has been developed in India...",Health officials say they will begin the roll ...,11813238,train
3,He said he had sent the US evidence of Fethull...,"The Turkish Prime Minister, Binali Yildirim, h...",36833972,train
4,The Multiple Sclerosis Society's report sugges...,People with the most common form of multiple s...,34392429,train
...,...,...,...,...
20399,They are the postal voters - and he is one of ...,"""For 20% of my patch, the election is over,"" s...",32405988,train
20400,Her lawyer successfully argued that she might ...,A teenager who hurled abuse at child murderers...,36439890,train
20401,It happened as the man cycled on the B800 Kirk...,A 78-year-old cyclist has died after being inv...,26219323,train
20402,Mohammad Reza Mahdavi Kani was considered to b...,"The head of the Assembly of Experts, the body ...",29685856,train


In [6]:
len(df.loc[0,'document'].split())

59

### 2.4. Prepare the Training Data

In [7]:
article = df['document']
summary = df['summary']
article = article.apply(lambda x: '<SOS> ' + x + ' <EOS>')
summary = summary.apply(lambda x: '<SOS> ' + x + ' <EOS>')

## 3. Data Preprocessing

Before we train our model, we need to pre-process our inputs.

In [8]:
def preprocess(text):
    text = re.sub(r"&.[1-9]+;"," ",text)
    return text
article = article.apply(lambda x: preprocess(x))
summary = summary.apply(lambda x: preprocess(x))

In [9]:
filters = '!"#$%&()*+,-./:;=?@[\\]^_`{|}~\t\n'
oov_token = '<unk>'
article_tokenizer = tf.keras.preprocessing.text.Tokenizer(oov_token=oov_token)
summary_tokenizer = tf.keras.preprocessing.text.Tokenizer(filters=filters, oov_token=oov_token)
article_tokenizer.fit_on_texts(article)
summary_tokenizer.fit_on_texts(summary)
inputs = article_tokenizer.texts_to_sequences(article)
targets = summary_tokenizer.texts_to_sequences(summary)

In [10]:
ENCODER_VOCAB = len(article_tokenizer.word_index) + 1
DECODER_VOCAB = len(summary_tokenizer.word_index) + 1
print(ENCODER_VOCAB, DECODER_VOCAB)

112523 27350


In [11]:
inputs = tf.keras.preprocessing.sequence.pad_sequences(inputs, maxlen=ENCODER_LEN, padding='post', truncating='post')
targets = tf.keras.preprocessing.sequence.pad_sequences(targets, maxlen=DECODER_LEN, padding='post', truncating='post')
inputs = tf.cast(inputs, dtype=tf.int64)
targets = tf.cast(targets, dtype=tf.int64)

2023-11-01 02:22:25.946720: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:268] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected


In [12]:
dataset = tf.data.Dataset.from_tensor_slices((inputs, targets)).shuffle(BUFFER_SIZE).batch(BATCH_SIZE)


In [13]:
len(inputs[0].numpy())

500

## 4. Building the Transformer Architecture

### 4.1. Positional Encoding
Training sequential models like RNNs meant that our inputs are fed into the model in an order / sequence. However, that is very time consuming and compute expensive. When training a transformer network, we leave the model to learn their positions not as integers, or a range of 0 to 1. Instead, we let them learn positional vector embeddings.

The values of the sine and cosine equations are small enough that when you add the positional encoding to a word embedding, the word embedding is not significantly distorted. The sum of the positional encoding and word embedding is then fed into the model. This then allows the Transformer network to attend to the relative positions of your input data. 

It it with these functions that we insert them into the positional encoding function - positional_encoding().

In [14]:
def get_angles(position, i, d_model):
    angle_rates = 1 / np.power(10000, (2 * (i // 2)) / np.float32(d_model))
    return position * angle_rates

def positional_encoding(position, d_model):
    angle_rads = get_angles(
        np.arange(position)[:, np.newaxis],
        np.arange(d_model)[np.newaxis, :],
        d_model
    )

    angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])

    angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])

    pos_encoding = angle_rads[np.newaxis, ...]

    return tf.cast(pos_encoding, dtype=tf.float32)

### 4.2. Masking
There are 2 types of masks that are useful when building a Transformer Network - the padding mask and look-ahead mask. Both help the softmax computation give the appropriate weights to the words in your input sentence.

#### 4.2.1. Padding Mask
The padding mask is meant to mask out the zeros by setting them close to negative infinity. If we don’t mask it, then the zeros will affect the softmax calculation. For example [1,2,3,0,0] → [1,2,3,-1e9,-1e9]. This way, the zeros don’t affect the score.

#### 4.2.2. Look-Ahead Mask
In training, we will have access to the complete correct output for the training example. The look-ahead mask helps the model pretend it had correctly predicted a part of the output and see it, without looking ahead, it can correctly predict the next output. 

For example, if the expected correct output is [1,2,3] and we wanted to see if given the model correctly predicted the first value it could predict the second value, then we’d mask out the second and the third values. As such, we would input the masked sequence [1,-1e9,-1e9] and see if it could generate [1,2,-1e9].

In [15]:
def create_padding_mask(seq):
    seq = tf.cast(tf.math.equal(seq, 0), tf.float32)
    return seq[:, tf.newaxis, tf.newaxis, :]

def create_look_ahead_mask(size):
    mask = 1 - tf.linalg.band_part(tf.ones((size, size)), -1, 0)
    return mask


### 4.3. Self-Attention via Scaled Dot Product Attention
Self attention allows for the parallization in operations which speeds up training. We can implement it with the scaled dot product attention which takes in the quey and key and value. 

In [16]:
def scaled_dot_product_attention(q, k, v, mask):
    matmul_qk = tf.matmul(q, k, transpose_b=True)

    dk = tf.cast(tf.shape(k)[-1], tf.float32)
    scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)

    if mask is not None:
        scaled_attention_logits += (mask * -1e9)  

    attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1)

    output = tf.matmul(attention_weights, v)
    return output, attention_weights

### 4.4. Multi-head Attention

<p align='center'>
    <img src="https://miro.medium.com/v2/resize:fit:1400/1*UxtH2qdJAmPP0F6dxiawUg.png" />
</p>

Multi-head attention consists of four parts:
*    Linear layers and split into heads.
*    Scaled dot-product attention.
*    Concatenation of heads.
*    Final linear layer.

Each multi-head attention block gets three inputs; Q (query), K (key), V (value). These are put through linear (Dense) layers and split up into multiple heads. 

The scaled_dot_product_attention defined above is applied to each head (broadcasted for efficiency). An appropriate mask must be used in the attention step.  The attention output for each head is then concatenated (using tf.transpose, and tf.reshape) and put through a final Dense layer.

Instead of one single attention head, Q, K, and V are split into multiple heads because it allows the model to jointly attend to information at different positions from different representational spaces. After the split each head has a reduced dimensionality, so the total computation cost is the same as a single head attention with full dimensionality.


In [17]:
class MultiHeadAttention(tf.keras.layers.Layer):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        self.num_heads = num_heads
        self.d_model = d_model

        assert d_model % self.num_heads == 0

        self.depth = d_model // self.num_heads

        self.wq = tf.keras.layers.Dense(d_model)
        self.wk = tf.keras.layers.Dense(d_model)
        self.wv = tf.keras.layers.Dense(d_model)

        self.dense = tf.keras.layers.Dense(d_model)
        
    def split_heads(self, x, batch_size):
        x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))
        return tf.transpose(x, perm=[0, 2, 1, 3])
    
    def call(self, v, k, q, mask):
        batch_size = tf.shape(q)[0]

        q = self.wq(q)
        k = self.wk(k)
        v = self.wv(v)

        q = self.split_heads(q, batch_size)
        k = self.split_heads(k, batch_size)
        v = self.split_heads(v, batch_size)

        scaled_attention, attention_weights = scaled_dot_product_attention(
            q, k, v, mask)

        scaled_attention = tf.transpose(scaled_attention, perm=[0, 2, 1, 3])
        
        concat_attention = tf.reshape(scaled_attention, (batch_size, -1, self.d_model))
        output = self.dense(concat_attention)
            
        return output, attention_weights
    
def point_wise_feed_forward_network(d_model, dff):
    return tf.keras.Sequential([
        tf.keras.layers.Dense(dff, activation='relu'),
        tf.keras.layers.Dense(d_model)
    ])

### 4.5. Encoder and Decoder Layer

<p align='center'>
    <img src="https://miro.medium.com/v2/resize:fit:863/0*jKqypwGzmDv7KDUZ.png" />
</p>

#### 4.5.1. Encoder Layer
Each encoder layer consists of sublayers:

1.   Multi-head attention (with padding mask) 
2.    Point wise feed forward networks. 

Each of these sublayers has a residual connection around it followed by a layer normalization. Residual connections help in avoiding the vanishing gradient problem in deep networks.

In [18]:
class EncoderLayer(tf.keras.layers.Layer):
    def __init__(self, d_model, num_heads, dff, rate=0.1):
        super(EncoderLayer, self).__init__()

        self.mha = MultiHeadAttention(d_model, num_heads)
        self.ffn = point_wise_feed_forward_network(d_model, dff)

        self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)

        self.dropout1 = tf.keras.layers.Dropout(rate)
        self.dropout2 = tf.keras.layers.Dropout(rate)
    
    def call(self, x, training, mask):
        attn_output, _ = self.mha(x, x, x, mask)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(x + attn_output)

        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)
        out2 = self.layernorm2(out1 + ffn_output)

        return out2

### 4.5.2. Decoder Layer
Each decoder layer consists of sublayers:

1.   Masked multi-head attention (with look ahead mask and padding mask)
2.   Multi-head attention (with padding mask). V (value) and K (key) receive the *encoder output* as inputs. Q (query) receives the output from the masked multi-head attention sublayer.
3.   Point wise feed forward networks

Each of these sublayers has a residual connection around it followed by a layer normalization. The output of each sublayer is LayerNorm(x + Sublayer(x)). The normalization is done on the d_model (last) axis.


In [19]:
class DecoderLayer(tf.keras.layers.Layer):
    def __init__(self, d_model, num_heads, dff, rate=0.1):
        super(DecoderLayer, self).__init__()

        self.mha1 = MultiHeadAttention(d_model, num_heads)
        self.mha2 = MultiHeadAttention(d_model, num_heads)

        self.ffn = point_wise_feed_forward_network(d_model, dff)

        self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layernorm3 = tf.keras.layers.LayerNormalization(epsilon=1e-6)

        self.dropout1 = tf.keras.layers.Dropout(rate)
        self.dropout2 = tf.keras.layers.Dropout(rate)
        self.dropout3 = tf.keras.layers.Dropout(rate)
    
    
    def call(self, x, enc_output, training, look_ahead_mask, padding_mask):
        attn1, attn_weights_block1 = self.mha1(x, x, x, look_ahead_mask)
        attn1 = self.dropout1(attn1, training=training)
        out1 = self.layernorm1(attn1 + x)

        attn2, attn_weights_block2 = self.mha2(enc_output, enc_output, out1, padding_mask)
        attn2 = self.dropout2(attn2, training=training)
        out2 = self.layernorm2(attn2 + out1)

        ffn_output = self.ffn(out2)
        ffn_output = self.dropout3(ffn_output, training=training)
        out3 = self.layernorm3(ffn_output + out2)

        return out3, attn_weights_block1, attn_weights_block2

### 4.5.3 Encoder and Decoder
The Encoder here will consist of 1) input embeddings, 2) positional encoding and 3) `num_layers` encoder layers.

The Decoder here will consist of 1) output embeddings, 2) positional encoding and 3) `num_layers` decoder layers

In [20]:
class Encoder(tf.keras.layers.Layer):
    def __init__(self, num_layers, d_model, num_heads, dff, input_vocab_size, maximum_position_encoding, rate=0.1):
        super(Encoder, self).__init__()

        self.d_model = d_model
        self.num_layers = num_layers

        self.embedding = tf.keras.layers.Embedding(input_vocab_size, d_model)
        self.pos_encoding = positional_encoding(maximum_position_encoding, self.d_model)

        self.enc_layers = [EncoderLayer(d_model, num_heads, dff, rate) for _ in range(num_layers)]

        self.dropout = tf.keras.layers.Dropout(rate)
        
    def call(self, x, training, mask):
        seq_len = tf.shape(x)[1]

        x = self.embedding(x)
        x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
        x += self.pos_encoding[:, :seq_len, :]

        x = self.dropout(x, training=training)
    
        for i in range(self.num_layers):
            x = self.enc_layers[i](x, training, mask)
    
        return x
    
class Decoder(tf.keras.layers.Layer):
        
    def __init__(self, num_layers, d_model, num_heads, dff, target_vocab_size, maximum_position_encoding, rate=0.1):
        super(Decoder, self).__init__()

        self.d_model = d_model
        self.num_layers = num_layers

        self.embedding = tf.keras.layers.Embedding(target_vocab_size, d_model)
        self.pos_encoding = positional_encoding(maximum_position_encoding, d_model)

        self.dec_layers = [DecoderLayer(d_model, num_heads, dff, rate) for _ in range(num_layers)]
        self.dropout = tf.keras.layers.Dropout(rate)
    
    def call(self, x, enc_output, training, look_ahead_mask, padding_mask):
        seq_len = tf.shape(x)[1]
        attention_weights = {}

        x = self.embedding(x)
        x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
        x += self.pos_encoding[:, :seq_len, :]

        x = self.dropout(x, training=training)

        for i in range(self.num_layers):
            x, block1, block2 = self.dec_layers[i](x, enc_output, training, look_ahead_mask, padding_mask)

            attention_weights['decoder_layer{}_block1'.format(i+1)] = block1
            attention_weights['decoder_layer{}_block2'.format(i+1)] = block2
    
        return x, attention_weights

### 4.6. Putting together the Transformer Model

<p align='center'>
    <img src="https://miro.medium.com/v2/resize:fit:856/1*ZCFSvkKtppgew3cc7BIaug.png" />
</p>

The transformer will consist of the encoder, decoder and a final linear layer that outputs the probabilities of the words. The output of the decoder is this input to this final linear layer.

In [21]:
class Transformer(tf.keras.Model):
    def __init__(self, num_layers, d_model, num_heads, dff, input_vocab_size, target_vocab_size, pe_input, pe_target, rate=0.1):
        super(Transformer, self).__init__()

        self.encoder = Encoder(num_layers, d_model, num_heads, dff, input_vocab_size, pe_input, rate)

        self.decoder = Decoder(num_layers, d_model, num_heads, dff, target_vocab_size, pe_target, rate)

        self.final_layer = tf.keras.layers.Dense(target_vocab_size)
    
    def call(self, inp, tar, training, enc_padding_mask, look_ahead_mask, dec_padding_mask):
        enc_output = self.encoder(inp, training, enc_padding_mask)

        dec_output, attention_weights = self.decoder(tar, enc_output, training, look_ahead_mask, dec_padding_mask)

        final_output = self.final_layer(dec_output)

        return final_output, attention_weights

## 5. Model Training

### 5.1. Hyperparameters

In [22]:
num_layers = 4 # number of layers for the encoder and decoder
d_model = 128 # dimension of the model
dff = 512 # dimension of the feed-forward network
num_heads = 4 # number of heads for the multihead attention 
dropout_rate = 0.2 # for regularization
EPOCHS = 15 # how many times to train over the input

### 5.2. Preparing Custom Optimizer
A learning rate schedule helps to modulate how the learning rate of your optimizer changes over time. From the 'Attention Is All You Need' paper, they used the following formula. Such a learning rate increases linearly for the first `warmup_steps` training steps, and decreases thereafter proportionally to the inverse square root of the step number.

<p align='center'>
    <img src="https://i.stack.imgur.com/GQurA.png" alt="Text Summarisation Visual" />
</p>

In [23]:
class CustomSchedule(tf.keras.optimizers.schedules.LearningRateSchedule):
    def __init__(self, d_model, warmup_steps=4000):
        super().__init__()

        self.d_model = d_model
        self.d_model = tf.cast(self.d_model, tf.float32)

        self.warmup_steps = warmup_steps

    def __call__(self, step):
        step = tf.cast(step, dtype=tf.float32)
        arg1 = tf.math.rsqrt(step)
        arg2 = step * (self.warmup_steps ** -1.5)

        return tf.math.rsqrt(self.d_model) * tf.math.minimum(arg1, arg2)

In [24]:
learning_rate = CustomSchedule(d_model)

optimizer = tf.keras.optimizers.Adam(learning_rate, beta_1=0.9, beta_2=0.98, epsilon=1e-9)

### 5.3. Loss and Metrics
Since the target sequences are padded, it is important to apply a padding mask when calculating the loss.

In [25]:
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction='none')
def loss_function(real, pred):
    mask = tf.math.logical_not(tf.math.equal(real, 0))
    loss_ = loss_object(real, pred)

    mask = tf.cast(mask, dtype=loss_.dtype)
    loss_ *= mask

    return tf.reduce_sum(loss_)/tf.reduce_sum(mask)


def accuracy_function(real, pred):
    accuracies = tf.equal(real, tf.argmax(pred, axis=2))
    #accuracies = tf.cast(accuracies, dtype= tf.float32)

    mask = tf.math.logical_not(tf.math.equal(real, 0))
    accuracies = tf.math.logical_and(mask, accuracies)

    accuracies = tf.cast(accuracies, dtype=tf.float32)
    mask = tf.cast(mask, dtype=tf.float32)
    return tf.reduce_sum(accuracies)/tf.reduce_sum(mask)

In [26]:
train_loss = tf.keras.metrics.Mean(name='train_loss')
train_accuracy = tf.keras.metrics.Mean(name='train_accuracy')

### 5.4. Training And Checkpointing

In [27]:
transformer = Transformer(
    num_layers=num_layers,
    d_model=d_model,
    num_heads=num_heads,
    dff=dff,
    input_vocab_size=ENCODER_VOCAB,
    target_vocab_size=DECODER_VOCAB,
    pe_input=1000,
    pe_target=1000,
    rate=dropout_rate)

In [28]:
def create_masks(inp, tar):
    enc_padding_mask = create_padding_mask(inp)
    dec_padding_mask = create_padding_mask(inp)

    look_ahead_mask = create_look_ahead_mask(tf.shape(tar)[1])
    dec_target_padding_mask = create_padding_mask(tar)
    combined_mask = tf.maximum(dec_target_padding_mask, look_ahead_mask)
  
    return enc_padding_mask, combined_mask, dec_padding_mask

In [29]:
checkpoint_path = "checkpoints"

ckpt = tf.train.Checkpoint(transformer=transformer, optimizer=optimizer)

ckpt_manager = tf.train.CheckpointManager(ckpt, checkpoint_path, max_to_keep=5)

# # if a checkpoint exists, restore the latest checkpoint.
# if ckpt_manager.latest_checkpoint:
#     ckpt.restore(ckpt_manager.latest_checkpoint)
#     print ('Latest checkpoint restored!!')

In [30]:
@tf.function
def train_step(inp, tar):
    tar_inp = tar[:, :-1]
    tar_real = tar[:, 1:]

    enc_padding_mask, combined_mask, dec_padding_mask = create_masks(inp, tar_inp)

    with tf.GradientTape() as tape:
        predictions, _ = transformer(
            inp, tar_inp, 
            True, 
            enc_padding_mask, 
            combined_mask, 
            dec_padding_mask
        )
        loss = loss_function(tar_real, predictions)

    gradients = tape.gradient(loss, transformer.trainable_variables)    
    optimizer.apply_gradients(zip(gradients, transformer.trainable_variables))

    train_loss(loss)
    train_accuracy(accuracy_function(tar_real, predictions))

In [31]:
for epoch in range(EPOCHS):
    start = time.time()

    train_loss.reset_states()
  
    for (batch, (inp, tar)) in enumerate(dataset):
        train_step(inp, tar)
    
        if batch % 100 == 0:
            print(f'Epoch {epoch + 1} Batch {batch} Loss {train_loss.result():.4f} Accuracy {train_accuracy.result():.4f}')
      
    if (epoch + 1) % 5 == 0:
        ckpt_save_path = ckpt_manager.save()
        print ('Saving checkpoint for epoch {} at {}'.format(epoch+1, ckpt_save_path))
   
    print(f'Epoch {epoch + 1} Loss {train_loss.result():.4f} Accuracy {train_accuracy.result():.4f}')
    print ('Time taken for 1 epoch: {} secs\n'.format(time.time() - start))

Epoch 1 Batch 0 Loss 10.2292 Accuracy 0.0000
Epoch 1 Batch 100 Loss 10.1242 Accuracy 0.0166
Epoch 1 Batch 200 Loss 9.8911 Accuracy 0.0290
Epoch 1 Batch 300 Loss 9.5489 Accuracy 0.0328
Epoch 1 Loss 9.4792 Accuracy 0.0333
Time taken for 1 epoch: 1349.9867615699768 secs

Epoch 2 Batch 0 Loss 8.2002 Accuracy 0.0333
Epoch 2 Batch 100 Loss 7.8102 Accuracy 0.0374
Epoch 2 Batch 200 Loss 7.5549 Accuracy 0.0438
Epoch 2 Batch 300 Loss 7.4271 Accuracy 0.0495
Epoch 2 Loss 7.4066 Accuracy 0.0506
Time taken for 1 epoch: 1319.7803394794464 secs

Epoch 3 Batch 0 Loss 7.1571 Accuracy 0.0506
Epoch 3 Batch 100 Loss 6.9521 Accuracy 0.0582
Epoch 3 Batch 200 Loss 6.8520 Accuracy 0.0662
Epoch 3 Batch 300 Loss 6.7568 Accuracy 0.0736
Epoch 3 Loss 6.7386 Accuracy 0.0750
Time taken for 1 epoch: 1317.3954560756683 secs

Epoch 4 Batch 0 Loss 6.4537 Accuracy 0.0750
Epoch 4 Batch 100 Loss 6.3494 Accuracy 0.0820
Epoch 4 Batch 200 Loss 6.2739 Accuracy 0.0884
Epoch 4 Batch 300 Loss 6.2064 Accuracy 0.0943
Epoch 4 Loss 6.

## 6. Inferencing

In [32]:
def evaluate(input_article):
    input_article = article_tokenizer.texts_to_sequences([input_article])
    input_article = tf.keras.preprocessing.sequence.pad_sequences(input_article, maxlen=ENCODER_LEN, 
                                                                   padding='post', truncating='post')

    encoder_input = tf.expand_dims(input_article[0], 0)

    decoder_input = [summary_tokenizer.word_index['<sos>']]
    output = tf.expand_dims(decoder_input, 0)
    
    for i in range(DECODER_LEN):
        enc_padding_mask, combined_mask, dec_padding_mask = create_masks(encoder_input, output)

        predictions, attention_weights = transformer(
            encoder_input, 
            output,
            False,
            enc_padding_mask,
            combined_mask,
            dec_padding_mask
        )

        predictions = predictions[: ,-1:, :]
        predicted_id = tf.cast(tf.argmax(predictions, axis=-1), tf.int32)

        if predicted_id == summary_tokenizer.word_index['<eos>']:
            return tf.squeeze(output, axis=0), attention_weights

        output = tf.concat([output, predicted_id], axis=-1)

    return tf.squeeze(output, axis=0), attention_weights

In [33]:
def summarize(input_article):
    summarized = evaluate(input_article=input_article)[0].numpy()
    summarized = np.expand_dims(summarized[1:], 0)  
    return summary_tokenizer.sequences_to_texts(summarized)[0]

In [34]:
print(article[0])

<SOS> 13 September 2013 Last updated at 08:33 BST
One woman who has well and truly broken through that ceiling in one of the toughest industries around is Dr Marlene Kanga.
She is the head of Engineers Australia and is in Singapore for a global industry conference.
She spoke to Ali Moore about what inspired her to become an engineer. <EOS>


In [35]:
print("Real Headline : ", summary[0],"\n Predicted Summary : ", summarize(article[0]))


Real Headline :  <SOS> The glass ceiling is still a problem for women trying to rise through the leadership ranks in business. <EOS> 
 Predicted Summary :  it's a good issue of the year for the uk and the uk the uk is a good issue


In [36]:
print(article[1])

<SOS> The character is seen dying as a result of an arrow to the head from Hawkeye, his Avengers teammate, in the third issue of Civil War II.
Banner has been the Hulk's alter ego since the character's creation in 1962.
However, for the last year, readers have seen Banner medicating himself to keep his anger management issues under control.
During that time, a Korean-American teenage genius named Amadeus Cho has taken over as the new human alter-ego of The Hulk.
"This is uncharted territory for us," Marvel's editor in chief Axel Alonso told the New York Daily News.
"Only two things are for certain: It will take a long, long time for our heroes to come to terms with his loss, and the circumstance surrounding his death will leave a huge scar on the superhero community."
In the latest edition of Civil War II, Hawkeye is seen killing his friend on the belief Banner is about to turn into the Hulk and unleash massive death and destruction.
Banner had recently asked him for a mercy killing in

In [37]:
print("Real Headline : ", summary[1],"\n Predicted Summary : ", summarize(article[1]))

Real Headline :  <SOS> Marvel has killed off The Hulk's human alter ego Bruce Banner in its latest comic. <EOS> 
 Predicted Summary :  the us space agency nasa has said it is a similar to the us space agency nasa says


In [38]:
print(article[2])

<SOS> The vaccine, which has been developed in India, costs less than fifty US cents a dose and clinical tests suggest it could offer protection for between 10 and 15 years.
Seasonal epidemics of meningitis kill thousands in Africa every year.
Vaccinations will start in Burkina Faso, then move to Niger and Mali.
Officials say clinical trials of the new vaccine have shown it to be highly effective in protecting against Meningitis A - a form of meningitis which kills thousands of young people each year across a swathe of sub-Saharan Africa dubbed the "meningitis belt".
The vaccine is similar in concept to the one used successfully in Britain to tackle Meningitis C.
If all goes to plan, it will first be offered to anyone aged between one and 29 years across 25 African states from Senegal in the west to Somalia in the east.
The drug was developed in India at the cost of the Bill & Melinda Gates Foundation, but much more money will be required to complete the initial vaccination programme -

In [39]:
print("Real Headline : ", summary[2],"\n Predicted Summary : ", summarize(article[2]))

Real Headline :  <SOS> Health officials say they will begin the roll out of a new meningitis vaccine for sub-Saharan Africa on 6 December. <EOS> 
 Predicted Summary :  the uk is to be offered the ebola virus in the uk and the uk to be scrapped


In [40]:
print(article[3])

<SOS> He said he had sent the US evidence of Fethullah Gulen's criminal activities - allegations the cleric denies - in support of an extradition bid.
Mr Yildirim insisted that his country was governed by the rule of law.
Thousands of soldiers, police and officials have been detained or sacked since Friday's coup attempt.
President Recep Tayyip Erdogan has again refused to rule out reinstating the death penalty for coup plotters if it is approved by parliament.
The EU has warned that such a move would see the end of accession talks to the bloc.
For now, at least, that seems not to worry President Erdogan, who is seizing the opportunity to tighten his grip, reports the BBC's Turkey correspondent, Mark Lowen.
Prime Minister Yildirim was speaking after meeting the leader of the main opposition CHP party.
He warned people not to act out of a spirit of revenge in the wake of Friday's failed military takeover, saying that would be "unacceptable" but whoever had acted against the law would be

In [41]:
print("Real Headline : ", summary[3],"\n Predicted Summary : ", summarize(article[3]))

Real Headline :  <SOS> The Turkish Prime Minister, Binali Yildirim, has vowed to purge supporters of an exiled cleric "by the roots" in the aftermath of the failed coup. <EOS> 
 Predicted Summary :  the us president of the russian president donald trump has said he will be a coup in the capital kiev


In [44]:
article1 = """Get ready to look perfect as you're thinking out loud on Feb 16, 2024 at Ed Sheeran's concert. The Grammy Award-winning singer is heading to Singapore for a one-night show at the National Stadium. Plus, he's bringing along English singer Calum Scott as a guest.

Tickets for the concert will cost between S$88 and S$488 and can be purchased via Ticketmaster and at SingPost outlets.

If you signed up for a UOB card for Taylor Swift's concert and didn't cancel your membership afterward, here's some great news. UOB cardholders can enjoy a presale from 10am on Oct 27 till 9.59 am on Oct 29.

A second presale will be held for KrisFlyer members from 10am on Oct 30 to 9.59am on Oct 31. To get in on this presale, KrisFlyer UOB credit and debit cardholders will need to subscribe to receive KrisFlyer and SIA Group promotional emails via their KrisFlyer account preferences. They will then receive a unique access code from KrisFlyer via email on Oct 27. 

Members who are not KrisFlyer UOB credit or debit cardholders can download Kris+, the SIA Group’s lifestyle rewards app and spend 150 miles between Oct 20 and 25 to redeem a unique access code. Do note that redemptions are limited to the first 110,000 customers.

Alternatively, those with loads of miles to spare can opt to redeem Categories 1 to 4 concert tickets using their miles via KrisFlyer Experiences from Oct 30. Tickets from Categories 1 to 4 may be redeemed with 49,000; 38,000; 29,000 and 19,000 miles, respectively.

General sale will commence from 11am on Oct 31."""

summarize(article1)

'the new album of the year has been mocked after a rare sign of a landmark album of the year'

## 7. Conclusion
In this method, I wanted to build my own transformer model using the 'Attention Is All You Need' paper's architecture to perform text abstraction. 

However, a big restraint I faced was cost. My AWS EC2 instance of c5.2xLarge was apparently not big enough to train the transformer to encode article lengths of up to 1024 words. As a result, I was only able to design the transformer model to encode article lengths up to 500 words and decode 100 words for headers, with a transformer model that has 4 heads (for the multi head attention) and has 5 layers of the encoder and decoder blocks. 

This limited the model from getting all the needed inputs from an article text to perform the needed training on summary targets. In other words, the model was insufficient to begin training with. I could increase it but it will be costly. Hence, I have refrained from further training my model, and have excluded it from being used as a method of extraction in the `event-extraction-workbook.ipynb` and `event-extraction-script.py`.

However, I have decided to still include this notebook into my submission because I wanted to highlight that I know how to build transformer models and am currently inhibited by cost. 