# Text Summarization

---
## Content
1) **Project Overview**

2) **Import Dataset**
- In `.parquet`format

3) **Text Preprocessing**
- Tokenize the text data and convert it to sequences
- Pad or truncate sequences to ensure uniform input length.
- Handle special tokens (e.g., start-of-sequence `<sos>` and end-of-sequence `<eos>`).

4) **Model Development**
- Build a Seq2Seq model in TensorFlow with the following components:
    - Encoder (RNN/LSTM/GRU).
    - Decoder with attention mechanism.
    - Attention layer to enhance summary quality.

5) **Training**
- Split the dataset
    - %80 train data
    - %10 validation data
    - %10 test data
- Train the model on the training set
- Monitor performance using the validation set
- Adjust hyperparameters as necessary.

6) **Evaluation**
- Use the test set to generate summaries.
- Evaluate the generated summaries using the ROUGE metric.

7) **Analysis**
- Compare generated summaries to reference summaries and discuss performance.
- Suggest potential improvements or extensions for better results.

8) **Contributors**

---
## Project Overview
This project focuses on developing a text summarization system for news articles using Sequence-to-Sequence (Seq2Seq) models enhanced with attention mechanisms. By utilizing a custom dataset of news content, the model is trained to generate concise, coherent, and informative summaries that capture the key points of each article.

The Seq2Seq architecture, paired with attention, allows the model to dynamically focus on relevant parts of the input text during the decoding process, improving the quality and accuracy of the summaries. This approach addresses the challenge of long and complex news articles by effectively reducing redundancy and preserving critical information.

The project includes dataset preprocessing, model training, and evaluation using metrics such as ROUGE, with the goal of producing high-quality, human-like summaries. This work aims to contribute to automated news aggregation, efficient information retrieval, and content generation.

---
## Import Dataset

Since we will work on Google Colab, first mount Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import pandas as pd
import os

# Current directory
print("You are currently in", os.getcwd(), "directory")

# Path to dataset
!ls /content/drive/My\ Drive/Colab\ Notebooks

You are currently in /content directory
 ds1.parquet	   mert.ipynb	  text_summarization.ipynb	   text_summarization_v3.ipynb
'mert (1).ipynb'   mutant.ipynb   text_summarization_model.keras   Untitled0.ipynb


In [1]:
# Read the Parquet file
import os
import pandas as pd
df = pd.read_parquet('ds1.parquet')

df.head(3)

Unnamed: 0,text,prediction,prediction_agent,annotation,annotation_agent,id,metadata,status,event_timestamp,metrics
0,WASHINGTON (Reuters) - President Donald Trump ...,"[{'score': 1.0, 'text': 'Trump ends 'Dreamer' ...",Argilla,,,04de325a-1fbf-41a9-977b-ec7892ef86f0,,Default,2017-09-05,{'text_length': 6904}
1,MOSCOW (Reuters) - Russian property developer ...,"[{'score': 1.0, 'text': 'Russian tycoon, fresh...",Argilla,,,97c7f5e7-ae32-44af-ad0c-e6b17ce31e54,,Default,2017-11-08,{'text_length': 1527}
2,WASHINGTON (Reuters) - The U.S. intelligence c...,"[{'score': 1.0, 'text': 'U.S. not started asse...",Argilla,,,90894659-b843-4817-9df8-bb34d6219cdf,,Default,2017-05-23,{'text_length': 677}


---
## Text Preprocessing

In [2]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
import numpy as np
import re

Extract the 'text' from the 'prediction' column. Handle special tokens in target text

In [3]:
# Extract text
def extract_text(x):
    if isinstance(x, np.ndarray) and len(x) > 0 and 'text' in x[0]:
        return x[0]['text']
    return ''

# Extract texts for summaries
df['text_prediction'] = df['prediction'].apply(extract_text)

# Handle special tokens
df['text_prediction'] = df['text_prediction'].apply(lambda x: '<sos> ' + x + ' <eos>')

Even if `Tokenizer` will handle, to make sure, apply basic text cleaning to input text

In [4]:
def clean_text(txt):
    txt = str(txt).lower()                      # Convert to lowercase
    txt = re.sub(r'[^a-z0-9\s.,!?]', '', txt)   # Remove special characters except basic punctuation
    txt = re.sub(r'\s+', ' ', txt).strip()      # Replace multiple spaces with single space
    return txt

# Clean input text
df['text'] = df['text'].apply(clean_text)

# Clean target text
df['text_prediction'] = df['text_prediction'].apply(clean_text)

Tokenize the text data and convert it to sequences using TensorFlow/Keras Tokenizer.

In [5]:
# Define max vocabulary size
max_vocab_size = 20000

# Initialize Tokenizer, handle out-of-vocabulary words
tokenizer = Tokenizer(num_words=max_vocab_size, oov_token='<UNK>')

# Convert texts to list
all_texts = df['text'].tolist() + df['text_prediction'].tolist()

# Fit the tokenizer
tokenizer.fit_on_texts(all_texts)

# Convert a given text into a sequence of integer IDs
def text_to_ids(txt):
    return tokenizer.texts_to_sequences([txt])[0]

# Apply text_to_ids function
df['enc_ids'] = df['text'].apply(text_to_ids)
df['dec_ids'] = df['text_prediction'].apply(text_to_ids)

Pad and truncate sequences to ensure uniform input length.

In [6]:
# Define max input and target lengths
max_input_length = 1000
max_summary_length = 50

# If text length < maximum text length, apply post padding. Else, trunctate text
enc_in = pad_sequences(df['enc_ids'], maxlen=max_input_length, padding='post', truncating='post')
dec_in_out = pad_sequences(df['dec_ids'], maxlen=max_summary_length, padding='post', truncating='post')

# Identify rows with non-empty sequences by summing each row
valid_idx = np.where(enc_in.sum(axis=1) != 0)[0]

# Retain only non-empty sequences for model input and output
enc_in = enc_in[valid_idx]
dec_in_out= dec_in_out[valid_idx]

---
## Model Development

In [7]:
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
from tensorflow.keras.layers import Embedding, LSTM, Dense
from sklearn.model_selection import train_test_split
from tensorflow.keras import layers, Model
import tensorflow as tf

Build a Seq2Seq model in TensorFlow with the following components:
- Encoder (RNN/LSTM/GRU)
- Decoder with attention mechanism.
- Attention layer to enhance summary quality.

This custom layer implements coverage-based Bahdanau attention to mitigate repetitive attention issues by incorporating a coverage mechanism.

In [8]:
class CoverageAttention(layers.Layer):
    """
    Implements coverage-based Bahdanau attention to address the issue of repetitive attention.
    This layer enhances the attention mechanism by incorporating past attention distribution (coverage).
    """
    def __init__(self, units):
        """
        Initializes the CoverageAttention layer.

        Args:
            units (int): Number of hidden units in the attention mechanism.
        """
        super(CoverageAttention, self).__init__()
        self.W1 = layers.Dense(units)  # Weight for encoder outputs
        self.W2 = layers.Dense(units)  # Weight for decoder hidden state
        self.V = layers.Dense(1)        # Final dense layer to produce attention scores
        self.coverage_dense = layers.Dense(units, activation='relu')  # Coverage dense layer

    def call(self, dec_h, enc_outputs, coverage):
        """
        Compute the context vector and attention weights.

        Args:
            dec_h (Tensor): Decoder hidden state (batch_size, dec_units).
            enc_outputs (Tensor): Encoder outputs (batch_size, max_enc_len, enc_units).
            coverage (Tensor): Accumulated attention scores (batch_size, max_enc_len).

        Returns:
            context_vector (Tensor): Weighted sum of encoder outputs (batch_size, enc_units).
            attn_weights (Tensor): Attention weights (batch_size, max_enc_len, 1).
            coverage (Tensor): Updated coverage vector (batch_size, max_enc_len).
        """
        # Expand dimensions to match encoder outputs
        dec_h_expanded = tf.expand_dims(dec_h, 1)  # Shape: (batch_size, 1, dec_units)
        coverage_expanded = tf.expand_dims(coverage, -1)  # Shape: (batch_size, max_enc_len, 1)

        # Compute coverage feature
        coverage_feat = self.coverage_dense(coverage_expanded)  # Shape: (batch_size, max_enc_len, units)

        # Calculate attention scores
        score = self.V(tf.nn.tanh(
            self.W1(enc_outputs) + self.W2(dec_h_expanded) + coverage_feat
        ))  # Shape: (batch_size, max_enc_len, 1)

        # Compute attention weights
        attn_weights = tf.nn.softmax(score, axis=1)  # Shape: (batch_size, max_enc_len, 1)

        # Compute context vector as the weighted sum of encoder outputs
        context_vector = attn_weights * enc_outputs  # Shape: (batch_size, max_enc_len, enc_units)
        context_vector = tf.reduce_sum(context_vector, axis=1)  # Shape: (batch_size, enc_units)

        # Update coverage vector by adding current attention weights
        coverage += tf.squeeze(attn_weights, axis=-1)  # Shape: (batch_size, max_enc_len)

        return context_vector, attn_weights, coverage

# Function to combine the vocabulary distribution and attention distribution using the pointer-generator mechanism
def calc_final_dist(vocab_dist, attn_dist, p_gen, enc_inputs, vocab_size):
    """
    Calculate the final probability distribution by combining vocab distribution
    and attention distribution based on the pointer-generator network mechanism.

    Args:
        vocab_dist (Tensor): Vocabulary distribution (batch_size, vocab_size).
        attn_dist (Tensor): Attention distribution (batch_size, max_enc_len).
        p_gen (Tensor): Generation probability scalar (batch_size, 1).
        enc_inputs (Tensor): Encoder input IDs (batch_size, max_enc_len).
        vocab_size (int): Total vocabulary size.

    Returns:
        final_dist (Tensor): Final combined distribution (batch_size, vocab_size).
    """
    # Calculate the weighted distributions
    vocab_part = p_gen * vocab_dist          # Probability of generating from vocab
    copy_part = (1.0 - p_gen) * attn_dist    # Probability of copying from input

    # Scatter to combine attention and vocabulary distributions
    def scatter_one(args):
        v_b, c_b, e_b = args
        indices = tf.expand_dims(e_b, axis=-1)  # Shape: (max_enc_len, 1)
        return tf.tensor_scatter_nd_add(v_b, indices, c_b)  # Shape: (vocab_size,)

    final_dist = tf.map_fn(
        scatter_one,
        (vocab_part, copy_part, enc_inputs),
        fn_output_signature=tf.float32  # Final output shape: (batch_size, vocab_size)
    )
    return final_dist

This model combines sequence-to-sequence architecture with attention and coverage mechanisms to generate summaries.

In [9]:
class PointerGenCoverage(Model):
    """
    Pointer-Generator Network with Coverage Mechanism.
    This model combines the power of sequence-to-sequence models with attention,
    and introduces coverage to reduce repetition in generated text.
    """
    def __init__(self, vocab_size, embed_dim, enc_units, dec_units, max_enc_len):
        """
        Initializes the PointerGenCoverage model.

        Args:
            vocab_size (int): Size of the vocabulary.
            embed_dim (int): Dimension of the embedding vectors.
            enc_units (int): Number of units in the encoder LSTM.
            dec_units (int): Number of units in the decoder LSTM.
            max_enc_len (int): Maximum length of encoder input sequences.
        """
        super(PointerGenCoverage, self).__init__()
        self.vocab_size = vocab_size
        self.embed_dim = embed_dim
        self.enc_units = enc_units
        self.dec_units = dec_units
        self.max_enc_len = max_enc_len

        # Embedding layer to encode input tokens
        self.embedding = Embedding(
            input_dim=vocab_size,
            output_dim=embed_dim,
            mask_zero=False,
            name="embedding"
        )

        # Encoder LSTM to process input sequences
        self.encoder_lstm = LSTM(
            enc_units,
            return_sequences=True,
            return_state=True,
            name="encoder_lstm"
        )

        # Decoder LSTM for generating target sequences
        self.decoder_lstm = LSTM(
            dec_units,
            return_sequences=True,
            return_state=True,
            name="decoder_lstm"
        )

        # Attention layer with coverage
        self.attention = CoverageAttention(dec_units)

        # Dense layer for output vocabulary distribution
        self.vocab_out = Dense(vocab_size, activation='softmax', name="vocab_out")

        # Pointer network gate to control copy/generate mechanism
        self.pointer_gate = Dense(1, activation='sigmoid', name="pointer_gate")

    def call_encoder(self, enc_inputs):
        """
        Encodes the input sequence into hidden states.

        Args:
            enc_inputs (Tensor): Encoder input tensor (batch_size, max_enc_len).

        Returns:
            enc_outputs (Tensor): Encoder outputs (batch_size, max_enc_len, enc_units).
            enc_h (Tensor): Final hidden state of the encoder (batch_size, enc_units).
            enc_c (Tensor): Final cell state of the encoder (batch_size, enc_units).
        """
        embedded = self.embedding(enc_inputs)  # Shape: (batch_size, max_enc_len, embed_dim)
        enc_outputs, enc_h, enc_c = self.encoder_lstm(embedded)
        return enc_outputs, enc_h, enc_c

    def decode_step(self, dec_input, dec_h, dec_c, coverage, enc_outputs, enc_inputs):
        """
        Perform a single decoding step.

        Args:
            dec_input (Tensor): Current decoder input (batch_size,).
            dec_h (Tensor): Decoder hidden state (batch_size, dec_units).
            dec_c (Tensor): Decoder cell state (batch_size, dec_units).
            coverage (Tensor): Coverage vector (batch_size, max_enc_len).
            enc_outputs (Tensor): Encoder outputs (batch_size, max_enc_len, enc_units).
            enc_inputs (Tensor): Encoder input IDs (batch_size, max_enc_len).

        Returns:
            final_dist (Tensor): Final vocabulary distribution (batch_size, vocab_size).
            new_h (Tensor): Updated decoder hidden state (batch_size, dec_units).
            new_c (Tensor): Updated decoder cell state (batch_size, dec_units).
            new_cov (Tensor): Updated coverage vector (batch_size, max_enc_len).
        """
        embedded = self.embedding(dec_input)  # Shape: (batch_size, embed_dim)
        embedded = tf.expand_dims(embedded, axis=1)  # Shape: (batch_size, 1, embed_dim)

        # Pass through decoder LSTM
        dec_outputs, new_h, new_c = self.decoder_lstm(embedded, initial_state=[dec_h, dec_c])
        dec_outputs = tf.squeeze(dec_outputs, axis=1)  # Shape: (batch_size, dec_units)

        # Calculate attention with coverage
        context_vector, attn_weights, new_cov = self.attention(new_h, enc_outputs, coverage)

        # Concatenate decoder output and context vector
        concat_vector = tf.concat([dec_outputs, context_vector], axis=-1)  # Shape: (batch_size, dec_units + enc_units)

        # Compute vocabulary distribution
        vocab_distribution = self.vocab_out(concat_vector)  # Shape: (batch_size, vocab_size)

        # Compute pointer gate value
        p_gen = self.pointer_gate(concat_vector)  # Shape: (batch_size, 1)

        # Squeeze attention weights for final distribution calculation
        attn_weights_squeezed = tf.squeeze(attn_weights, axis=-1)  # Shape: (batch_size, max_enc_len)

        # Calculate final distribution by combining vocabulary and attention distributions
        final_distribution = calc_final_dist(
            vocab_distribution,
            attn_weights_squeezed,
            p_gen,
            enc_inputs,
            self.vocab_size
        )  # Shape: (batch_size, vocab_size)

        return final_distribution, new_h, new_c, new_cov

    def call(self, enc_inputs, dec_inputs):
        """
        Forward pass for the Pointer-Generator model.

        Args:
            enc_inputs (Tensor): Encoder input tensor (batch_size, max_enc_len).
            dec_inputs (Tensor): Decoder input tensor (batch_size, max_dec_len).

        Returns:
            final_dists (Tensor): Final distributions for all decoding steps (batch_size, max_dec_len, vocab_size).
        """
        # Encode the input sequences
        enc_outputs, enc_h, enc_c = self.call_encoder(enc_inputs)

        # Initialize decoder states with encoder's final states
        dec_h, dec_c = enc_h, enc_c

        # Initialize coverage vector to zeros
        batch_size = tf.shape(enc_inputs)[0]
        coverage = tf.zeros((batch_size, self.max_enc_len))

        # Initialize list to collect final distributions
        final_dists = []

        # Iterate over each time step in the decoder input
        for t in range(dec_inputs.shape[1]):
            # Get the decoder input for the current time step
            dec_input_t = dec_inputs[:, t]

            # Perform a decoding step
            final_dist, dec_h, dec_c, coverage = self.decode_step(
                dec_input_t, dec_h, dec_c, coverage, enc_outputs, enc_inputs
            )

            # Append the final distribution to the list
            final_dists.append(final_dist)

        # Stack the final distributions along the time axis
        final_dists = tf.stack(final_dists, axis=1)  # Shape: (batch_size, max_dec_len, vocab_size)

        return final_dists

---
## Training

Since we have the model, we will split our data (%80 train / %10 validation / %10 test)

In [10]:
# First split: 80% training and 20% temporary (to be further split)
X_train, X_temp, Y_train, Y_temp = train_test_split(
    enc_in, dec_in_out, test_size=0.2, random_state=42
)

# Second split: Split the 20% temporary data into 10% validation and 10% test
X_val, X_test, Y_val, Y_test = train_test_split(
    X_temp, Y_temp, test_size=0.5, random_state=42
)

# Verify the shapes
print(f"Training set: {X_train.shape}, {Y_train.shape}")
print(f"Validation set: {X_val.shape}, {Y_val.shape}")
print(f"Test set: {X_test.shape}, {Y_test.shape}")

Training set: (16332, 1000), (16332, 50)
Validation set: (2042, 1000), (2042, 50)
Test set: (2042, 1000), (2042, 50)


We do step-by-step decoding to compute the loss for each time step

In [11]:
# Use sparse categorical cross entropy as loss function
loss_obj = tf.keras.losses.SparseCategoricalCrossentropy(
    from_logits=False, reduction='none'
)

def pgn_coverage_loss(final_dist, target_token):
    """
    Computes the pointer-generator coverage loss.

    Args:
        final_dist (Tensor): Final distribution over the vocabulary (batch_size, vocab_size).
        target_token (Tensor): Ground-truth token IDs (batch_size,).

    Returns:
        Tensor: Per-example loss (batch_size,).
    """
    # Compute the loss for each example in the batch
    loss = loss_obj(target_token, final_dist)  # Shape: (batch_size,)
    return loss

def pgn_accuracy(final_dist, target_token):
    """
    Computes the token-level accuracy.

    Args:
        final_dist (Tensor): Final distribution over the vocabulary (batch_size, vocab_size).
        target_token (Tensor): Ground-truth token IDs (batch_size,).

    Returns:
        Tensor: Accuracy for the batch (scalar).
    """
    # Predicted token is the one with the highest probability
    pred_token = tf.argmax(final_dist, axis=-1, output_type=tf.int32)  # Shape: (batch_size,)

    # Compare with target tokens
    correct = tf.cast(tf.equal(pred_token, target_token), tf.float32)  # Shape: (batch_size,)

    # Compute average accuracy
    accuracy = tf.reduce_mean(correct)  # Scalar
    return accuracy

@tf.function
def train_step(model, optimizer, enc_inp, dec_inp, dec_out):
    """
    Performs a single training step.

    Args:
        model (PointerGenCoverage): The model to train.
        optimizer (tf.keras.optimizers.Optimizer): Optimizer for training.
        enc_inp (Tensor): Encoder input tensor (batch_size, max_enc_len).
        dec_inp (Tensor): Decoder input tensor (batch_size, max_dec_len-1).
        dec_out (Tensor): Decoder output tensor (batch_size, max_dec_len-1).

    Returns:
        Tuple[Tensor, Tensor]: Batch loss and batch accuracy.
    """
    with tf.GradientTape() as tape:
        # Encode the input sequences
        enc_outputs, enc_h, enc_c = model.call_encoder(enc_inp)

        # Initialize decoder states with encoder's final states
        dec_h, dec_c = enc_h, enc_c

        # Initialize coverage vector to zeros
        coverage = tf.zeros((tf.shape(enc_inp)[0], model.max_enc_len))

        # Initialize loss and accuracy
        total_loss = tf.constant(0.0)
        total_accuracy = tf.constant(0.0)

        # Iterate over each time step in the decoder input
        for t in range(dec_inp.shape[1]):
            # Get the decoder input and target for the current time step
            current_token_in = dec_inp[:, t]
            current_token_out = dec_out[:, t]

            # Perform a decoding step
            final_dist, dec_h, dec_c, coverage = model.decode_step(
                current_token_in, dec_h, dec_c, coverage, enc_outputs, enc_inp
            )

            # Compute loss and accuracy
            loss = pgn_coverage_loss(final_dist, current_token_out)  # Shape: (batch_size,)
            acc = pgn_accuracy(final_dist, current_token_out)        # Scalar

            # Accumulate loss and accuracy
            total_loss += tf.reduce_mean(loss)
            total_accuracy += acc

        # Average loss and accuracy over all time steps
        batch_loss = total_loss / tf.cast(dec_inp.shape[1], tf.float32)
        batch_accuracy = total_accuracy / tf.cast(dec_inp.shape[1], tf.float32)

    # Compute gradients and apply them
    variables = model.trainable_variables
    grads = tape.gradient(batch_loss, variables)
    optimizer.apply_gradients(zip(grads, variables))

    return batch_loss, batch_accuracy

def train_epochs(model, optimizer, X_train, Y_train, X_val, Y_val, epochs=5, batch_size=16):
    """
    Trains the model for a specified number of epochs.

    Args:
        model (PointerGenCoverage): The model to train.
        optimizer (tf.keras.optimizers.Optimizer): Optimizer for training.
        X_train (ndarray): Training encoder inputs (num_train, max_enc_len).
        Y_train (ndarray): Training decoder outputs (num_train, max_dec_len).
        X_val (ndarray): Validation encoder inputs (num_val, max_enc_len).
        Y_val (ndarray): Validation decoder outputs (num_val, max_dec_len).
        epochs (int): Number of epochs to train.
        batch_size (int): Size of each training batch.
    """
    # Prepare decoder inputs and outputs by shifting
    dec_in_train = Y_train[:, :-1]  # Decoder input (start with <sos>)
    dec_out_train = Y_train[:, 1:]  # Decoder output (end with <eos>)
    dec_in_val   = Y_val[:, :-1]
    dec_out_val  = Y_val[:, 1:]

    steps_per_epoch = len(X_train) // batch_size
    val_steps       = len(X_val) // batch_size

    for ep in range(1, epochs + 1):
        print(f"=== Epoch {ep}/{epochs} ===")
        total_train_loss = 0.0
        total_train_accuracy = 0.0

        # Shuffle the training data
        idxs = np.random.permutation(len(X_train))
        X_train_shuff = X_train[idxs]
        di_train_shuff = dec_in_train[idxs]
        do_train_shuff = dec_out_train[idxs]

        for step in range(steps_per_epoch):
            start = step * batch_size
            end = (step + 1) * batch_size
            enc_inp_batch = X_train_shuff[start:end]
            dec_inp_batch = di_train_shuff[start:end]
            dec_out_batch = do_train_shuff[start:end]

            # Convert to tensors
            enc_inp_batch = tf.convert_to_tensor(enc_inp_batch, dtype=tf.int32)
            dec_inp_batch = tf.convert_to_tensor(dec_inp_batch, dtype=tf.int32)
            dec_out_batch = tf.convert_to_tensor(dec_out_batch, dtype=tf.int32)

            # Perform a training step
            batch_loss, batch_accuracy = train_step(model, optimizer, enc_inp_batch, dec_inp_batch, dec_out_batch)

            # Accumulate loss and accuracy
            total_train_loss += batch_loss.numpy()
            total_train_accuracy += batch_accuracy.numpy()

            if (step + 1) % 10 == 0:
                print(f"  step {step + 1}/{steps_per_epoch}, loss={batch_loss.numpy():.4f}, accuracy={batch_accuracy.numpy():.4f}")

        # Compute average training loss and accuracy
        avg_train_loss = total_train_loss / steps_per_epoch
        avg_train_accuracy = total_train_accuracy / steps_per_epoch
        print(f"  >> Epoch {ep} train loss: {avg_train_loss:.4f}, train accuracy: {avg_train_accuracy:.4f}")

        # Validation
        total_val_loss = 0.0
        total_val_accuracy = 0.0
        for step in range(val_steps):
            start = step * batch_size
            end = (step + 1) * batch_size
            enc_inp_batch = X_val[start:end]
            dec_inp_batch = dec_in_val[start:end]
            dec_out_batch = dec_out_val[start:end]

            # Convert to tensors
            enc_inp_batch = tf.convert_to_tensor(enc_inp_batch, dtype=tf.int32)
            dec_inp_batch = tf.convert_to_tensor(dec_inp_batch, dtype=tf.int32)
            dec_out_batch = tf.convert_to_tensor(dec_out_batch, dtype=tf.int32)

            # Perform a forward pass without gradient computation
            enc_outputs, enc_h, enc_c = model.call_encoder(enc_inp_batch)
            coverage = tf.zeros((batch_size, model.max_enc_len))
            dec_h, dec_c = enc_h, enc_c
            batch_val_loss = 0.0
            batch_val_accuracy = 0.0

            for t in range(dec_inp_batch.shape[1]):
                fin_dist, dec_h, dec_c, coverage = model.decode_step(
                    dec_inp_batch[:, t], dec_h, dec_c, coverage, enc_outputs, enc_inp_batch
                )
                loss = pgn_coverage_loss(fin_dist, dec_out_batch[:, t])
                acc = pgn_accuracy(fin_dist, dec_out_batch[:, t])

                batch_val_loss += tf.reduce_mean(loss).numpy()
                batch_val_accuracy += acc.numpy()

            # Average over time steps
            batch_val_loss /= dec_inp_batch.shape[1]
            batch_val_accuracy /= dec_inp_batch.shape[1]
            total_val_loss += batch_val_loss
            total_val_accuracy += batch_val_accuracy

        # Compute average validation loss and accuracy
        avg_val_loss = total_val_loss / val_steps if val_steps > 0 else 0.0
        avg_val_accuracy = total_val_accuracy / val_steps if val_steps > 0 else 0.0
        print(f"  >> Epoch {ep} val loss: {avg_val_loss:.4f}, val accuracy: {avg_val_accuracy:.4f}\n")


Build and summarize the model

In [12]:
def build_and_summarize_model(vocab_size, embed_dim, enc_units, dec_units, max_enc_len, max_dec_len):
    """
    Builds the Pointer-Generator Coverage model and prints its summary.

    Args:
        vocab_size (int): Size of the vocabulary.
        embed_dim (int): Dimension of the embedding vectors.
        enc_units (int): Number of units in the encoder LSTM.
        dec_units (int): Number of units in the decoder LSTM.
        max_enc_len (int): Maximum length of encoder input sequences.
        max_dec_len (int): Maximum length of decoder input sequences.

    Returns:
        model (PointerGenCoverage): Instantiated Pointer-Generator Coverage model.
    """
    # Instantiate the model
    model = PointerGenCoverage(
        vocab_size=vocab_size,
        embed_dim=embed_dim,
        enc_units=enc_units,
        dec_units=dec_units,
        max_enc_len=max_enc_len
    )

    # Define inputs with integer dtype
    enc_inputs = tf.keras.Input(shape=(max_enc_len,), dtype='int32', name='enc_inputs')
    dec_inputs = tf.keras.Input(shape=(max_dec_len,), dtype='int32', name='dec_inputs')

    # Get the model outputs
    outputs = model(enc_inputs, dec_inputs)

    # Create the Keras model
    keras_model = Model(inputs=[enc_inputs, dec_inputs], outputs=outputs, name="PointerGenCoverage")

    # Print the model summary
    keras_model.summary()

    return model


Finally, we can train our model

In [13]:
# Hyperparameters
embedding_dimension = 128
encoder_units = 256
decoder_units = 256
learning_rate = 1e-3
batch = 32
epoch = 5

# Build and summarize the model
pgn_model = build_and_summarize_model(
    vocab_size=max_vocab_size,
    embed_dim=embedding_dimension,
    enc_units=encoder_units,
    dec_units=decoder_units,
    max_enc_len=max_input_length,
    max_dec_len=max_summary_length
)

# Initialize the optimizer
optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate)

# Start training
print(f"\nStarting training with Pointer-Generator + Coverage (custom loop) ...\n")
train_epochs(
    model=pgn_model,
    optimizer=optimizer,
    X_train=X_train,
    Y_train=Y_train,
    X_val=X_val,
    Y_val=Y_val,
    epochs=epoch,
    batch_size=batch
)


Starting training with Pointer-Generator + Coverage (custom loop) ...

=== Epoch 1/5 ===
  step 10/510, loss=2.0172, accuracy=0.7730
  step 20/510, loss=2.1097, accuracy=0.7290
  step 30/510, loss=2.0177, accuracy=0.7073
  step 40/510, loss=1.1709, accuracy=0.7838
  step 50/510, loss=1.2844, accuracy=0.7864
  step 60/510, loss=1.1564, accuracy=0.7927
  step 70/510, loss=1.1718, accuracy=0.7972
  step 80/510, loss=1.1759, accuracy=0.7966
  step 90/510, loss=1.1689, accuracy=0.7997
  step 100/510, loss=1.2015, accuracy=0.7966
  step 110/510, loss=1.1561, accuracy=0.8042
  step 120/510, loss=1.2022, accuracy=0.8042
  step 130/510, loss=1.0816, accuracy=0.8157
  step 140/510, loss=1.0978, accuracy=0.8112
  step 150/510, loss=1.1183, accuracy=0.8131
  step 160/510, loss=1.1392, accuracy=0.8099
  step 170/510, loss=1.1190, accuracy=0.8131
  step 180/510, loss=1.0875, accuracy=0.8080
  step 190/510, loss=1.1424, accuracy=0.8080
  step 200/510, loss=1.1015, accuracy=0.8087
  step 210/510, los

---
## Evaluation
Let's evaluate the model. For decoding, we will use beam search

In [14]:
# Beam search decoding
def beam_search_decode(model, enc_input, tokenizer, beam_width=4, max_dec_steps=60):
    """
    enc_input: shape (1, max_enc_len)
    Return list of token IDs for best summary
    """
    # Encode
    enc_outputs, enc_h, enc_c = model.call_encoder(enc_input)
    coverage = tf.zeros(shape=(1, model.max_enc_len))

    # Start token
    sos_id = tokenizer.word_index.get('<sos>', 1)
    eos_id = tokenizer.word_index.get('<eos>', 2)

    initial_beam = (0.0, enc_h, enc_c, coverage, [sos_id])
    beams = [initial_beam]

    for _ in range(max_dec_steps):
        new_beams = []
        for logp, h, c, cov, tokens in beams:
            if tokens[-1] == eos_id:
                # Already ended
                new_beams.append((logp, h, c, cov, tokens))
                continue

            x_t = tf.constant(tokens[-1], shape=(1,))  # (1,)
            final_dist, new_h, new_c, new_cov = model.decode_step(
                x_t, h, c, cov, enc_outputs, enc_input
            )
            final_dist = final_dist[0].numpy()  # => shape (vocab_size,)

            # topk
            top_ids = np.argsort(final_dist)[-beam_width:]
            for tid in top_ids:
                prob = final_dist[tid]
                new_logp = logp + np.log(prob + 1e-9)
                new_beams.append((new_logp, new_h, new_c, new_cov, tokens+[tid]))

        # Sort
        new_beams.sort(key=lambda x: x[0], reverse=True)
        beams = new_beams[:beam_width]

    return beams[0][-1]  # best tokens

def ids_to_text(ids_, tokenizer):
    rev_dict = {v:k for k,v in tokenizer.word_index.items()}
    words = []
    for i in ids_:
        if i==0: break
        w = rev_dict.get(i, '<UNK>')
        if w in ['<sos>', '<eos>', '<UNK>']:
            continue
        words.append(w)
    return " ".join(words)

Finally, evaluate the model with ROUGE metric

In [15]:
pip install rouge-score

Collecting rouge-score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge-score
  Building wheel for rouge-score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge-score: filename=rouge_score-0.1.2-py3-none-any.whl size=24935 sha256=54361ece3ed788cef0f7637a4d66508951f091a8bf84f24420ee12fb2a39f3ac
  Stored in directory: /root/.cache/pip/wheels/5f/dd/89/461065a73be61a532ff8599a28e9beef17985c9e9c31e541b4
Successfully built rouge-score
Installing collected packages: rouge-score
Successfully installed rouge-score-0.1.2


In [16]:
from rouge_score import rouge_scorer

In [17]:
BEAM_WIDTH        = 4
MAX_DEC_STEPS     = 60   # for beam search

Evaluate with ROUGE metric

In [19]:
def evaluate_rouge(model, X_data, Y_data, tokenizer, num_samples=5):
    scorer = rouge_scorer.RougeScorer(['rouge1','rouge2','rougeL'], use_stemmer=True)
    idxs = np.random.choice(len(X_data), size=min(num_samples, len(X_data)), replace=False)

    r1_list, r2_list, rl_list = [], [], []
    for idx in idxs:
        enc_input = X_data[idx:idx+1]
        ref_ids   = Y_data[idx]
        ref_text  = ids_to_text(ref_ids, tokenizer)

        pred_ids  = beam_search_decode(model, enc_input, tokenizer, beam_width=BEAM_WIDTH, max_dec_steps=MAX_DEC_STEPS)
        pred_text = ids_to_text(pred_ids, tokenizer)

        scores = scorer.score(ref_text, pred_text)
        r1_list.append(scores['rouge1'].fmeasure)
        r2_list.append(scores['rouge2'].fmeasure)
        rl_list.append(scores['rougeL'].fmeasure)

        print("\nSample Reference:", ref_text)
        print("Sample Prediction:", pred_text)

    print(f"\nAverage over {len(r1_list)} samples:")
    print("  ROUGE-1:", np.mean(r1_list))
    print("  ROUGE-2:", np.mean(r2_list))
    print("  ROUGE-L:", np.mean(rl_list))

# Let’s do a small test
print("\nEvaluating on a small sample from test data with beam search ...")
evaluate_rouge(pgn_model, X_test, Y_test, tokenizer, num_samples=100)


Evaluating on a small sample from test data with beam search ...

Sample Reference: sos riot police hooded youths clash in paris at labor reform protest eos
Sample Prediction: protest with protest in paris protest reforms eos

Sample Reference: sos trump fires campaign manager in shakeup for election push eos
Sample Prediction: trump fired campaign manager who helped appeal to appeal eos

Sample Reference: sos venezuela vp makes clearest indication yet that maduro will run in 2018 eos
Sample Prediction: venezuela says hopes nicolas maduro will be reelected as president eos

Sample Reference: sos u s lawmakers say afghanistan corruption threatens future spending eos
Sample Prediction: u s senators seek corruption in afghanistan eos

Sample Reference: sos white house names new energy climate adviser at national security council eos
Sample Prediction: white house named new adviser to obama climate adviser spokesman eos

Sample Reference: sos white house says trump will announce fed chair

---
## Analysis
**ROUGE-1 (0.4363):**
- Approximately 43.6% of the unigrams in the generated summaries overlap with those in the reference summaries.
- This score indicates a moderate level of word overlap. While it shows that the model is capturing a significant portion of relevant content, there's room for improvement in terms of capturing more key terms and ensuring comprehensive coverage of the source material.

**ROUGE-2 (0.1674):**
- Around 16.7% of the bigrams in the generated summaries match those in the reference summaries.
- Bigram overlap is typically lower than unigram overlap because it captures more specific phrases and context. A ROUGE-2 score of 0.1674 suggests that the model has some ability to maintain contextual and sequential accuracy but may struggle with maintaining fluency and preserving the original meaning in phrases.

**ROUGE-L (0.4137):**
- Approximately 41.4% of the longest common subsequence between the generated and reference summaries is preserved.
- ROUGE-L is sensitive to the overall structure and coherence of the summary. A score of 0.4137 indicates that the model maintains a reasonable level of structural similarity to the reference summaries, suggesting that it can generate coherent and logically ordered content, albeit not perfectly.

**The best avarage that we get was**

  ROUGE-1: 0.48847994111152004

  ROUGE-2: 0.16705882352941176

  ROUGE-L: 0.45491350754508647



---
## Contributors
Faruk KAPLAN - 21050111026

Mert ALTEKİN - 21050111065