# Deep Learning : Summary Generation

## Transformers

Transformers are a revolutionary neural network architecture introduced in 2017, leveraging self-attention mechanisms to process input data in parallel. This allows them to effectively capture complex relationships within sequences, serving as the foundation for state-of-the-art models like **BERT** and **GPT**.


## T5 Model

The **Text-to-Text Transfer Transformer (T5)** treats every NLP task as a text-to-text problem, representing both inputs and outputs as text strings. Key features include:

- **Unified Framework**: All tasks are framed as text generation, simplifying model usage.
- **Pretraining**: The model learns from a diverse corpus.
- **Fine-Tuning**: It can be adapted to specific datasets to improve performance.

## Summary Generation

**Summary generation** refers to a model's ability to create coherent summaries from various texts, including unseen content. This is crucial for applications such as document summarization and content curation.

In this notebook, we will explore how to leverage the **T5 model** for effective summary generation using the **T5-SQuAD dataset**.


## Data Understanding

In our notebook we used two different datasets to test out the summary generation.

### T5-SQuAD Dataset

The **T5-SQuAD dataset** is an adaptation of the Stanford Question Answering Dataset (SQuAD) tailored for training T5 models in summary generation tasks. It includes:

- **Question-Answer Pairs**: Each example consists of a question and its corresponding answer derived from a passage.
- **Text-to-Text Format**: Questions serve as inputs while answers are outputs, aligning with T5’s text-to-text framework.

This dataset is useful for exploring T5’s capabilities in summary generation, as well as for other question-answering tasks.

### CNN/DailyMail:3.0.0 Dataset

The **CNN/DailyMail:3.0.0** dataset is widely used in natural language processing (NLP), particularly for text summarization. It consists of over 300,000 news articles paired with human-written summaries from CNN and Daily Mail news sites.
Each record includes:
  - **article**: The full news article text.
  - **highlights**: A human-written summary, typically a few sentences.
  - **id**: A unique identifier for each article.

This dataset is primarily used for training and evaluating models on both abstractive and extractive summarization tasks.

## Summary Generation with T5-SQuAD

In [2]:
!pip install trax jax jaxlib tensorflow-datasets
!pip install datasets
!pip install transformers
!pip install evaluate
!pip install rouge_score
!pip install t5
!pip install torch

Collecting trax
  Downloading trax-1.4.1-py2.py3-none-any.whl.metadata (1.7 kB)
Collecting funcsigs (from trax)
  Downloading funcsigs-1.0.2-py2.py3-none-any.whl.metadata (14 kB)
Collecting tensorflow-text (from trax)
  Downloading tensorflow_text-2.18.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.8 kB)
Collecting tensorflow<2.19,>=2.18.0 (from tensorflow-text->trax)
  Downloading tensorflow-2.18.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.1 kB)
Collecting tensorboard<2.19,>=2.18 (from tensorflow<2.19,>=2.18.0->tensorflow-text->trax)
  Downloading tensorboard-2.18.0-py3-none-any.whl.metadata (1.6 kB)
Collecting keras>=3.5.0 (from tensorflow<2.19,>=2.18.0->tensorflow-text->trax)
  Downloading keras-3.6.0-py3-none-any.whl.metadata (5.8 kB)
Downloading trax-1.4.1-py2.py3-none-any.whl (637 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m637.9/637.9 kB[0m [31m17.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading func

In [3]:
import trax
import trax.layers as tl
import trax.models.transformer as transformer
import tensorflow_datasets as tfds

In [4]:
from google.colab import drive
drive.mount('/content/drive/', force_remount=True)

  and should_run_async(code)


Mounted at /content/drive/


Define the tokenizer and detokenizer

In [5]:
from transformers import AutoTokenizer

# Load the tokenizer (assuming T5-small)
tokenizer = AutoTokenizer.from_pretrained("t5-small")

def tokenize(text):
  """Tokenizes the input text using the T5-small tokenizer.

  Args:
    text: The input text to be tokenized.

  Returns:
    A list of token IDs.
  """

  return tokenizer.encode(text, return_tensors="pt").squeeze().tolist()

def detokenize(token_ids):
  """Detokenizes a list of token IDs into text.

  Args:
    token_ids: A list of token IDs.

  Returns:
    The decoded text.
  """

  return tokenizer.decode(token_ids, skip_special_tokens=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

Load the SQuAD dataset

In [6]:
data_dir = "/content/drive/MyDrive/T5_SquaD"
(train_ds, validation_ds), ds_info = tfds.load(
    "squad",
    data_dir=data_dir,
    split=["train", "validation"],
    shuffle_files=True,
    with_info=True,
)

Downloading and preparing dataset 33.51 MiB (download: 33.51 MiB, generated: 94.06 MiB, total: 127.58 MiB) to /content/drive/MyDrive/T5_SquaD/squad/v1.1/3.0.0...


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Extraction completed...: 0 file [00:00, ? file/s]

Generating splits...:   0%|          | 0/2 [00:00<?, ? splits/s]

Generating train examples...:   0%|          | 0/87599 [00:00<?, ? examples/s]

Shuffling /content/drive/MyDrive/T5_SquaD/squad/v1.1/3.0.0.incompleteWFFEYQ/squad-train.tfrecord*...:   0%|   …

Generating validation examples...:   0%|          | 0/10570 [00:00<?, ? examples/s]

Shuffling /content/drive/MyDrive/T5_SquaD/squad/v1.1/3.0.0.incompleteWFFEYQ/squad-validation.tfrecord*...:   0…

Dataset squad downloaded and prepared to /content/drive/MyDrive/T5_SquaD/squad/v1.1/3.0.0. Subsequent calls will reuse this data.


Load Model

In [13]:
import torch
import tensorflow_datasets as tfds
from transformers import T5Tokenizer, T5ForConditionalGeneration

model_name = "t5-small"  # Using t5-base for improved quality
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Define the summarization function

In [14]:
# Set the device to GPU if available, otherwise use CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

T5ForConditionalGeneration(
  (shared): Embedding(32128, 512)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 512)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=512, out_features=512, bias=False)
              (k): Linear(in_features=512, out_features=512, bias=False)
              (v): Linear(in_features=512, out_features=512, bias=False)
              (o): Linear(in_features=512, out_features=512, bias=False)
              (relative_attention_bias): Embedding(32, 8)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseActDense(
              (wi): Linear(in_features=512, out_features=2048, bias=False)
              (wo): Linear(in_features=2048, out_features=512, bias=False)
              (dropout): Drop

In [15]:
def summarize_text(text):
    """Generate a more informative summary for a given input text."""
    inputs = tokenizer(f"summarize: {text}", return_tensors="pt", max_length=512, truncation=True).to(device)

    # Generate the summary with enhanced parameters
    summary_ids = model.generate(
        inputs["input_ids"],
        max_length=80,          # Allowing for a more detailed summary
        num_beams=6,            # Higher beam width for better quality
        early_stopping=True,
        length_penalty=1.2      # Favor longer summaries slightly
    )

    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return summary

Generate summary example

In [16]:
squad_data, ds_info = tfds.load("squad", split="train", with_info=True)

# Extract one example and preprocess it
example = next(iter(squad_data))
context = example['context'].numpy().decode("utf-8")
target = example['answers']['text'][0].numpy().decode("utf-8")

print("Context:", context)
print("Expected Summary (Target):", target)

# Generate and print the summary using the new function
generated_summary = summarize_text(context)
print("\nGenerated Summary:")
print(generated_summary)

Context: The difference in the above factors for the case of θ=0 is the reason that most broadcasting (transmissions intended for the public) uses vertical polarization. For receivers near the ground, horizontally polarized transmissions suffer cancellation. For best reception the receiving antennas for these signals are likewise vertically polarized. In some applications where the receiving antenna must work in any position, as in mobile phones, the base station antennas use mixed polarization, such as linear polarization at an angle (with both vertical and horizontal components) or circular polarization.
Expected Summary (Target): mobile phones

Generated Summary:
the difference in the above factors for the case of =0 is the reason that most broadcasting (transmissions intended for the public) uses vertical polarization. for best reception the receiving antennas for these signals suffer cancellation.


Evaluate the summary

In [17]:
from rouge_score import rouge_scorer

# Initialize the ROUGE scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

# Calculate ROUGE scores
scores = scorer.score(context, generated_summary)
print("\nROUGE Scores:")
print(f"ROUGE-1: {scores['rouge1'].fmeasure:.4f}")
print(f"ROUGE-2: {scores['rouge2'].fmeasure:.4f}")
print(f"ROUGE-L: {scores['rougeL'].fmeasure:.4f}")


ROUGE Scores:
ROUGE-1: 0.5854
ROUGE-2: 0.5620
ROUGE-L: 0.5528


summarize with f"summarize:"

In [18]:
import torch
import tensorflow_datasets as tfds
from transformers import T5Tokenizer, T5ForConditionalGeneration
from rouge_score import rouge_scorer

# Load the T5 model and tokenizer
model_name = "t5-base"  # Use t5-base instead of t5-small
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

# Set the device to GPU if available, otherwise use CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Load one raw example directly from the SQuAD dataset for testing
squad_data, ds_info = tfds.load("squad", split="train", with_info=True)

# Extract one example and preprocess it
example = next(iter(squad_data))
context = example['context'].numpy().decode("utf-8")
target = example['answers']['text'][0].numpy().decode("utf-8")

# Tokenize manually for T5
input_text = f"summarize: {context}"
input_ids = tokenizer(input_text, return_tensors="pt", padding="max_length", max_length=512, truncation=True).input_ids
target_ids = tokenizer(target, return_tensors="pt", padding="max_length", max_length=128, truncation=True).input_ids

# Move input_ids to the same device as the model
input_ids = input_ids.to(device)

print("Context:", context)
#print("Expected Summary (Target):", target)
#print("\nTokenized Inputs:", input_ids)
#print("Tokenized Targets:", target_ids)

# Generate a summary from the model
try:
    generated_summary_ids = model.generate(input_ids)
    generated_summary = tokenizer.decode(generated_summary_ids[0], skip_special_tokens=True)
    print("\nGenerated Summary:")
    print(generated_summary)

    # Evaluate using ROUGE
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    scores = scorer.score(context, generated_summary)

    print("\nROUGE Scores:")
    print(f"ROUGE-1: {scores['rouge1'].fmeasure:.4f}")
    print(f"ROUGE-2: {scores['rouge2'].fmeasure:.4f}")
    print(f"ROUGE-L: {scores['rougeL'].fmeasure:.4f}")

except Exception as e:
    print(f"Error during summary generation: {e}")


Context: The difference in the above factors for the case of θ=0 is the reason that most broadcasting (transmissions intended for the public) uses vertical polarization. For receivers near the ground, horizontally polarized transmissions suffer cancellation. For best reception the receiving antennas for these signals are likewise vertically polarized. In some applications where the receiving antenna must work in any position, as in mobile phones, the base station antennas use mixed polarization, such as linear polarization at an angle (with both vertical and horizontal components) or circular polarization.

Generated Summary:
most broadcasting uses vertical polarization. horizontally polarized transmissions suffer cancellation

ROUGE Scores:
ROUGE-1: 0.2062
ROUGE-2: 0.1474
ROUGE-L: 0.2062


## Summary Generation with CNN/DailyMail

### With T5-small Model

Load dataset

In [None]:
from datasets import load_dataset

dataset = load_dataset("cnn_dailymail", "3.0.0")

train_subset = dataset["train"].shuffle(seed=42).select(range(int(0.1 * len(dataset["train"]))))
val_subset = dataset["validation"].shuffle(seed=42).select(range(int(0.1 * len(dataset["validation"]))))
test_subset = dataset["test"].shuffle(seed=42).select(range(int(0.1 * len(dataset["test"]))))

README.md:   0%|          | 0.00/15.6k [00:00<?, ?B/s]

train-00000-of-00003.parquet:   0%|          | 0.00/257M [00:00<?, ?B/s]

train-00001-of-00003.parquet:   0%|          | 0.00/257M [00:00<?, ?B/s]

train-00002-of-00003.parquet:   0%|          | 0.00/259M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/34.7M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/287113 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/13368 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11490 [00:00<?, ? examples/s]

In [None]:
small_dataset

DatasetDict({
    train: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 28711
    })
    validation: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 1336
    })
    test: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 1149
    })
})

Preprocess the Data

In [None]:
from transformers import T5Tokenizer

# Load the T5 tokenizer
tokenizer = T5Tokenizer.from_pretrained("t5-base")


def preprocess_data(example):
    inputs = ["summarize: " + article for article in example["article"]]
    targets = example["highlights"]

    model_inputs = tokenizer(inputs, max_length=512, truncation=True, padding="max_length")
    labels = tokenizer(targets, max_length=128, truncation=True, padding="max_length")

    # Return inputs and labels as tensors
    model_inputs["labels"] = labels["input_ids"]
    return {"input_ids": model_inputs["input_ids"], "attention_mask": model_inputs["attention_mask"], "labels": model_inputs["labels"]}

tokenized_small_dataset = small_dataset.map(preprocess_data, batched=True, remove_columns=["id", "article", "highlights"])



Map:   0%|          | 0/28711 [00:00<?, ? examples/s]

Map:   0%|          | 0/1336 [00:00<?, ? examples/s]

Map:   0%|          | 0/1149 [00:00<?, ? examples/s]

initialize the model

In [None]:
from transformers import T5ForConditionalGeneration

# Load the T5 model
model = T5ForConditionalGeneration.from_pretrained("t5-small")

Summarize definition

In [None]:
def generate_summary(example):
    # Add "summarize:" prefix to indicate the task for T5
    inputs = "summarize: " + example["article"]

    # Tokenize the input text
    input_ids = tokenizer(inputs, return_tensors="pt", max_length=512, truncation=True).input_ids

    # Generate summary using model
    summary_ids = model.generate(input_ids, max_length=128, num_beams=4, length_penalty=2.0, early_stopping=True)

    # Decode the generated summary back to text
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

    return {"article": example["article"], "generated_summary": summary, "reference_summary": example["highlights"]}

# Apply the generate_summary function to a small subset for demonstration
sample_summaries = small_dataset["validation"].map(generate_summary)

# Display some example summaries
for i in range(3):
    print(f"Article:\n{sample_summaries[i]['article']}\n")
    print(f"Generated Summary:\n{sample_summaries[i]['generated_summary']}\n")
    print(f"Reference Summary:\n{sample_summaries[i]['reference_summary']}\n")
    print("-" * 50)

Map:   0%|          | 0/1336 [00:00<?, ? examples/s]

Article:
Jarryd Hayne's move to the NFL is a boost for rugby league in the United States, it has been claimed. The Australia international full-back or centre quit the National Rugby League in October to try his luck in American football and was this week given a three-year contract with the San Francisco 49ers. Peter Illfield, chairman of US Association of Rugby League, said: 'Jarryd, at 27, is one of the most gifted and talented rugby league players in Australia. He is an extraordinary athlete. Jarryd Hayne (right) has signed with the San Francisco 49ers after quitting the NRL in October . Hayne, who played rugby league for Australia, has signed a three year contract with the 49ers . 'His three-year deal with the 49ers, as an expected running back, gives the USA Rugby League a connection with the American football lover like never before. 'Jarryd's profile and playing ability will bring our sport to the attention of many. It also has the possibility of showing the American college at

Evaluate results

In [None]:
import evaluate
rouge = evaluate.load("rouge")

results = small_dataset["validation"].map(generate_summary)

generated_summaries = [result["generated_summary"] for result in results]
reference_summaries = [result["reference_summary"] for result in results]

# Compute ROUGE scores
rouge_output = rouge.compute(predictions=generated_summaries, references=reference_summaries)

# Print ROUGE scores
print("ROUGE Scores:")
print(f"ROUGE-1: F1 = {rouge_output['rouge1']}, Precision = {rouge_output['rouge1']}, Recall = {rouge_output['rouge1']}")
print(f"ROUGE-2: F1 = {rouge_output['rouge2']}, Precision = {rouge_output['rouge2']}, Recall = {rouge_output['rouge2']}")
print(f"ROUGE-L: F1 = {rouge_output['rougeL']}, Precision = {rouge_output['rougeL']}, Recall = {rouge_output['rougeL']}")

ROUGE Scores:
ROUGE-1: F1 = 0.39545979080432536, Precision = 0.39545979080432536, Recall = 0.39545979080432536
ROUGE-2: F1 = 0.18337482448981735, Precision = 0.18337482448981735, Recall = 0.18337482448981735
ROUGE-L: F1 = 0.28152059239363714, Precision = 0.28152059239363714, Recall = 0.28152059239363714


These results indicate that while the model demonstrates a reasonable ability to extract relevant information, there is significant room for improvement in generating more contextually rich and coherent summaries. Future work may involve fine-tuning the model with more data or adjusting hyperparameters to enhance summarization quality.

### With Bart Model