#Code by - Vanshika Gupta

# Dataset Metadata
- Dataset Name: Newspaper Text Summarization (CNN/DailyMail)

- Source: Kaggle

- Description: A dataset containing news articles from CNN and DailyMail, paired with human-written summaries. Commonly used for text summarization tasks.

- Features:

  - Articles: Full news articles.

  - Summaries: Concise summaries of the articles.

  - Use Case: Training and evaluating summarization models (e.g., extractive or abstractive summarization).

In [None]:
pip install transformers datasets torch rouge-score nltk evaluate

In [4]:
#Importing required libraries and packages

#Data manipulation, numerical calculations and handle randomness
import random
import numpy as np
import pandas as pd

#Loading datasets from Hugging Face (CNN/DailyMail dataset)
from datasets import load_dataset

#Importing PyTorch for tensor operations and model training
import torch

#Importing evaluation metrics (ROUGE, etc.)
import evaluate

#Importing DataLoader for batch processing
from torch.utils.data import DataLoader

#Importing NLTK for text processing and sentence tokenization
import nltk
nltk.download('punkt')  # Download necessary tokenizer data for NLTK

#Importing T5 and BART tokenizers & models for fine-tuning
from transformers import T5Tokenizer, T5ForConditionalGeneration
from transformers import BartTokenizer, BartForConditionalGeneration

#Trainer and TrainingArguments for fine-tuning models using Hugging Face's Trainer API
from transformers import Trainer, TrainingArguments

#DataCollator to handle padding dynamically during batch training
from transformers import DataCollatorForSeq2Seq

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [5]:
rouge = evaluate.load("rouge")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

In [6]:
#Loading the dataset
dataset = load_dataset("cnn_dailymail", "3.0.0")

# Reduce dataset size for faster training
train_data = dataset["train"].shuffle(seed=42).select(range(len(dataset["train"]) // 10))
test_data = dataset["test"].shuffle(seed=42).select(range(len(dataset["test"]) // 10))


README.md:   0%|          | 0.00/15.6k [00:00<?, ?B/s]

train-00000-of-00003.parquet:   0%|          | 0.00/257M [00:00<?, ?B/s]

train-00001-of-00003.parquet:   0%|          | 0.00/257M [00:00<?, ?B/s]

train-00002-of-00003.parquet:   0%|          | 0.00/259M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/34.7M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/287113 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/13368 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11490 [00:00<?, ? examples/s]

In [7]:
#Loading Tokenizers
t5_tokenizer = T5Tokenizer.from_pretrained("t5-small")
bart_tokenizer = BartTokenizer.from_pretrained("facebook/bart-base")

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.72k [00:00<?, ?B/s]

In [8]:
print(train_data.column_names)

['article', 'highlights', 'id']


In [9]:
#Preprocessing Function
def preprocess_function(examples, tokenizer, model_type):
    prefix = "summarize: " if model_type == "t5" else ""  # Prefix for T5
    inputs = [prefix + text for text in examples["article"]]  # Use "article"

    #Tokenizing inputs
    model_inputs = tokenizer(inputs, max_length=512, truncation=True, padding="max_length")

    #Tokenizing labels (highlights)
    labels = tokenizer(examples["highlights"], max_length=150, truncation=True, padding="max_length")

    #Handling padding for T5 model (replace pad tokens with -100)
    if model_type == "t5":
        labels["input_ids"] = [
            [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]
        ]

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs


Tokenizing the dataset for both T5 and BART models using the preprocess_function. It applies the function to both training and test datasets.

In [10]:
# Tokenizing datasets
train_dataset_t5 = train_data.map(lambda x: preprocess_function(x, t5_tokenizer, "t5"), batched=True,batch_size=1000)

Map:   0%|          | 0/28711 [00:00<?, ? examples/s]

In [11]:
test_dataset_t5 = test_data.map(lambda x: preprocess_function(x, t5_tokenizer, "t5"), batched=True,batch_size=1000)

Map:   0%|          | 0/1149 [00:00<?, ? examples/s]

In [12]:
train_dataset_bart = train_data.map(lambda x: preprocess_function(x, bart_tokenizer, "bart"), batched=True,batch_size=1000)

Map:   0%|          | 0/28711 [00:00<?, ? examples/s]

In [13]:
test_dataset_bart = test_data.map(lambda x: preprocess_function(x, bart_tokenizer, "bart"), batched=True,batch_size=1000)

Map:   0%|          | 0/1149 [00:00<?, ? examples/s]

In [14]:
#Data Collator
t5_data_collator = DataCollatorForSeq2Seq(tokenizer=t5_tokenizer)
bart_data_collator = DataCollatorForSeq2Seq(tokenizer=bart_tokenizer)

The collator handles batching and padding dynamically during training.

#T5 Model

In [16]:
#Loading the Model
t5_model = T5ForConditionalGeneration.from_pretrained("t5-small")

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [26]:
#Optimized Training Arguments
t5_training_args = TrainingArguments(
    output_dir="./t5_summarization",  # Save model outputs
    evaluation_strategy="no",          # Disable evaluation to save time
    save_strategy="no",                # Don't save checkpoints (avoids I/O delays)
    learning_rate=5e-4,                 # Higher learning rate for faster convergence
    per_device_train_batch_size=16,     # Increase batch size to utilize GPU power
    weight_decay=0.01,
    num_train_epochs=1,                 # Single pass over dataset
    logging_dir="./logs",
    logging_steps=1000,                 # Log less frequently to reduce overhead
    report_to="none",                   # Disable logging to W&B or other tools
    fp16=True,                          # Use mixed precision for faster training
    dataloader_num_workers=4,           # Speed up data loading
)




In [27]:
#Trainer
t5_trainer = Trainer(
    model=t5_model,                   # The T5 model we loaded
    args=t5_training_args,             # Training arguments
    train_dataset=train_dataset_t5,       # Training data
    eval_dataset=test_dataset_t5,         # Evaluation data
    tokenizer=t5_tokenizer,            # Tokenizer for text processing
    data_collator=t5_data_collator,    # Collator to handle padding/batching
    compute_metrics=lambda pred: {"generated_text": pred.predictions}  # Use generation in evaluation
)


  t5_trainer = Trainer(


In [28]:
import os
os.environ["WANDB_DISABLED"] = "true"


In [29]:
!pip uninstall -y wandb


[0m

In [30]:
from transformers.integrations import WandbCallback
t5_trainer.remove_callback(WandbCallback)

In [31]:
#Train
t5_trainer.train()



Step,Training Loss
1000,2.0974


TrainOutput(global_step=1795, training_loss=2.0845696324425487, metrics={'train_runtime': 658.3583, 'train_samples_per_second': 43.61, 'train_steps_per_second': 2.726, 'total_flos': 3885798462062592.0, 'train_loss': 2.0845696324425487, 'epoch': 1.0})

#Results and Inferences
1. Training Progress:

- The model completed 1 epoch of training, consisting of 1795 steps, in approximately 10 minutes and 57 seconds.

- The training process was stable, and the model successfully processed the dataset.

2. Training Loss:

- The final training loss was 2.0845, which indicates that the model is learning and improving over time.

- At step 1000, the training loss was 2.0974, showing a gradual decrease in loss as training progressed.

3. Training Efficiency:

- The model achieved an average training speed of 43.61 samples per second and 2.726 steps per second.

- The total computational cost of training was 3.8858e+15 FLOPs, reflecting the complexity of the model and dataset.

4. Key Observations:

- The model demonstrated consistent learning, with the training loss decreasing steadily over the course of the epoch.

- The training speed and efficiency are within expected ranges for a model of this complexity and dataset size.

## BART Model

In [32]:
#Loading the Model
bart_model = BartForConditionalGeneration.from_pretrained("facebook/bart-base")

model.safetensors:   0%|          | 0.00/558M [00:00<?, ?B/s]

In [34]:
#Optimized Training Arguments
bart_training_args = TrainingArguments(
    output_dir="./bart_summarization",
    evaluation_strategy="no",
    save_strategy="no",
    learning_rate=5e-4,
    per_device_train_batch_size=16,
    weight_decay=0.01,
    num_train_epochs=1,
    logging_dir="./logs",
    logging_steps=1000,
    report_to="none",
    fp16=True,
    dataloader_num_workers=4,
)



In [35]:
#Trainer
bart_trainer = Trainer(
    model=bart_model,
    args=bart_training_args,
    train_dataset=train_dataset_bart,
    eval_dataset=test_dataset_bart,
    tokenizer=bart_tokenizer,
    data_collator=bart_data_collator,
    compute_metrics=lambda pred: {"generated_text": pred.predictions}
)

  bart_trainer = Trainer(


In [36]:
#Train
bart_trainer.train()



Step,Training Loss
1000,1.6142


TrainOutput(global_step=1795, training_loss=1.4074836242165738, metrics={'train_runtime': 750.7513, 'train_samples_per_second': 38.243, 'train_steps_per_second': 2.391, 'total_flos': 8753071726264320.0, 'train_loss': 1.4074836242165738, 'epoch': 1.0})

#Results and Inferences

1. Training Loss:

- The training loss decreased steadily over the course of the training process.

- At step 1000, the training loss was recorded at 1.614200.

- By the end of training (step 1795), the final training loss was 1.4075, indicating that the model was learning effectively and converging.

2. Training Efficiency:

- The total training runtime was 750.75 seconds (~12.5 minutes).

- The training speed was 38.24 samples per second and 2.39 steps per second, which demonstrates efficient utilization of computational resources.

3. Computational Effort:

- The model processed a total of 8.75 x 10¹⁵ floating-point operations (FLOPs) during training, highlighting the complexity of the task and the computational effort required.

4. Convergence:

- The steady decrease in training loss suggests that the model was able to learn meaningful patterns from the dataset.

- The final training loss of 1.4075 indicates that the model achieved reasonable performance, though further training or hyperparameter tuning could potentially improve results.

5. Key Takeaways:
The model successfully completed training and showed consistent improvement in reducing the training loss.

- The training process was computationally intensive, as evidenced by the high number of FLOPs and the runtime.

- While the model performed well, there is room for further optimization, such as:

  - Training for additional epochs to achieve better convergence
  - Fine-tuning hyperparameters (e.g., learning rate, batch size) to improve performance.
  - Exploring techniques like learning rate scheduling or gradient clipping to stabilize training.

In [38]:
print(test_data[0])


{'article': '(CNN) I see signs of a revolution everywhere. I see it in the op-ed pages of the newspapers, and on the state ballots in nearly half the country. I see it in politicians who once preferred to play it safe with this explosive issue but are now willing to stake their political futures on it. I see the revolution in the eyes of sterling scientists, previously reluctant to dip a toe into this heavily stigmatized world, who are diving in head first. I see it in the new surgeon general who cites data showing just how helpful it can be. I see a revolution in the attitudes of everyday Americans. For the first time a majority, 53%, favor its legalization, with 77% supporting it for medical purposes. Support for legalization has risen 11 points in the past few years alone. In 1969, the first time Pew asked the question about legalization, only 12% of the nation was in favor. I see a revolution that is burning white hot among young people, but also shows up among the parents and gran

In [40]:
print(test_data[0].keys())


dict_keys(['article', 'highlights', 'id'])


In [47]:
test_data = test_data.to_list()  # Convert Dataset to a list of dictionaries
# OR
#test_data = test_data.to_pandas().to_dict(orient="records")  # Convert via Pandas if available


In [48]:
print(type(test_data))  # Check data type
print(test_data)        # Print some data


<class 'list'>


In [49]:
import torch

# Load ROUGE metric
rouge = evaluate.load("rouge")

# Function to Evaluate Model with Batch Processing
def evaluate_model(model, tokenizer, test_data, batch_size=4):  # Reduce batch size if needed
    model.to("cuda")  # Move model to GPU
    predictions, references = [], []

    for i in range(0, len(test_data), batch_size):
        batch = test_data[i : i + batch_size]
        inputs = [example["article"] for example in batch]
        refs = [example["highlights"] for example in batch]

        # Tokenize inputs in batch
        inputs_tokenized = tokenizer(inputs, return_tensors="pt", padding=True, truncation=True, max_length=512)
        input_ids = inputs_tokenized.input_ids.to("cuda")

        # Generate summaries
        with torch.no_grad():  # Disable gradient calculation for inference
            summary_ids = model.generate(input_ids, max_length=150, num_beams=4)

        preds = tokenizer.batch_decode(summary_ids, skip_special_tokens=True)

        predictions.extend(preds)
        references.extend(refs)

        # Free up memory
        del input_ids, summary_ids
        torch.cuda.empty_cache()

    # Compute ROUGE scores
    results = rouge.compute(predictions=predictions, references=references)

    return results

# Evaluate BART model
bart_results = evaluate_model(bart_model, bart_tokenizer, test_data, batch_size=2)  # Adjust batch_size as needed
print("BART Model Performance:", bart_results)


BART Model Performance: {'rouge1': 0.3721136539078991, 'rouge2': 0.1618540348657509, 'rougeL': 0.2590528163136441, 'rougeLsum': 0.3461126765334377}


In [50]:
t5_results = evaluate_model(t5_model, t5_tokenizer, test_data, batch_size=2)  # Adjust batch_size as needed
print("T5 Model Performance:", t5_results)

T5 Model Performance: {'rouge1': 0.373029525203216, 'rouge2': 0.16711167323607173, 'rougeL': 0.2591316945418999, 'rougeLsum': 0.32396980182855317}


In [51]:
from tabulate import tabulate

# Model performance results
t5_results = {'rouge1': 0.373029525203216, 'rouge2': 0.16711167323607173, 'rougeL': 0.2591316945418999, 'rougeLsum': 0.32396980182855317}
bart_results = {'rouge1': 0.3721136539078991, 'rouge2': 0.1618540348657509, 'rougeL': 0.2590528163136441, 'rougeLsum': 0.3461126765334377}

# Create a list of rows for the table
table_data = [
    ["ROUGE-1", t5_results["rouge1"], bart_results["rouge1"], "T5" if t5_results["rouge1"] > bart_results["rouge1"] else "BART"],
    ["ROUGE-2", t5_results["rouge2"], bart_results["rouge2"], "T5" if t5_results["rouge2"] > bart_results["rouge2"] else "BART"],
    ["ROUGE-L", t5_results["rougeL"], bart_results["rougeL"], "Tie" if t5_results["rougeL"] == bart_results["rougeL"] else ("T5" if t5_results["rougeL"] > bart_results["rougeL"] else "BART")],
    ["ROUGE-Lsum", t5_results["rougeLsum"], bart_results["rougeLsum"], "T5" if t5_results["rougeLsum"] > bart_results["rougeLsum"] else "BART"],
]

# Define table headers
headers = ["Metric", "T5 Model Performance", "BART Model Performance", "Winner"]

# Display the table
print(tabulate(table_data, headers=headers, tablefmt="pretty"))

+------------+----------------------+------------------------+--------+
|   Metric   | T5 Model Performance | BART Model Performance | Winner |
+------------+----------------------+------------------------+--------+
|  ROUGE-1   |  0.373029525203216   |   0.3721136539078991   |   T5   |
|  ROUGE-2   | 0.16711167323607173  |   0.1618540348657509   |   T5   |
|  ROUGE-L   |  0.2591316945418999  |   0.2590528163136441   |   T5   |
| ROUGE-Lsum | 0.32396980182855317  |   0.3461126765334377   |  BART  |
+------------+----------------------+------------------------+--------+


#Analysis of Results
1. ROUGE-1:

- T5 (0.3730) performs slightly better than BART (0.3721).

- This indicates that T5 is marginally better at capturing unigram (single-word) overlap between the generated summaries and the reference summaries.

2. ROUGE-2:

- T5 (0.1671) outperforms BART (0.1619).

- This suggests that T5 is better at capturing bigram (two-word) overlap, which is important for understanding the context and coherence of the summary.

3. ROUGE-L:

- Both models perform identically (0.2591).

- This metric measures the longest common subsequence (LCS) between the generated and reference summaries, indicating that both models are equally good at capturing the overall structure and flow of the summary.

4. ROUGE-Lsum:

- BART (0.3461) outperforms T5 (0.3240).

- This metric is similar to ROUGE-L but is calculated at the summary level. BART’s higher score suggests it is better at generating summaries that are more aligned with the reference summaries in terms of overall content.

5. Conclusion
- Overall Performance:

  - T5 performs slightly better on ROUGE-1 and ROUGE-2, indicating it is better at capturing word-level and context-level overlap.

  - BART performs better on ROUGE-Lsum, suggesting it generates summaries that are more aligned with the reference summaries in terms of overall content.