# CNN-DailyMail News Text Summarization

T5 Model is used to perform Seq2Seq task of text summarization of news articles.

<img src="images/31.png">

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. J. (2019). Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. arXiv. https://doi.org/10.48550/ARXIV.1910.10683

The Text Summarization code from the Hugging Face course is used for reference.
https://huggingface.co/course/chapter7/5?fw=pt#finetuning-mt5-with-keras

In [None]:
# Download Hugging Face libraries
!pip install datasets transformers[sentencepiece]
!pip install rouge_score

In [2]:
# Import Libraries
import numpy as np 
import pandas as pd 
from datasets import load_dataset
from transformers import AutoTokenizer
from datasets import load_metric
from transformers import AutoModelForSeq2SeqLM
from transformers import Seq2SeqTrainingArguments
from transformers import DataCollatorForSeq2Seq
from transformers import Seq2SeqTrainer

In [9]:
# Load the Kaggle dataset https://www.kaggle.com/datasets/gowrishankarp/newspaper-text-summarization-cnn-dailymail
dataset = load_dataset('csv', data_files={'train': '../input/newspaper-text-summarization-cnn-dailymail/cnn_dailymail/train.csv', 
                                          'validation': '../input/newspaper-text-summarization-cnn-dailymail/cnn_dailymail/validation.csv' , 
                                          'test': '../input/newspaper-text-summarization-cnn-dailymail/cnn_dailymail/test.csv'})

In [11]:
# T5 model is used
model_checkpoint = "google/t5-v1_1-small"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [82]:
# Maximum Input Length for the news article and maximum output length for the summarization. 
max_input_length = 512
max_target_length = 75

# Preprocessing function to process the dataset.
def preprocess_function(examples):
    model_inputs = tokenizer(
        examples["article"], max_length=max_input_length, truncation=True,padding=True
    )
    # Set up the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            examples["highlights"], max_length=max_target_length, truncation=True,padding=True
        )

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [13]:
# Use map to apply the preprocessing function on the dataset
tokenized_dataset = dataset.map(preprocess_function, batched=True)

### ROUGE Score

<img src="images/32.png">

ROUGE Score computes precision and recall based on the similarity between the model's summarization output and the reference statement(label).


<img src="images/33.png">

The precision and recall is combined using F1-score metric to get ROUGE F1.

https://towardsdatascience.com/the-ultimate-performance-metric-in-nlp-111df6c64460

In [16]:
# Import rouge score metric
rouge_score = load_metric("rouge")

In [17]:
# Load T5 Model
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

In [167]:
batch_size = 8
num_train_epochs = 1

# Show the training loss with every epoch
logging_steps = len(tokenized_dataset["train"]) // batch_size

# Load the Seq2Seq model training arguments. 
# The model is called Seq2Seq because the input data is a sequence of text and
# the output data is also a sequence of text.

args = Seq2SeqTrainingArguments(
    output_dir="news summarizer",
    evaluation_strategy="epoch",
    learning_rate=5.6e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=num_train_epochs,
    predict_with_generate=True,
    logging_steps=logging_steps,
)

In [168]:
# Function for Computing ROUGE score metric.
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    # Decode generated summaries into text
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    # Decode reference summaries into text
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    # ROUGE expects a newline after each sentence
    decoded_preds = ["\n".join(sent_tokenize(pred.strip())) for pred in decoded_preds]
    decoded_labels = ["\n".join(sent_tokenize(label.strip())) for label in decoded_labels]
    # Compute ROUGE scores
    result = rouge_score.compute(
        predictions=decoded_preds, references=decoded_labels, use_stemmer=True
    )
    # Extract the median scores
    result = {key: value.mid.fmeasure * 100 for key, value in result.items()}
    return {k: round(v, 4) for k, v in result.items()}

In [169]:
# Data collator forms batches of input data.
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

In [170]:
# Seq2Seq to train the model.
trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

In [171]:
# WANDB is disabled.
%env WANDB_DISABLED=True

In [174]:
# Train the model.
trainer.train()

In [175]:
# Use the trained model for evaluation.
trainer.evaluate()